# Model Selection and Rationale

## Models Used in This Analysis:

### 1. Logistic Regression
- Purpose: Predicts probability of categorical outcomes
- Why chosen:
  - Excellent interpretability of feature importance
  - Efficient with limited computational resources
  - Provides probability scores for predictions
  - Works well when decision boundaries are roughly linear
  - Fast training and inference times

### 2. Random Forest Classifier
- Purpose: Ensemble learning method for classification
- Why chosen:
  - Handles non-linear relationships effectively
  - Robust against overfitting
  - Captures complex interactions between features
  - Provides feature importance rankings
  - Performs well with both numerical and categorical data

## Why Use Both Models?
1. Complementary Strengths:
   - Logistic Regression: Linear patterns and interpretability
   - Random Forest: Complex patterns and robustness

2. Model Comparison:
   - Validates if complexity adds value
   - Provides different perspectives on feature importance

3. Business Value:
   - Logistic Regression: Quick insights and easy deployment
   - Random Forest: Higher accuracy for complex patterns

Importing Necessary Libraries and Loading the Merged Dataset for Feature Engineering and Predictive Modeling.

In [None]:
#Loading libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the merged dataset
merged_data = pd.read_csv("merged_data.csv")

Handling Missing Data: Imputation for Numerical, Categorical, and Date Columns

In [None]:
#Feature Engineering#
# 1. Address Missing Data
# Numerical columns: Continuous and Count Data
numerical_columns = merged_data.select_dtypes(include=['float64', 'int64']).columns
for col in numerical_columns:
    if merged_data[col].isnull().sum() > 0:
        # Use median for imputation
        merged_data[col].fillna(merged_data[col].median(), inplace=True)

# Categorical columns: Impute with the mode
categorical_columns = merged_data.select_dtypes(include=['object']).columns
for col in categorical_columns:
    if merged_data[col].isnull().sum() > 0:
        # Use mode for imputation
        merged_data[col].fillna(merged_data[col].mode()[0], inplace=True)

# Date columns: Forward-fill or backward-fill for imputation
date_columns = [col for col in merged_data.columns if 'Date' in col or 'date' in col]
for col in date_columns:
    if merged_data[col].isnull().sum() > 0:
        # Forward-fill method
        merged_data[col] = pd.to_datetime(merged_data[col], errors='coerce')
        merged_data[col].fillna(method='ffill', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  merged_data[col].fillna(merged_data[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  merged_data[col].fillna(merged_data[col].median(), inplace=True)


One-Hot Encoding for Categorical Variables: Transforming Categorical Data into Numerical Format

In [None]:
# 2. One-Hot Encoding for Categorical Columns
# Verify available columns
print("Available columns:", merged_data.columns)

# Specify columns to encode (adjust if needed)
specified_categorical_columns = ['Product_Name', 'Company_Name', 'Address']

# Dynamically filter to include only existing columns
categorical_columns_to_encode = [
    col for col in specified_categorical_columns if col in merged_data.columns
]

if categorical_columns_to_encode:
    # Update: Correcting the OneHotEncoder usage by removing the 'sparse_output' argument
    # which is not valid. This behavior is controlled by 'sparse' argument. Switching accordingly.
    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoded_features = pd.DataFrame(
        encoder.fit_transform(merged_data[categorical_columns_to_encode]),
        columns=encoder.get_feature_names_out(categorical_columns_to_encode)
    )

    # Add encoded features to the dataset
    merged_data = pd.concat([merged_data.reset_index(drop=True), encoded_features.reset_index(drop=True)], axis=1)

    # Drop original categorical columns after encoding
    merged_data = merged_data.drop(columns=categorical_columns_to_encode)
else:
    print("No valid categorical columns found for one-hot encoding.")

Available columns: Index(['Transaction_ID', 'Company_ID', 'Product_ID', 'Quantity',
       'Transaction_Date', 'Product_Price_x', 'Total_Cost', 'Product_Name',
       'Product_Price_y', 'Company_Name', 'Company_Profit', 'Address'],
      dtype='object')


TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

Scaling Numerical Features: Standardizing Continuous Variables for Modeling

In [None]:
# 3. Scale Numerical Features
# Define numerical columns for scaling (ensure they exist)
numerical_columns = [
    col for col in ['Product_Price', 'Quantity', 'Total_Cost', 'Company_Profit']
    if col in merged_data.columns
]

if numerical_columns:
    scaler = StandardScaler()
    merged_data[numerical_columns] = scaler.fit_transform(merged_data[numerical_columns])
else:
    print("No valid numerical columns found for scaling.")

# Final Check: Ensure all features are processed
print("Feature Engineering Complete. Dataset Preview:")
print(merged_data.head())

Feature Engineering Complete. Dataset Preview:
   Transaction_ID  Company_ID  Product_ID  Quantity Transaction_Date  \
0             1.0        88.0         6.0  0.076948       2024-03-26   
1             2.0        29.0        19.0  0.984139       2024-07-09   
2             3.0        28.0        18.0 -0.830243       2024-04-13   
3             4.0        85.0        12.0  0.258386       2023-09-06   
4             5.0        47.0         3.0 -0.467367       2021-07-06   

   Product_Price_x  Total_Cost            Product_Name  Product_Price_y  \
0    194379.147964   -0.395485    RevenueVue Dashboard         179200.0   
1     97930.993380    0.013660        EcoNomix Modeler          95200.0   
2    126095.547778   -0.551349  DashSync Analytics Hub         134400.0   
3    131600.000000   -0.473417        BudgetMaster Pro          84000.0   
4     99575.609634   -0.824112    TrendWise Forecaster         100800.0   

            Company_Name  Company_Profit  \
0    Elite Consulting 88 

Defining Features and Target Variable, and Splitting Data for Training and Testing

In [None]:
# Define X (features) and y (target)
X = merged_data.drop(columns=['Product_ID'])
y = merged_data['Product_ID']

# Convert target variable to categorical (if necessary)
y = y.astype(str)

### Split Data ###
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Handling Non-Numerical Features and Implementing Multinomial Logistic Regression

## Logistic Regression Implementation
- Multi-class classification setup
- Maximum iterations set to handle convergence
- Provides probability scores for each class
- Features standardized for optimal performance
- Interpretable coefficients for feature importance

In [None]:
# Fix: Remove or encode non-numerical columns in the features
# Drop non-numerical columns such as 'Transaction_Date' to avoid conversion errors
X_numerical = X.select_dtypes(include=['number'])

# Split Data (again) after correction
X_train, X_test, y_train, y_test = train_test_split(X_numerical, y, test_size=0.2, random_state=42)

# Model 1: Multinomial Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Multinomial Logistic Regression
logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=42, max_iter=1000)

# Train the model
logistic_model.fit(X_train, y_train)

# Predictions
y_pred_logistic = logistic_model.predict(X_test)

# Evaluation Metrics
logistic_metrics = {
    "Model": "Multinomial Logistic Regression",
    "Accuracy": accuracy_score(y_test, y_pred_logistic),
    "Classification Report": classification_report(y_test, y_pred_logistic, output_dict=True)
}

# Print results
print("Multinomial Logistic Regression Metrics:")
print(f"Accuracy: {logistic_metrics['Accuracy']}")
print(classification_report(y_test, y_pred_logistic))

# Save predictions to a DataFrame
logistic_predictions = pd.DataFrame({"Actual": y_test, "Predicted": y_pred_logistic})
logistic_predictions.to_csv("logistic_predictions.csv", index=False)



Multinomial Logistic Regression Metrics:
Accuracy: 0.143
              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00        82
        10.0       0.14      0.79      0.24       266
        11.0       0.00      0.00      0.00        92
        12.0       0.43      0.03      0.06        97
        13.0       0.08      0.01      0.02        77
        14.0       0.00      0.00      0.00        93
        15.0       0.00      0.00      0.00        83
        16.0       0.00      0.00      0.00       106
        17.0       0.00      0.00      0.00        82
        18.0       0.00      0.00      0.00        96
        19.0       0.00      0.00      0.00        85
         2.0       0.00      0.00      0.00        88
        20.0       0.14      0.70      0.23        90
         3.0       0.00      0.00      0.00        98
         4.0       0.00      0.00      0.00        94
         5.0       0.25      0.08      0.12        97
         6.0       0.00 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Implementing and Evaluating Random Forest Classifier

## Random Forest Implementation
- 100 decision trees (n_estimators=100)
- Random state set for reproducibility
- Automatic feature importance calculation
- Handles both numerical and categorical features
- Robust to outliers and noise in data

In [None]:
#Model 2: Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluation Metrics
rf_metrics = {
    "Model": "Random Forest Classifier",
    "Accuracy": accuracy_score(y_test, y_pred_rf),
    "Classification Report": classification_report(y_test, y_pred_rf, output_dict=True)
}

# Print results
print("Random Forest Classifier Metrics:")
print(f"Accuracy: {rf_metrics['Accuracy']}")
print(classification_report(y_test, y_pred_rf))

# Save predictions to a DataFrame
rf_predictions = pd.DataFrame({"Actual": y_test, "Predicted": y_pred_rf})
rf_predictions.to_csv("rf_predictions.csv", index=False)

Random Forest Classifier Metrics:
Accuracy: 0.8015
              precision    recall  f1-score   support

         1.0       1.00      1.00      1.00        82
        10.0       0.91      0.93      0.92       266
        11.0       1.00      1.00      1.00        92
        12.0       0.51      0.52      0.51        97
        13.0       1.00      1.00      1.00        77
        14.0       1.00      1.00      1.00        93
        15.0       0.49      0.51      0.50        83
        16.0       0.99      1.00      1.00       106
        17.0       0.49      0.49      0.49        82
        18.0       0.79      0.76      0.78        96
        19.0       1.00      1.00      1.00        85
         2.0       0.53      0.53      0.53        88
        20.0       1.00      1.00      1.00        90
         3.0       0.56      0.54      0.55        98
         4.0       1.00      1.00      1.00        94
         5.0       0.51      0.49      0.50        97
         6.0       1.00      1

Model Performance Comparison: Evaluating Logistic Regression and Random Forest Classifier

## Model Evaluation Strategy
### Metrics Used and Why:
1. Accuracy
   - Measures overall prediction correctness
   - Easy to communicate to stakeholders
   - Baseline performance indicator

2. Precision
   - Important for minimizing false positives
   - Critical for resource allocation decisions
   - Measures prediction reliability

3. Recall
   - Captures ability to find all positive cases
   - Important for not missing opportunities
   - Key for comprehensive coverage

4. F1-Score
   - Balances precision and recall
   - Single metric for model comparison
   - Handles class imbalance

In [None]:
#Model Comparison Matrix
# Consolidate metrics into a single table
comparison_table = pd.DataFrame({
    "Model": [logistic_metrics["Model"], rf_metrics["Model"]],
    "Accuracy": [logistic_metrics["Accuracy"], rf_metrics["Accuracy"]],
    "Precision (Weighted Avg)": [
        logistic_metrics["Classification Report"]["weighted avg"]["precision"],
        rf_metrics["Classification Report"]["weighted avg"]["precision"]
    ],
    "Recall (Weighted Avg)": [
        logistic_metrics["Classification Report"]["weighted avg"]["recall"],
        rf_metrics["Classification Report"]["weighted avg"]["recall"]
    ],
    "F1-Score (Weighted Avg)": [
        logistic_metrics["Classification Report"]["weighted avg"]["f1-score"],
        rf_metrics["Classification Report"]["weighted avg"]["f1-score"]
    ]
})

# Save comparison table to CSV
comparison_table.to_csv("model_comparison.csv", index=False)

# Print the comparison table
print("Model Comparison Table:")
print(comparison_table)

Model Comparison Table:
                             Model  Accuracy  Precision (Weighted Avg)  \
0  Multinomial Logistic Regression    0.1430                  0.061358   
1         Random Forest Classifier    0.8015                  0.801369   

   Recall (Weighted Avg)  F1-Score (Weighted Avg)  
0                 0.1430                 0.052301  
1                 0.8015                 0.801266  


Side-by-Side Prediction Comparison: Logistic Regression vs. Random Forest

In [None]:
# Predict on the same sample data
sample_data = X_test.iloc[:12]  # Predicting 12 datapoints
logistic_predictions = logistic_model.predict(sample_data)
rf_predictions = rf_model.predict(sample_data)

# Combine predictions into a single DataFrame
comparison_df = pd.DataFrame({
    'Sample Index': sample_data.index,
    'Actual': y_test.loc[sample_data.index].values,
    'Logistic Regression Prediction': logistic_predictions,
    'Random Forest Prediction': rf_predictions
})

# Display the comparison table
print("\nComparison of Predictions Side by Side:")
print(comparison_df)


Comparison of Predictions Side by Side:
    Sample Index Actual Logistic Regression Prediction  \
0           6252   19.0                           10.0   
1           4684   10.0                           10.0   
2           1731   10.0                           10.0   
3           4742    3.0                           10.0   
4           4521    3.0                           10.0   
5           6340    9.0                           20.0   
6            576   17.0                            6.0   
7           5202    3.0                           10.0   
8           6363   14.0                           10.0   
9            439    9.0                           20.0   
10          2750    1.0                           10.0   
11          7487    3.0                           10.0   

   Random Forest Prediction  
0                      19.0  
1                      10.0  
2                      18.0  
3                       3.0  
4                      15.0  
5                       

Comparing Predicted Probabilities and Performance of Logistic Regression vs. Random Forest

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Correctly reference the RandomForestClassifier object defined previously
rf_pred_probs = rf_model.predict_proba(X_test)  # Changed 'random_forest_model' to 'rf_model'

# --- Printing the predictions side by side ---
# Create a DataFrame for the predictions to print them side by side
logistic_pred_df = pd.DataFrame(logistic_model.predict_proba(X_test), columns=[f'Product_{i}' for i in range(1, len(logistic_model.classes_) + 1)])  # Fixed logistic_pred_probs
rf_pred_df = pd.DataFrame(rf_pred_probs, columns=[f'Product_{i}' for i in range(1, len(rf_model.classes_) + 1)])

# Print the predicted probabilities from both models side by side
comparison_df = pd.concat([logistic_pred_df, rf_pred_df], axis=1, keys=["Logistic Regression", "Random Forest"])
print("\nPredicted Probabilities from Both Models (Logistic Regression vs Random Forest):")
print(comparison_df)

# --- Accuracy and Metrics ---
# Predict classes for evaluation (select the product with the highest probability)
logistic_preds = logistic_model.predict(X_test)
rf_preds = rf_model.predict(X_test)  # Changed 'random_forest_model' to 'rf_model'

# Calculate accuracy scores
logistic_accuracy = accuracy_score(y_test, logistic_preds)
rf_accuracy = accuracy_score(y_test, rf_preds)

print(f"\nLogistic Regression Accuracy: {logistic_accuracy:.4f}")
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")

# --- Display Classification Reports ---
print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, logistic_preds))

print("\nRandom Forest Classification Report:")
print(classification_report(y_test, rf_preds))


Predicted Probabilities from Both Models (Logistic Regression vs Random Forest):
     Logistic Regression                                                    \
               Product_1 Product_2 Product_3 Product_4 Product_5 Product_6   
0               0.055181  0.094298  0.044384  0.041486  0.039665  0.043985   
1               0.058622  0.175505  0.061454  0.015255  0.050084  0.049743   
2               0.054494  0.206480  0.071431  0.010983  0.046753  0.055479   
3               0.055026  0.100695  0.051357  0.031429  0.043363  0.047297   
4               0.046622  0.123365  0.040585  0.049782  0.021819  0.046903   
...                  ...       ...       ...       ...       ...       ...   
1995            0.043241  0.090009  0.031813  0.068389  0.018976  0.039453   
1996            0.060423  0.155658  0.057386  0.012026  0.058520  0.041438   
1997            0.024687  0.176034  0.012659  0.132185  0.003059  0.027880   
1998            0.051149  0.055699  0.069459  0.009144  0.10

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Conclusion and Rationale**

Based on the evaluation metrics, the Random Forest Classifier significantly outperforms Multinomial Logistic Regression in terms of accuracy, precision, recall, and F1-score.

1. Multinomial Logistic Regression performs poorly with an accuracy of 14.3%, a low precision of 6.1%, and an F1-score of 5.2%, indicating that the model struggles to correctly classify the data. This suggests that the data may not follow a linear decision boundary, making logistic regression unsuitable for this problem.

2. Random Forest Classifier, on the other hand, achieves an accuracy of 80.15%, with consistently high precision, recall, and F1-score (~80.1%). This suggests that the model effectively captures patterns in the dataset, likely benefiting from its ensemble learning approach, which reduces variance and improves classification performance.

**Rationale:**

1. Model Suitability: The poor performance of logistic regression suggests that the problem is highly nonlinear, making tree-based methods like Random Forest more effective.

2. Feature Importance: Random Forest can capture complex relationships and interactions between features, whereas logistic regression assumes a linear relationship, which may not exist in the data.

3. Overfitting Concern: While 80.15% accuracy is strong, further validation is needed to ensure the Random Forest model is not overfitting, especially if the training accuracy is significantly higher. Cross-validation and hyperparameter tuning should be performed to confirm its robustness.

Next Steps: If overfitting is suspected, techniques such as feature selection, hyperparameter tuning, or testing alternative models (e.g., Gradient Boosting, XGBoost) should be explored.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6a672b84-99aa-4a43-8d4f-014dc3397bc0' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>