# Introduction
This is a `Random Forest` approach towards a dataset called **Marketing Campaigns Dataset**. This dataset is obtained from kaggle, and is available on my repo. Below is the preview of the data.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/youronlydimwit/Data_ScienceUse_Cases/refs/heads/main/Predictions/Data/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,Pedigree,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pregnancies    768 non-null    int64  
 1   Glucose        768 non-null    int64  
 2   BloodPressure  768 non-null    int64  
 3   SkinThickness  768 non-null    int64  
 4   Insulin        768 non-null    int64  
 5   BMI            768 non-null    float64
 6   Pedigree       768 non-null    float64
 7   Age            768 non-null    int64  
 8   Outcome        768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


This dataset is related to diabetes prediction, where the target variable (`Outcome`) indicates whether a patient has diabetes (`1`) or not (`0`).

# 1. Initial Hypothesis
- **Hypothesis:** There is a relationship between various medical indicators (e.g., glucose levels, BMI, age, etc.) and whether an individual has diabetes. We hypothesize that using these features, a Random Forest model can predict the outcome effectively.

# 2. Equation/Model
Mathematically, the Random Forest prediction can be written as:

$$
f(x) = \frac{1}{N} \sum_{i=1}^{N} T_i(x)
$$

Where:
- \( f(x) \) is the final predicted outcome for input features \( x \),
- \( N \) is the number of trees in the forest,
- \( T_i(x) \) is the prediction from the \( i^{th} \) tree for the input features \( x \).

# 3. Preprocessing
Before training the model, we need to preprocess the data:
- **Check for missing values**: Since this dataset does not have missing values, this step can be skipped.
- **Feature scaling**: Although Random Forest is not as sensitive to feature scaling as algorithms like SVM or logistic regression, it's generally a good practice to standardize or normalize features like `BMI` or `Glucose`.
- **Splitting the data**: Split the data into training and testing sets (80-20 split is typical).
- **One-hot encoding**: There are no categorical variables in this dataset, so this step is not needed.
- **Target variable**: The `Outcome` variable is already binary (0 or 1), so no transformation is needed.

We will be leveraging the pre-installed `sklearn` library to conduct our analysis.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [6]:
# Features and target variable
X = df.drop('Outcome', axis=1)  # Features
y = df['Outcome']  # Target variable

In [7]:
# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.72
Confusion Matrix:
[[77 22]
 [21 34]]
Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.78      0.78        99
           1       0.61      0.62      0.61        55

    accuracy                           0.72       154
   macro avg       0.70      0.70      0.70       154
weighted avg       0.72      0.72      0.72       154



In [9]:
# Train the Random Forest model
rf_model_norm = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model_norm.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model_norm.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.72
Confusion Matrix:
[[77 22]
 [21 34]]
Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.78      0.78        99
           1       0.61      0.62      0.61        55

    accuracy                           0.72       154
   macro avg       0.70      0.70      0.70       154
weighted avg       0.72      0.72      0.72       154



In [11]:
# Feature importance
importances = rf_model.feature_importances_
features = X.columns

# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print(feature_importance_df)

         Feature  Importance
1        Glucose    0.258864
5            BMI    0.169984
7            Age    0.140931
6       Pedigree    0.123768
2  BloodPressure    0.088134
0    Pregnancies    0.076551
4        Insulin    0.076122
3  SkinThickness    0.065646


# First Interpretation of Model Results

### 1. **Accuracy: 0.72**
   - The model correctly predicted 72% of the instances in the dataset. While this is a decent starting point, it indicates there’s room for improvement.

### 2. **Confusion Matrix:**
[[77 22] [21 34]]

The confusion matrix shows how the model performed in terms of `true positives` (TP), `false positives` (FP), `true negatives` (TN), and `false negatives` (FN):

- **True Negatives (TN) = 77**: The model correctly predicted "No Diabetes" (`0`) 77 times.
- **False Positives (FP) = 22**: The model incorrectly predicted "Diabetes" (`1`) when the actual outcome was "No Diabetes" (0).
- **False Negatives (FN) = 21**: The model incorrectly predicted "No Diabetes" (`0`) when the actual outcome was "Diabetes" (1).
- **True Positives (TP) = 34**: The model correctly predicted "Diabetes" (`1`) 34 times.

### 3. **Classification Report:**
This gives us more detailed performance metrics, including precision, recall, and F1-score for both classes (`0` and `1`):

- **Class `0` (No Diabetes)**:
  - **Precision = 0.79**: Of all the instances the model predicted as "No Diabetes" (`0`), 79% were actually correct.
  - **Recall = 0.78**: Of all the actual "No Diabetes" instances, 78% were correctly identified by the model.
  - **F1-score = 0.78**: The harmonic mean of precision and recall for class `0`. This score indicates a balanced performance for class `0`.

- **Class 1 (Diabetes)**:
  - **Precision = 0.61**: Of all the instances the model predicted as "Diabetes" (`1`), 61% were actually correct.
  - **Recall = 0.62**: Of all the actual "Diabetes" instances, 62% were correctly identified by the model.
  - **F1-score = 0.61**: The F1-score for class `1` is lower than that for class `0`, which indicates that the model is performing less well for class 1.

- **Macro Average**: The average of precision, recall, and F1-score across both classes (`0` and `1`), without considering their support (i.e., the number of instances in each class). Here, you have:
  - **Precision = 0.70**
  - **Recall = 0.70**
  - **F1-score = 0.70**

  The macro average indicates that, on average, the model performs similarly for both classes, but it doesn’t take into account the class imbalance (if there’s more data for one class than the other).

- **Weighted Average**: This metric accounts for the support (number of instances) of each class. The weighted averages are:
  - **Precision = 0.72**
  - **Recall = 0.72**
  - **F1-score = 0.72**

  The weighted average suggests that, considering the class distribution, the overall model performance is slightly better for `No Diabetes` than for `Diabetes` but still reasonably balanced.

### 4. **Conclusion:**
- **Accuracy** (72%) shows that the model is making correct predictions most of the time, but there’s still a potential for improvement, especially for classifying `Diabetes` correctly.
- The **confusion matrix** indicates that the model is somewhat biased towards predicting `No Diabetes` (class `0`), as seen by the higher number of `True Negatives` and `False Positives`.
- For **class `1` (`Diabetes`)**, both precision and recall are lower, which means the model is missing some instances of diabetes (`False Negatives`) and sometimes incorrectly identifying non-diabetes cases as diabetes (`False Positives`).
- The **macro and weighted averages** suggest a moderately balanced performance, but there’s room for improving the recall for class `1` (`Diabetes`) to reduce the number of `False Negatives`.

## **Proposed Improvement Methods:**
1. **Address Class Imbalance**: If the dataset has a class imbalance (more "No Diabetes" than "Diabetes"), consider techniques like oversampling the minority class, undersampling the majority class, or using class weights in the Random Forest model to handle the imbalance better.
2. **Hyperparameter Tuning**: Adjust the hyperparameters of the Random Forest, such as the number of trees (`n_estimators`), the maximum depth (`max_depth`), or the minimum samples required for a split, to potentially improve performance.

In [13]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

In [17]:
# Define the RandomForestClassifier
rf = RandomForestClassifier()

# Define the parameter distribution
param_dist = {
    'n_estimators': [50, 200, 100, 150],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False],
    'class_weight': ['balanced', {0: 1, 1: 5}, {0: 1, 1: 7}, {0: 1, 1: 10}]
}

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, 
                                   n_iter=100, cv=5, verbose=2, random_state=42, n_jobs=-1)

# Fit the random search to the data
random_search.fit(X_train, y_train)

# Best hyperparameters
print(f"Best Hyperparameters: {random_search.best_params_}")

# Predict using the best model
y_pred = random_search.best_estimator_.predict(X_test)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))


Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best Hyperparameters: {'n_estimators': 50, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'class_weight': 'balanced', 'bootstrap': False}
Accuracy: 0.7337662337662337
              precision    recall  f1-score   support

           0       0.84      0.73      0.78        99
           1       0.60      0.75      0.67        55

    accuracy                           0.73       154
   macro avg       0.72      0.74      0.72       154
weighted avg       0.75      0.73      0.74       154



185 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
82 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\sang.yogi\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\sang.yogi\Anaconda3\lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "C:\Users\sang.yogi\Anaconda3\lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\sang.yogi\Anaconda3\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    rais

# Further processing with the addition of SMOTE Oversampling

In [18]:
from imblearn.over_sampling import SMOTE

In [19]:
# Instantiate SMOTE
smote = SMOTE(random_state=42)

# Fit and resample the training data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Check the new class distribution
print(f"Original class distribution: {y_train.value_counts()}")
print(f"Resampled class distribution: {pd.Series(y_resampled).value_counts()}")

Original class distribution: Outcome
0    401
1    213
Name: count, dtype: int64
Resampled class distribution: Outcome
0    401
1    401
Name: count, dtype: int64


In [20]:
# Best hyperparameters from the previous RandomizedSearchCV
best_params = random_search.best_params_

# Define the RandomForestClassifier with the best parameters
rf = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf'],
    max_features=best_params['max_features'],
    bootstrap=best_params['bootstrap'],
    class_weight='balanced',  # Since the data is now balanced
    random_state=42
)

# Train the model
rf.fit(X_resampled, y_resampled)

# Make predictions
y_pred = rf.predict(X_test)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Accuracy: 0.7597402597402597
              precision    recall  f1-score   support

           0       0.84      0.77      0.80        99
           1       0.64      0.75      0.69        55

    accuracy                           0.76       154
   macro avg       0.74      0.76      0.75       154
weighted avg       0.77      0.76      0.76       154



# Final Interpretation

### 1. **Accuracy: 75.97%**
   - This indicates that the model correctly predicted the outcome (Diabetes or No Diabetes) for about 76% of the test data. While this is a decent performance, there's room for improvement, particularly for classifying the minority class (Diabetes).

### 2. **Classification Report:**
This gives us more detailed performance metrics, including precision, recall, and F1-score for both classes (`0` and `1`):

- **Class `0` (No Diabetes)**:
  - **Precision = 0.84**: The model correctly identifies 84% of the instances predicted as "No Diabetes"
  - **Recall = 0.77**: The model identifies 75% of the actual "Diabetes" instances, meaning that it is fairly good at recognizing class `1`. However, 25% of the true `Diabetes` cases are misclassified as `No Diabetes`.
  - **F1-score = 0.80**: The harmonic mean of precision and recall for class `0`. A score of 0.80 for class 0 indicates that the model has a good balance of precision and recall.

- **Class `1` (Diabetes)**:
  - **Precision = 0.64**: Of all the instances the model predicted as "Diabetes" (`1`), 64% were actually correct.
  - **Recall = 0.75**: Of all the actual "Diabetes" instances, 62% were correctly identified by the model.
  - **F1-score = 0.69**: The F1-score for class `1` reflects a reasonable trade-off between precision and recall. While the model performs relatively well at detecting diabetes (due to the recall), there’s still room to improve the accuracy of its predictions.

- **Macro Average**: The average of precision, recall, and F1-score across both classes (`0` and `1`), without considering their support (i.e., the number of instances in each class). Here, you have:
  - **Precision = 0.74**
  - **Recall = 0.76**
  - **F1-score = 0.75**

  These averages suggest that, on balance, the model does a fair job of identifying both classes (Diabetes and No Diabetes) across the dataset.

- **Weighted Average**: This metric accounts for the support (number of instances) of each class. The weighted averages are:
  - **Precision = 0.77**
  - **Recall = 0.76**
  - **F1-score = 0.76**

  This indicates that the model performs well overall, especially when considering the class imbalance. Since class `0` (`No Diabetes`) has more instances than class `1` (`Diabetes`), the weighted average is more influenced by the performance on class `0`.

### 4. **Conclusion:**
- The model is relatively strong at predicting class `0` (`No Diabetes`), with good precision and recall, resulting in a higher F1-score.
- For class `1` (`Diabetes`), the model performs reasonably well but could be improved in terms of precision (since some non-Diabetes instances are incorrectly predicted as Diabetes). The recall for Diabetes is better, meaning the model can identify most of the actual diabetes cases, but it still misses some (false negatives).
- Overall, the model performs well, but focusing on improving the recall and precision for the minority class (Diabetes) could enhance its real-world applicability, especially when detecting true diabetic cases in imbalanced datasets.

# Optional: Save the model for further deployment

In [None]:
#import joblib

# Save the trained model
# joblib.dump(rf, 'diabetes_rf_model_with_smote.pkl')

# Load the saved model
# rf_loaded = joblib.load('diabetes_rf_model_with_smote.pkl')
# X_new = new_data.drop(columns=['Outcome'])  # Drop the Outcome column if present

# Apply the model to make predictions
# y_pred_new = rf_loaded.predict(X_new)

# Add predictions to the new dataframe (if you'd like)
# new_data['Predicted_Outcome'] = y_pred_new

# Display or save the new dataframe with predictions
# print(new_data.head())