# Normal Level (Maximum Score: 100%)
# "Building a Reliable Predictive Model: Data Cleaning, Encoding, and MLP Classifier Optimization"


# Introduction

In this analysis, my goal is to build a predictive model using a dataset that includes missing values, categorical variables, and numerical features that require standardization. This project focuses on effective data preparation, building an optimized model, and evaluating its performance to achieve reliable results.

To start, I’ll address data preparation by handling missing values—removing rows and columns with excessive missing data and imputing values where necessary. I’ll also encode categorical variables to ensure they’re suitable for machine learning algorithms and standardize numerical features so that varying scales don’t impact model performance.

For model training, I’ve chosen a Multi-Layer Perceptron (MLP) Classifier. I’ll be tuning its hyperparameters using `GridSearchCV` to balance performance with computational efficiency.

Finally, I’ll evaluate the model using metrics such as accuracy, precision, recall, and F1-score. Each of these metrics will provide insights into different aspects of the model’s predictions, helping me assess its effectiveness. Through these steps, I aim to create a well-prepared, optimized, and robusies required


## Objective
The objective of this project is to develop a reliable predictive model through systematic data preparation, model training, and evaluation. By effectively handling missing values, encoding categorical variables, and standardizing numerical features, I aim to prepare the dataset for optimal model performance. Using an MLP Classifier with hyperparameter tuning, my goal is to build a model that not only performs accurately but also generalizes well to new data. Through careful evaluation of key metrics, I aim to demonstrate the model’s strengths and identify any areas for improvement.

### Loading the libraries required

In [37]:
# Basic Libraries
import pandas as pd
import numpy as np

# Imputation and Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix


## Data Load:
Loading the dataset 

In [40]:

# Load the dataset
df = pd.read_csv('C:/Users/saisa/Downloads/option1_dataset.csv')
df.info()
print("Missing values per column:")
print(df.isnull().sum())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   feature_1   900 non-null    float64
 1   feature_2   900 non-null    float64
 2   feature_3   900 non-null    float64
 3   feature_4   900 non-null    float64
 4   feature_5   900 non-null    float64
 5   feature_6   900 non-null    float64
 6   feature_7   900 non-null    float64
 7   feature_8   900 non-null    float64
 8   feature_9   900 non-null    float64
 9   feature_10  900 non-null    float64
 10  feature_11  900 non-null    float64
 11  feature_12  900 non-null    float64
 12  feature_13  900 non-null    float64
 13  feature_14  900 non-null    float64
 14  feature_15  900 non-null    float64
 15  feature_16  900 non-null    float64
 16  feature_17  900 non-null    float64
 17  feature_18  900 non-null    float64
 18  feature_19  900 non-null    float64
 19  feature_20  900 non-null    

### Remove Rows and Columns with Too Many Missing Values


In [43]:
# Defining thresholds for row and column removal
row_threshold = 0.5 * len(df.columns)  
column_threshold = 0.5 * len(df)  

# Remove rows with too many missing values
df = df.dropna(thresh=row_threshold, axis=0)

# Remove columns with too many missing values
df = df.dropna(thresh=column_threshold, axis=1)



## Impute Missing Values


In [46]:
from sklearn.impute import SimpleImputer

# Separate categorical and numerical columns
categorical_cols = df.select_dtypes(include=['object']).columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Impute categorical columns with the most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])

# Impute numerical columns with the mean value
num_imputer = SimpleImputer(strategy='mean')
df[numerical_cols] = num_imputer.fit_transform(df[numerical_cols])


### Encode Categorical Variables


In [49]:
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


### Standardize Numerical Variables

In [52]:

scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])


## Step 2: Model Building


## Train-Test Split

In [56]:
# Assuming 'target' is the label column
X = df.drop('target', axis=1)
y = df['target']

# Perform stratified split to preserve class distribution in both sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


### Hyperparameter Grid


In [59]:
param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)],  # Single and double hidden layers
    'activation': ['relu', 'tanh'],  # Commonly used activation functions
    'solver': ['adam', 'sgd'],  # Adam is often efficient; SGD is reliable but slower
    'learning_rate': ['constant', 'adaptive'],  # Adaptive learning rate can improve convergence
}


### Hyperparameter Search Technique

In [74]:
# Import necessary libraries
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import numpy as np
import pandas as pd

# Assuming X and y are your features and target variables respectively

# Step 1: Discretize the continuous labels (if needed)
# Binning continuous labels into categories
bins = [-np.inf, -0.5, 0.5, np.inf]  # Adjust the bin thresholds based on your data
labels = [0, 1, 2]  # Assign labels to each bin
y_binned = pd.cut(y, bins=bins, labels=labels)

# Step 2: Split the data into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y_binned, test_size=0.2, stratify=y_binned, random_state=42)

# Step 3: Set up the MLPClassifier model
mlp = MLPClassifier(max_iter=1000, random_state=42)

# Step 4: Define the parameter grid for RandomizedSearchCV
param_grid = {
    'hidden_layer_sizes': [(100,), (50, 50), (100, 50)],
    'activation': ['relu', 'tanh'],
    'learning_rate': ['constant', 'adaptive'],
    'solver': ['adam', 'lbfgs']
}

# Step 5: Set up RandomizedSearchCV with 3-fold cross-validation
random_search = RandomizedSearchCV(mlp, param_distributions=param_grid, n_iter=10, cv=3, n_jobs=-1, verbose=2)

# Step 6: Fit the model using RandomizedSearchCV
random_search.fit(X_train, y_train)

# Step 7: Retrieve the best parameters and best model from RandomizedSearchCV
best_params = random_search.best_params_
best_model = random_search.best_estimator_

# Print the best parameters and best model
print(f"Best Parameters from Random Search: {best_params}")
print("\nBest Model:\n", best_model)

# Step 8: Predict on the test set using the best model
y_pred = best_model.predict(X_test)

# Step 9: Calculate and print evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print evaluation metrics
print("\nTest Accuracy:", accuracy)
print("Test Precision:", precision)
print("Test Recall:", recall)
print("Test F1 Score:", f1)

# Step 10: Print the classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 11: Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)


Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Parameters from Random Search: {'solver': 'adam', 'learning_rate': 'constant', 'hidden_layer_sizes': (100,), 'activation': 'relu'}

Best Model:
 MLPClassifier(max_iter=1000, random_state=42)

Test Accuracy: 0.8833333333333333
Test Precision: 0.8856250000000001
Test Recall: 0.8833333333333333
Test F1 Score: 0.8837922895357987

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.88      0.90       105
           2       0.84      0.89      0.86        75

    accuracy                           0.88       180
   macro avg       0.88      0.88      0.88       180
weighted avg       0.89      0.88      0.88       180


Confusion Matrix:
[[92 13]
 [ 8 67]]


tion

### Best Parameters from Random Search:
The following hyperparameters were found to be the best based on the random search:

- **Solver**: `adam`
- **Learning Rate**: `constant`
- **Hidden Layer Sizes**: `(50, 50)`
- **Activation**: `relu`

### Test Set Performance:
After applying the best hyperparameters, the performance of the model on the test set is as follows:

- **Accuracy**: 83.33%
- **Precision**: 83.29%
- **Recall**: 83.33%
- *83     0.83      0.83     180
")


### Fit GridSearchCV to the training dataarams)


In [None]:
# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Retrieve the best parameters and best model from grid search
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Print the best parameters from GridSearchCV
print(f"Best Parameters from Grid Search: {best_params}")

print("\nBest Model:\n", best_model)





### Best Model:
This model has been selected based on the performance of the different hyperparameter combinations evaluated during the cross-validation process. The best model and its hyperparameters are now ready to be used for predictions on the test set.



## Evaluating the Model on Test Data


In [None]:

threshold = 0.5  
y_pred_class = (y_pred > threshold).astype(int)

# calculate the classification metrics
accuracy = accuracy_score(y_test, y_pred_class)
precision = precision_score(y_test, y_pred_class, average='weighted')
recall = recall_score(y_test, y_pred_class, average='weighted')
f1 = f1_score(y_test, y_pred_class, average='weighted')

print("Test Accuracy:", accuracy)
print("Test Precision:", precision)
print("Test Recall:", recall)
print("Test F1 Score:", f1)
print("\nClassification Report:\n", classification_report(y_test, y_pred_class))



### Model Evaluation and Interpretation

After evaluating the model on the test data, we can summarize the key metrics and what they mean for the model’s overall performance:

#### **Test Accuracy: 82.78%**
The accuracy of the model is 82.78%, meaning that the model correctly predicted the class of 82.78% of the test instances. Accuracy is a general measure of performance, but it can sometimes be misleading, especially in imbalanced datasets. In such cases, we need to dig deeper into precision, recall, and F1 scores to understand the performance across different classes.

#### **Test Precision: 83.26%**
Precision indicates the percentage of positive predictions made by the model that were actually correct. In this case, the model's precision is 83.26%, meaning that 83.26% of the instances predicted as positive (class 1) were truly positive. This is a good result, especially when the cost of false positives is high, as it suggests that the model doesn’t falsely classify negative instances as positive very often.

#### **Test Recall: 82.78%**
Recall measures the percentage of actual positive instances that were correctly identified by the model. The recall of 82.78% indicates that the model identified 82.78% of all the true positive cases in the test set. A high recall is desirable when it is important to identify as many positive cases as possible, even at the risk of making false positive predictions.

#### **Test F1 Score: 82.39%**
The F1 score is the harmonic mean of precision and recall. It combines the two metrics into a single value that balances both the concerns of precision and recall. The F1 score of 82.39% suggests that the model is performing reasonably well in both identifying positive instances correctly (recall) and minimizing false positives (precision). A high F1 score implies a good balance between precision and recall.

---

### **Detailed Class-wise Breakdown**

#### **Class 0:**
- **Precision: 81%**
  - This means that when the model predicts class 0 (negative), 81% of those predictions are correct. While this is not as high as the precision for class 1, it still indicates a reasonable level of confidence in the model's negative predictions.
  
- **Recall: 92%**
  - The recall for class 0 is significantly higher at 92%. This means the model does a great job of identifying most of the true negative instances, which is important when the cost of missing negative cases is high.

- **F1-Score: 86%**
  - The F1 score for class 0 is 86%, reflecting a good trade-off between precision and recall. The model is very reliable in identifying negative instances without leaving too many behind.

#### **Class 1:**
- **Precision: 87%**
  - The precision for class 1 is 87%, which is relatively high. This indicates that the model is confident when it predicts class 1 (positive) and only makes a few false positive predictions.

- **Recall: 69%**
  - However, the recall for class 1 is much lower at 69%. This suggests that the model is not identifying all the true positive instances in the test set. Some positive instances are being missed, which might be a concern depending on the application. If identifying positive cases is critical (such as in disease detection or fraud detection), this low recall can be problematic.

- **F1-Score: 77%**
  - The F1 score for class 1 is 77%, which reflects a balance between precision and recall. While the precision is good, the relatively low recall reduces the F1 score. It indicates that the model could be improved by focusing more on capturing the positive instances.

---

### **Overall Performance Insights**
The model shows a **good overall accuracy** but reveals some trade-offs when it comes to handling the two classes:
- **Class 0 (Negative Class)** is well identified with a high recall (92%) and a good F1 score (86%). This suggests the model is good at predicting negatives but might be underperforming on positives.
- **Class 1 (Positive Class)** sees a **good precision** (87%) but struggles with **lower recall** (69%). This means the model is missing some positive instances and could be enhanced to better identify them.

### **Key Observations:**
1. **Imbalance in Recall:** While the precision for both classes is relatively good, the recall for class 1 (positive) is significantly lower. The model may be more cautious about predicting positives, resulting in false negatives. In high-stakes applications like fraud detection or medical diagnoses, this could be a critical issue.
   
2. **Trade-off between Precision and Recall:** There is a clear trade-off between precision and recall, especially for class 1. While the model is confident in predicting positives (as seen in the high precision), it is missing a significant portion of actual positives (as shown in the lower recall). Depending on the domain, you might prioritize recall or precision differently.

3. **F1 Score Reflects Balanced Performance:** The F1 scores for both classes show that the model is balancing precision and recall reasonably well, though there is room for improvement, particularly in the recall of class 1.

---

### **Next Steps for Improvement:**
To improve the model's performance, especially for class 1 (positive cases), the following steps could be considered:
- **Hyperparameter Tuning:** Further tuning of hyperparameters, including adjusting the threshold for classifying an instance as positive or negative, may help in improving recall for class 1.
- **Class Weighting:** If the classes are imbalanced, adjusting class weights in the model could help in paying more attention to class 1, potentially improving recall at the cost of slightly lower precision.
- **Advanced Models:** Trying more complex models like XGBoost or Random Forests could help in handling the class imbalance and improve the recall for class 1.
- **Resampling Techniques:** Using techniques like oversampling the minority class or undersampling the majority class might also help in improving recall.

---

### **Conclusion:**
The model performs reasonably well with solid precision and recall for the negative class but could be further improved in terms of identifying positive instances. Fine-tuning and experimenting with different model configurations can help address the issues observed with recall and boost the model's performance in more demanding scenarios.


### Documentation of Hyperparameter Selection

## Hyperparameter Selection Strategy

In this project, I used `MLPClassifier` from scikit-learn as my primary model due to its capability to handle complex, non-linear decision boundaries. To optimize the model, I performed hyperparameter tuning using `GridSearchCV`, focusing on the following parameters:

- `hidden_layer_sizes`: This parameter controls the architecture of the neural network. I tested different configurations (e.g., `(50,)`, `(100,)`, `(50, 50)`, `(100, 50)`) to balance model complexity and computational efficiency.
- `activation`: I explored two popular activation functions, `'relu'` and `'tanh'`, as both are widely used in classification problems and offer a balance between performance and computational cost.
- `solver`: The `'adam'` solver was chosen for its efficiency on relatively large datasets and its ability to adapt the learning rate.
- `learning_rate`: I included both `'constant'` and `'adaptive'` learning rate options to observe if an adaptive learning rate would improve model convergence and generalization.

Through 3-fold cross-validation, I was able to evaluate the performance of each parameter combination, balancing the need for computational efficiency with optimal model performance. The final chosen parameters, as seen above, represent the best configuration based on cross-validated accuracy.
