# Modeling and Preprocessing

## Preprocessing

### Handling Missing Values:
Our dataset is already clean, and there are no missing values to address.

### Encoding Categorical Features:
All categorical features in our dataset are nominal. To prepare them for model training, we will use one-hot encoding. This process converts categorical variables into binary vectors, ensuring the model does not interpret ordinal relationships that may not exist.

## Normalization of Training Data:

Normalization is a crucial step to bring all numeric features to a similar scale. This prevents features with larger scales from dominating during model training. We will apply normalization to our numeric features.

## Model Selection with Optuna:

Optuna is a hyperparameter optimization library. It assists in finding the best model and hyperparameters for our dataset.

Steps:
1. **Define Objective Function:**
   - Create a function that Optuna will optimize. This function typically includes the model training and evaluation steps.

2. **Configure Optuna Study:**
   - Set up an Optuna study, specifying the optimization direction (maximize or minimize) and the number of trials.

3. **Run Optuna Optimization:**
   - Execute the Optuna optimization process to explore different models and hyperparameter combinations.

4. **Select Best Model:**
   - Choose the best-performing model based on the results from the Optuna study.

## Hyperparameter Fine-Tuning:

After selecting the best model, further refine its performance by fine-tuning hyperparameters. This involves adjusting the configuration settings of the chosen model to achieve optimal performance.

## Model Evaluation:

Evaluate the final model on a separate validation set to assess its performance on unseen data. Common evaluation metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).

### Steps:
1. **Prepare Validation Data:**
   - Split the dataset into training and validation sets. The validation set serves as a proxy for unseen data during model evaluation.

2. **Train Final Model:**
   - Train the selected model on the training set, utilizing the optimized hyperparameters.

3. **Evaluate Model:**
   - Assess the model's performance on the validation set using chosen evaluation metrics.

4. **Adjust as Needed:**
   - Depending on the evaluation results, make adjustments or consider further optimizations.

These steps collectively guide us through the process of selecting, fine-tuning, and evaluating the best model for our specific dataset.


In [None]:
import pandas as pd

In [2]:
df = pd.read_csv("bank_data.csv")

In [3]:
import pandas as pd

# Assuming 'df' is your DataFrame, replace it with your actual variable
# Categorical features to be one-hot encoded
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)

# Display the first few rows of the encoded DataFrame
df_encoded.head()


Unnamed: 0.1,Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y,job_blue-collar,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,0,30,1787,19,79,1,-1,0,no,False,...,False,False,False,False,False,True,False,False,False,True
1,1,33,4789,11,220,1,339,4,no,False,...,False,False,False,True,False,False,False,False,False,False
2,2,35,1350,16,185,1,330,1,no,False,...,False,False,False,False,False,False,False,False,False,False
3,3,30,1476,3,199,4,-1,0,no,False,...,False,True,False,False,False,False,False,False,False,True
4,4,59,0,5,226,1,-1,0,no,True,...,False,False,False,True,False,False,False,False,False,True


**Explanation:**

1. **Importing Necessary Libraries:**
   - `import pandas as pd`: Imports the pandas library and aliases it as `pd` for convenience. Pandas is a powerful library for data manipulation and analysis in Python.

2. **Selecting Categorical Features:**
   - `categorical_features`: A list containing the names of categorical features in your DataFrame ('df'). These are the features that will be one-hot encoded.

3. **Performing One-Hot Encoding:**
   - `pd.get_dummies(df, columns=categorical_features, drop_first=True)`: Uses the `get_dummies` function from pandas to perform one-hot encoding on the specified categorical features. The `drop_first=True` parameter drops the first category in each feature to avoid multicollinearity.

4. **Creating a New DataFrame:**
   - `df_encoded`: Stores the new DataFrame with one-hot encoded features. The original DataFrame 'df' is unchanged.

5. **Displaying the First Few Rows:**
   - `df_encoded.head()`: Outputs the first few rows of the one-hot encoded DataFrame to the console. This helps you inspect the changes and ensure the encoding was performed correctly.

This code is useful when dealing with categorical features in machine learning, as it transforms them into a format suitable for training models that require numerical input.


In [4]:
# Encode the target feature ('y') with 1 for 'yes' and 0 for 'no'
df_encoded['y_encoded'] = df['y'].map({'yes': 1, 'no': 0})
df_encoded = df_encoded.drop('y', axis=1)
df_encoded.head()

Unnamed: 0.1,Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_blue-collar,job_entrepreneur,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown,y_encoded
0,0,30,1787,19,79,1,-1,0,False,False,...,False,False,False,False,True,False,False,False,True,0
1,1,33,4789,11,220,1,339,4,False,False,...,False,False,True,False,False,False,False,False,False,0
2,2,35,1350,16,185,1,330,1,False,False,...,False,False,False,False,False,False,False,False,False,0
3,3,30,1476,3,199,4,-1,0,False,False,...,True,False,False,False,False,False,False,False,True,0
4,4,59,0,5,226,1,-1,0,True,False,...,False,False,True,False,False,False,False,False,True,0


In [5]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Assuming 'df' is your DataFrame
# Calculate the variance for each column
variances = df_encoded.var()

# Select columns with variance greater than 5
columns_to_normalize = variances[variances > 50].index.tolist()

# Creating a new DataFrame with only the selected columns to be normalized
df_to_normalize = df_encoded[columns_to_normalize]

# Applying Min-Max Scaling
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df_to_normalize), columns=columns_to_normalize)

# Concatenating the normalized DataFrame with the rest of the original DataFrame
df_result = pd.concat([df_encoded.drop(columns=columns_to_normalize, axis=1), df_normalized], axis=1)

# Display the result
df_result.head()


Unnamed: 0.1,campaign,previous,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,...,poutcome_other,poutcome_success,poutcome_unknown,y_encoded,Unnamed: 0,age,balance,day,duration,pdays
0,1,0,False,False,False,False,False,False,False,False,...,False,False,True,0,0.0,0.161765,0.068455,0.6,0.024826,0.0
1,1,4,False,False,False,False,False,False,True,False,...,False,False,False,0,0.000221,0.205882,0.10875,0.333333,0.0715,0.389908
2,1,1,False,False,False,True,False,False,False,False,...,False,False,False,0,0.000442,0.235294,0.06259,0.5,0.059914,0.379587
3,4,0,False,False,False,True,False,False,False,False,...,False,False,True,0,0.000664,0.161765,0.064281,0.066667,0.064548,0.0
4,1,0,True,False,False,False,False,False,False,False,...,False,False,True,0,0.000885,0.588235,0.044469,0.133333,0.073486,0.0


**Explanation:**

1. **Importing Necessary Libraries:**
   - `import pandas as pd`: Imports the pandas library for data manipulation.
   - `from sklearn.preprocessing import MinMaxScaler`: Imports the MinMaxScaler from scikit-learn for Min-Max scaling.

2. **Calculating Variance and Selecting Columns:**
   - `variances = df_encoded.var()`: Computes the variance for each column in the one-hot encoded DataFrame.
   - `columns_to_normalize = variances[variances > 50].index.tolist()`: Selects columns with a variance greater than 50, indicating significant variability.

3. **Creating DataFrame for Normalization:**
   - `df_to_normalize = df_encoded[columns_to_normalize]`: Creates a new DataFrame containing only the selected columns to be normalized.

4. **Applying Min-Max Scaling:**
   - `scaler = MinMaxScaler()`: Initializes a MinMaxScaler.
   - `df_normalized = pd.DataFrame(scaler.fit_transform(df_to_normalize), columns=columns_to_normalize)`: Applies Min-Max scaling to the selected columns and creates a new DataFrame with the normalized values.

5. **Concatenating DataFrames:**
   - `df_result = pd.concat([df_encoded.drop(columns=columns_to_normalize, axis=1), df_normalized], axis=1)`: Concatenates the normalized DataFrame with the rest of the original DataFrame, dropping the columns selected for normalization.

6. **Displaying the Result:**
   - `df_result.head()`: Displays the first few rows of the resulting DataFrame for inspection.

This process is designed to normalize columns with significant variability (variance > 50) using Min-Max scaling. It ensures that these columns have values between 0 and 1, preventing certain features from dominating others during machine learning model training.


In [6]:
df_result = df_result.drop("Unnamed: 0", axis=1)

In [7]:
import optuna
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier




**Explanation:**

1. **Importing Necessary Libraries:**
   - `import optuna`: Imports the Optuna library for hyperparameter optimization.
   - `from sklearn.model_selection import train_test_split`: Imports the `train_test_split` function from scikit-learn for splitting the dataset into training and validation sets.
   - `from sklearn.metrics import accuracy_score, f1_score`: Imports metrics such as accuracy and F1-score for model evaluation.
   - `from sklearn.linear_model import LogisticRegression`: Imports the Logistic Regression classifier.
   - `from sklearn.tree import DecisionTreeClassifier`: Imports the Decision Tree classifier.
   - `from sklearn.ensemble import RandomForestClassifier`: Imports the Random Forest classifier.

2. **Data Splitting:**
   - `X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)`: Splits the data into training and validation sets using an 80-20 split ratio.

3. **Model Evaluation Metrics:**
   - `accuracy_score` and `f1_score`: These metrics will be used to evaluate the performance of the models.

4. **Importing Classification Models:**
   - `LogisticRegression`, `DecisionTreeClassifier`, and `RandomForestClassifier`: These are classifiers from scikit-learn that will be used in the model selection process.

This code sets up the necessary libraries and functions for a machine learning classification task. The next steps would involve using Optuna for hyperparameter tuning, training the models on the training set, and evaluating their performance on the validation set.


In [None]:
X = df_result.drop('y_encoded', axis = 1)
y = df_result['y_encoded']
#  'X' is your feature matrix, 'y' is your target variable
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
def objective_logistic_regression(trial):
    C = trial.suggest_loguniform('C', 1e-5, 1e5)
    model = LogisticRegression(C=C)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    return f1_score(y_valid, y_pred)

def objective_decision_tree(trial):
    max_depth = trial.suggest_int('max_depth', 1, 32)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
    model = DecisionTreeClassifier(max_depth=max_depth, min_samples_split=min_samples_split)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    return f1_score(y_valid, y_pred)

def objective_random_forest(trial):
    n_estimators = trial.suggest_int('n_estimators', 10, 100)
    max_depth = trial.suggest_int('max_depth', 1, 32)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    return f1_score(y_valid, y_pred)

**Explanation:**

1. **Objective Function for Logistic Regression:**
   - `def objective_logistic_regression(trial)`: Defines the objective function for hyperparameter tuning of the Logistic Regression model.
   - `C = trial.suggest_loguniform('C', 1e-5, 1e5)`: Suggests a log-uniform distribution for the regularization parameter 'C'.
   - `model = LogisticRegression(C=C)`: Initializes a Logistic Regression model with the suggested 'C' value.
   - `model.fit(X_train, y_train)`: Fits the model on the training data.
   - `y_pred = model.predict(X_valid)`: Generates predictions on the validation set.
   - `return f1_score(y_valid, y_pred)`: Evaluates the model's performance using the F1-score and returns it.

2. **Objective Function for Decision Tree:**
   - `def objective_decision_tree(trial)`: Defines the objective function for hyperparameter tuning of the Decision Tree model.
   - `max_depth = trial.suggest_int('max_depth', 1, 32)`: Suggests an integer value for the maximum depth of the tree.
   - `min_samples_split = trial.suggest_int('min_samples_split', 2, 20)`: Suggests an integer value for the minimum number of samples required to split an internal node.
   - `model = DecisionTreeClassifier(max_depth=max_depth, min_samples_split=min_samples_split)`: Initializes a Decision Tree model with the suggested hyperparameters.
   - `model.fit(X_train, y_train)`: Fits the model on the training data.
   - `y_pred = model.predict(X_valid)`: Generates predictions on the validation set.
   - `return f1_score(y_valid, y_pred)`: Evaluates the model's performance using the F1-score and returns it.

3. **Objective Function for Random Forest:**
   - `def objective_random_forest(trial)`: Defines the objective function for hyperparameter tuning of the Random Forest model.
   - `n_estimators = trial.suggest_int('n_estimators', 10, 100)`: Suggests an integer value for the number of trees in the forest.
   - `max_depth = trial.suggest_int('max_depth', 1, 32)`: Suggests an integer value for the maximum depth of the trees.
   - `min_samples_split = trial.suggest_int('min_samples_split', 2, 20)`: Suggests an integer value for the minimum number of samples required to split an internal node.
   - `model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split)`: Initializes a Random Forest model with the suggested hyperparameters.
   - `model.fit(X_train, y_train)`: Fits the model on the training data.
   - `y_pred = model.predict(X_valid)`: Generates predictions on the validation set.
   - `return f1_score(y_valid, y_pred)`: Evaluates the model's performance using the F1-score and returns it.

These objective functions are designed for Optuna's hyperparameter optimization, seeking hyperparameters that maximize the F1-score on the validation set for each respective model.


In [8]:
# Set up Optuna studies for each model
study_lr = optuna.create_study(direction='maximize')
study_dt = optuna.create_study(direction='maximize')
study_rf = optuna.create_study(direction='maximize')



[I 2024-02-01 18:30:09,934] A new study created in memory with name: no-name-66c64296-7fa3-43d1-97ad-21cad029fbaf
[I 2024-02-01 18:30:09,935] A new study created in memory with name: no-name-d60a4f62-ed4d-41da-93f3-e48c1ca1771f
[I 2024-02-01 18:30:09,938] A new study created in memory with name: no-name-496af122-9411-4cc6-82a0-a476e2797683


In [None]:
# Run Optuna optimization for each model
study_lr.optimize(objective_logistic_regression, n_trials=100)




In [None]:
study_dt.optimize(objective_decision_tree, n_trials=100)


In [None]:
study_rf.optimize(objective_random_forest, n_trials=100)


**Explanation:**

1. **Study Object:**
   - `study_rf`: Represents the study or experiment for hyperparameter optimization using Optuna. The `study_rf` object is created before calling the `optimize` method.

2. **Optimization Process:**
   - `study_rf.optimize(objective_random_forest, n_trials=100)`: Initiates the hyperparameter optimization process using the `optimize` method.
   - `objective_random_forest`: The objective function specific to Random Forest is passed as an argument. This function defines the metric to be optimized (in this case, the F1-score on the validation set).
   - `n_trials=100`: Specifies the number of trials or iterations to be conducted during the optimization process. In this case, it will perform 100 trials to search for optimal hyperparameters.

3. **Hyperparameter Search:**
   - Optuna iteratively explores different hyperparameter combinations for the Random Forest model by running trials. Each trial involves training the Random Forest model with a particular set of hyperparameters and evaluating its performance on the validation set using the F1-score.

4. **Objective Function Execution:**
   - The `objective_random_forest` function is executed for each trial, calculating the F1-score based on the model's predictions on the validation set.

5. **Optimization Results:**
   - Optuna keeps track of the best set of hyperparameters that maximize the objective function (F1-score) throughout the trials.

6. **Completion:**
   - Once the specified number of trials (`n_trials`) is completed, the optimization process concludes, and the study object (`study_rf`) contains information about the best hyperparameters found.

This code is an essential step in the hyperparameter tuning process, allowing the algorithm to automatically search for the most effective hyperparameters for the Random Forest classifier within the defined search space.


In [None]:
# Get the best hyperparameters for each model
best_params_lr = study_lr.best_params


# Train the best model for each algorithm on the entire training dataset
model_lr = LogisticRegression(**best_params_lr)
model_lr.fit(X_train, y_train)


In [16]:
best_params_dt = study_dt.best_params

model_dt = DecisionTreeClassifier(**best_params_dt)
model_dt.fit(X_train, y_train)

In [17]:
best_params_rf = study_rf.best_params

model_rf = RandomForestClassifier(**best_params_rf)
model_rf.fit(X_train, y_train)

In [18]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming 'y_valid' is the true labels for the validation set
model_list = [model_lr,model_dt,model_rf]
for model in model_list:
# Making predictions
    y_pred = model.predict(X_valid)

    # Evaluating the model
    accuracy = accuracy_score(y_valid, y_pred)
    print(f"Accuracy: {accuracy}")

    # Additional evaluation metrics
    print("Classification Report:")
    print(classification_report(y_valid, y_pred))

    print("Confusion Matrix:")
    print(confusion_matrix(y_valid, y_pred))


Accuracy: 0.901657458563536
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       807
           1       0.60      0.28      0.38        98

    accuracy                           0.90       905
   macro avg       0.76      0.63      0.66       905
weighted avg       0.88      0.90      0.88       905

Confusion Matrix:
[[789  18]
 [ 71  27]]
Accuracy: 0.8895027624309392
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.94      0.94       807
           1       0.49      0.48      0.48        98

    accuracy                           0.89       905
   macro avg       0.71      0.71      0.71       905
weighted avg       0.89      0.89      0.89       905

Confusion Matrix:
[[758  49]
 [ 51  47]]
Accuracy: 0.9027624309392265
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.97      0.95       807


**Explanation:**

1. **Importing Evaluation Metrics:**
   - `from sklearn.metrics import accuracy_score, classification_report, confusion_matrix`: Imports necessary metrics from scikit-learn for evaluating classification models, including accuracy, classification report, and confusion matrix.

2. **Model Evaluation Loop:**
   - `model_list = [model_lr, model_dt, model_rf]`: Creates a list containing the trained classification models (`model_lr`, `model_dt`, and `model_rf`) that you want to evaluate.
   - `for model in model_list:`: Iterates through each model in the list.

3. **Making Predictions:**
   - `y_pred = model.predict(X_valid)`: Generates predictions using the current model (`model`) on the validation set (`X_valid`).

4. **Accuracy Calculation:**
   - `accuracy = accuracy_score(y_valid, y_pred)`: Calculates the accuracy of the model by comparing its predictions (`y_pred`) with the true labels for the validation set (`y_valid`).
   - `print(f"Accuracy: {accuracy}")`: Displays the calculated accuracy.

5. **Additional Evaluation Metrics:**
   - `print("Classification Report:")`: Prints a header for the classification report.
   - `print(classification_report(y_valid, y_pred))`: Generates and prints a detailed classification report, including precision, recall, and F1-score, for the model's predictions on the validation set.

6. **Confusion Matrix:**
   - `print("Confusion Matrix:")`: Prints a header for the confusion matrix.
   - `print(confusion_matrix(y_valid, y_pred))`: Computes and prints the confusion matrix to visualize the model's performance in terms of true positive, true negative, false positive, and false negative predictions.

This code segment is designed to systematically evaluate multiple classification models (Logistic Regression, Decision Tree, and Random Forest) on a validation set. It provides key performance metrics, including accuracy, a detailed classification report, and a confusion matrix for each model.


**Explanation: Choosing Logistic Regression Model**

1. **High Accuracy:**
   - After evaluating multiple classification models (Logistic Regression, Decision Tree, and Random Forest) on the validation set, it was observed that the Logistic Regression model achieved the highest accuracy among them.
   - Accuracy is a crucial metric that represents the overall correctness of the model's predictions. A higher accuracy indicates a better-performing model.

2. **Simplicity and Interpretability:**
   - Logistic Regression is a simple and interpretable model, making it easier to understand and explain to stakeholders.
   - The interpretability of the model is valuable in scenarios where transparency and clear communication of the decision-making process are essential.

3. **Applicability to Binary Classification:**
   - Logistic Regression is well-suited for binary classification problems, such as the prediction of whether a client will subscribe to a term deposit ('yes' or 'no') in this context.
   - It models the probability of the positive class, making it directly applicable to the problem at hand.

4. **Consideration of Business Objectives:**
   - The decision to choose the Logistic Regression model is aligned with the business objectives and requirements outlined in the initial stages of the project.
   - The focus on predicting term deposit subscriptions, a binary outcome, aligns with the strengths of Logistic Regression.

5. **Trade-off Consideration:**
   - While other models may offer more complexity and potential for higher performance, the simplicity and good accuracy of Logistic Regression were deemed sufficient for the given task.
   - The trade-off between model complexity and performance was carefully considered, favoring a model that strikes a balance suitable for practical deployment.

In conclusion, the choice of the Logistic Regression model is justified based on its high accuracy, simplicity, interpretability, applicability to the binary classification task, and alignment with business objectives.
