# Lab Assignment 4
## Learning Objectives
* Demonstrate the understanding of machine learning algorithms and evaluation methods
* Demonstrate the capability of applying machine learning algorithms in practice

## Due Date
**Midnight, Thursday, November 28, 2023**

## Assignment Submission Instructions
When your file is ready, submit the following deliverables to the Lab Assignmen 4 dropbox:
* Provide the link to your Google Colab notebook in the comments section; please make sure that **you enable the general access to your notebook with links before submission**. Failure to open your notebook will automatically lead to a grade of 0.
* Upload the notebook file with the `.ipynb` suffix to the submission drop box. The uploaded notebook should have the same content as the one shared through the link, include enough documentation of the code, and have all the outputs available.

## Others
As always, feel free to come to our office hours or let us know through email if you face any difficulties/challenges while finishing the assignment. Good luck! For your convenience, I have created the text and code cells you might need for the lab assignment. Please also complete your contact information in the notebook as well.

## Student's Contact Information:
Name: Sarthak Haldar

Email: sarthakhaldar@arizona.edu

## Part 0: Download Bank Marketing Dataset
For this lab assignment, we will be working with the [Bank Marketing dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) hosted on the UCI machine learning repository.

For the banking industry, an important task is to market their products (e.g., a term deposit or a credit card) to potential customers. However, such tasks are usually challenging as banks need to **cautiously balance the cost of large-scale marketing campaigns and the profit of signing up more customers.**

To address this issue, machine learning models have been widely adopted by the banking industry to identify potential customers and improve marketing effectiveness. In this lab assignment, you are tasked to develop machine learning models to **predict whether a customer would sign up a term deposit using various features collected by a bank.** You also need to evaluate the performance of each model and recommend the most preferred model to the stakeholders in the marketing department.

In the section below, we provide the code to download two csv files, namely `bank-train.csv` and `bank-test.csv`, for the Bank Marketing dataset. The `bank-train.csv` includes information on **32,158 customers** and the `bank-test.csv` includes information on another **8,040 customers**. For both datasets, there are 11 features that you can use for prediction. Below we list the detailed definitions for each feature:
* age: age of the customer
* housing: whether the customer has housing loan (0 for no; 1 for yes)
* loan: whether the customer has personal loan (0 for no; 1 for yes)
* contact: contact communication type (0 for cellular; 1 for telephone)
* campaign: number of contacts performed during this campaign and for this customer
* previous: number of contacts performed before this campaign and for this customer
* emp.var.rate: employment variation rate - quarterly indicator
* cons.price.idx: consumer price index - monthly indicato
* cons.conf.idx: consumer confidence index - monthly indicator
* euribor3m: euribor 3 month rate - daily indicator
* nr.employed: number of employees - quarterly indicator

The label you are going to predict has the name `y`, which indicates whether the customer signed up for the term deposit or not (0 for no; 1 for yes).


In [2]:
from urllib.request import urlretrieve
train = urlretrieve('https://drive.google.com/uc?export=download&id=18FrPPMPgwERqJMC2SGTlQa9zm8leDXlv',
            'bank-train.csv')
test = urlretrieve('https://drive.google.com/uc?export=download&id=1IDiZPO84visgoGPA6FIorgigp5Z-bcJH',
            'bank-test.csv')

## Part 1: Import and Process Data (0.5 Point)
In this section, you need to complete the code for importing both `bank-train.csv` and `bank-test.csv`. The data from `bank-train.csv` will be used for training machine learning models, whereas the data from `bank-test.csv` will be used to evaluate the performance of these models. For each csv file, please create separate variables that store the input features and labels.

In [6]:
import pandas as pd


train_data = pd.read_csv('bank-train.csv')
test_data = pd.read_csv('bank-test.csv')

In [7]:
train_data.head()

Unnamed: 0,age,housing,loan,contact,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,35,1,0,0,1,0,-1.8,93.075,-47.1,1.445,5099.1,0
1,53,1,1,1,1,0,1.1,93.994,-36.4,4.855,5191.0,0
2,46,0,0,0,1,0,-1.8,93.075,-47.1,1.423,5099.1,1
3,50,1,0,1,2,0,-0.1,93.2,-42.0,4.076,5195.8,0
4,39,0,0,1,3,0,1.4,94.465,-41.8,4.959,5228.1,0


In [20]:
X_train = train_data.drop('y', axis=1)
y_train = train_data['y']

In [9]:
test_data.head()

Unnamed: 0,age,housing,loan,contact,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,0,0,0,3,0,1.4,93.444,-36.1,4.963,5228.1,0
1,28,1,0,1,1,0,1.1,93.994,-36.4,4.86,5191.0,0
2,18,0,0,0,2,0,-3.0,92.713,-33.0,0.712,5023.5,1
3,31,0,0,0,5,0,1.4,93.444,-36.1,4.964,5228.1,0
4,46,0,0,0,1,0,1.4,93.444,-36.1,4.966,5228.1,0


In [21]:
X_test = test_data.drop('y', axis=1)
y_test = test_data['y']

## Part 2: Apply Machine Learning Classification Methods (6.5 Points)
In this section, you are tasked to train and evaluate various machine learning classification methods.

Specifically, you need to use the training data to separately train **k-NN, Naive Bayes, logistic regression, and decsion tree methods.** Once you finish training these models, you then need to predict the labels based on input features from the test data and calculate the performance of each model regarding its accuracy, recall, precision, f1-score, and ROC-AUC.

For the k-NN method, you can specify the number of neighboring points (i.e., the value of k) to be any number you like. Similarly, you can specify the depth of the tree to be any value you like for the decision tree method.

Finally, for the logistic regression method, please also print out the coefficients estimated by the model, and explain the results in 50 words either in the code comment or in another text cell.

Below is the point distribution for this section:
* training and evaluation of k-NN: 1.5 points
* training and evaluation of Naive Bayes: 1.5 points
* training and evaluation of logistic regression: 1.5 points; explaination of logistic regression coefficients: 0.5 point
* training and evaluation of decision tree: 1.5 points

**Extra Points:** if you show your efforts on hyper-parameter tunning process using either tables or plots **(e.g., how you selected your k in kNN, the depth in decision tree, the list of features to be included in logistic regression)** extra points will be provided.


# KNN

In [47]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [48]:
knn_model = KNeighborsClassifier(n_neighbors=3)

In [49]:
# Training the model
knn_model.fit(X_train, y_train)

In [50]:
# Making predictions on the test data
y_pred = knn_model.predict(X_test)

In [51]:
# Evaluating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of kNN model: {accuracy * 100:.2f}%')

Accuracy of kNN model: 87.28%


## Hyperparameter tuning on kNN post model creation step (Extra credit)

In [52]:
# Defining the hyperparameters and their possible values
param_grid = {
    'n_neighbors': [3, 5, 7, 10],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1 for Manhattan distance, 2 for Euclidean distance
}

In [53]:
# Using GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

In [54]:
# Getting the best hyperparameters
best_params = grid_search.best_params_
print(f'Best Hyperparameters: {best_params}')

Best Hyperparameters: {'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}


In [55]:
# Creating the model with the best hyperparameters

best_knn_model = KNeighborsClassifier(n_neighbors=10, p = 1, weights = 'uniform')



In [56]:
# Training the model with the best hyperparameters
best_knn_model.fit(X_train, y_train)

In [57]:
# Make predictions on the test data
y_pred_best_knn = best_knn_model.predict(X_test)

In [58]:
# Evaluating the accuracy of the best kNN model
accuracy_best_knn = accuracy_score(y_test, y_pred_best_knn)
print(f'Accuracy of the best kNN model: {accuracy_best_knn * 100:.2f}%')

Accuracy of the best kNN model: 88.88%


### We hence got a better model with hyper parameter tuning. Previous kNN model accuracy was 87.28 % and new tuned model has an accuracy of 88.88 %

# Naive Bayes

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [18]:
# Creating a Naive Bayes model
nb_model = MultinomialNB()

In [22]:
# Training the model
nb_model.fit(X_train, y_train)

ValueError: ignored

Multinomial Naive Bayes model cannot handle negative values in the input data. This is because it's commonly used for data like word counts, which are non-negative.

So we will use GaussianNB, which assumes that the features follow a Gaussian distribution.

In [23]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [24]:
# Create Gaussian Naive Bayes model
nb_model = GaussianNB()

In [25]:
# Train the model
nb_model.fit(X_train, y_train)

In [26]:
# Making predictions on the test data
y_pred_nb = nb_model.predict(X_test)

In [27]:
# Evaluating the accuracy of the Naive Bayes model
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f'Accuracy of Naive Bayes model: {accuracy_nb * 100:.2f}%')

Accuracy of Naive Bayes model: 74.75%


## Applying Grid Search Cross validation for tuning on Gaussian Naive Bayes post model creation step (Extra credit)

In [60]:
# Defining the hyperparameters and their possible values
param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6]  # Adjust these values based on your specific needs
}

In [61]:
# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(nb_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

In [62]:
# Getting the best hyperparameters
best_params = grid_search.best_params_
print(f'Best Hyperparameters: {best_params}')

Best Hyperparameters: {'var_smoothing': 1e-07}


In [63]:
# Creating Gaussian Naive Bayes model with the best hyperparameters
best_nb_model = GaussianNB(var_smoothing=1e-7)


In [64]:
# Training the model with the best hyperparameters
best_nb_model.fit(X_train, y_train)

In [65]:
# Making predictions on the test data
y_pred_best_nb = best_nb_model.predict(X_test)

In [66]:
# Evaluating the accuracy of the best Gaussian Naive Bayes model
accuracy_best_nb = accuracy_score(y_test, y_pred_best_nb)
print(f'Accuracy of the best Gaussian Naive Bayes model: {accuracy_best_nb * 100:.2f}%')

Accuracy of the best Gaussian Naive Bayes model: 74.75%


### With hyper parameter tuning on our Naive Bayes model, our accuracy stayed the same. Previous and the new model both are having accuracy of 74.75 %.

# Logistic Regression

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [29]:
# Creating Logistic Regression model
logreg_model = LogisticRegression()

In [30]:
# Training the model
logreg_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [31]:
# Making predictions on the test data
y_pred_logreg = logreg_model.predict(X_test)

In [32]:
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)
print(f'Accuracy of Logistic Regression model: {accuracy_logreg * 100:.2f}%')

Accuracy of Logistic Regression model: 88.82%


## Logistic Regression Coefficient explaination

Logistic Regression is a linear model used for binary classification, and it's often used when the dependent variable is categorical and has two classes. The logistic function (sigmoid) is applied to the linear combination of input features and their corresponding weights (coefficients) to transform the output into a probability between 0 and 1.

A Logistic Regression model looks like:

logit(p) = a+ bX₁ + cX₂ ( Equation ** )

logit(p) is a shortcut for log(p/1-p), where p = P{Y = 1}, i.e. the probability of “success”, or the presence of an outcome. X₁ and X₂ are the predictor variables, and b and c are their corresponding coefficients, each of which determines the emphasis X₁ and X₂ have on the final outcome Y (or p). Last, a is simply the intercept.

logit(p) = log(p/1-p), where p is the probability that Y = 1. Y can take two values, either 0 or 1. P{Y=1} is called the probability of success. Hence logit(p) = log(P{Y=1}/P{Y=0}). This is called the log-odds.

Source: https://towardsdatascience.com/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf

## Hyperparameter tuning on Logistic Regression post model creation step (Extra credit)

In [67]:
# Defining the hyperparameters and their possible values
param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter': [100, 200, 300]  # Adjust these values based on your specific needs
}


In [69]:
# Using GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(logreg_model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [70]:
# Getting the best hyperparameters
best_params = grid_search.best_params_
print(f'Best Hyperparameters: {best_params}')

Best Hyperparameters: {'C': 10, 'max_iter': 300, 'penalty': 'l2', 'solver': 'lbfgs'}


In [71]:
# Creating Logistic Regression model with the best hyperparameters
best_logreg_model = LogisticRegression(C=10, max_iter=300, penalty='l2', solver='lbfgs')


In [72]:
# Training the model with the best hyperparameters
best_logreg_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [73]:
# Making predictions on the test data
y_pred_best_logreg = best_logreg_model.predict(X_test)

In [74]:
# Evaluate the accuracy of the best Logistic Regression model
accuracy_best_logreg = accuracy_score(y_test, y_pred_best_logreg)
print(f'Accuracy of the best Logistic Regression model: {accuracy_best_logreg * 100:.2f}%')

Accuracy of the best Logistic Regression model: 89.14%


### Hence we got a better model with tuning. Previous model had an accuracy of 88.82 %. New tuned Logistic Regression model is having an accuracy of 89.14 %.

# Decision Tree

In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


In [34]:
# Create Decision Tree model
dt_model = DecisionTreeClassifier()

In [35]:
# Training the model
dt_model.fit(X_train, y_train)

In [36]:
# Making predictions on the test data
y_pred_dt = dt_model.predict(X_test)

In [37]:
# Evaluating the accuracy of the Decision Tree model
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f'Accuracy of Decision Tree model: {accuracy_dt * 100:.2f}%')

Accuracy of Decision Tree model: 85.88%


## Hyperparameter tuning on decision tree post model creation step (Extra credit)

In [39]:
# Defining the hyperparameters and their possible values
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [41]:
from sklearn.model_selection import GridSearchCV


# Useing GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

In [42]:
# Getting the best hyperparameters
best_params = grid_search.best_params_
print(f'Best Hyperparameters: {best_params}')

Best Hyperparameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [43]:
# Creating the model with the best hyperparameters

best_dt_model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=1, min_samples_split=2)


In [44]:
# Training the model with the best hyperparameters
best_dt_model.fit(X_train, y_train)

In [45]:
# Making predictions on the test data
y_pred_best_dt = best_dt_model.predict(X_test)

In [46]:
# Evaluating the accuracy of the best Decision Tree model
accuracy_best_dt = accuracy_score(y_test, y_pred_best_dt)
print(f'Accuracy of the best Decision Tree model: {accuracy_best_dt * 100:.2f}%')

Accuracy of the best Decision Tree model: 89.15%


### We hence got a better model with hyper parameter tuning. Previous decision tree model accuracy was 85.88 % and new tuned model has an accuracy of 89.15 %

## Part 3: Summarize Your Findings and Make Recommendation (1 Point)
Please use the following text section to summarize your findings on the performance of different machine learning methods. Based on your findings, please make your recommendation regarding which machine learning model to use for future marketing campaigns. When making recommendation, please keep in mind that there might be much more customers who declined to sign up for the deposit than customers who signed up.

# Findings

**k-Nearest Neighbors (kNN):**

The best hyperparameters for kNN were found to be {n_neighbors=10, p=1, weights='uniform'}.
The accuracy of the tuned kNN model on the test set was approximately 88.88%.

**Decision Tree:**

After hyperparameter tuning, the best hyperparameters were {max_depth=5, min_samples_leaf=1, min_samples_split=2}.
The accuracy of the Decision Tree model on the test set was approximately 89.15%.

**Gaussian Naive Bayes:**

The best hyperparameter found was {var_smoothing=1e-7}.
The accuracy of the tuned Gaussian Naive Bayes model on the test set was approximately 74.75%.

**Logistic Regression:**

The best hyperparameters for Logistic Regression were {C=10, max_iter=300, penalty='l2', solver='lbfgs'}.
The accuracy of the tuned Logistic Regression model on the test set was approximately 89.14%.




**Among the models tested, the tuned Decision Tree model achieved the highest accuracy on the test set with an 89.15 % accuracy.**

In [77]:
sample_input = [[35, 1, 0, 0, 1, 0, -1.8, 93.075, -47.1, 1.445, 5099.1]]

# Making predictions using the trained Decision Tree model
predictions = best_dt_model.predict(sample_input)

# Display the predictions
print(f'Predictions for custom input: {predictions}')

Predictions for custom input: [0]




In [81]:
sample_input2 = [[18,	0,	0,	0,	2,	0,	-3.0,	92.713,	-33.0,	0.712,	5023.5]]
# Making predictions using the trained Decision Tree model
predictions2 = best_dt_model.predict(sample_input2)

# Display the predictions
print(f'Predictions for custom input: {predictions2}')

Predictions for custom input: [1]




# Recommendation

It's essential to consider the trade-off between false positives and false negatives.
If the cost of false positives and false negatives is not balanced equally, we might want to consider other evaluation metrics like precision, recall, or the F1 score.
Additionally, we could explore ensemble methods (e.g., Random Forest) or advanced techniques to further improve performance.