# Homework 4: Hyperparameter tuning
### Customer churn prediction 

In this assignment, you will apply hyperparameter tuning to machine learning models. You will be working with decision trees and random forests, using SciKit's functionality to find the best combination of hyperparameters for each model. You also need to select the best model. 

The datasets, named ```churn_train.csv``` and ```churn_test.csv```, contain information about telecom customers and their churn (switch to another telecommunications provider) behavior.  
This dataset has 4250 samples with 19 input features and a boolean target variable called "churn". The goal in this assignment is to predict with high accuracy which customers are likely to churn. For your convenicnce, the code for loading and encoding the training data is already provided.

Hyperparameters to be tuned for Randomforest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- n_estimators = number of trees in the foreset
- max_depth = max number of levels in each decision tree
- min_samples_split = min number of data points placed in a node before the node is split
- min_samples_leaf = min number of data points allowed in a leaf node
- bootstrap = method for sampling data points (with or without replacement)

Hyperparameters to be tuned for DecisionTreeClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Look at the Jupyter notebook from class for reference.

Evaluate the model you select on ```churn_test.csv```.

In [29]:
import pandas as pd
df = pd.read_csv("churn_train.csv")
df.head(5)
test_df = pd.read_csv("churn_test.csv")

In [30]:
df.shape


(3500, 20)

In [31]:
test_df.shape

(750, 20)

In [32]:
# Set X and y
X = df.drop(['churn'], axis=1)
y = df['churn']

In [33]:
X_test = test_df.drop(['churn'],axis=1)
y_test= test_df["churn"]


In [34]:
#Encode categorical variables as dummy variables
# Select non-numeric columns
non_numeric_cols = X.select_dtypes(include=['object']).columns
X = pd.get_dummies(X, columns=non_numeric_cols)
X.head()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,...,state_WI,state_WV,state_WY,area_code_area_code_408,area_code_area_code_415,area_code_area_code_510,international_plan_no,international_plan_yes,voice_mail_plan_no,voice_mail_plan_yes
0,146,31,202.5,91,34.43,241.4,108,20.52,169.6,77,...,False,False,False,True,False,False,True,False,False,True
1,126,0,103.7,93,17.63,127.0,107,10.8,329.3,66,...,False,False,False,False,True,False,True,False,True,False
2,61,20,254.4,133,43.25,161.7,96,13.74,251.4,91,...,False,False,False,False,True,False,True,False,False,True
3,116,0,197.9,84,33.64,168.1,113,14.29,239.8,145,...,False,False,False,True,False,False,True,False,True,False
4,103,24,111.8,85,19.01,239.6,102,20.37,268.3,81,...,False,False,False,True,False,False,True,False,False,True


In [35]:
non_numeric_cols_test = X_test.select_dtypes(include=['object']).columns
X_test = pd.get_dummies(X_test, columns=non_numeric_cols)
X_test.head()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,...,state_WI,state_WV,state_WY,area_code_area_code_408,area_code_area_code_415,area_code_area_code_510,international_plan_no,international_plan_yes,voice_mail_plan_no,voice_mail_plan_yes
0,137,0,243.4,114,41.38,121.2,110,10.3,162.6,104,...,False,False,False,False,True,False,True,False,True,False
1,84,0,299.4,71,50.9,61.9,88,5.26,196.9,89,...,False,False,False,True,False,False,False,True,True,False
2,68,43,147.7,95,25.11,259.3,108,22.04,237.1,106,...,False,False,False,True,False,False,True,False,False,True
3,141,37,258.6,84,43.96,222.0,111,18.87,326.4,97,...,False,True,False,False,True,False,False,True,False,True
4,74,0,187.7,127,31.91,163.4,148,13.89,196.0,94,...,False,False,False,False,True,False,True,False,True,False


In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

In [36]:
# Prepare the test set
test_df = pd.read_csv("churn_test.csv")
test_df.head(5)

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
0,NJ,137,area_code_415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,no
1,OH,84,area_code_408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,no
2,AR,68,area_code_408,no,yes,43,147.7,95,25.11,259.3,108,22.04,237.1,106,10.67,8.9,8,2.4,3,no
3,WV,141,area_code_415,yes,yes,37,258.6,84,43.96,222.0,111,18.87,326.4,97,14.69,11.2,5,3.02,0,no
4,RI,74,area_code_415,no,no,0,187.7,127,31.91,163.4,148,13.89,196.0,94,8.82,9.1,5,2.46,0,no


In [17]:
# Tune hyperparameters, find the best set of hyperparameters

# Random Forest
# Crossvalidating to find the best hyperparameters
param_grid_rf = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf = RandomForestClassifier()
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=5)
grid_search_rf.fit(X, y)
best_rf_model = grid_search_rf.best_estimator_

In [18]:
best_rf_model

In [19]:
# Find the best hyperparameters
# Decision Tree
param_grid_dt = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeClassifier()
grid_search_dt = GridSearchCV(estimator=dt, param_grid=param_grid_dt, cv=5)
grid_search_dt.fit(X, y)
best_dt_model = grid_search_dt.best_estimator_

In [20]:
best_dt_model

In [21]:
predicted_churn = best_dt_model.predict(X_test)

In [22]:
y_test = predicted_churn

In [23]:
# Train the model with the best combination of hyperparameters using Random Forest
# Create and train the Random Forest model
rf_model = RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=200)
rf_model.fit(X, y)

# Make predictions on the test data
rf_predictions = rf_model.predict(X_test)

# Evaluate the Random Forest model
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_classification_report = classification_report(y_test, rf_predictions)

In [24]:
# Create and train the Decision Tree model
dt_model = DecisionTreeClassifier(max_depth=10, min_samples_leaf=4, min_samples_split=5)
dt_model.fit(X, y)

# Make predictions on the test data
dt_predictions = dt_model.predict(X_test)

# Evaluate the Decision Tree model
dt_accuracy = accuracy_score(y_test, dt_predictions)
dt_classification_report = classification_report(y_test, dt_predictions)

In [25]:
# Evaluate the best model on the test set
print("Decision Tree Model Results:")
print(f"Accuracy: {dt_accuracy}")
print("Classification Report:\n", dt_classification_report)

Decision Tree Model Results:
Accuracy: 0.988
Classification Report:
               precision    recall  f1-score   support

          no       0.99      0.99      0.99       654
         yes       0.96      0.95      0.95        96

    accuracy                           0.99       750
   macro avg       0.98      0.97      0.97       750
weighted avg       0.99      0.99      0.99       750



In [26]:
# Accuracy evaluatuin for the RFM
print("\nRandom Forest Model Results:")
print(f"Accuracy: {rf_accuracy}")
print("Classification Report:\n", rf_classification_report)


Random Forest Model Results:
Accuracy: 0.9613333333333334
Classification Report:
               precision    recall  f1-score   support

          no       0.97      0.98      0.98       654
         yes       0.89      0.80      0.84        96

    accuracy                           0.96       750
   macro avg       0.93      0.89      0.91       750
weighted avg       0.96      0.96      0.96       750



Based on precision, we can inferthat the DTM is more accurate. Based on the f1-score we can see that the DTM is not only more accurate, but less prone to making a mistake of predicting a churn to be 0 or "no" when actually its 1 or "yes". 