<a href="https://colab.research.google.com/github/v4roberts/PortfolioProjects/blob/main/Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MIT5672_Lab4_VernerRoberts**



# Tackle the Telco Customer Churn dataset
In this lab assignment, you will work with the Telco Customer Churn dataset, a resource frequently employed in the telecommunications industry to forecast customer turnover. The dataset offers a range of customer-specific variables such as Monthly Charges and Contract Type, along with a 'Churn' indicator (Yes/No), signaling whether the customer has left the company.

Your objective is to apply five distinct ensemble techniques—Voting, Bagging, Random Forest, AdaBoost, and Stacking—to construct classification models that accurately predict customer churn. Ultimately, you will identify the most effective model based on its accuracy score.


Let's fetch the data and load it:

In [None]:
import pandas as pd

# Read data from URL
url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
df = pd.read_csv(url)

Let's first conduct exploratory data analysis (EDA) to understand the dataset better.

#### **Q1: Show the top few rows of the training set**

In [None]:
# Show the top few rows of the training set

df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


#### **Q2: Show basic information, e.g. the index dtype and columns, non-null values and memory usage**

In [None]:
# Show basic information, e.g. the index dtype and columns, non-null values and memory usage

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### **Q3: Use a method which returns description of the numerical data in the DataFrame, e.g. count, mean, std, min, 25%, 50%, 75%, max.**

In [None]:
#  Use a method which returns description of the numerical data in the DataFrame, e.g. count, mean, std, min, 25%, 50%, 75%, max.

df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [None]:
# Define features and target
X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

In [None]:
# Identify numerical and categorical columns
num_cols = X.select_dtypes(include=['float64', 'int64']).columns
cat_cols = X.select_dtypes(include=['object']).columns

In [None]:
print(num_cols)

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')


In [None]:
print(cat_cols)

Index(['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'TotalCharges'],
      dtype='object')


#### **Q4: Create preprocessors for both numerical and categorical features by using make_pipeline**

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Create preprocessors for both numerical and categorical features by using make_pipeline

num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler()
)

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore")
)


#### **Q5: Combine preprocessors by using ColumnTransformer**

In [None]:
from sklearn.compose import ColumnTransformer

# Combine preprocessors by using ColumnTransformer

preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", cat_pipeline, cat_cols)
])


#### **Q6: Build based models: LogisticRegression and DecisionTreeClassifier**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Build based models: LogisticRegression and DecisionTreeClassifier

lr_model = LogisticRegression()
dt_model = DecisionTreeClassifier()



#### **Q7: Create a dictionary named `ensemble_models` as a container to hold five seperate ensemble models: Voting, Bagging, Random Forest, AdaBoost, and Stacking**

In [None]:
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, StackingClassifier

# Create a dictionary named ensemble_models as a container to hold five seperate ensemble models: Voting, Bagging, Random Forest, AdaBoost, and Stacking

ensemble_models = {
    'Voting': VotingClassifier(estimators=[('lr', LogisticRegression()), ('dt', DecisionTreeClassifier())], voting='hard'),
    'Bagging': BaggingClassifier(estimator=DecisionTreeClassifier()),
    'Random Forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(estimator=DecisionTreeClassifier()),
    'Stacking': StackingClassifier(estimators=[('lr', LogisticRegression()), ('dt', DecisionTreeClassifier())], final_estimator=LogisticRegression())
}

#### **Q8: Train-test split**

In [None]:
from sklearn.model_selection import train_test_split

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### **Q9:**


1.   Construct a new pipeline which integrates the given `preprocessor` and a `classifier`.
2.   Utilize a `for` loop to iterate through each model in the ensemble_models dictionary.
3.   For each iteration, set the classifier in the pipeline to the current model.
4.   Train the pipeline using the `X_train` and `y_train` datasets.
5.   Compute the accuracy of the trained pipeline on the test dataset (`X_test` and `y_test`).
6.   Print out the accuracy of the model.




**Alternatively, you can create five models (Voting, Bagging, Random Forest, AdaBoost, and Stacking) individually instead of using `for` loop and `pipeline`.**

In [None]:
# create five models (Voting, Bagging, Random Forest, AdaBoost, and Stacking)

from sklearn.metrics import accuracy_score

voting_model = make_pipeline(preprocessor, ensemble_models['Voting'])
voting_model.fit(X_train, y_train)

bagging_model = make_pipeline(preprocessor, ensemble_models['Bagging'])
bagging_model.fit(X_train, y_train)

rf_model = make_pipeline(preprocessor, ensemble_models['Random Forest'])
rf_model.fit(X_train, y_train)

adaboost_model = make_pipeline(preprocessor, ensemble_models['AdaBoost'])
adaboost_model.fit(X_train, y_train)

stacking_model = make_pipeline(preprocessor, ensemble_models['Stacking'])
stacking_model.fit(X_train, y_train)

models = {
    "Voting": voting_model,
    "Bagging": bagging_model,
    "Random Forest": rf_model,
    "AdaBoost": adaboost_model,
    "Stacking": stacking_model,
}

for model_name, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy of {model_name}: {accuracy}")




Accuracy of Voting: 0.8055358410220014
Accuracy of Bagging: 0.8005677785663591
Accuracy of Random Forest: 0.7899219304471257
Accuracy of AdaBoost: 0.7799858055358411
Accuracy of Stacking: 0.8211497515968772


#### **Q10: Click Share at the top right. Ensure sharing settings are set to "Anyone with the link can edit." Copy the shared link. Submit this link to the Canvas assignment page.**

#### **Bonus question (5pts): how to get feature importance of each variable?**

In [None]:
# Extracting the feature importances

import numpy as np

rf_importances = rf_model.named_steps['randomforestclassifier'].feature_importances_



In [None]:
# Getting the feature names after one-hot encoding (for categorical variables)

cat_feature_names = rf_model.named_steps['columntransformer'].transformers_[1][1].named_steps['onehotencoder'].get_feature_names_out(cat_cols)



In [None]:
# Combining numerical and one-hot-encoded categorical feature names

feature_names = np.concatenate([num_cols, cat_feature_names])



In [None]:
# Sorting feature importances in descending order and taking the indices

sorted_idx = np.argsort(rf_importances)[::-1]



In [None]:
# Printing feature importances

print("Feature Importances:")
for idx in sorted_idx:
    print(f"{feature_names[idx]}: {rf_importances[idx]}")




[1;30;43mStreaming output truncated to the last 5000 lines.[0m
TotalCharges_135.2: 0.00022823913479096708
TotalCharges_4512.7: 0.00022793106771927643
TotalCharges_2585.95: 0.00022748294042594444
TotalCharges_2497.35: 0.00022734632533193748
TotalCharges_131.05: 0.0002269705018599379
TotalCharges_1425.45: 0.00022659412301609969
TotalCharges_2665: 0.00022644467992997234
TotalCharges_5464.65: 0.00022620246594811896
TotalCharges_191.05: 0.00022571474708387587
TotalCharges_4822.85: 0.00022547697115704357
TotalCharges_108.15: 0.0002254488679947501
TotalCharges_4065: 0.00022529703886843382
TotalCharges_2375.4: 0.00022496243060696682
TotalCharges_1235.55: 0.0002247552106412565
TotalCharges_435.4: 0.00022382111762314573
TotalCharges_165.2: 0.00022362528979925675
TotalCharges_399.6: 0.00022349052860626975
TotalCharges_6578.55: 0.00022343429357865523
TotalCharges_874.8: 0.0002231022674129255
TotalCharges_7804.15: 0.0002228997325902986
TotalCharges_3655.45: 0.0002224995649895871
TotalCharges_2429