## Intro

This is a project to determine whether a person will churn or not for the telecom operator Interconnect. I have four datasets to work with for this project. A contract dataset, personal dataset, internet dataset, and a phone dataset. Each on contains different information based off of the customer ID. The EndDate column will be my target and the rest will be my features to help determine what customers will churn or not. 

## Data Preprocessing

In [1]:
import pandas as pd

Loading in the datasets

In [2]:
contract = pd.read_csv('/datasets/final_provider/contract.csv')

In [3]:
personal = pd.read_csv('/datasets/final_provider/personal.csv')

In [4]:
internet = pd.read_csv('/datasets/final_provider/internet.csv')

In [5]:
phone = pd.read_csv('/datasets/final_provider/phone.csv')

Checking each dataset to see what they contain

In [6]:
contract.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


Contract dataset contains the contracts information

In [7]:
personal.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


Personal dataset contains the clients personal data

In [8]:
internet.head()

Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No


Internet dataset contains information about the internet services

In [9]:
phone.head()

Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes


Phone dataset contains information about telephone services

Merging the datasets into one dataset

In [10]:
df_merged = contract.merge(personal, on='customerID', how='left')\
                       .merge(internet, on='customerID', how='left')\
                       .merge(phone, on='customerID', how='left')

In [11]:
df_merged.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


Datasets have been successfully merged on the customerID column

Checking df info

In [12]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   gender            7043 non-null   object 
 9   SeniorCitizen     7043 non-null   int64  
 10  Partner           7043 non-null   object 
 11  Dependents        7043 non-null   object 
 12  InternetService   5517 non-null   object 
 13  OnlineSecurity    5517 non-null   object 
 14  OnlineBackup      5517 non-null   object 
 15  DeviceProtection  5517 non-null   object 
 16  TechSupport       5517 non-null   object 


There are some missing values and some types that need to be changed. Also going to make the columns easier to read

Making the columns easier to read

In [13]:
df_merged.columns = df_merged.columns.str.lower()
df_merged.head()

Unnamed: 0,customerid,begindate,enddate,type,paperlessbilling,paymentmethod,monthlycharges,totalcharges,gender,seniorcitizen,partner,dependents,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,multiplelines
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


In [14]:
df_merged.columns = df_merged.columns.str.replace('customerid', 'customer_id') \
                                      .str.replace('begindate', 'begin_date') \
                                      .str.replace('enddate', 'end_date') \
                                      .str.replace('type', 'contract_type') \
                                      .str.replace('paperlessbilling', 'paperless_billing') \
                                      .str.replace('paymentmethod', 'payment_method') \
                                      .str.replace('monthlycharges', 'monthly_charges') \
                                      .str.replace('totalcharges', 'total_charges') \
                                      .str.replace('gender', 'gender') \
                                      .str.replace('seniorcitizen', 'senior_citizen') \
                                      .str.replace('partner', 'partner') \
                                      .str.replace('dependents', 'dependents') \
                                      .str.replace('internetservice', 'internet_service') \
                                      .str.replace('onlinesecurity', 'online_security') \
                                      .str.replace('onlinebackup', 'online_backup') \
                                      .str.replace('deviceprotection', 'device_protection') \
                                      .str.replace('techsupport', 'tech_support') \
                                      .str.replace('streamingtv', 'streaming_tv') \
                                      .str.replace('streamingmovies', 'streaming_movies') \
                                      .str.replace('multiplelines', 'multiple_lines')

df_merged.head()

Unnamed: 0,customer_id,begin_date,end_date,contract_type,paperless_billing,payment_method,monthly_charges,total_charges,gender,senior_citizen,partner,dependents,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,multiple_lines
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


Checking the column types

In [16]:
df_merged.dtypes

customer_id           object
begin_date            object
end_date              object
contract_type         object
paperless_billing     object
payment_method        object
monthly_charges      float64
total_charges         object
gender                object
senior_citizen         int64
partner               object
dependents            object
internet_service      object
online_security       object
online_backup         object
device_protection     object
tech_support          object
streaming_tv          object
streaming_movies      object
multiple_lines        object
dtype: object

Changing dates to datetime and total charges to float

In [18]:
df_merged['begin_date'] = pd.to_datetime(df_merged['begin_date'], errors='coerce')
df_merged['end_date'] = pd.to_datetime(df_merged['end_date'], errors='coerce')

df_merged['total_charges'] = pd.to_numeric(df_merged['total_charges'], errors='coerce')

df_merged.dtypes

customer_id                  object
begin_date           datetime64[ns]
end_date             datetime64[ns]
contract_type                object
paperless_billing            object
payment_method               object
monthly_charges             float64
total_charges               float64
gender                       object
senior_citizen                int64
partner                      object
dependents                   object
internet_service             object
online_security              object
online_backup                object
device_protection            object
tech_support                 object
streaming_tv                 object
streaming_movies             object
multiple_lines               object
dtype: object

In [None]:
Changing object columns to category columns

In [19]:
categorical_columns = [
    'contract_type', 'paperless_billing', 'payment_method', 'gender',
    'partner', 'dependents', 'internet_service', 'online_security',
    'online_backup', 'device_protection', 'tech_support', 'streaming_tv',
    'streaming_movies', 'multiple_lines'
]

df_merged[categorical_columns] = df_merged[categorical_columns].astype('category')

df_merged.dtypes

customer_id                  object
begin_date           datetime64[ns]
end_date             datetime64[ns]
contract_type              category
paperless_billing          category
payment_method             category
monthly_charges             float64
total_charges               float64
gender                     category
senior_citizen                int64
partner                    category
dependents                 category
internet_service           category
online_security            category
online_backup              category
device_protection          category
tech_support               category
streaming_tv               category
streaming_movies           category
multiple_lines             category
dtype: object

Checking for duplicates

In [20]:
duplicate_rows = df_merged[df_merged.duplicated()]

num_duplicates = duplicate_rows.shape[0]

num_duplicates, duplicate_rows.head()

(0,
 Empty DataFrame
 Columns: [customer_id, begin_date, end_date, contract_type, paperless_billing, payment_method, monthly_charges, total_charges, gender, senior_citizen, partner, dependents, internet_service, online_security, online_backup, device_protection, tech_support, streaming_tv, streaming_movies, multiple_lines]
 Index: [])

There are no duplicates

Checking for missing values

In [21]:
missing_values = df_merged.isnull().sum()

missing_values[missing_values > 0]

end_date             5174
total_charges          11
internet_service     1526
online_security      1526
online_backup        1526
device_protection    1526
tech_support         1526
streaming_tv         1526
streaming_movies     1526
multiple_lines        682
dtype: int64

In [23]:
df_merged.head()

Unnamed: 0,customer_id,begin_date,end_date,contract_type,paperless_billing,payment_method,monthly_charges,total_charges,gender,senior_citizen,partner,dependents,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,multiple_lines
0,7590-VHVEG,2020-01-01,NaT,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,NaT,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,NaT,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


filling in the missing end dates with a placeholder date then changing the placeholder date to 0 and the other dates to 1 indicating that they have churned and saving it in a new churn column and dropping the end_date column

In [24]:
df_merged['end_date'] = df_merged['end_date'].fillna(pd.Timestamp('2099-12-31'))

df_merged['churn'] = df_merged['end_date'].apply(lambda x: 0 if x == pd.Timestamp('2099-12-31') else 1)

df_merged = df_merged.drop(columns=['end_date'])

In [25]:
df_merged.head()

Unnamed: 0,customer_id,begin_date,contract_type,paperless_billing,payment_method,monthly_charges,total_charges,gender,senior_citizen,partner,dependents,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,multiple_lines,churn
0,7590-VHVEG,2020-01-01,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,,0
1,5575-GNVDE,2017-04-01,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No,0
2,3668-QPYBK,2019-10-01,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No,1
3,7795-CFOCW,2016-05-01,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,,0
4,9237-HQITU,2019-09-01,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No,1


Filling in the missing values in total charges column with the median

In [26]:
df_merged['total_charges'] = df_merged['total_charges'].fillna(df_merged['total_charges'].median())

Filling in the missing values in the service columns with No

In [28]:
service_columns = [
    'internet_service', 'online_security', 'online_backup', 
    'device_protection', 'tech_support', 'streaming_tv', 
    'streaming_movies', 'multiple_lines'
]
for column in service_columns:
    if 'No' not in df_merged[column].cat.categories:
        df_merged[column] = df_merged[column].cat.add_categories('No')
df_merged[service_columns] = df_merged[service_columns].fillna('No')

missing_values_after = df_merged.isnull().sum()

missing_values_after[missing_values_after > 0]

Series([], dtype: int64)

In [29]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   customer_id        7043 non-null   object        
 1   begin_date         7043 non-null   datetime64[ns]
 2   contract_type      7043 non-null   category      
 3   paperless_billing  7043 non-null   category      
 4   payment_method     7043 non-null   category      
 5   monthly_charges    7043 non-null   float64       
 6   total_charges      7043 non-null   float64       
 7   gender             7043 non-null   category      
 8   senior_citizen     7043 non-null   int64         
 9   partner            7043 non-null   category      
 10  dependents         7043 non-null   category      
 11  internet_service   7043 non-null   category      
 12  online_security    7043 non-null   category      
 13  online_backup      7043 non-null   category      
 14  device_p

Missing values have been handled

Label encoding the categorical columns

In [30]:
from sklearn.preprocessing import LabelEncoder

df_label_encoded = df_merged.copy()

label_encoder = LabelEncoder()

for column in categorical_columns:
    df_label_encoded[column] = label_encoder.fit_transform(df_label_encoded[column])

df_label_encoded.head()

Unnamed: 0,customer_id,begin_date,contract_type,paperless_billing,payment_method,monthly_charges,total_charges,gender,senior_citizen,partner,dependents,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,multiple_lines,churn
0,7590-VHVEG,2020-01-01,0,1,2,29.85,29.85,0,0,1,0,0,0,1,0,0,0,0,0,0
1,5575-GNVDE,2017-04-01,1,0,3,56.95,1889.5,1,0,0,0,0,1,0,1,0,0,0,0,0
2,3668-QPYBK,2019-10-01,0,1,3,53.85,108.15,1,0,0,0,0,1,1,0,0,0,0,0,1
3,7795-CFOCW,2016-05-01,1,0,0,42.3,1840.75,1,0,0,0,0,1,0,1,1,0,0,0,0
4,9237-HQITU,2019-09-01,0,1,2,70.7,151.65,0,0,0,0,1,0,0,0,0,0,0,0,1


Columns have been successfully label encoded

Since we have a beginning date, I would like to change that to tenure to see how long they have been a customer for. The reference date is from the contract information being valid as of Feburary 1, 2020

In [31]:
import numpy as np

reference_date = pd.Timestamp('2020-02-01')

df_label_encoded['tenure_months'] = ((reference_date - df_label_encoded['begin_date']) / np.timedelta64(1, 'M')).astype(int)

df_label_encoded = df_label_encoded.drop(columns=['begin_date'])

df_label_encoded.head()

Unnamed: 0,customer_id,contract_type,paperless_billing,payment_method,monthly_charges,total_charges,gender,senior_citizen,partner,dependents,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,multiple_lines,churn,tenure_months
0,7590-VHVEG,0,1,2,29.85,29.85,0,0,1,0,0,0,1,0,0,0,0,0,0,1
1,5575-GNVDE,1,0,3,56.95,1889.5,1,0,0,0,0,1,0,1,0,0,0,0,0,34
2,3668-QPYBK,0,1,3,53.85,108.15,1,0,0,0,0,1,1,0,0,0,0,0,1,4
3,7795-CFOCW,1,0,0,42.3,1840.75,1,0,0,0,0,1,0,1,1,0,0,0,0,45
4,9237-HQITU,0,1,2,70.7,151.65,0,0,0,0,1,0,0,0,0,0,0,0,1,5


Scaling the numerical features

In [32]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

numerical_columns = ['monthly_charges', 'total_charges', 'tenure_months']

df_label_encoded[numerical_columns] = scaler.fit_transform(df_label_encoded[numerical_columns])

df_label_encoded.head()

Unnamed: 0,customer_id,contract_type,paperless_billing,payment_method,monthly_charges,total_charges,gender,senior_citizen,partner,dependents,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,multiple_lines,churn,tenure_months
0,7590-VHVEG,0,1,2,-1.160323,-0.994242,0,0,1,0,0,0,1,0,0,0,0,0,0,-1.32308
1,5575-GNVDE,1,0,3,-0.259629,-0.173244,1,0,0,0,0,1,0,1,0,0,0,0,0,0.044891
2,3668-QPYBK,0,1,3,-0.36266,-0.959674,1,0,0,0,0,1,1,0,0,0,0,0,1,-1.198719
3,7795-CFOCW,1,0,0,-0.746535,-0.194766,1,0,0,0,0,1,0,1,1,0,0,0,0,0.500881
4,9237-HQITU,0,1,2,0.197365,-0.94047,0,0,0,0,1,0,0,0,0,0,0,0,1,-1.157266


Data has been successfully preprocessed and is ready for Model Training

## Model 

Splitting the dataset

In [33]:
from sklearn.model_selection import train_test_split

X = df_label_encoded.drop(columns=['customer_id', 'churn'])
y = df_label_encoded['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12345, stratify=y)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((5634, 18), (1409, 18), (5634,), (1409,))

Training several different models to try to get the best auc-roc and accuracy score

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

In [37]:
log_reg = LogisticRegression(max_iter=1000, random_state=12345)
random_forest = RandomForestClassifier(random_state=12345)
grad_boost = GradientBoostingClassifier(random_state=12345)

models = {'Logistic Regression': log_reg, 
          'Random Forest': random_forest, 
          'Gradient Boosting': grad_boost}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    
    auc = roc_auc_score(y_test, y_pred_prob)
    accuracy = accuracy_score(y_test, y_pred)
    
    results[name] = {'AUC-ROC': auc, 'Accuracy': accuracy}

results_df = pd.DataFrame(results).T

results_df

Unnamed: 0,AUC-ROC,Accuracy
Logistic Regression,0.822222,0.786373
Random Forest,0.877844,0.838183
Gradient Boosting,0.905726,0.863023


After training the several models, the gradient boosting model scored the highest in both auc-roc and accuracy

Tuning the gradient boosting model to try and get a higher score

In [38]:
param_sets = [
    {'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 3, 'min_samples_split': 2, 'min_samples_leaf': 1},
    {'n_estimators': 200, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_split': 5, 'min_samples_leaf': 2},
    {'n_estimators': 100, 'learning_rate': 0.2, 'max_depth': 3, 'min_samples_split': 5, 'min_samples_leaf': 2},
    {'n_estimators': 200, 'learning_rate': 0.2, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 1}
]

# Dictionary to store the results
manual_results = {}

# Train and evaluate each parameter set
for i, params in enumerate(param_sets):
    model = GradientBoostingClassifier(random_state=42, **params)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    
    auc = roc_auc_score(y_test, y_pred_prob)
    accuracy = accuracy_score(y_test, y_pred)
    
    manual_results[f'Set {i+1}'] = {'Parameters': params, 'AUC-ROC': auc, 'Accuracy': accuracy}

# Convert the results to a dataframe for better visualization
manual_results_df = pd.DataFrame(manual_results).T
print(manual_results_df)

                                              Parameters   AUC-ROC  Accuracy
Set 1  {'n_estimators': 100, 'learning_rate': 0.1, 'm...  0.905685  0.863023
Set 2  {'n_estimators': 200, 'learning_rate': 0.1, 'm...  0.928484  0.885025
Set 3  {'n_estimators': 100, 'learning_rate': 0.2, 'm...  0.916742  0.880057
Set 4  {'n_estimators': 200, 'learning_rate': 0.2, 'm...  0.930166  0.891412


Set 4 had the best score. Going to use that for the model and save it as best_model

In [39]:
best_params = {
    'n_estimators': 200,
    'learning_rate': 0.2,
    'max_depth': 4,
    'min_samples_split': 2,
    'min_samples_leaf': 1
}

best_model = GradientBoostingClassifier(random_state=12345, **best_params)

## Conclusion

In conclusion for this project, I created a model that predcited customer churn. I opened up the datasets and preprocessed them (changed datatypes, filled missing values, scaled and label encoded the data) in order to train them for a model. I tried several different models and the gradient boosting model came up with the highest auc-roc and accuracy score. After getting the high score with the gradient boosting model, I then finely tuned it to try and get a higher auc-roc score and accuracy score which I was able to do. I ended up with an auc-roc score of 93% and an accuracy score of 89%. It is saved as "best_model". 