### Business Problem
- You are asked to develop a machine learning model that can predict customers who will leave the company.
- You are expected to perform the necessary data analysis and feature engineering steps before developing the model.

### Dataset Story
- Telco customer churn contains information about a fictitious telecom company providing home phone and Internet services to 7043 customers in California in the third quarter. It includes which customers left, stayed or signed up for service.
- The data set consists of 21 Variables and 7043 Observations.
- CustomerId : Customer Id
- Gender : Gender
- SeniorCitizen : Whether the customer is a senior citizen (1, 0)
- Partner : Whether the client has a partner (Yes, No) ? Married or not. Living together, being roommates
- Dependents : Whether the client has dependents (Yes, No) (Child, mother, father, grandmother)
- tenure : Number of months the customer stays with the company
- PhoneService : Whether the customer has phone service (Yes, No)
- MultipleLines : Whether the customer has more than one line (Yes, No, No phone service)
- InternetService : Customer's internet service provider (DSL, Fiber optic, No)
- OnlineSecurity : Whether the customer has online security (Yes, No, No Internet service)
- OnlineBackup : Whether the customer has online backup (Yes, No, No Internet service)
- DeviceProtection : Whether the customer has device protection (Yes, No, No Internet service)
- TechSupport : Whether the customer receives technical support (Yes, No, No Internet service)
- StreamingTV : Whether the customer has streaming TV (Yes, No, no Internet service) (The customer has a third-party indicates whether the provider uses the Internet service to broadcast television programs)
- StreamingMovies : Whether the customer has streaming movies (Yes, No, No Internet service) (Customer has a third-party Indicates whether the customer is using the Internet service to stream movies from the provider)
- Contract : Duration of the customer's contract (Month to month, One year, Two years)
- PaperlessBilling : Whether the customer has a paperless bill (Yes, No)
- PaymentMethod : Customer's payment method (Electronic check, Postal check, Bank transfer (automatic), Credit card (automatic)
- MonthlyCharges : Amount charged to the customer monthly
- TotalCharges : Total amount charged to the customer
- Churn : Whether the customer is using or not (Yes or No) - Customers who left in the last month or quarter.

- Each row represents a unique customer. Variables contain information about customer service, account and demographic data.
- Services that customers sign up for => phone, multiple lines, internet, online security, online backup, device protection, technical support and TV and movie streaming.
- Customer account information => how long they have been a customer, contract, payment method, paperless billing, monthly fees and total fees.
- Demographic information about clients => gender, age range and partners and dependents whether or not

In [56]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import joblib
import graphviz
import pydotplus
import plotly.graph_objects as go
from scipy import stats
from datetime import date
import warnings
warnings.simplefilter(action="ignore")

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export_text
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import GridSearchCV, cross_validate, RandomizedSearchCV, validation_curve
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, RobustScaler
from skompiler import skompile

In [57]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.3f' %x)

In [58]:
df = pd.read_csv('data\WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Exploratory Data Analysis

In [59]:
print("#"*50, "Shape", "#"*50)
print(df.shape, "\n")
print("#"*50, "Head", "#"*50)
print(df.head(5), "\n")
print("#"*50, "Tail", "#"*50)
print(df.tail(5), "\n")


################################################## Shape ##################################################
(7043, 21) 

################################################## Head ##################################################
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService     MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies        Contract PaperlessBilling              PaymentMethod  MonthlyCharges TotalCharges Churn
0  7590-VHVEG  Female              0     Yes         No       1           No  No phone service             DSL             No          Yes               No          No          No              No  Month-to-month              Yes           Electronic check          29.850        29.85    No
1  5575-GNVDE    Male              0      No         No      34          Yes                No             DSL            Yes           No              Yes          No          No           

In [60]:
print("#"*50, "NA", "#"*50)
print(df.isnull().sum(), "\n")


################################################## NA ##################################################
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64 



In [61]:
print("#"*50, "Types", "#"*50)
print(df.dtypes)

################################################## Types ##################################################
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [62]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [63]:
print("#"*50, "Quantiles", "#"*50)
print(df[['SeniorCitizen','tenure','MonthlyCharges','TotalCharges']].quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

################################################## Quantiles ##################################################
                0.000  0.050    0.500    0.950    0.990    1.000
SeniorCitizen   0.000  0.000    0.000    1.000    1.000    1.000
tenure          0.000  1.000   29.000   72.000   72.000   72.000
MonthlyCharges 18.250 19.650   70.350  107.400  114.729  118.750
TotalCharges   18.800 49.605 1397.475 6923.590 8039.883 8684.800


In [64]:
df['SeniorCitizen'] = df['SeniorCitizen'].astype('O')

In [65]:
df['Churn'] = df['Churn'].apply(lambda x: 1 if x=='Yes' else 0)

In [66]:
print("#"*50, "Head", "#"*50)
print(df.head(5), "\n")

################################################## Head ##################################################
   customerID  gender SeniorCitizen Partner Dependents  tenure PhoneService     MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies        Contract PaperlessBilling              PaymentMethod  MonthlyCharges  TotalCharges  Churn
0  7590-VHVEG  Female             0     Yes         No       1           No  No phone service             DSL             No          Yes               No          No          No              No  Month-to-month              Yes           Electronic check          29.850        29.850      0
1  5575-GNVDE    Male             0      No         No      34          Yes                No             DSL            Yes           No              Yes          No          No              No        One year               No               Mailed check          56.950      1889.500      0
2  3668-QPYBK    