## Objective: Examine why its customers have left in the past and which features are more important to determine who will churn in the future.

### 1. Set up environment and import libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import sklearn

In [None]:
df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head(5)

### 2. Exploratory Data Analysis
In this section, the followings will be done:
- Looking for missing values
- See if there is any outliers
- Distribution of Churning Rate due to different features
- Remove unused columns

#### Looking for missing values

In [None]:
# Look for missing values
df.info()

Total Charges is default in object type, so we need to change it back to float type.

In [None]:
df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')

In [None]:
df.info()

In [None]:
df.isnull().sum()

We remove all 11 rows with missing values in TotalCharges

In [None]:
df.dropna(inplace=True)

In [None]:
df.reset_index(inplace=True,drop=True)

In [None]:
df

In [None]:
# Look for Outliers
df.describe()

In [None]:
df['tenure']

In [None]:
num_features = ['tenure','MonthlyCharges','TotalCharges']
fig, axes = plt.subplots(1,3,figsize=(10,5))
for i in num_features:
    sns.boxplot(ax=axes[num_features.index(i)],data=df[i],palette='Set2').set_title(i)
    
    

#### Look for Outliers

To check for outliers, we define outliers as beyond the range of +/-1.5 IQR and see if there is any datas lying beyond the range

In [None]:
# Create a function to find outlier
def iqr_outliers(num_features):
    outlier_position=[]
    for i in num_features:
        q1 = df[num_features].quantile(0.25)[i]
        q3 = df[num_features].quantile(0.75)[i]
        iqr = q3-q1
        Lower_tail = q1 - 1.5 * iqr
        Upper_tail = q3 + 1.5 * iqr
        for j in df[i]:
            if j > Upper_tail or j < Lower_tail:
                outlier_position.append(i)
    print("Outliers:",outlier_position)
iqr_outliers(num_features)

It shows that no outliers in our dataset.

#### Remove unused column

In [None]:
# Remove unused column
df.drop(columns='customerID',inplace=True)

#### Distribution of Churning Rate

In [None]:
plt.pie(df['Churn'].value_counts(),labels=df['Churn'].unique(),autopct='%1.1f%%')

From the pie chart above, the ratio of churning rate (yes/no) is nearly 3:1, if we use the data for training directly, our model will predict customer will not churn due to the dominating data of not churning. To deal with such imbalance data set, data augmentation skills has to be used. I will demonstrate undersampling in the data preprocessing part of this project.

In [None]:
category_features = df.loc[:, ~df.columns.isin(num_features)]
category_features

In [None]:
def plot_categoricals(columns, title):
    fig, axs = plt.subplots(ncols=2, nrows=int(len(columns) / 2) + len(columns) % 2)
    fig.set_size_inches(15, 45)

    row = col = 0
    for column in columns:
        plot_title = '{}: {}'.format(title, column)
        sns.countplot(x=column, hue="Churn", data=category_features, ax=axs[row][col]).set_title(plot_title)

        if col == 1:
          col = 0
          row += 1
        else:
          col += 1

    # this prevents plots from overlapping
    plt.tight_layout()

In [None]:
plot_categoricals(category_features.columns.to_list(), 'Demographic Variables')


From the plot, we can have some insights:
- Overall, it is an highly imbalanced data
- The ratio of churning rate is more or less the same on each gender, so gender is not considered to be an important factor
- Internet Service: there are relatively more people having fiber optic internet service. At the same time, the ratio of churning rate is much higher in this category. 
- Onliine Security: there are relatively more people not acquiring online security service. At the same time, the ratio of churning rate is much higher in this category. 
- Online Backup: there are relatively more people not having online backup serviece. Again, the ratio of churning rate is much higher in this category.
- Tech Support: there are relatively more people not having tech support serviece. Again, the ratio of churning rate is much higher in this category.
- Contract: most of people using telco service are based on month-to-month contract. The ratio of churning rate is much higher in this category.
- Payment Method: most of people using telco service are using electronic check to pay. The ratio of churning rate is much higher in this category.
- Among the observations in differnent factors, we can conclude there may be some reasons relating to the follow-up services such as the poor service of fibre optic etc pushing people to churn. We may investigate more to see if there are some groups of people having similar pattern and give advice to specific cluster for the sake of pulling them back.

### 3. Data Preprocessing
In this section, the followings will be done:
- Feature Scaling
- Encoding
- Dealing with imbalanced dataset

#### Feature Scaling

Among all features, we have three numerical features: tenure, MonthlyCharges, TotalCharges. From the boxplots shown above, as they have vast difference in range of values, we will adopt min-max scaler to limit their range from 0 to 1

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
df[num_features] = scaler.fit_transform(df[num_features])

In [None]:
df

In [None]:
df['Churn'].value_counts()

#### Encoding

In this case, we will use one-hot encoding for those categorical features with multiple labels

In [None]:
df.columns

In [None]:
df['gender'] = df['gender'].replace({'Male':1,'Female':0})
df['Partner'] = df['Partner'].replace({'Yes':1,'No':0})
df['Dependents'] = df['Dependents'].replace({'Yes':1,'No':0})
df['PhoneService'] = df['PhoneService'].replace({'Yes':1,'No':0})
df['PaperlessBilling'] = df['PaperlessBilling'].replace({'Yes':1,'No':0})
df['Churn'] = df['Churn'].replace({'Yes':1,'No':0})

In [None]:
categorical_features = [
    'MultipleLines', 
    'InternetService', 
    'OnlineSecurity',
    'OnlineBackup', 
    'DeviceProtection', 
    'TechSupport', 
    'StreamingTV',
    'StreamingMovies', 
    'Contract',
    'PaymentMethod'
]

In [None]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(drop='first')
encoded_df =onehotencoder.fit_transform(df[categorical_features]).toarray()

In [None]:
encoded_df

In [None]:
encoded_df = pd.DataFrame(encoded_df, columns=onehotencoder.get_feature_names(categorical_features))
encoded_df

In [None]:
df = pd.concat([df,encoded_df],axis=1)
df

In [None]:
df.isna().sum()

In [None]:
df2 = df.drop(categorical_features,axis=1)
df2

In [None]:
df2.columns

In [None]:
df2['Churn'].value_counts()

#### Dealing with imblanced dataset

In this case, we will use undersampling to reduce the data as long as the churing ratio of yes to no is 50:50

In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
y = df2['Churn'] # target
X = df2.drop(columns='Churn') # all features

In [None]:
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)

In [None]:
print(X_resampled.shape)
print(y_resampled.shape)

To confirm our dataset is balance, we plot a pie chart for illustration

In [None]:
plt.pie(y_resampled.value_counts(),labels=y_resampled.value_counts().index,autopct='%1.1f%%')

### 4. Model Training

In this section, the followings will be done:
- Train-test Split

For the classification model, we will use machine learning algorithms listed as follows:
- Logistic Regression
- Random Forest
- KNN
- XGBoost

To evaluate their performance, we will focus on the recall score and try to minimum the false negative (we predict customers not going to churn but they actually churn)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#### Train-test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=41)
print('Train data stroke count')
display(y_train.value_counts())
print('Test data stroke count')
display(y_test.value_counts())

#### Logistic Regression

In [None]:
# GridSearchCV
log_clf = Pipeline([
#    ( 'column-onehot', col_trans ),
    ( 'classifier', LogisticRegression() )
])
hyperparams = { 
    'classifier__C': np.linspace(0.0001, 0.01, 50),
    'classifier__max_iter': range(80, 111)
}

log_search = GridSearchCV(log_clf,  hyperparams, n_jobs = -1,cv=5, verbose=1)
log_search.fit(X_train, y_train)
y_pred = log_search.predict(X_test)
print("Best params", log_search.best_params_)
print("Best score", log_search.best_score_)
log_C = log_search.best_params_['classifier__C']
log_max_iter = log_search.best_params_['classifier__max_iter']
# print(classification_report(y_test, y_pred))
# tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
# print([tp,fp])
# print([fn,tn])
# log_score = recall_score(y_test,y_pred)
# print('recall score: {}'.format(log_score))

#### Random Forest

In [None]:
forest_clf = Pipeline([
    ( 'classifier', RandomForestClassifier() )
])
hyperparams = { 
    'classifier__n_estimators': [100,200],
    'classifier__max_depth': [2,6,8,10],
    'classifier__max_leaf_nodes': [10,20,30,40,50,60,70,80,90100]

}

forest_search = GridSearchCV(forest_clf,  hyperparams, n_jobs = -1,cv=5,verbose=1)
forest_search.fit(X_train, y_train)
y_pred = forest_search.predict(X_test)
print("Best params", forest_search.best_params_)
print("Best score", forest_search.best_score_)
forest_n_estimators = forest_search.best_params_['classifier__n_estimators']
forest_max_depth = forest_search.best_params_['classifier__max_depth']
forest_max_leaf_nodes = forest_search.best_params_['classifier__max_leaf_nodes']

#### KNN

In [None]:
knn_clf = Pipeline([
    ( 'classifier', KNeighborsClassifier() )
])
hyperparams = { 
    'classifier__n_neighbors': np.arange(1,100,2),
    'classifier__weights': ['distance']
}

knn_search = GridSearchCV(knn_clf,  hyperparams, n_jobs = -1)
knn_search.fit(X_train, y_train)
y_pred = knn_search.predict(X_test)
print("Best params", knn_search.best_params_)
print("Best score", knn_search.best_score_)
n_neighbors = knn_search.best_params_['classifier__n_neighbors']
knn_weights = knn_search.best_params_['classifier__weights']

#### XGBoost

In [None]:
import xgboost as xgb

In [None]:
xgb_clf = Pipeline([
    ( 'classifier', xgb.XGBClassifier(
        booster='gbtree',
        learning_rate=0.3,
        base_score=0.5,
        colsample_bylevel=1, 
        colsample_bytree=1, 
        gamma=0,
        reg_alpha=0,
        random_state=40
        
    ) 
    )
])
hyperparams = { 
    'classifier__n_estimators': np.arange(500,800,50),
    'classifier__max_depth':[2,6,8,10],
    
}

xgb_search = GridSearchCV(xgb_clf,  hyperparams, n_jobs = -1,cv=5,verbose=2)
xgb_search.fit(X_train, y_train)
y_pred = xgb_search.predict(X_test)
print("Best params", xgb_search.best_params_)
print("Best score", xgb_search.best_score_)
xgb_n_estimators = xgb_search.best_params_['classifier__n_estimators']
xgb_max_depth = xgb_search.best_params_['classifier__max_depth']

In [None]:
algorithm = ['LogisticRegression','KNeighborsClassifier','RandomForestClassifier','XGBClassifier']
hyperparameters = [
    LogisticRegression(
        C = log_C, 
        max_iter=log_max_iter
    ), 
    KNeighborsClassifier(
        n_neighbors = n_neighbors, 
        weights = knn_weights
    ),
    RandomForestClassifier(
        n_estimators = forest_n_estimators,
        max_depth = forest_max_depth,
        max_leaf_nodes = forest_max_leaf_nodes
    ),
    xgb.XGBRFClassifier(
        n_estimators = xgb_n_estimators,
        max_depth = xgb_max_depth
    )
]

In [None]:
models=dict(zip(algorithm,hyperparameters))
print(models)

In [None]:
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import precision_score
acc_score_list =[]
recall_score_list=[]
for name,algo in models.items():
    model=algo
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    acc_score_list.append(acc)
    recall_score_list.append(recall)
    print(name,acc)
    print(classification_report(y_test, y_pred))
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    print([tp,fp])
    print([fn,tn])
    ax = plt.axes()
    ax.set_title(name)
    plt.figure(figsize=(10,5))
    cm_data = [tp, fp], [fn, tn]
    conf_matrix = pd.DataFrame(data=cm_data,columns=['Predicted: No Churn','Predicted: Churn'],index=['No Churn','Churn'])
    sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="Blues",ax=ax)
    plt.show()



In [None]:
print(acc_score_list)
plt.figure(figsize = (10,5))
sns.barplot(x = acc_score_list, y = algorithm , palette='pastel')
plt.title("Model Accuracies Score", fontsize=16, fontweight="bold")

print(recall_score_list)
plt.figure(figsize = (10,5))
sns.barplot(x = recall_score_list, y = algorithm , palette='pastel')
plt.title("Model Recall Score", fontsize=16, fontweight="bold")

As shown above, KNN did the best job among four ML models.

### 5. Feature Importance evaluation

To go further, we are going to investigate on each feature importance to see which features have great influence on the prediction and remove the rest of them

In [None]:
model = KNeighborsClassifier(
        n_neighbors = 65, 
        weights = 'distance',
    )
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print([tp,fp])
print([fn,tn])

In [None]:
from sklearn.inspection import permutation_importance
importance = permutation_importance(model, X_train, y_train, scoring='recall')


In [None]:
importance_score = importance['importances_mean'].tolist()
feature = X_train.columns.values.tolist()
feature_importance = {'Features':feature,'Score':importance_score}

In [None]:
feature_importance_df = pd.DataFrame(feature_importance)
feature_importance_df.sort_values('Score',ascending=False,inplace=True)

In [None]:
feature_importance_df['Features'].reset_index(drop=True)

In [None]:
feature_importance_df

The top 10 important features affecting the recall score are:
1. Monthly Charges
2. Tenure
3. Total Charges
4. InternetService_Fiber optic
5. StreamingMovies_Yes
6. MultipleLines_Yes
7. Partner
8. StreamingTV_Yes
9. gender
10. OnlineBackup_Yes

### 6. Recommendation

#### In view of the top 10 features, Telco Company could review their service in three aspects:

#### 1. Service Charges
Whether they fine tune the price or offer discounts/bundle_price in a long term contract

#### 2. Connection Stability of Service
They should review the connection stability of internet service espically for users connecting via optic fibre.

#### 3. Choice of Streaming Channel
Reviewing the selection/choices of both Streaming TV and Movies seems to be one of the directions to think of. They may also give special offers to those in pairs having lower price to watch Streaming TV/Movies.