# INTRODUCTION
This is a case study for a fictional telecom company that provided home phone and Internet services. The goal of this report is to analyze customer data to predict their behavior (leave or stay) and develop focused customer retention programs.

> 
> Title: **Telco Churn Analysis: Who left? How to retain them?**
> 
> Author: Rita Chang
> 
> Date: 13 July, 2021
> 

# STEP 1. ASK 



#### 1.1 Business Objectives:

Before starting the analysis, we first set down the questions that we expect to answer:
1. What product features will cause customers to leave? ([#ANSWER](#discussion))
2. How to use the findings from the data to help the company retain customers? ([#ANSWER](#RECOMMENDATIONS))


#### 1.2 Deliverables:
1. A clear summary of the business task
2. Documentation of any cleaning or manipulation of data
3. A summary of analysis
4. Supporting visualizations and key findings
5. Recommendations based on the analysis

# STEP 2. PREPARE



#### 2.1 Information on Data Source:
1. The data is publicly available on [Kaggle: Telco Customer Churn](https://www.kaggle.com/blastchar/telco-customer-churn).
2. Generated by IBM Samples Team.
3. About a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3.
4. Data collected includes (1) customers who left within the last month, (2) services that each customer has signed up for, (3) customer account information, and (4) demographic info about customers.


#### 2.2 Limitations of Data Set:
1. As data is artificially generated, we cannot ascertain the integrity or accuracy of data.

# STEP 3. PROCESS


We are using Python to prepare and process the data.

#### 3.1 Preparing the Environment and Importing data set

In [None]:
# data analysis
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns

#predict
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

import warnings
warnings.filterwarnings('ignore')

#read files
data = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

#### 3.2 Data cleaning and manipulation

*Steps*
1. Observe and familiarize with data
2. Check for null or missing values
3. Perform sanity check of data

**`Which features are available in the dataset?`**

In [None]:
print(data.columns.values)

**`Which features are categorical?`**

gender, Partner, Dependents, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, Churn

**`Which features are numerical?`**

Continous: tenure. Discrete: MonthlyCharges, TotalCharges.

In [None]:
data.head()

In [None]:
data.info()

From the above observation, noted that

1. TotalCharges is wrongly classified as object dtype and has to be converted to float64 dtype.
2. As SeniorCitizen is a True-False Item, we would like to convert 0, 1 to Yes, No.

In [None]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
data["SeniorCitizen"].replace(to_replace = 1, value = "Yes", inplace =True)
data["SeniorCitizen"].replace(to_replace = 0, value = "No", inplace =True)

data.isnull().sum()
data[data.TotalCharges.isnull()] 
#Check the rows of data with null values in the TotalCharges field to determine how to deal with null values

data.dropna(axis=0, how='any', inplace=True)

After printing out the rows of missing values, it was found that the tenure field of them are all 0.
We suppose the null value of "TotalCharges" may be caused by the system not being updated, so they may be inferred as new customers.
Since they are new customers and the missing values are not many, they have little effect on the analysis, so we directly remove the null values.




Data cleaning and manipulation is completed. Hence, data is now ready to be analysed.

# STEP 4. ANALYZE

#### 4.1 Explore the distribution of features

**`What is the distribution of numerical feature values across the samples?`**

In [None]:
data.describe()

In [None]:
fig,axes = plt.subplots(1,3,figsize = (10,4))
plt.suptitle('Distribution of numerical feature values', fontsize = 20)
cols = ["tenure", "MonthlyCharges", "TotalCharges"]
ylabel = ["Months", "Dollars", "Dollars"]
for i in range(3):
    sns.violinplot(data=data[cols[i]],orient="v",ax=axes[i], color="#D3D3D3", cut=0)
    axes[i].grid(linestyle="--", alpha=0.5)
    axes[i].set_xticks(np.arange(1))
    axes[i].set_ylabel(ylabel[i], fontsize=15)
label1=["tenure"]
label2=["MonthlyCharges"]
label3=["TotalCharges"]
axes[0].set_xticklabels(label1, fontsize=15)
axes[1].set_xticklabels(label2, fontsize=15)
axes[2].set_xticklabels(label3, fontsize=15)
fig.tight_layout()
plt.show()

* Total samples are 7032.
* The distribution of tenure is relatively scattered. The mode is 1 month(8.7%).
* The TotalCharges range of top 25% is wide, with a maximum of 8684 dollars, but half of the users spend less than 1397 dollars.

**`What is the distribution of categorical features?`**

In [None]:
data.describe(include=['O'])

**Demographic info**
* customerID are unique across the dataset (count=unique=7032)
* Customers are 50.5 % male and 49.5 % female.
* 70.2% users don't have dependents.

**Services**
* There're 42% users using MultipleLines services.
* 78.4% users signed up for the Internet service, and the most popular Internet service item is StreamingMovies with 38.8%.

**Customer account info**
* More than half of users use Month-to-month contract (55%).
* The Churn rate is 26.6%.

#### 4.2 Correlating features to goal (Churn)

**`What is the correlation between features and churn rate?`**

As we want to know how well does each feature correlate with Churn, we'll divide users into two groups according to Churn.

In [None]:
trans_df = data.iloc[:,1:21]
trans_df["Churn"].replace(to_replace = "Yes", value = 1, inplace =True)
trans_df["Churn"].replace(to_replace = "No", value = 0, inplace = True)

#draw bar plot
cate_cols = ['SeniorCitizen', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
colors = ["Salmon", "#5F9EA0"]

fig, axes = plt.subplots(nrows = 8,ncols = 2,figsize = (15,40))

for i,feature in enumerate(cate_cols):
    if i < 8:
        ax = trans_df.groupby([feature,'Churn']).size().groupby(level=0).apply( lambda x: 100 * x / x.sum()).unstack().plot(kind='bar', color=colors, ax=axes[i,0])
    elif i >=8 and i < 16:
        ax = trans_df.groupby([feature,'Churn']).size().groupby(level=0).apply( lambda x: 100 * x / x.sum()).unstack().plot(kind='bar', color=colors, ax=axes[i-8,1])
    
    ax.set_title("Distribution of {} and % Churn".format(feature), fontsize = 20)
    ax.set_ylabel("Percentage", fontsize=15)
    ax.set_xlabel("{}".format(feature), fontsize=15)
    ax.legend(["No", "Yes"], fontsize=12)
    
    for patch in ax.patches:
        width, height = patch.get_width(), patch.get_height()
        ax.annotate('{:.0f}%'.format(height), (patch.get_x()+0.1*width, patch.get_y()+0.5*height),
                    color = 'Black',
                    weight = 'bold',
                    size = 20)

    ax.yaxis.set_major_formatter(mtick.PercentFormatter())
    plt.setp(ax.get_xticklabels(), rotation=10, fontsize=12)

plt.tight_layout()
plt.show()

**Demographic info**
* Senior citizen have a much higher churn rate: 42% 
* Customers without partner or dependents are more likely to leave.

**Services**
* Customers with Fiber optic are very likely to churn.
* Customers who doesn't sign up for OnlineSecurity/OnlineBackup/DeviceProtection/TechSupport are very likely to churn.
* StreamingTV and StreamingMovies are the most used service, but they're also the service with the highest customer churn rate.

**Customer account info**
* Customers with paperless billing/using electronic payment method are very likely to churn.
* The shorter the contract, the higher the customer churn rate.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 4))
ax[0].set_title("Distribution of tenure by Churn", fontsize = 20)
item = data["Churn"].unique()
sns.kdeplot(data.tenure[(data["Churn"] == item[0]) ], color = "Gold", shade = True, ax = ax[0])
sns.kdeplot(data.tenure[(data["Churn"] == item[1]) ], ax = ax[0], color = "#5F9EA0", shade = True)
ax[0].legend(["{}".format(item[0]), "{}".format(item[1])], loc='upper right')
ax[0].set_ylabel('Density')
ax[0].set_xlabel('tenure')

sns.lineplot(x = "tenure", y = "Churn", data = trans_df, ax = ax[1])
ax[1].set_title("Churn rate of tenure", fontsize = 20)
plt.show()

* There's a peak in distribution of customers who left, but the distribution curve of number of customers who stay is stable.
* There is a trend that the larger the tenure, the smaller the churn rate.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 4))
ax[0].set_title("Distribution of Monthly Charges by Churn", fontsize = 20)
item = data["Churn"].unique()
sns.kdeplot(data.MonthlyCharges[(data["Churn"] == item[0]) ], color="Gold", shade = True, ax = ax[0])
ax[0] = sns.kdeplot(data.MonthlyCharges[(data["Churn"] == item[1]) ], ax = ax[0], color="#5F9EA0", shade= True)
ax[0].legend(["{}".format(item[0]), "{}".format(item[1])], loc='upper right')
ax[0].set_ylabel('Density')
ax[0].set_xlabel('Total Charges')

trans_df["MonthlyCharges_cut"] = pd.cut(trans_df.MonthlyCharges, bins=15, labels=np.arange(15))
sns.lineplot(x = "MonthlyCharges_cut", y = "Churn", data = trans_df, ax = ax[1])
ax[1].set_title("Churn rate of MonthlyCharges", fontsize = 20)
plt.show()

* Most customers don’t spend too little every month, but 2 groups have different peak values. Peak - Churn: around 80 dollars, Not Churn: around 20 dollars.
* The churn rate and monthly charges are not clearly related to each other.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 4))
ax[0].set_title("Distribution of Total Charges by Churn", fontsize = 20)
item = data["Churn"].unique()
sns.kdeplot(data.TotalCharges[(data["Churn"] == item[0]) ], color="Gold", shade = True, ax = ax[0])
ax[0] = sns.kdeplot(data.TotalCharges[(data["Churn"] == item[1]) ], ax = ax[0], color="#5F9EA0", shade= True)
ax[0].legend(["{}".format(item[0]), "{}".format(item[1])], loc='upper right')
ax[0].set_ylabel('Density')
ax[0].set_xlabel('Total Charges')

trans_df["TotalCharges_cut"] = pd.cut(trans_df.TotalCharges, bins=9, labels=np.arange(9))
sns.lineplot(x = "TotalCharges_cut", y = "Churn", data = trans_df, ax = ax[1])
ax[1].set_title("Churn rate of TotalCharges", fontsize = 20)
plt.show()

* The total charges distribution of customers who left is tightly clustered around the peak, and the general distribution of customers who stay is similar as it.
* There is a trend that the larger the TotalCharges, the smaller the churn rate.

**`What is the correlation between categorical features and Churn?`**

In [None]:
colors = ["Salmon", "#5F9EA0"]
sns.catplot(x="Partner", y="Churn",
            hue="Dependents", col="SeniorCitizen",
            data=trans_df, kind="point",
            dodge=True, palette = colors,
            height=3, aspect=1.5)

SeniorCitizen who don't have partner and dependents are most likely to left with churn rate about 50%.

**`How do different categories of people use the service?`**

In [None]:
cols = ["OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"]
colors = ["Salmon", "#5F9EA0"]
fig, ax = plt.subplots(3, 2, figsize=(15, 12))
df1 = pd.melt(data[(data["InternetService"] != "No") & (data["Partner"] =="Yes")][cols]).rename({'value': 'Has service'}, axis=1)
sns.countplot(data=df1, x='variable', hue='Has service', ax = ax[0, 0], palette = colors)
df2 = pd.melt(data[(data["InternetService"] != "No") & (data["Partner"] =="No")][cols]).rename({'value': 'Has service'}, axis=1)
sns.countplot(data=df2, x='variable', hue='Has service', ax = ax[0, 1], palette = colors)
df3 = pd.melt(data[(data["InternetService"] != "No") & (data["Dependents"] =="Yes")][cols]).rename({'value': 'Has service'}, axis=1)
sns.countplot(data=df3, x='variable', hue='Has service', ax = ax[1, 0], palette = colors)
df4 = pd.melt(data[(data["InternetService"] != "No") & (data["Dependents"] =="No")][cols]).rename({'value': 'Has service'}, axis=1)
sns.countplot(data=df4, x='variable', hue='Has service', ax = ax[1, 1], palette = colors)
df5 = pd.melt(data[(data["InternetService"] != "No") & (data["SeniorCitizen"] =="Yes")][cols]).rename({'value': 'Has service'}, axis=1)
sns.countplot(data=df5, x='variable', hue='Has service', ax = ax[2, 0], palette = colors)
df6 = pd.melt(data[(data["InternetService"] != "No") & (data["SeniorCitizen"] =="No")][cols]).rename({'value': 'Has service'}, axis=1)
sns.countplot(data=df6, x='variable', hue='Has service', ax = ax[2, 1], palette = colors)

for i in range(3):
    for j in range(2):
        ax[i,j].set_xlabel('Service Item', fontsize = 15)
        ax[i,j].set_ylabel('Num of customers',  fontsize = 15)
        ax[i,j].legend(bbox_to_anchor=(1.0, 1.0), loc='upper left', title="Has service")
        plt.setp(ax[i,j].get_xticklabels(), rotation=15, fontsize=12)

ax[0,0].set_title("Partner = Yes", fontsize = 20)
ax[0,1].set_title("Partner = No", fontsize = 20)
ax[1,0].set_title("Dependents = Yes", fontsize = 20)
ax[1,1].set_title("Dependents = No", fontsize = 20)
ax[2,0].set_title("SeniorCitizen = Yes", fontsize = 20)
ax[2,1].set_title("SeniorCitizen = No", fontsize = 20)
plt.subplots_adjust(wspace=0.4, hspace=0.7)
plt.show()

* Most customers without a partner or dependents do not sign up for additional services.
* Most senior citizen customers do not sign up for additional services, too.

**`What is the correlation between features and MonthlyCharges?`**

In [None]:
col_name1 = ["SeniorCitizen", "Partner", "Dependents"]

fig, ax = plt.subplots(2, 2, figsize=(15,10))
print("Average:")
    
for i,j in enumerate(col_name1):
    item = data[j].unique()
    if i < 2:
        ax[0,i] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[0]) ], color="Gold", shade = True, ax = ax[0,i])
        ax[0,i] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[1]) ], ax =ax[0,i], color="#5F9EA0", shade= True)
        ax[0,i].set_title("Distribution of monthly charges by {}".format(j), fontsize = 20)
        ax[0,i].legend(["{}".format(item[0]), "{}".format(item[1])], loc='upper right')
        ax[0,i].set_ylabel('Density', fontsize = 15)
        ax[0,i].set_xlabel('Monthly Charges', fontsize = 15)
    elif i >= 2:
        ax[1,i-2] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[0]) ], color="Gold", shade = True, ax = ax[1,i-2])
        ax[1,i-2] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[1]) ], ax =ax[1,i-2], color="#5F9EA0", shade= True)
        ax[1,i-2].set_title("Distribution of monthly charges by {}".format(j), fontsize = 20)
        ax[1,i-2].legend(["{}".format(item[0]), "{}".format(item[1])], loc='upper right')
        ax[1,i-2].set_ylabel('Density', fontsize = 15)
        ax[1,i-2].set_xlabel('Monthly Charges', fontsize = 15)
    ave1 = data[data[j]==item[0]].MonthlyCharges.mean().round(2)
    ave2 = data[data[j]==item[1]].MonthlyCharges.mean().round(2)
    print(f"{j} ({item[0]}/{item[1]}) -->  {ave1} / {ave2}.")

fig.delaxes(ax[1,1])
plt.tight_layout()
plt.show()

In [None]:
col_name3 = ["StreamingTV", "Contract", "StreamingMovies", "InternetService"]

fig, ax = plt.subplots(2, 2, figsize=(15,10))
print("Average:")
for i,j in enumerate(col_name3):
    if i < 2:
        ax[i,0].set_title("Distribution of monthly charges by {}".format(j), fontsize = 20)
        item = data[j].unique()
        ax[i,0] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[0]) ], color="Salmon", shade = True, ax = ax[i,0])
        ax[i,0] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[1]) ], ax =ax[i,0], color="Gold", shade= True)
        ax[i,0] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[2]) ], ax =ax[i,0], color="#5F9EA0", shade= True)
        ax[i,0].legend(["{}".format(item[0]), "{}".format(item[1]), "{}".format(item[2])], loc='upper right')
        ax[i,0].set_ylabel('Density')
        ax[i,0].set_xlabel('Monthly Charges')
        if j != "Contract":
            ax[i,0].set_ylim(0,0.12)
    elif i >=2:
        ax[i-2,1].set_title("Distribution of monthly charges by {}".format(j), fontsize = 20)
        item = data[j].unique()
        ax[i-2,1] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[0]) ], color="Salmon", shade = True, ax = ax[i-2,1])
        ax[i-2,1] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[1]) ], ax =ax[i-2,1], color="Gold", shade= True)
        ax[i-2,1] = sns.kdeplot(data.MonthlyCharges[(data[j] == item[2]) ], ax =ax[i-2,1], color="#5F9EA0", shade= True)
        ax[i-2,1].legend(["{}".format(item[0]), "{}".format(item[1]), "{}".format(item[2])], loc='upper right')
        ax[i-2,1].set_ylabel('Density')
        ax[i-2,1].set_xlabel('Monthly Charges')
        if j != "Contract":
            ax[i-2,1].set_ylim(0,0.12)
    ave1 = data[data[j]==item[0]].MonthlyCharges.mean().round(2)
    ave2 = data[data[j]==item[1]].MonthlyCharges.mean().round(2)
    ave3 = data[data[j]==item[2]].MonthlyCharges.mean().round(2)
    print(f"{j} ({item[0]}/ {item[1]}/ {item[2]}) -->  {ave1} / {ave2} / {ave3}.")

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15,5))
print("Average:")

ax[0].set_title("Distribution of monthly charges by PaperlessBilling", fontsize = 20)
item = data["PaperlessBilling"].unique()
ax[0] = sns.kdeplot(data.MonthlyCharges[(data["PaperlessBilling"] == item[0]) ], color="Gold", shade = True, ax = ax[0])
ax[0] = sns.kdeplot(data.MonthlyCharges[(data["PaperlessBilling"] == item[1]) ], ax = ax[0], color="#5F9EA0", shade= True)
ax[0].legend(["{}".format(item[0]), "{}".format(item[1])], loc='upper right')
ax[0].set_ylabel('Density')
ax[0].set_xlabel('Monthly Charges')


ave1 = data[data["PaperlessBilling"]==item[0]].MonthlyCharges.mean().round(2)
ave2 = data[data["PaperlessBilling"]==item[1]].MonthlyCharges.mean().round(2)
print(f"PaperlessBilling ({item[0]}/{item[1]}) -->  {ave1} / {ave2}.")

ax[1].set_title("Distribution of monthly charges by PaymentMethod", fontsize = 20)
item = data["PaymentMethod"].unique()
ax[1] = sns.kdeplot(data.MonthlyCharges[(data["PaymentMethod"] == item[0]) ], color="Salmon", shade = True, ax = ax[1])
ax[1] = sns.kdeplot(data.MonthlyCharges[(data["PaymentMethod"] == item[1]) ], ax = ax[1], color="Gold", shade= True)
ax[1] = sns.kdeplot(data.MonthlyCharges[(data["PaymentMethod"] == item[2]) ], ax = ax[1], color="#5F9EA0", shade= True)
ax[1] = sns.kdeplot(data.MonthlyCharges[(data["PaymentMethod"] == item[3]) ], ax = ax[1], color="#D3D3D3", shade= True)
ax[1].legend(["{}".format(item[0]), "{}".format(item[1]), "{}".format(item[2]), "{}".format(item[3])], loc='upper right')
ax[1].set_ylabel('Density')
ax[1].set_xlabel('Monthly Charges')

ave1 = data[data["PaymentMethod"]==item[0]].MonthlyCharges.mean().round(2)
ave2 = data[data["PaymentMethod"]==item[1]].MonthlyCharges.mean().round(2)
ave3 = data[data["PaymentMethod"]==item[2]].MonthlyCharges.mean().round(2)
ave4 = data[data["PaymentMethod"]==item[3]].MonthlyCharges.mean().round(2)
print(f"PaymentMethod ({item[0]}/ {item[1]}/ {item[2]}/ {item[3]}) -->  {ave1} / {ave2} / {ave3} / {ave4}.")

plt.tight_layout()
plt.show()

**Demographic info**
* The monthly charges of Senior Citizen is higher than the others. **Peak** - Senior Citizen: around 100 dollars, Not Senior Citizen: around 20 dollars.
* The presence of a partner/dependents do not have much effect on the distribution of monthly charges.

**Services**
* Customers who did not use PhoneService monthly spent 25 dollars less than those who did, and the distribution of them is centralized.
* The three category of InternetService can be clearly distinguished, customers using Fiber optic pay much more than others. **Average** - Fiber optic: 91.5, DSL: 58.1, No Service: 21.1.
* There is no particular trend in the distribution of monthly charges by InternetService items, but if customers sign up for the services, the charge will be slightly higher.

**Customer account info**
* Customers using paperless bills pay more than others. **Peak** - PaperlessBilling: around 90 dollars, Not PaperlessBilling: around 20 dollars.
* Customers who left pay more than others. **Peak** - Churn: around 80 dollars, Not Churn: around 20 dollars.
* Customers who select mail check payment method spent least, and customers who choose the electronic payment method spend the most. The the distribution of monthly charges by Bank transfer/Credit card are almost the same.

#### 4.3 Composition of different groups (Churn/No Churn)

In [None]:
#distribution of categorical features by Churn
def Donut_Chart(col_name):
    def df(col_name, churn_YoN):
        df = trans_df[trans_df["Churn"]==churn_YoN][col_name].value_counts().to_frame().reset_index()
        return df

    fig, ax = plt.subplots(1,2, figsize=(18,6))
    dict = {"width": 0.5}

    ax[0].pie(df(col_name, 0)[col_name], 
            colors = ["Salmon", "Gold", "#5F9EA0", "#D3D3D3"], 
            autopct='%1.1f%%', startangle = 90, 
            pctdistance = 0.75,
            wedgeprops=dict,
            textprops={'fontsize': 17, 'weight' : 'heavy'})

    ax[1].pie(df(col_name, 1)[col_name], 
            colors = ["Salmon", "Gold", "#5F9EA0", "#D3D3D3"], 
            autopct='%1.1f%%', startangle = 90, 
            pctdistance = 0.75,
            wedgeprops=dict,
            textprops={'fontsize': 17, 'weight' : 'heavy'})

    ax[0].text(0, 0, 'No Churn', ha="center", fontsize = 17, weight="bold")
    ax[1].text(0, 0, 'Churn', ha="center", fontsize = 17, weight="bold")
    for i in range(2):
        ax[i].legend(loc='upper left', fontsize=15, bbox_to_anchor=(0.9, 0.8), labels = df(col_name,i)["index"])

    plt.suptitle("The Composition of the {} grouped by Churn".format(col_name), fontsize = 30)
    fig.tight_layout()
    plt.show()

col = ['SeniorCitizen', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
for i in col:
    Donut_Chart(i)

# STEP 5. PREDICT

#### 5.1 Correcting by dropping features

In [None]:
data = data.drop(['customerID'], axis=1)
data["Churn"].replace(to_replace = "Yes", value = 1, inplace =True)
data["Churn"].replace(to_replace = "No", value = 0, inplace = True)

#### 5.2 Cut numerical continuous feature into sections


*Steps*
1. Create feature sections and determine correlations with Churn
2. Replace feature with ordinals based on the sections
3. Remove the featureSection column


#### 5.2.1 tenure

In [None]:
data['tenureSection'] = pd.qcut(data['tenure'], 5)
data[['tenureSection', 'Churn']].groupby(['tenureSection'], as_index=False).mean().sort_values(by='tenureSection', ascending=True)

In [None]:
data.loc[ data['tenure'] <= 6, 'tenure'] = 0
data.loc[(data['tenure'] > 6) & (data['tenure'] <= 20), 'tenure'] = 1
data.loc[(data['tenure'] > 20) & (data['tenure'] <= 40), 'tenure'] = 2
data.loc[(data['tenure'] > 40) & (data['tenure'] <= 60.8), 'tenure'] = 3
data.loc[ data['tenure'] > 60.8, 'tenure'] = 4

data = data.drop(['tenureSection'], axis=1)
data.head()

#### 5.2.2 MonthlyCharges

In [None]:
data['MonthlyChargesSection'] = pd.qcut(data['MonthlyCharges'], 5)
data[['MonthlyChargesSection', 'Churn']].groupby(['MonthlyChargesSection'], as_index=False).mean().sort_values(by='MonthlyChargesSection', ascending=True)

In [None]:
data.loc[ data['MonthlyCharges'] <= 20.5, 'MonthlyCharges'] = 0
data.loc[(data['MonthlyCharges'] > 20.5) & (data['MonthlyCharges'] <= 58.92), 'MonthlyCharges'] = 1
data.loc[(data['MonthlyCharges'] > 58.92) & (data['MonthlyCharges'] <= 79.15), 'MonthlyCharges'] = 2
data.loc[(data['MonthlyCharges'] > 79.15) & (data['MonthlyCharges'] <= 94.3), 'MonthlyCharges'] = 3
data.loc[ data['MonthlyCharges'] > 94.3, 'MonthlyCharges'] = 4

data = data.drop(['MonthlyChargesSection'], axis=1)
data.head()

#### 5.2.3 TotalCharges

In [None]:
data['TotalChargesSection'] = pd.qcut(data['TotalCharges'], 4)
data[['TotalChargesSection', 'Churn']].groupby(['TotalChargesSection'], as_index=False).mean().sort_values(by='TotalChargesSection', ascending=True)

In [None]:
data.loc[ data['TotalCharges'] <= 401.45, 'TotalCharges'] = 0
data.loc[(data['TotalCharges'] > 401.45) & (data['TotalCharges'] <= 1397.475), 'TotalCharges'] = 1
data.loc[(data['TotalCharges'] > 1397.475) & (data['TotalCharges'] <= 3794.738), 'TotalCharges'] = 2
data.loc[(data['TotalCharges'] > 3794.738) & (data['TotalCharges'] <= 8684.8), 'TotalCharges'] = 3
data.loc[ data['TotalCharges'] > 8684.8, 'TotalCharges'] = 4

data = data.drop(['TotalChargesSection'], axis=1)
data.head()

#### 5.3 Create new feature combining existing features

We can create a new feature which combines tenure and MonthlyCharges. Although tenure and MonthlyCharges have their own trends, neither can be completely divided into two groups based on the distribution of customer churn rates. As a result, we suppose that the two together can be better grouped.

In [None]:
data['CombineTM'] = (4-data['tenure']) + data['MonthlyCharges']
data[['CombineTM', 'Churn']].groupby(['CombineTM'], as_index=False).mean().sort_values(by='Churn', ascending=False)

#### 5.4 Converting categorical feature into numerical


In [None]:
df_dummies = pd.get_dummies(data)
df_dummies.head()

#### 5.5 Model

In [None]:
# Create Train & Test Data
X_train = df_dummies.drop(columns = ['Churn'])
y_train = df_dummies['Churn'].values
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

In [None]:
# Running model
clfs = []

clfs.append(("LogReg", 
             Pipeline([("Scaler", StandardScaler()),
                       ("LogReg", LogisticRegression())])))

clfs.append(("KNN", 
             Pipeline([("Scaler", StandardScaler()),
                       ("KNN", KNeighborsClassifier())]))) 

clfs.append(("RandomForestClassifier", 
             Pipeline([("Scaler", StandardScaler()),
                       ("RandomForest", RandomForestClassifier())]))) 

clfs.append(("DecisionTreeClassifier", 
             Pipeline([("Scaler", StandardScaler()),
                       ("DecisionTree", DecisionTreeClassifier())])))

clfs.append(("SVM", 
             Pipeline([("Scaler", StandardScaler()),
                       ("SVM", SVC())])))

#draw accuracy distribution
results, names  = [], [] 

for name, model  in clfs:
    cv_results = cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy', n_jobs=-1)    
    names.append(name)
    results.append(cv_results)    
    msg = "%s: %f (+/- %f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

# boxplot algorithm comparison
dict={}
for i,j in zip(names, results):
    dict[i]=j
    
df = pd.DataFrame(dict)
df = pd.melt(df)
g = sns.catplot(data = df, x="variable", y="value", kind="box", hue = "variable", height=4, aspect=1.5)
plt.title('Accuracy of Different Classifier', fontsize=22)
plt.xlabel("Classifier", fontsize=20)
plt.ylabel("Accuracy of Models", fontsize=18)
g.set_xticklabels(rotation=20, fontsize=10)
plt.show()

#### 5.5.1 LogisticRegression

Different weights represent the size of the association between the feature and the predicted value (Churn). The negative correlation coefficient means that the more obvious this feature is, the less likely it is to cause customer left.

In [None]:
# To get the weights of all the variables
model = LogisticRegression()
model.fit(X_train, y_train)
weights = pd.Series(model.coef_[0],
                 index=X_train.columns.values)
print (weights.sort_values(ascending = False)[:10].plot(kind='barh'))

In [None]:
print(weights.sort_values(ascending = False)[-10:].plot(kind='barh'))

# STEP 6. DISCUSSION
<a id="discussion"></a>

Aligning all that with the above analysis, we can draw some conclusion here for all 2 groups:

***

**`THE LEAVING CUSTOMERS (Churn = Yes)`**


> 26.6% of customers in this sample left within the last month. Most of them have **no partners or dependents**. They usually use **fiber optic** internet, and they **don’t sign up for additional services other than Streaming movies/Streaming TV**.
> 
> Most of them sign a **Month-to-month contract**, and leave as soon as the contract expires. **Paperless billing** and **electronic payment** are supported by them, and their **bills are usually higher** than other customers.

**Group's characteristics:**

(a) **Demographic info about customers:**
* 26% are senior citizen
* 50% of men and 50% of women
* 64% have no partner
* 83% have no dependents

(b) **Services:**
* 70% use fiber optic internet, and only 6% of them do not sign up for the Internet
* 65-80% do not use additional internet services such as 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', and 'TechSupport'
* 50% don’t use Streaming movies/ Streaming TV and 40% do.

(c) **Customer account information:**
* 89% of customers sign a Month-to-month contract
* 75% adopt paperless billing
* 57% use electronic payment methods
* Most of them leave after only one month
* Most of their monthly charges are distributed in areas with higher monthly charges, and 60% of their monthly charges are higher than average
* The maximum Total Charges is 8685 dollars, but 47% are concentrated in the range of 0-100 dollars
***
**`THE STAYING CUSTOMERS (Churn = No):`**

> 73.4% of customers in this sample stayed within the last month.

**Group's characteristics:**

(a) **Demographic info about customers:** 
* 13% are senior citizen
* 51% of men and 50% of women
* 47% have no partner
* 66% have no dependents

(b) **Services:**
* Most of them use DSL internet(38%). The number of people selected by the three options is evenly distributed (Fiber optic 35%, No 27%)
* 36-40% do not use additional internet services, and the number of people using the additional services is similar to that of not using the services.

(c) **Customer account information:**
* 43% of customers sign a Month-to-month contract/ 32% sign a Two year contract/ 25% sign a One year contract
* 54% adopt paperless billing
* The proportion of users of the four payment methods is the same (Electronic check/ Bank transfer (automatic)/ Credit card (automatic)/ Mailed check)
* The tenures concentrate around 1 and 72 months, and the number of customers is evenly distributed in other months
* Most of them don't spend much, the monthly charges is concentrated at 20 dollars. (Range: 18-120 dollars)
* Most of their total charges are concentrated in areas with a small number. The higher the total charges, the fewer the number of people.

# STEP 7. RECOMMENDATIONS
<a id="RECOMMENDATIONS"></a>

> **Launch Special Program**

1. SeniorCitizen — A very high percentage of the senior citizens left within the last month (42%). After our research, we found that the senior citizens have the characteristics of high monthly charges and not signing up for services. As a result, we can develop a preferential program for them based on their characteristics.

2. Contract — During the analysis, we found that the longer the contract length, the less the customer will leave, so we can add some preferential items for long-term contracts to attract customers to sign.

> **Marketing Promotion**

* If customers have the following identities, they are more likely to leave than others: Senior Citizen/ No Partner/ No Dependents. The Churn rate of customers who are senior citizen without partner and dependents is 50%. We found that these types of identities have common behavior — they rarely sign up for additional services such as OnlineSecurity, OnlineBackup. We can advertise for this type of customer.


> **Questionnaire**

* Customers using fiber optic internet have a high chance of leaving (42%). We found that the monthly charges for customers using fiber optic internet is 30 dollars higher than customers using DSL internet. (DSL: 58.09/ Fiber optic: 91.5). Perhaps the reason why customers leave is not just the price, but there may also be unsatisfactory usage of the service. It would be great help if we can do a questionnaire for these leaving customers whether they have encountered any problems in use, or have any suggestions for the service.

# Thanks for your reading! :)