In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Telecom Chuan Case Study

### Introduction
#### - What is the Churn

Generally, in industry with subsricption/yearly-renewal contract as the bases, the **Churn rate** refers to the proportion of customers who leave the company during a given time period.

<br>

#### - Why it is important to analysis the churn rate

By observating the churn rate and studying the factors of group who stays and group who leaves, it help the companies modify their products, services or marketing stretgy in a more competitive way to the market.

<br>

#### - Purpose of this case study

From the above, I am going to define the main purposes of this case study are:
1. Understanding the important features of customers who leaves
2. Build a model to predict the possibility of churn of a customer given his/her data

<br>


### Background Information

#### - Sources of data

The dataset that will applied for this case study is from Kaggle 

<br>

#### - Describe the data

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

- Customers who left within the last month - the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

<br>

#### - Work Flow

This project will devided into the following parts:
1. Churn
2. Data anaylsis (with EDA)
3. Prediction models building
4. Deploy final model

For the first part, we will have a EDA on the dataset to find out insights that this dataset brought us; the second part will be the machine learning models construction, try to build a model that can predict the probability of the churn action from inputing some customers information.

# Ask 
#### Business Task
By predicting the Churn possibility of a customer, to develop a focused customer retention programs.

<br>

#### Key stakeholders
Who will be interested in this case study, who will be benefitted from this case study
- Management
- Marketing Team
- Operation Team 

<br>

#### Any Questions help to understand/ get to the main tasks
- What is the important features towards Churn rates?
  - Who are the most likey to churn?
  - What can be improve in order to keep the customer?

# Prepare

**Import libraries**


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('ticks')
plt.rcParams["figure.figsize"] = (10,8)

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

**Load data file**


In [None]:
df = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
df.head()

In [None]:
df.shape

**Observations:**

The dataset have 7043 observation records.

In [None]:
df.columns

In [None]:
df.describe()

**Observations:**
- SeniorCitizen is a catagory columns since it only have 1 and 0;
- 75% of customers have tenure less the 55 months and the means of tenure is 32.4 months;
- The average monthly charges is \$64.76, 25% of customers are paying more than \$89.85 per month

Lets take a look on the target class

In [None]:
plt.figure(figsize=(18,6))
df["Churn"].value_counts().plot(kind="barh")
plt.xlabel("Count")
plt.ylabel("Class")
plt.title("Count of Target Variable")

df["Churn"].value_counts()

Since the target class only have Yes or No, so it is a binery classification problem. **And We have an imbalance target variale distribution in class, we have to take this imbalance situation into account in the after analysis and solve it before building our models.**

In [None]:
# Feature dtypes
def df_summary(df):
    '''
    input the dataframe, and it will return a summary table with columns datails.
    '''
    #create a dataframe call summary
    summary = pd.DataFrame(df.dtypes, columns=['dtype'])

    # Number of Missing values (-1 count)
    summary['num_missing'] = df.isna().sum().values    

    # Number of unique values by features
    summary['num_uniques'] = df.nunique().values
    
    return summary


In [None]:
summary = df_summary(df)
summary

In [None]:
print("Unique values of each columns: ")
for col in df.columns:
  print(f"{col}: \n{df[col].unique()}\n")

**Observations:**
- Seems like no N/A values in the dataset, while we still have to check the string, since the collector may input N/A data with string labels.
- TotalCharges should be in number format, we have to charge its dtype,
- Tenure is count in month, maybe we can transform it into year, and get a cross-check with Contract columns
- In MultipleService column, "No phone service" shares similar meaning with "No", should combine into "No", the same logic applied to "OnlineSecurity", 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'

# Data Cleaning and Analysis

**Create a copy dataset to keep the original remain unchange**


In [None]:
df_data = df.copy()

In [None]:
df_data['TotalCharges'] = pd.to_numeric(df_data['TotalCharges'], errors='coerce')
summary.loc["TotalCharges","dtype"] = "float64"

Since we have transform the Total Charges into numeric, lets check is any N/A data again.

In [None]:
df_data.isna().sum() 

Lets have a deep look on the N/A data.

In [None]:
df_data[df_data["TotalCharges"].isna() == True]

Seems like nothing are very special here, lets drop them,

In [None]:
df_data.dropna(inplace=True)

In [None]:
df_data.isna().sum() 

**Then is the Tenure columns**

In [None]:
df_data["tenure"].max()

In [None]:
labels = [f"{i} - {i+11}" for i in range(1,72,12)]
df_data['tenure_group'] = pd.cut(df_data.tenure, range(1, 80, 12), right=False, labels=labels)

In [None]:
df_data['tenure_group'].value_counts()

For this case study, customer ID will not be applied, it will be droped with "tenure" together.

In [None]:
df_data.drop(columns=["tenure","customerID"],inplace=True)

In [None]:
df_data.head()

## Data Exploration

### Univariate Analysis

In [None]:
summary

In [None]:
# Put the columns into groups 
customer_count = ['gender', 'SeniorCitizen', 'Partner', 'Dependents']
contract_count = [ 'Contract', 'PaperlessBilling', 'PaymentMethod'] #  'MonthlyCharges', 'TotalCharges' is not catagories col.
phoneser_count = ['PhoneService', 'MultipleLines']
internetser_count = [ 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

In [None]:
def uni_plot(categorical_list):
  df_categoric = df.loc[:, categorical_list]
  for i in categorical_list:
    plt.figure()
    plt.figure(figsize=(10,8))
    sns.countplot(x = i, data = df_categoric)
    plt.title(f"Distribution of {i}")
    # plt.xticks(rotation = 45)

In [None]:
uni_plot(customer_count)

**Observation:**
- Equal distribution in Gender, Partner,
- Most of our customers are not Senior Citizen, only 1/7 are Senior Citizen
- ⅔ of our customers are enconomic independents and ⅓ of our customers are enconomic dependent


In [None]:
uni_plot(contract_count)

**Observation**
- Around half of our customers signed **month-to-month contract** with us.
- More customers are going to pay with electronic check.


In [None]:
uni_plot(phoneser_count)

**Observation**
- Most of our customers have used our phone services, among these customers, around ½ have multiple lines.

In [None]:
uni_plot(internetser_count)

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(df_data['tenure_group'])
plt.title(f"Distribution of tenure_group")

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(df_data['tenure_group'], hue=df_data["Churn"])
plt.title(f"Distribution of tenure_group")

### Bivariate Analysis


In [None]:
# Put the columns into groups 
customer_col = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', "Churn"]
contract_col = [ 'Contract', 'PaperlessBilling', 'PaymentMethod', "Churn"] #  'MonthlyCharges', 'TotalCharges' is not catagories col.
phoneser_col = ['PhoneService', 'MultipleLines', "Churn"]
internetser_col = [ 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', "Churn"]

In [None]:
def category_plot(categorical_list):
  df_categoric = df.loc[:, categorical_list]
  for i in categorical_list:
    plt.figure()
    plt.figure(figsize=(10,8))
    sns.countplot(x = i, data = df_categoric, hue = "Churn")
    plt.title(f'{i} in term of CHURN')
    # plt.xticks(rotation = 45)

In [None]:
category_plot(customer_col)

**Observation:**
- If the customer is a Senior Citizen, seems to have a higher churn ratio,
- If the customer does not have partner, seems like having higher probrobility leave to leave the company, the same case shows on dependents. Maybe for this part of customer, stable is not their first thing to consider.  


In [None]:
category_plot(contract_col)

**Observation:**
- As expected, customers with short-term contract are more likely to churn, vice versa.
- Customers who pay with electronic check also are more likely to leave, while when it comes to a compare with other payment types, electronic check is more common, we have to go deeper to check is there a relationship between them.


In [None]:
category_plot(phoneser_col)

**Observation:**
- Most of the customers are using the phone service provided by the company>


In [None]:
category_plot(internetser_col)

**Observation:**
- Customers who uses fiber optic internet service are much likely to churn among three kind of internet connection service, maybe something wrong with the company's fiber optic service? 



### Data Converting

#### Feature variable

In [None]:
df_data['gender'] = df_data['gender'].map({"Male":1,"Female":0}).astype("int")
df_data["Partner"] = df_data["Partner"].map({"No":0,"Yes":1}).astype("int")
df_data["Dependents"] = df_data["Dependents"].map({"No":0,"Yes":1}).astype("int")
df_data["PhoneService"] = df_data["PhoneService"].map({"No":0,"Yes":1}).astype("int")
df_data["PaperlessBilling"] = df_data["PaperlessBilling"].map({"No":0,"Yes":1}).astype("int")


#### Target variable

In [None]:
df_data["Churn"] = df_data["Churn"].map({"Yes":1, "No":0}).astype("int")

# another way to convert
# df_data["Churn"] = np.where(df_data["Churn"] == "Yes", 1, 0)

In [None]:
df_data

#### Catagorical columns

In [None]:
data_dummies = pd.get_dummies(df_data)

In [None]:
data_dummies

In [None]:
plt.figure(figsize=(10,8))
Mth = sns.kdeplot(data_dummies.MonthlyCharges[(data_dummies["Churn"] == 0) ],
                color="Red", shade = True)
Mth = sns.kdeplot(data_dummies.MonthlyCharges[(data_dummies["Churn"] == 1) ],
                ax =Mth, color="Blue", shade= True)
Mth.legend(["No Churn","Churn"],loc='upper right')
Mth.set_ylabel('Density')
Mth.set_xlabel('Monthly Charges')
Mth.set_title('Monthly charges by churn')

**Observation:**
- More churn in higher monthly charges, this maybe related to the fiber optic case since normally fiber optic is more expensive then others internet services.

In [None]:
plt.figure(figsize=(10,8))
Tot = sns.kdeplot(data_dummies.TotalCharges[(data_dummies["Churn"] == 0) ],
                color="Red", shade = True)
Tot = sns.kdeplot(data_dummies.TotalCharges[(data_dummies["Churn"] == 1) ],
                ax =Tot, color="Blue", shade= True)
Tot.legend(["No Churn","Churn"],loc='upper right')
Tot.set_ylabel('Density')
Tot.set_xlabel('Total Charges')
Tot.set_title('Total charges by churn')

**Observation:**
- Higher churn in lower total charges, which is surprising.

### Data Correlation 

In [None]:
plt.figure(figsize=(20,6))
data_dummies.corr()['Churn'].sort_values(ascending = False).plot(kind='bar')

Observation:
- Features help to lower the churn rate:
  - Longer in Contract term and tenure,
  - No internet service in "Device Protection", "Streaming Movies", "Streaming TV", "Tech Support", "Online Backup", "Online Security"

- Features that have a positive relation with churn rete:
  - Monthly Contract term, short tenure remains,
  - Have internet but doesn't use the "Online Security", "Tech Support", "Online Backup", "Device Protection"

### Observation Sum-up
- If customers are does not have partner, imdependent, they are more likely to churn.
- If customers with short tenure remains, and having monthly contract with the company, they are more likely to churn.
- If the customers are uses fiber optic internet services, they are more likely to churn among all.


In [None]:
data_dummies.to_csv("telecom-churn_dummies_2.csv",index = False)