<h1 align ='center'> Customer Churn Prediction <h1>

<img src='https://kranthi.me/wp-content/uploads/2020/04/Telecom_Churn_Prediction-e1587281300645.jpg' align='center'>

## Table of Contents

### [1. Introduction](#1)
### [2. Variable Description](#1)
### [1. Importing Librarires](#1)
### [1. Basic Understanding Dataset](#1)
### [1. Data Preprocessing](#1)
### [1. Exploratory Data Analysis](#1)

<a id=1 > </a>
## 1. Introduction

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business goal.

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

In this project, we will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

### 2. Variable Description

1. **Customer_id** = id number of each customer
2. **Gender** = male or female
3. **SeniorCitizen** = Whether the customer is a senior citizen or not (1, 0)
4. **Partner**=Whether the customer has a partner or not (Yes, No)
5. **Dependents**=Whether the customer has dependents or not (Yes, No)
6. **Tenure**=Number of months the customer has stayed with the company
7. **Phone Service**= Whether the customer has a phone service or not (Yes, No)
8. **MultipleLines**= Whether the customer has multiple lines or not (Yes, No, No phone service)
9. **InternetService**=Customer’s internet service provider (DSL, Fiber optic, No)
10. **OnlineSecurity**=Whether the customer has online security or not (Yes, No, No internet service)
11. **OnlineBackup** = Whether the customer has online backup or not
12. **Device Protection** = Whether the customer hsa device protection or not
13. **TechSupport** = Whether the customer has techsupport or not
14. **StreamingTV** = Whether the customer has Streaming tv or not
15. **StreamingMovies** = Whether the customer has movie or not
16. **Contract** = Contract options for Customer (Month-to month,one year)
17. **PaperlessBilling** = Whether the customer uses paperless bill or not
18. **PaymentMethod** = Payment method customer uses 
19. **MonthlyCharges** = Amount of Monthly Charges
20. **TotalCharges** = Amount of Total Charges
21. **Churn** =  Whether the customer churned or stayed.

### 3. Importing Librarires

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
plt.style.use('ggplot')

pd.options.display.max_columns = 100
pd.options.display.max_rows = 9000

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('churn_data.csv')

### 4. Basic Understanding Dataset

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

- We see that **SeniorCitizen** columns is string type, and **TotalCharges** is object. We will need to reformat these columns in order to analyse properly

In [None]:
df.describe()

### Checking Missing Values

In [None]:
df.isnull().sum()

### Data Preprocessing

#### Deleting Redundant Values

In [None]:
df.drop('customerID',axis=1,inplace=True)

#### Rename Columns

In [None]:
df.rename(columns={'gender':'Gender','tenure':'Tenure'},inplace=True)

#### Reformatting Our columns

In [None]:
df['SeniorCitizen'] = np.where(df['SeniorCitizen']==1,'Yes','No') ## changed the value to categorical for analysis

In [None]:
df['TotalCharges'].replace(" ",np.nan,inplace=True)

In [None]:
df['TotalCharges'] = df['TotalCharges'].astype('float64')

In [None]:
df.isnull().sum()

- After we replaced space with nan values. There are 11 values appeares as null values.Total Charges = Tenure multiply Monthly charges. Let's try to fill missing values.

In [None]:
df[df['TotalCharges'].isnull()] 

- We see that TotalCharges is not missing at random, When Total Charges is missing, Tenure column is missing as well.
- We can simply delete these rows.

In [None]:
df.dropna(axis=0,inplace=True)  # We deleted rows that have any missing values.

In [None]:
df.head()

In [None]:
df.isnull().sum() # Now We have no missing values.

### Exploratory Data Analysis

### Churn Rate

- Let's see the ratio of customer who churned|

In [None]:
print('Percentage of churn rate :  ',round((df[df.Churn=='Yes']['Churn'].count()/ df['Churn'].count())*100),'%') ### Percentage of people who left 

In [None]:
sns.countplot(x='Churn',data=df)
plt.title('Churn Rate')
plt.show()
print('-'*100)
print((df['Churn'].value_counts().to_frame()))

**1869** customers churned while **5163** customers stayed.

## Numerical Values

### Distributions of Numerical Values

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 5), sharey=True)
for i,j in enumerate(df.describe().columns):
    sns.distplot(ax=axes[i],x=df[j],bins=30,kde=False,color='red')
    axes[i].set_title(j)
plt.show()

### Insights 

* **Tenure**
* Customers tend to have long tenure or short tenure
* Customers have less venure might be new customers. We need to analyse further.
* **Monthly Charges**
* Our company usually get payment around 20
* **Total Charges**
* Most of the total charges between 0-250 

### Boxplots

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 5), sharey=True)
for i,j in enumerate(df.describe().columns):
    sns.boxplot(ax=axes[i],x=df[j],color='red')
    axes[i].set_title(j)
plt.show()

#### Insights

#### *It seems that we don't have any outliers based on our single numeric columns*

## Categorical Bivariate Analysis

- Six binary features (Yes/No)
- Nine features with three unique values each (categories)
- One feature with four unique values

## Insights

#### Let's see how many categorical values in the dataset.

In [None]:
len(df.describe(include=object).columns)

In [None]:
(df.describe(include=object).columns).tolist()

### CountPlot

Let's see the number of people who churned by given column

In [None]:

for i in df.describe(include=object).columns:
    plt.figure(figsize=(8,6))
    sns.countplot(i,data=df,hue='Churn')
    plt.title('Churn Number by' + ' '+ str(i))
    plt.show() 
    plt.tight_layout

###  Takeaways

### Gender
* There is no significantly difference between gender. Male and Female customer have the same churn ratio.
### SeniorCitizen
* Most of the customers are not SeniorCitizen.
### Partners
* Customers that doesn't have partners are more likely to churn
### Dependents 
* Customers without dependents are also more likely to churn
### Phone Service 
* Our Phone Service might not be  as good as customer expected. Because Customer who has phoneservice are more prone to churn.

### Correlations

In [None]:
df['Churn'] =  np.where(df['Churn']=='Yes',1,0)
sns.heatmap(df.corr(method='pearson'),annot=True)
plt.title('Correlations between Variables')

### Results

- Tenure is negatively correlated with Churn 
- Monthly Charges positively correlated with Churn
- Total Charges negatively correlated with Churn

##### Save Our Dataframe for model building section

In [None]:
df.to_csv('preprocessed_churn.csv')