# Example of an Exploratory Data Analysis on the example of the Telco Churn Data Set


This notebook demonstrates how to perform a basic EDA on the example of a sample data set containing Telco customer data and showing the customers that left in the last month.

We will perform the EDA in Python programming language by using common libraries like
- __NumPy__ for basic operations on numerical data
- __pandas__: for data reading, analysis and transformation in DataFrames
- __seaborn__: for visualizing data in charts and plots


In [None]:
#import the required libraries
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.ticker as mtick  
import matplotlib.pyplot as plt
%matplotlib inline

import os
print("Input data file:  ",os.listdir("../input"))

sns.set(style = 'white')



First we need to load the data into a pandas DataFrame object we call "telco_base_data".<br>
See the pandas reference for explanation, what a DataFrame is.<br>
Additionally, this is a good introduction tutorial on DataFrames:<br> https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python

In [None]:
telco_base_data = pd.read_csv('../input/WA_Fn-UseC_-Telco-Customer-Churn.csv')

## Get basic information on the shape and nature of the data
Show the variables included in the data set, which is now the columns in our DataFrame >> telco_base_data.columns

In [None]:
print("\n".join(telco_base_data.columns.values))

Look at the top 5 records of data

In [None]:
telco_base_data.head()

Check the shape (rows, cols) of the data frame 

In [None]:
telco_base_data.shape

Check the data types of all the columns. int64 is a number data type (integer values) same as float64 (floating decimal values), object can be complex object or like in our case it is String.

In [None]:
telco_base_data.dtypes

## Check the descriptive statistics of numeric variables

A DataFrame has the method *describe* that applies some basic descriptive statistic functions on the __numeric__ data in the DataFrame.


In [None]:

telco_base_data.describe()

### First Findings 

SeniorCitizen is actually a categorical (0 or 1) hence the 25%-50%-75% distribution is not proper

75% customers have tenure less than 55 months

Average Monthly charges are USD 64.76 whereas 25% customers pay more than USD 89.85 per month

## Data Cleaning


**1.** Create a copy of base data for manipulation & processing

In [None]:
telco_data = telco_base_data.copy()

**2.** Total Charges should be numeric amount. Let's convert it to numerical data type

In [None]:
telco_data.TotalCharges = pd.to_numeric(telco_data.TotalCharges, errors='coerce')
telco_data.isnull().sum()

**3.** As we can see there are 11 missing values in TotalCharges column. Let's check these records 

In [None]:
telco_data.loc[telco_data ['TotalCharges'].isnull() == True]

**4. Missing Value Imputation**

Since the % of these records compared to total dataset is very low ie 0.15%, we could just drop them from further processing.
Dropping missing values is done with the method *dropna*, e.g. telco_data.dropna(how = 'any', inplace = True)<br><br>
Alternatively, we can impute the missing values and replace them with the mean of Total Charges.<br>
For this, we select the rows where TotalCharges is NaN (Not a Number = missing number value) with the method *isna* and assign the mean of all other values in that column instead.



In [None]:
mean_value_totalcharge = telco_data['TotalCharges'].mean()
telco_data.loc[telco_data ['TotalCharges'].isna() == True ,'TotalCharges'] = mean_value_totalcharge
telco_data.loc[telco_data ['TotalCharges'] == mean_value_totalcharge]

**5.** Data Binning

Let's look at tenure data and recognize the distribution of those values and the correlation to MonthlyCharges.
We will do so by plotting them in a scatterplot, here implemented with seaborn function *lmplot*.

Finally we will divide customers into bins based on tenure e.g. for tenure < 12 months: assign a tenure group 0 years, for tenure between 12 to 24 month = 1 Yr, tenure group of 25-36 = 2 Yrs; and so on...

In [None]:
telco_data.tenure.describe()
sns.lmplot(x="tenure",y="MonthlyCharges", data=telco_data.head(250), hue="Churn", aspect=2/1, height=7, fit_reg=False)

In [None]:
# Get the max tenure
print(telco_data['tenure'].max()) #72

# Group the tenure in bins of years
labels = ["{0} y".format(i) for i in range(0, 6, 1)]

telco_data['tenure_group'] = pd.cut(telco_data.tenure, range(1, 80, 12), right=False, labels=labels)

In [None]:
telco_data['tenure_group'].head(10)

**6.** Remove columns not required for processing

In [None]:
#drop column customerID and old tenure values
telco_data.drop(columns= ['customerID','tenure'], axis=1, inplace=True)
telco_data.head()

## Data Exploration / Visualization
1. Plot distribution of individual predictors by churn using seaborn's *countplot*.
This function is automatically counting the values in a column *x* and showing them in relation to the target value 'Churn'  

In [None]:
for i, predictor in enumerate(telco_data.drop(columns=['Churn', 'TotalCharges', 'MonthlyCharges'])):
    plt.figure(i)
    sns.countplot(data=telco_data, x=predictor, hue='Churn')

**2.** Convert the target variable 'Churn'  in a binary numeric variable i.e. Yes=1 ; No = 0

In [None]:
telco_data['Churn'] = np.where(telco_data.Churn == 'Yes',1,0)

In [None]:
telco_data.head()

**3.** Convert all the categorical variables into dummy variables

In [None]:
telco_data_dummies = pd.get_dummies(telco_data)
telco_data_dummies.head()

**9. ** Relationship between Monthly Charges and Total Charges

In [None]:
sns.lmplot(data=telco_data_dummies, x='MonthlyCharges', y='TotalCharges', fit_reg=False)

Total Charges increase as Monthly Charges increase - as expected.

**10. ** Churn by Monthly Charges and Total Charges

Here we will use Kernel Density Estimate (KDE) plot, a method *kdeplot* for visualizing the distribution of observations in a dataset, analogous to a histogram

In [None]:
Mth = sns.kdeplot(telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"] == 0) ],
                color="Red", shade = True)
Mth = sns.kdeplot(telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"] == 1) ],
                ax =Mth, color="Blue", shade= True)
Mth.legend(["No Churn","Churn"],loc='upper right')
Mth.set_ylabel('Density')
Mth.set_xlabel('Monthly Charges')
Mth.set_title('Monthly charges by churn')

**Insight:** Churn is high when Monthly Charges are high

In [None]:
Tot = sns.kdeplot(telco_data_dummies.TotalCharges[(telco_data_dummies["Churn"] == 0) ],
                color="Red", shade = True)
Tot = sns.kdeplot(telco_data_dummies.TotalCharges[(telco_data_dummies["Churn"] == 1) ],
                ax =Tot, color="Blue", shade= True)
Tot.legend(["No Churn","Churn"],loc='upper right')
Tot.set_ylabel('Density')
Tot.set_xlabel('Total Charges')
Tot.set_title('Total charges by churn')

**Surprising insight ** as higher Churn at lower Total Charges

However if we combine the insights of 3 parameters i.e. Tenure, Monthly Charges & Total Charges then the picture is bit clear :- Higher Monthly Charge at lower tenure results into lower Total Charge. Hence, all these 3 factors viz **Higher Monthly Charge**,  **Lower tenure** and **Lower Total Charge** are linkd to **High Churn**.

In [None]:
#determine correlations
correlations = telco_data_dummies.corr()['Churn']
# now show the values except for column 'Churn' because correlation of Churn to Churn is 1.0
plt.figure(figsize=(20,10))
correlations.drop('Churn').sort_values(ascending = False).plot(kind='bar')

**Derived Insight: **

**HIGH** Churn seen in case of  **Month to month contracts**, **No online security**, **No Tech support**, **First year of subscription** and **Fibre Optics Internet**

**LOW** Churn is seens in case of **Long term contracts**, **Subscriptions without internet service** and **The customers engaged for 5+ years**

Factors like **Gender**, **Availability of PhoneService** and **# of multiple lines** have alomost **NO** impact on Churn
