# Telco Churn Analysis

**Dataset Info:**
Sample Data Set containing Telco customer data and showing customers left last month

In [None]:
#import the required libraries
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.ticker as mtick  
import matplotlib.pyplot as plt
%matplotlib inline




**Load the data file **

In [None]:
telco_base_data = pd.read_csv("/kaggle/input/telcom-churns-dataset/TelcoChurn.csv")

Look at the top 5 records of data

In [None]:
telco_base_data.head()

Check the various attributes of data like shape (rows and cols), Columns, datatypes

In [None]:
telco_base_data.shape

In [None]:
telco_base_data.columns.values

The telco_base_data.columns.values expression is used to retrieve the column names of a DataFrame in pandas.

In [None]:
# Checking the data types of all the columns
telco_base_data.dtypes

The telco_base_data.dtypes expression is used to retrieve the data types of the columns in a pandas DataFrame called telco_base_data.

In [None]:
# Check the descriptive statistics of numeric variables
telco_base_data.describe()

The telco_base_data.describe() method is used to generate descriptive statistics of a pandas DataFrame called telco_base_data. It provides summary statistics for each numerical column in the DataFrame, such as count, mean, standard deviation, minimum value, maximum value, and quartile information.

SeniorCitizen is actually a categorical hence the 25%-50%-75% distribution is not proper

75% customers have tenure less than 55 months

Average Monthly charges are USD 64.76 whereas 25% customers pay more than USD 89.85 per month

In [None]:
telco_base_data['Churn'].value_counts().plot(kind='barh', figsize=(8, 6))
plt.xlabel("Count", labelpad=14)
plt.ylabel("Target Variable", labelpad=14)
plt.title("Count of TARGET Variable per category", y=1.02);
plt.savefig('churn_count_plot.png')


The code snippet you provided is using the matplotlib library to create a horizontal bar plot of the counts of the 'Churn' variable in the telco_base_data DataFrame. It also adds labels to the x-axis, y-axis, and a title to the plot.This code will generate a horizontal bar plot that displays the counts of each category of the 'Churn' variable in the telco_base_data DataFrame. The x-axis represents the count, the y-axis represents the target variable categories, and the title provides a description of the plot.

Make sure you have the necessary libraries (pandas and matplotlib) imported before running this code.

In [None]:
100*telco_base_data['Churn'].value_counts()/len(telco_base_data['Churn'])

The code snippet you provided calculates the percentage distribution of each category in the 'Churn' variable of the telco_base_data DataFrame. It divides the count of each category by the total number of observations and multiplies by 100 to obtain the percentage.This code will calculate the percentage distribution of each category in the 'Churn' variable. The resulting output will be a pandas Series where the index represents the categories of the 'Churn' variable, and the values represent the corresponding percentage distribution.

Make sure you have the necessary libraries (pandas) imported before running this code.

In [None]:
telco_base_data['Churn'].value_counts()

The code telco_base_data['Churn'].value_counts() calculates the count of each unique value in the 'Churn' column of the telco_base_data DataFrame. It returns a pandas Series where the index represents the unique values in the 'Churn' column, and the values represent their respective counts

* Data is highly imbalanced, ratio = 73:27<br>
* So we analyse the data with other features while taking the target values separately to get some insights.

In [None]:
# Concise Summary of the dataframe, as we have too many columns, we are using the verbose = True mode
telco_base_data.info(verbose = True) 

The telco_base_data.info(verbose=True) method provides a summary of the telco_base_data DataFrame, including the number of non-null values, data types, and memory usage. By setting verbose=True, you'll get a detailed output with information about each column.

In [None]:
missing = pd.DataFrame((telco_base_data.isnull().sum())*100/telco_base_data.shape[0]).reset_index()
plt.figure(figsize=(16,5))
ax = sns.pointplot('index',0,data=missing)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()
plt.savefig('percentage_of_missing_values.png')

The code snippet you provided calculates the percentage of missing values for each column in the telco_base_data DataFrame and creates a point plot using the seaborn library. The resulting plot displays the percentage of missing values for each column.

### Missing Data - Initial Intuition

* Here, we don't have any missing data.

General Thumb Rules:

* For features with less missing values- can use regression to predict the missing values or fill with the mean of the values present, depending on the feature.
* For features with very high number of missing values- it is better to drop those columns as they give very less insight on analysis.
* As there's no thumb rule on what criteria do we delete the columns with high number of missing values, but generally you can delete the columns, if you have more than 30-40% of missing values. But again there's a catch here, for example, Is_Car & Car_Type, People having no cars, will obviously have Car_Type as NaN (null), but that doesn't make this column useless, so decisions has to be taken wisely.

## Data Cleaning


**1.** Create a copy of base data for manupulation & processing

In [None]:
telco_data = telco_base_data.copy()

**2.** Total Charges should be numeric amount. Let's convert it to numerical data type

In [None]:
telco_data.TotalCharges = pd.to_numeric(telco_data.TotalCharges, errors='coerce')
telco_data.isnull().sum()

**3.** As we can see there are 11 missing values in TotalCharges column. Let's check these records 

In [None]:
telco_data.loc[telco_data ['TotalCharges'].isnull() == True]

**4. Missing Value Treatement**

Since the % of these records compared to total dataset is very low ie 0.15%, it is safe to ignore them from further processing.

In [None]:
#Removing missing values 
telco_data.dropna(how = 'any', inplace = True)

#telco_data.fillna(0)

The code telco_data.dropna(how='any', inplace=True) is used to drop rows from the telco_data DataFrame that contain any missing values (NaN values). The dropna() function is a pandas DataFrame method that removes rows or columns with missing values based on specified conditions.

**5.** Divide customers into bins based on tenure e.g. for tenure < 12 months: assign a tenure group if 1-12, for tenure between 1 to 2 Yrs, tenure group of 13-24; so on...

In [None]:
# Get the max tenure
print(telco_data['tenure'].max()) #72

Based on the code snippet you provided (print(telco_data['tenure'].max())), it prints the maximum value of the 'tenure' column in the telco_data DataFrame, which is 72.

In [None]:
# Group the tenure in bins of 12 months
labels = ["{0} - {1}".format(i, i + 11) for i in range(1, 72, 12)]

telco_data['tenure_group'] = pd.cut(telco_data.tenure, range(1, 80, 12), right=False, labels=labels)

The code snippet you provided groups the 'tenure' column in the telco_data DataFrame into bins of 12 months and creates a new column called 'tenure_group' to store the respective bin labels.In this line, a list of labels is created for the tenure groups. Each label represents a range of 12 months, starting from 1 and incrementing by 12. For example, the first label will be "1 - 12", the second label will be "13 - 24", and so on, up to the maximum tenure of 72 months. n this line, the pd.cut() function is used to create the 'tenure_group' column based on the 'tenure' column values. The pd.cut() function takes the 'tenure' column as the first argument. The second argument, range(1, 80, 12), specifies the bin edges, starting from 1 and incrementing by 12 up to a maximum of 80. The right=False parameter indicates that the intervals should be left-closed (inclusive on the left side) and right-open. The labels=labels parameter assigns the previously created labels to the corresponding bins.

After running this code, the telco_data DataFrame will have a new column called 'tenure_group' that categorizes the tenure values into bins of 12 months, with the corresponding labels assigned to each bin.

In [None]:
telco_data['tenure_group'].value_counts()

The code telco_data['tenure_group'].value_counts() calculates the count of each unique value in the 'tenure_group' column of the telco_data DataFrame. It returns a pandas Series where the index represents the unique values in the 'tenure_group' column, and the values represent their respective counts. This code will display the count of each unique value in the 'tenure_group' column of the telco_data DataFrame. The resulting output will be a pandas Series where the unique values in the 'tenure_group' column are shown as the index, and their respective counts are displayed as the values.

You can use this information to gain insights into the distribution of the 'tenure_group' variable in your dataset.

**6.** Remove columns not required for processing

In [None]:
#drop column customerID and tenure
telco_data.drop(columns= ['customerID','tenure'], axis=1, inplace=True)
telco_data.head()

The code snippet you provided drops the 'customerID' and 'tenure' columns from the telco_data DataFrame using the drop() method. In this code, the drop() method is called on the telco_data DataFrame to remove the specified columns. The columns=['customerID', 'tenure'] parameter specifies the list of columns to drop. The axis=1 parameter indicates that the operation should be applied along the columns (i.e., drop columns). Finally, inplace=True ensures that the changes are applied to the telco_data DataFrame itself, modifying it directly.

After running this code, the 'customerID' and 'tenure' columns will be removed from the telco_data DataFrame, and the resulting DataFrame will be displayed using the head() method to show the updated DataFrame with the specified columns dropped.

## Data Exploration
**1. ** Plot distibution of individual predictors by churn

### Univariate Analysis

In [None]:
for i, predictor in enumerate(telco_data.drop(columns=['Churn', 'TotalCharges', 'MonthlyCharges'])):
    plt.figure(i)
    sns.countplot(data=telco_data, x=predictor, hue='Churn')
    plt.savefig(f'countplot_{predictor}.png')
   

The code snippet you provided creates countplots for each predictor column in the telco_data DataFrame, excluding the 'Churn', 'TotalCharges', and 'MonthlyCharges' columns. Each countplot shows the count of churned and non-churned customers for each unique value in the respective predictor column.The code snippet you provided creates countplots for each predictor column in the telco_data DataFrame, excluding the 'Churn', 'TotalCharges', and 'MonthlyCharges' columns. Each countplot shows the count of churned and non-churned customers for each unique value in the respective predictor column.In this code, the drop() method is used to exclude the 'Churn', 'TotalCharges', and 'MonthlyCharges' columns from the telco_data DataFrame, leaving only the predictor columns. The enumerate() function is used to iterate over the predictor columns, providing both the index (i) and the column name (predictor) in each iteration.

Within the loop, a new figure is created for each predictor using plt.figure(i). Then, the sns.countplot() function is called to create a countplot, where the predictor column is plotted on the x-axis (x=predictor), and the 'Churn' column is used to determine the hue (hue='Churn'), which separates the bars by churned and non-churned customers.

Finally, plt.show() is used to display all the countplots.

Make sure you have the necessary libraries (matplotlib and seaborn) imported before running this code.

**2.** Convert the target variable 'Churn'  in a binary numeric variable i.e. Yes=1 ; No = 0

In [None]:
telco_data['Churn'] = np.where(telco_data.Churn == 'Yes',1,0)

The code snippet you provided converts the values in the 'Churn' column of the telco_data DataFrame from categorical strings ('Yes' and 'No') to numerical values (1 and 0) using the np.where() function from the NumPy library.

In [None]:
telco_data.head()

**3.** Convert all the categorical variables into dummy variables

In [None]:
telco_data_dummies = pd.get_dummies(telco_data)
telco_data_dummies.head()

The code snippet you provided creates dummy variables for categorical columns in the telco_data DataFrame using the pd.get_dummies() function from the pandas library. The resulting DataFrame, telco_data_dummies, will contain the original columns along with additional columns representing the dummy variables.

**9. ** Relationship between Monthly Charges and Total Charges

In [None]:
sns.lmplot(data=telco_data_dummies, x='MonthlyCharges', y='TotalCharges', fit_reg=False)
plt.savefig('Monthlychargesvstotalcharges.png')

Total Charges increase as Monthly Charges increase - as expected.

The code snippet  provided creates a scatter plot using the lmplot() function from the seaborn library. The scatter plot visualizes the relationship between the 'MonthlyCharges' and 'TotalCharges' columns from the telco_data_dummies DataFrame.

**10. ** Churn by Monthly Charges and Total Charges

In [None]:
Mth = sns.kdeplot(telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"] == 0) ],
                color="Red", shade = True)
Mth = sns.kdeplot(telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"] == 1) ],
                ax =Mth, color="Blue", shade= True)
Mth.legend(["No Churn","Churn"],loc='upper right')
Mth.set_ylabel('Density')
Mth.set_xlabel('Monthly Charges')
Mth.set_title('Monthly charges by churn')
plt.savefig('KDE_montlychargesvschurn.png')

**Insight:** Churn is high when Monthly Charges ar high

The code snippet you provided creates a kernel density estimation (KDE) plot to visualize the distribution of monthly charges in relation to churn status in the telco_data_dummies DataFrame.In this code, two KDE plots are created using the kdeplot() function from seaborn:

* The first kdeplot() call plots the kernel density estimation of the monthly charges for rows where 'Churn' equals 0 (non-churned customers). These charges are represented by telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"] == 0)]. The KDE plot is colored red and shaded.
* The second kdeplot() call plots the kernel density estimation of the monthly charges for rows where 'Churn' equals 1 (churned customers). These charges are represented by telco_data_dummies.MonthlyCharges[(telco_data_dummies["Churn"] == 1)]. The KDE plot is colored blue and shaded.
* The ax=Mth parameter in the second kdeplot() call ensures that the second KDE plot is drawn on the same axes as the first plot, allowing for comparison.
* The legend is set using Mth.legend(["No Churn", "Churn"], loc='upper right'), providing labels for the two KDE plots.
* The y-axis label is set to 'Density' using Mth.set_ylabel('Density').
* The x-axis label is set to 'Monthly Charges' using Mth.set_xlabel('Monthly Charges').
* The title of the plot is set to 'Monthly charges by churn' using Mth.set_title('Monthly charges by churn').
* Finally, plt.show() is used to display the KDE plot.

Make sure you have the necessary libraries (seaborn, matplotlib.pyplot) imported before running this code.

In [None]:
Tot = sns.kdeplot(telco_data_dummies.TotalCharges[(telco_data_dummies["Churn"] == 0) ],
                color="Red", shade = True)
Tot = sns.kdeplot(telco_data_dummies.TotalCharges[(telco_data_dummies["Churn"] == 1) ],
                ax =Tot, color="Blue", shade= True)
Tot.legend(["No Churn","Churn"],loc='upper right')
Tot.set_ylabel('Density')
Tot.set_xlabel('Total Charges')
Tot.set_title('Total charges by churn')
plt.savefig('KDE_totalchargesvschurn.png')

**Surprising insight ** as higher Churn at lower Total Charges

However if we combine the insights of 3 parameters i.e. Tenure, Monthly Charges & Total Charges then the picture is bit clear :- Higher Monthly Charge at lower tenure results into lower Total Charge. Hence, all these 3 factors viz **Higher Monthly Charge**,  **Lower tenure** and **Lower Total Charge** are linkd to **High Churn**.

**11. Build a corelation of all predictors with 'Churn' **

In [None]:
plt.figure(figsize=(20,8))
telco_data_dummies.corr()['Churn'].sort_values(ascending = False).plot(kind='bar')
plt.savefig('bargraph_corr.png')

The code snippet you provided creates a bar plot to visualize the correlation values between the 'Churn' column and all other columns in the telco_data_dummies DataFrame. The correlation values are sorted in descending order, and the resulting plot displays the correlations in a bar chart. In this code, the plt.figure(figsize=(20, 8)) statement sets the figure size to (20, 8) inches.

The correlation values between the 'Churn' column and all other columns in the telco_data_dummies DataFrame are calculated using .corr()['Churn']. The resulting Series is then sorted in descending order using .sort_values(ascending=False).

The .plot(kind='bar') method is called to create a bar plot of the sorted correlation values. The resulting plot shows the correlation values on the y-axis and the column names on the x-axis, with bars representing the magnitude of the correlations.

Finally, plt.show() is used to display the bar plot.

Make sure you have the necessary library (matplotlib.pyplot) imported before running this code.

**Derived Insight: **

**HIGH** Churn seen in case of  **Month to month contracts**, **No online security**, **No Tech support**, **First year of subscription** and **Fibre Optics Internet**

**LOW** Churn is seens in case of **Long term contracts**, **Subscriptions without internet service** and **The customers engaged for 5+ years**

Factors like **Gender**, **Availability of PhoneService** and **# of multiple lines** have alomost **NO** impact on Churn

This is also evident from the **Heatmap** below

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(telco_data_dummies.corr(), cmap="Paired")
plt.savefig('heatmat_corr.png')

The code snippet you provided creates a heatmap using the heatmap() function from the seaborn library to visualize the correlation matrix of the telco_data_dummies DataFrame. The correlation values are represented as colors in the heatmap, with a color palette defined by the "Paired" colormap.

### Bivariate Analysis

In [None]:
new_df1_target0=telco_data.loc[telco_data["Churn"]==0]
new_df1_target1=telco_data.loc[telco_data["Churn"]==1]

The code snippet you provided creates two new DataFrames, new_df1_target0 and new_df1_target1, by filtering the telco_data DataFrame based on the value of the 'Churn' column. In this code, the .loc[] indexing method is used to filter the telco_data DataFrame based on the condition telco_data["Churn"] == 0 and telco_data["Churn"] == 1, respectively. This creates two new DataFrames: new_df1_target0, which contains rows where the 'Churn' column is equal to 0 (representing non-churned customers), and new_df1_target1, which contains rows where the 'Churn' column is equal to 1 (representing churned customers).

These new DataFrames allow you to separate the data based on the churn status, which can be useful for further analysis or modeling specific to each group.

Please note that the assumption is made that the column name 'Churn' exists in the telco_data DataFrame.

In [None]:
def uniplot(df,col,title,hue =None):
    
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 22
    plt.rcParams['axes.titlepad'] = 30
    
    
    temp = pd.Series(data = hue)
    fig, ax = plt.subplots()
    width = len(df[col].unique()) + 7 + 4*len(temp.unique())
    fig.set_size_inches(width , 8)
    plt.xticks(rotation=45)
    plt.yscale('log')
    plt.title(title)
    ax = sns.countplot(data = df, x= col, order=df[col].value_counts().index,hue = hue,palette='bright') 
        
    plt.show()

The code snippet you provided defines a function called uniplot() that can be used to create a count plot with additional customization options using seaborn and matplotlib.The uniplot() function takes the following parameters:

df: The pandas DataFrame containing the data to be plotted.
col: The column name in the DataFrame for which the count plot will be created.
title: The title of the plot.
hue (optional): The column name in the DataFrame used for color differentiation in the count plot.
Inside the function, the seaborn style and context are set to 'whitegrid' and 'talk', respectively, to provide a specific visual style for the plot. The font size and padding for the axes labels and title are also adjusted.

A temporary series, temp, is created to hold the hue data.

A figure and axes are created using plt.subplots(), and the size of the figure is adjusted based on the number of unique values in the col column and the number of unique values in the hue column.

The x-axis labels are rotated by 45 degrees for better readability, and the y-axis scale is set to logarithmic using plt.yscale('log').

The title of the plot is set using plt.title(title).

The count plot is created using sns.countplot(), where the data is specified as df, the x-axis column is specified as col, the order of the x-axis values is set based on the value counts of the column, and the hue column is specified as hue. The 'bright' palette is used for coloring the bars.

Finally, the plot is displayed using plt.show().

You can call this function with your own DataFrame and desired parameters to create a customized count plot.

In [None]:
uniplot(new_df1_target1,col='Partner',title='Distribution of Gender for Churned Customers',hue='gender')

The uniplot() function you provided can be used to create a count plot that shows the distribution of the "Partner" column for churned customers, differentiated by gender. The title of the plot will be "Distribution of Gender for Churned Customers".

In [None]:
uniplot(new_df1_target0,col='Partner',title='Distribution of Gender for Non Churned Customers',hue='gender')
plt.savefig('bar_partner.png')

To create a count plot that shows the distribution of the "Partner" column for non-churned customers, differentiated by gender, you can use the uniplot() function with the appropriate parameters. 

In [None]:
uniplot(new_df1_target1,col='PaymentMethod',title='Distribution of PaymentMethod for Churned Customers',hue='gender')
plt.savefig('bar_payment_method.png')


To create a count plot that shows the distribution of the "PaymentMethod" column for churned customers, differentiated by gender, you can use the uniplot() function with the appropriate parameters.

In [None]:
uniplot(new_df1_target1,col='Contract',title='Distribution of Contract for Churned Customers',hue='gender')
plt.savefig('bar_contract.png')

To create a count plot that shows the distribution of the "Contract" column for churned customers, differentiated by gender, you can use the uniplot() function with the appropriate parameters. 

In [None]:
uniplot(new_df1_target1,col='TechSupport',title='Distribution of TechSupport for Churned Customers',hue='gender')
plt.savefig('bar_techsupport.png')

To create a count plot that shows the distribution of the "TechSupport" column for churned customers, differentiated by gender, you can use the uniplot() function with the appropriate parameters. 

In [None]:
uniplot(new_df1_target1,col='SeniorCitizen',title='Distribution of SeniorCitizen for Churned Customers',hue='gender')
plt.savefig('bar_sensior.png')

To create a count plot that shows the distribution of the "SeniorCitizen" column for churned customers, differentiated by gender, you can use the uniplot() function with the appropriate parameters. 

# CONCLUSION

These are some of the quick insights from this exercise:

1. Electronic check medium are the highest churners
2. Contract Type - Monthly customers are more likely to churn because of no contract terms, as they are free to go customers.
3. No Online security, No Tech Support category are high churners
4. Non senior Citizens are high churners



In [None]:
telco_data_dummies.to_csv('tel_churn.csv')

To save the telco_data_dummies DataFrame as a CSV file named 'tel_churn.csv', you can use the to_csv() function.