In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

> ****analyzing a dataset on the churn rate of telecom operator clients

Reading the data to a dataframe

In [None]:
df = pd.read_csv('../input/telecom_churn.csv')

Checking the datatype for df

In [None]:
type(df)

Reading the dataset

In [None]:
df.head()

Checking the dimensions of the dataset

In [None]:
df.shape

we have 3333 rows and 20 columns in the whole dataset.

Let us see all the names of the columns in the dataset

In [None]:
df.columns

To get some general information regarding the dataset we use the .info() function.

In [None]:
df.info()

We can see the data structure of df to be a dataframe. The names of the columns are displayed along with their data types.

Also if there are any missing values present , they would be displayed as null which is 0 in this case. There are 16 numeric features with 1 boolean and 3 object features. 

In [None]:
df['Churn']=df['Churn'].astype('int64')

We have converted our target feature Churn from boolean into numeric by astype method for detailed statistical results.

To get statistical results like count, mean, std, quartiles and many more from the dataset , we use the descsribe() function.

In [None]:
df.describe()

We need the include parameter to see summary statistics for non-numeric features.

In [None]:
df.describe(include=['object'])

To check on the distribution on our target feature Churn , we use value_counts().

In [None]:
df['Churn'].value_counts()

Out of 3333 , 2850 customers claim to be loyal 

In [None]:
df['Churn'].value_counts(normalize=True)

We normalized the function to get the fraction for each distribution of the total, where 85.5% claim to be loyal and 14.5% do not.

**Sorting**

We can sort the whole dataset by sorting 1 column in ascending or descending order. Here we would be using the column 'Total Day Charge' in ascending=False depicting descending order.

In [None]:
df.sort_values(by='Total day charge',ascending=False).head()

We can perform the same sorting exercise using mutiple columns.Let us try.

In [None]:
df.sort_values(by=['Churn','Total day charge'],ascending=[True,False]).head()

Indexing and Retrieving Data

Indexing helps us to get a proportion of each feature for example propoortion of clients likely to churn from the company.

In [None]:
df['Churn'].mean()

14.5% of clients leaving a company is not a good sign at all.

Boolean Indexing helps us with a condition with respect to analyse a feature we wanted a proportion of by averaging other features.
For example let us see the average numerical features for all thr churned users.


In [None]:
df[df['Churn']==1].mean()

What is the average daytime spent on calls by churned users?

In [None]:
df[df['Churn']==1]['Total day minutes'].mean()

What is the maximum length of international calls by loyal users who do not have any international plans? 

In [None]:
df[(df['Churn']==0) & (df['International plan']== 'No')] ['Total intl minutes'].max()

Dataframes can be indexed using loc[] & iloc[] function.
The first one gives us the number of rows to be extracted and the latter one extracts rows along with columns.

In [None]:
df.loc[0:3]

In [None]:
df.iloc[0:3,0:5]

What is the last line of the dataframe? 

In [None]:
df[-1:]

Applying functions to cells, rows and columns.

In [None]:
df.apply(np.max) 

Select all states starting with W using lambda function.

In [None]:
df[df['State'].apply(lambda state:state[0]=='W')].head() 

We can use the map function to replace old values by new ones. 

In [None]:
d= {'Yes':True, 'No':False}
df['International plan']=df['International plan'].map(d) 
df.head() 

We can do the same thing using replace function.

In [None]:
df=df.replace({'Voice mail plan':d}) 

In [None]:
df.head() 

**Groupby** is used to view selected columns with respect to column value we wish to segregate. 

In [None]:
columns_to_show=['Total day minutes', 'Total eve minutes', 'Total night minutes']
df.groupby(['Churn'])[columns_to_show].describe(percentiles=[]) 

We can obtain specific aggregate functions by using agg().

In [None]:
columns_to_show = ['Total day minutes', 'Total eve minutes', 
                   'Total night minutes']

df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, 
                                            np.max])

Summary Tables are crosstabulation tables that could be used to compare 2 variables by their distribution or more than 2 using pivot tables.

In [None]:
pd.crosstab(df['Churn'],df['International plan'])

In pivot tables , we compare distribution of different features with respect to a particular one based on their aggregate functions.

In [None]:
df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'],['Area code'],aggfunc='mean')

**Dataframe Transformations**

If we are to add different calls to a total calls in a day column and add it to the dataset we can do the following.

In [None]:
total_calls=df['Total day calls']+df['Total eve calls']+df['Total night calls']+df['Total intl calls']
df.insert(loc=len(df.columns),column='Total calls',value=total_calls)

In [None]:
df.head()

If we require to remove the columns we use the drop function.

In [None]:
df.drop(['Total calls'],axis=1,inplace=True)
df.head()

If we are to remove rows we use..

In [None]:
df.drop([1,2]).head()

Give a detailed contingency table between Churn and International plan along with a plot.

In [None]:
pd.crosstab(df['Churn'],df['International plan'],margins=True)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'

In [None]:
sns.countplot(x='International plan', hue='Churn', data=df)

From the above results , we can determine that poorly maintained and inexpensive International plans has lead to more churn rate.

Give a detailed contingency table between Churn and Customer service calls along with a plot.

In [None]:
pd.crosstab(df['Churn'],df['Customer service calls'],margins=True)

In [None]:
sns.countplot(x='Customer service calls', hue='Churn', data=df)

We can clearly determine that as soon as the the service calls go beyond 4 , churn rate increases substanially.

Based on the above observation ,create a new feature and observe the relationship with churn.

In [None]:
df['many_calls']=(df['Customer service calls']>3).astype('int')
pd.crosstab(df['Churn'],df['many_calls'],margins=True)

In [None]:
sns.countplot(x='many_calls', hue='Churn', data=df);

In [None]:
pd.crosstab(df['many_calls'] & df['International plan'] , df['Churn'])

Therefore, predicting that a customer is not loyal (Churn=1) in the case when the number of calls to the service center is greater than 3 and the International Plan is added (and predicting Churn=0 otherwise), we might expect an accuracy of 85.8% (we are mistaken only 464 + 9 times). This number, 85.8%, that we got through this very simple reasoning serves as a good starting point (baseline) for the further machine learning models that we will build.