<center><h1> Customer Churn Prediction Model </h1>
<h4>By: Ravikumar Patel </h4></center>


Customer churn means the percentage of customers who stopped using the company's services or products during a certain period. This period could be days, months, quarters or years. So, if the company decides to use a month as a period, then they can deduct the value of customers remained after a month from the customers the company had at the beginning of the month and calculate the percentage of it. **Note**: It is important to not include new customers in this calculation, otherwise the fetched insights will not be accurate.

The company would like to make this churn percentage as close to 0% as possible. If the customer churn percentage is higher, that means the company is not getting revenue from the customers and without the profit, they can not operate as successfully as possible. 

According to The American Customer Satisfaction Index, which measures the overall customer satisfaction by sector according to a formula, reported that only 72.2% of customers are satisfied with their current telecom provider [Source](https://www.theacsi.org/acsi-benchmarks/benchmarks-by-sector). It means 27.8% of customers are highly risked of leaving the company. However, it doesn't mean satisfied customers will not leave. They might leave but it is not very likely that they will leave. In 2017, Canada's two telecommunication company Telus and BCE (Bell), reported that it cost them 50 times more to get new customers than it would have cost to retain existing customers [Source](https://telecoms.com/opinion/churn-is-breaking-the-telecoms-market-heres-how-to-fix-it/).


The dataset used in the project is taken from [Kaggle](https://www.kaggle.com/abhinav89/telecom-customer). The dataset contains over 100 features and 100,000 instances. It does not contain any information about the time frame of the data collected or any related information. The features contain almost all the features regarding telecom customers in categorical and continuous values. To help reduce the customer churn percentage, the companies use predictive models, so they can offer special perks to those customers who are more likely to leave. 

Following are the steps to develop a customer churn prediction model

1.   Data Analysis
2.   Data Preprocessing
3.   Feature Selection
4.   Model Selection and Training
5.   Model Evaluation

In [1]:
# basic library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# to split the data into random smaller sets
from sklearn.model_selection import train_test_split

with open("Telecom_customer churn.csv", 'r') as file:
    df = pd.read_csv(file)

display(df.head())

X_train, X_test = train_test_split(df, test_size = 0.2, random_state=10)

Unnamed: 0,rev_Mean,mou_Mean,totmrc_Mean,da_Mean,ovrmou_Mean,ovrrev_Mean,vceovr_Mean,datovr_Mean,roam_Mean,change_mou,...,forgntvl,ethnic,kid0_2,kid3_5,kid6_10,kid11_15,kid16_17,creditcd,eqpdays,Customer_ID
0,23.9975,219.25,22.5,0.2475,0.0,0.0,0.0,0.0,0.0,-157.25,...,0.0,N,U,U,U,U,U,Y,361.0,1000001
1,57.4925,482.75,37.425,0.2475,22.75,9.1,9.1,0.0,0.0,532.25,...,0.0,Z,U,U,U,U,U,Y,240.0,1000002
2,16.99,10.25,16.99,0.0,0.0,0.0,0.0,0.0,0.0,-4.25,...,0.0,N,U,Y,U,U,U,Y,1504.0,1000003
3,38.0,7.5,38.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.5,...,0.0,U,Y,U,U,U,U,Y,1812.0,1000004
4,55.23,570.5,71.98,0.0,0.0,0.0,0.0,0.0,0.0,38.5,...,0.0,I,U,U,U,U,U,Y,434.0,1000005


## Data Analysis

In [2]:
# the code on this cell has been from tutorial 2 CSCI 4146 course
def buildContinuousFeaturesReport(features, data_df):
	conHead = ['Count', 'Miss %', 'Card.', 'Min', '1st Qrt.',
            'Mean', 'Median', '3rd Qrt.', 'Max', 'Std. Dev.']

	conOut_df = pd.DataFrame(index=features, columns=conHead)
	columns_df = data_df[features]

	#COUNT
	conOut_df[conHead[0]] = len(columns_df)

	#MISS % 
	conOut_df[conHead[1]] = columns_df.isna().sum() / len(columns_df) * 100

	#CARDINALITY
	conOut_df[conHead[2]] = columns_df.nunique()

	#MINIMUM
	conOut_df[conHead[3]] = columns_df.min()

	#1ST QUARTILE
	conOut_df[conHead[4]] = columns_df.quantile(0.25)

	#MEAN
	conOut_df[conHead[5]] = columns_df.mean()

	#MEDIAN
	conOut_df[conHead[6]] = columns_df.median()

	#3rd QUARTILE
	conOut_df[conHead[7]] = columns_df.quantile(0.75)

	#MAX
	conOut_df[conHead[8]] = columns_df.max()

	#STANDARD DEVIATION
	conOut_df[conHead[9]] = columns_df.std()

	return conOut_df

def buildCategoricalFeaturesReport(features, data_df):
	catHead = ['Count', 'Miss %', 'Card.', 'Mode', 'Mode Freq',
            'Mode %', '2nd Mode', '2nd Mode Freq', '2nd Mode %']

	columns_df = data_df[features]

	#preparing a dictionary for storing data
	stats_dict = {k: ['']*len(features) for k in catHead}

	#CARDINALITY
	stats_dict['Card.'] = columns_df.nunique()

	missing = columns_df.isna().sum() / len(columns_df) * 100

	for col in columns_df:
		values = columns_df[col].value_counts()
		index = features.index(col)

    #COUNT
		stats_dict['Count'][index] = len(columns_df)
		
		#MISS %
		stats_dict['Miss %'][index] = missing[col]

		#MODES
		mode = values.index[0]
		mode2 = values.index[1] if len(values.index) > 1 else mode
		stats_dict['Mode'][index] = mode
		stats_dict['2nd Mode'][index] = mode2

		#MODE FREQ
		modeCount = values.loc[mode]
		modeCount2 = values.loc[mode2]
		stats_dict['Mode Freq'][index] = modeCount
		stats_dict['2nd Mode Freq'][index] = modeCount2

		#MODE %
		miss = stats_dict['Miss %'][index]

		modePer = (modeCount/(len(columns_df)*((100-miss)/100)))*100
		stats_dict['Mode %'][index] = round(modePer, 2)

		modePer2 = (modeCount2/(len(columns_df)*((100-miss)/100)))*100
		stats_dict['2nd Mode %'][index] = round(modePer2, 2)
	
	output_df = pd.DataFrame.from_dict(stats_dict)
	return output_df

In [3]:
# build the data quality reports(DQRs) for continuous and categorical features 
dqr_continuous_listings = buildContinuousFeaturesReport(X_train.select_dtypes('number').columns.to_list(), X_train)
dqr_categorical_listings = buildCategoricalFeaturesReport(X_train.select_dtypes('object').columns.to_list(), X_train)

#display the reports
with pd.option_context('display.max_rows', None, 'display.max_columns', None),\
    pd.option_context('display.float_format', '{:.2f}'.format):
        print("Data quality report for quantitative features")
        display(dqr_continuous_listings)

        print("\nData quality report for qualitative features")
        display(dqr_categorical_listings)

SyntaxError: unexpected character after line continuation character (<ipython-input-3-25ca0815f14d>, line 6)

From the Data Quality Reports, there are some features that contain numeric values but are qualitative by nature. Like, truck, rv, lor and others. So, let's move those features to the correct category and run the report again.

In [None]:
# move truck, rv, lor, adults, income, numbcars to categorical analysis

lst_cat = ['truck', 'rv', 'lor', 'adults', 'income', 'numbcars']

dqr_continuous_listings = dqr_continuous_listings.drop(lst_cat)

lst_cat.extend(X_train.select_dtypes('object').columns.to_list())
dqr_categorical_listings = buildCategoricalFeaturesReport(lst_cat, X_train)

#display the reports
with pd.option_context('display.max_rows', None, 'display.max_columns', None),\ 
    pd.option_context('display.float_format', '{:.2f}'.format):
        print("Data quality report for quantitative features")
        display(dqr_continuous_listings)

        print("\nData quality report for qualitative features")
        display(dqr_categorical_listings)

### Quantitative feature analysis

The DQR of quantitative features shows that most of the features for means or total (ends with '_Means' or starts with 'avg' or 'adj' or 'tot') have a higher number of outliers, i.e. difference between 3rd Qrtile and max value is too big. The box plot and histogram can help to gain some more information about these features, to make sound decisions for handling them. The missing values in features are not that high, so handling them is not a big task that requires a lot of analysis.

Some features contain high values but from the problem domain knowledge, these values are not feasible. This might be because of a collection error or the wrong value. This needs to analyze further along with the feature ("eqpdays"). The "eqpdays" represents how since when (number of days) the customer has the current equipment, i.e. how long ago the customer got the equipment they are using. This feature contains negative values, which does not make sense unless it means that the new equipment is on the way (in the shipping) to the customer. Since the information is not provided, analysis is required before making any decisions.

#### Features with high number of outliers

In [None]:
# plot the boxplot and histogram for the given features from given dataframe
def plot(df, features):
    for i in features:    
        df.boxplot(column=i)
        plt.title('Boxplot of '+ i)
        df.hist(column=i)
        plt.title('Histogram of '+ i)
        plt.show()

lst_not_high_outliers = ['churn', 'months', 'uniqsubs', 'actvsubs', 'phones', \
                         'models', 'forgntvl', 'Customer_ID']

lst_high_outliers = set(dqr_continuous_listings.index) - set(lst_not_high_outliers)

plot(X_train, lst_high_outliers)

#### Features with wrong high value analysis

In [None]:
# uniqsubs analysis
print("Check values in uniqsubs")
plt.title("uniqsubs values > 10")
plt.xlabel("Number of unique subscriptions")
X_train[X_train['uniqsubs'] > 10]['uniqsubs'].hist()
plt.show()

In [None]:
# actvsubs analysis
print("Check values in actvsubs")
plt.title("actvsubs values > 5")
plt.xlabel("Number of actvsubs subscriptions")
X_train[X_train['actvsubs'] > 5]['actvsubs'].hist(bins=50)
plt.show()

In [None]:
# phones analysis
print("Check values in phones")
plt.title("phones values > 10")
plt.xlabel("Number of phones issued")
X_train[X_train['phones'] > 10]['phones'].hist(bins=20)
plt.show()

In [None]:
# models analysis
print("Check values in models")
plt.title("models values > 5")
plt.xlabel("Number of models issued")
X_train[X_train['models'] > 5]['models'].hist()
plt.show()

#### Features with wrong negative values analysis

In [None]:
# eqpdays analysis
print("Check values in eqpdays")
plt.title("eqpdays values < 5")
plt.xlabel("Number of days (age) of current equipment")
X_train[X_train['eqpdays'] < 5]['eqpdays'].hist()
plt.show()

plt.title("eqpdays")
plt.xlabel("Number of days (age) of current equipment")
X_train['eqpdays'].hist()
plt.show()

### Qualitative feature analysis

#### Features with more than 10% missing values



In [None]:
high_missing =  dqr_categorical_listings[(dqr_categorical_listings['Miss %'] > 10)].index.to_list()

for f in high_missing:
    print("Feature :", f)
    plt.title(f)
    plt.xlabel("Values")
    plt.ylabel("% of customers")
    X_train[f].value_counts(normalize=True, dropna = False).plot(kind='bar')
    plt.show()

#### Features with smaller number of values analysis

In [None]:
lst_features = ['asl_flag', 'refurb_new', 'ownrent', 'dwlltype', 'infobase', \
              'kid0_2', 'kid3_5', 'kid6_10', 'kid11_15', 'kid16_17', \
              'creditcd', 'new_cell', 'prizm_social_one', 'hnd_webcap', \
              'marital', 'HHstatin']

for f in lst_features:
    print("Feature :", f)
    plt.title(f)
    plt.xlabel("Values")
    plt.ylabel("% of customers")
    X_train[f].value_counts(normalize=True, dropna = False).plot(kind='bar')
    plt.show()

#### Features with uncommon values analysis

In [None]:
plt.title('dualband')
plt.xlabel("Values")
plt.ylabel("% of customers")
X_train['dualband'].value_counts(normalize=True, dropna = False).plot(kind='bar')
plt.show()

#### Features with large values analysis

In [None]:
lst_features = ['crclscod', 'area', 'dwllsize', 'ethnic']

for f in lst_features:
    print("Feature :", f)
    plt.title(f)
    plt.xlabel("Values")
    plt.ylabel("% of customers")
    X_train[f].value_counts(normalize=True, dropna = False).plot(kind='bar')
    plt.show()