### Business Problem Overview

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business goal.
To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

### Business Ojective

The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful.

Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Prepaid is also the most common model in India and southeast Asia, while postpaid is more common in Europe in North America.

 

This project is based on the Indian and Southeast Asian market.

### Solution Overview

The project starts with understanding the data set,  dropping unnecessary features, deriving features and performing exploratory data analysis. Where possible and required IterativeImputer from SKLearn is used to fill in gaps of numerical data by regression
The project is limited to High Value customers and there is a set definition to drive High Value Customers based on the data set. This definition is applied. The resultant data set is around 30K entries. 

For the south asian market it is enough if a good enough Churn prediction is performed on the high value customers
The below algorithms have been applied on the data set and various models created and compared.

**Principal Component Analysis  (PCA) with Logistic Regression** 

**Logistic Regression with Recursive Feature Elimination**

**Random Forest algorithm based model with Hyper Parameter tuning**

**Gradient Boosting - XGBoost algorithm**

After this the results are compared based on the Confusion Matrix and Accuracy and a good model chosen.
Then based on the model chosen the main features that affect Churn are identified so that some business recommendations can be made




In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline    
pd.options.display.float_format = '{:.2f}'.format
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")

## Data Preparation & Understanding 

In [None]:
#Load the data into a dataframe
telecom = pd.read_csv("../input/telecom-churn-data-set-for-the-south-asian-market/telecom_churn_data.csv", low_memory=False)

In [None]:
telecom.head()

In [None]:
telecom.shape

In [None]:
telecom.info()

### Check unique values

In [None]:
#mobile_number is unique
print(telecom.mobile_number.is_unique)
telecom.mobile_number.nunique()

### Columns with 70% missing data

In [None]:
# Columns with more than 70% missing values
colmns_missing_data = round(100*(telecom.isnull().sum()/len(telecom.index)), 2)
colmns_missing_data[colmns_missing_data >= 40]

In [None]:
telecom.shape

### Filter High Value Customers
#### Define high-value customers as follows: 
- Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

- There are lot of missing values for the Data and the Amt_Data Columns indicating that no recharge was done on that month
- The NaN values should be replaced by 0

In [None]:
telecom.total_rech_data_6.fillna(value=0, inplace=True)
telecom.total_rech_data_7.fillna(value=0, inplace=True)
telecom.total_rech_data_8.fillna(value=0, inplace=True)
telecom.total_rech_data_9.fillna(value=0, inplace=True)#
telecom.av_rech_amt_data_6.fillna(value=0, inplace=True)
telecom.av_rech_amt_data_7.fillna(value=0, inplace=True)
telecom.av_rech_amt_data_8.fillna(value=0, inplace=True)
telecom.av_rech_amt_data_9.fillna(value=0, inplace=True)

In [None]:
#Total recharge amounts for months 6 and 7
#Total recharge amount logic = Total data recharge + Total recharge Amount. 
#if any of the data recharge columns are 0 then retain the total recharge amt column as is

telecom['total_rech_amt_6'] = np.where((telecom['total_rech_data_6'] != 0) & (telecom['av_rech_amt_data_6'] != 0),
                                            telecom['total_rech_data_6']*telecom['av_rech_amt_data_6']+telecom['total_rech_amt_6'],
                                            telecom['total_rech_amt_6'])

telecom['total_rech_amt_7'] = np.where((telecom['total_rech_data_7'] != 0) & (telecom['av_rech_amt_data_7'] != 0),
                                            telecom['total_rech_data_7']*telecom['av_rech_amt_data_7']+telecom['total_rech_amt_7'],
                                            telecom['total_rech_amt_7'])

In [None]:
# Filter high-value customers
telecom['av_rech_amt'] = (telecom["total_rech_amt_6"] + 
                          telecom["total_rech_amt_7"]) / 2.0
cutoff = telecom.av_rech_amt.quantile(.70)
print('70 percentile of first two months avg recharge amount: ', cutoff)
telecom_hv = telecom[telecom['av_rech_amt'] >= cutoff]

In [None]:
telecom_hv.shape

In [None]:
# We can drop total_rech_data_* and av_rech_amt_data_*
drop_data_columns = ["total_rech_data_6", "total_rech_data_7", "total_rech_data_8", "total_rech_data_9", 
                'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 'av_rech_amt_data_9']
telecom_hv.drop(drop_data_columns, axis=1, inplace=True)

#### The above columns can be dropped as they have not meaningful either for high value customers or for Churn Labelling 

In [None]:
pd.set_option('display.max_rows', telecom_hv.shape[0]+1)

### From a total of 99999 records 30001 Records satisfy the High Value Customers
### The churn prediction will be executed on the high value customers

###  Derive Churn Label using the 'Churn Phase' data 
#### Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase
- 1 Churn
- 0 No Churn

In [None]:
def conditions(s):
    if ((s['total_ic_mou_9'] <= 0) & (s['total_og_mou_9'] <= 0) & (s['vol_2g_mb_9'] <= 0) & (s['vol_3g_mb_9'] <= 0)):
        return 1
    else:
        return 0

In [None]:
telecom_hv['Churn'] = telecom_hv.apply(conditions, axis=1)

#### Drop the Churn Phase Data set after Label Derivation as per Problem Instruction

In [None]:
telecom_hv = telecom_hv.loc[:,~telecom_hv.columns.str.endswith('_9')]
telecom_hv = telecom_hv.loc[:,~telecom_hv.columns.str.startswith('sep')]

In [None]:
telecom_hv.shape


#### Understand the Churn Rate & Imbalance

In [None]:
churn_rate = (sum(telecom_hv['Churn'])/len(telecom_hv['Churn'].index))*100
churn_rate

In [None]:
imbalance = (sum(telecom_hv['Churn'] != 0)/sum(telecom_hv['Churn'] == 0))*100
imbalance

- Churn Rate: We have 8.14% Churn rate
- Imbalance ratio of 8.66 %

## Data Prep, Exploratory Data Analysis & Feature Generation

In [None]:
#Study the dataset
telecom_hv.describe()

#### Few columns have a unique value in all the rows
#### They cannot be uesd to predict any variance between the data set
#### it makes intuitive sense to drop these columns

In [None]:
nunique = telecom_hv.apply(pd.Series.nunique)
cols_to_drop = nunique[nunique == 1].index
cols_to_drop

In [None]:
telecom_hv.drop(cols_to_drop,axis=1,inplace=True)
telecom_hv.shape

#### Rows with more than 55% of data missing

In [None]:
# sum it up to check how many rows have all missing values
print("All null values:", telecom_hv.isnull().all(axis=1).sum())
# drop rows with 55% of missing data
telecom_hv = telecom_hv[(telecom_hv.isnull().sum(axis=1)/telecom_hv.shape[1])*100 < 55]
print("Record Count after Row/Column Data deletion:", telecom_hv.shape[0])

#### Date Format Alignment

### Box Plots, Bar Plots, Scatter plots and Correlation Matrix

In [None]:
#Create Bar Plot
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.barplot(x = 'Churn', y = 'arpu_6', data = telecom_hv)
plt.subplot(2,3,2)
plt.ylabel('Av Rev. Month 7')
sns.barplot(x = 'Churn', y = 'arpu_7', data = telecom_hv)
plt.subplot(2,3,3)
plt.ylabel('Av Rev. Month 8')
sns.barplot(x = 'Churn', y = 'arpu_8', data = telecom_hv)


##### The Average Revenue per user metric Drop in month 8 indicates Churn

In [None]:
#Create Bar Plot
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.barplot(x = 'Churn', y = 'total_og_mou_6', data = telecom_hv)
plt.subplot(2,3,2)
sns.barplot(x = 'Churn', y = 'total_og_mou_7', data = telecom_hv)
plt.subplot(2,3,3)
sns.barplot(x = 'Churn', y = 'total_og_mou_8', data = telecom_hv)


#### The Outgoing minutes Drop in month 8 indicates Churn

In [None]:
#Create Bar Plot
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.barplot(x = 'Churn', y = 'total_ic_mou_6', data = telecom_hv)
plt.subplot(2,3,2)
sns.barplot(x = 'Churn', y = 'total_ic_mou_7', data = telecom_hv)
plt.subplot(2,3,3)
sns.barplot(x = 'Churn', y = 'total_ic_mou_8', data = telecom_hv)


#### The Incoming minutes usage Drop in month 8 indicates Churn

In [None]:
#Create Bar Plot
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.barplot(x = 'Churn', y = 'onnet_mou_6', data = telecom_hv)
plt.subplot(2,3,2)
sns.barplot(x = 'Churn', y = 'onnet_mou_7', data = telecom_hv)
plt.subplot(2,3,3)
sns.barplot(x = 'Churn', y = 'onnet_mou_8', data = telecom_hv)


#### The Same Operator/network Calls Drop in month 8 indicates Churn

In [None]:
#Create Bar Plot
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.barplot(x = 'Churn', y = 'offnet_mou_6', data = telecom_hv)
plt.subplot(2,3,2)
sns.barplot(x = 'Churn', y = 'offnet_mou_7', data = telecom_hv)
plt.subplot(2,3,3)
sns.barplot(x = 'Churn', y = 'offnet_mou_8', data = telecom_hv)


#### The Different operator/network Calls Drop in month 8 indicates Churn
#### It would benefit if we Merge the months 6 and 7 into an average number indicating Good Phase

In [None]:
telecom_hv.shape

#### Extract all other columns separately for Plotting the correlation and observing highly correlated variables
#### Different metrics include Totals, Amounts, Minutes of Usage, OFFNET & ONNNET, 2g and 3g Data sets

In [None]:
rech_data = telecom_hv.loc[:,telecom_hv.columns.str.contains('rech')]
tot_data = telecom_hv.loc[:,telecom_hv.columns.str.contains('tot')]
amt_data = telecom_hv.loc[:,telecom_hv.columns.str.contains('amt')]
ic_mou_data = telecom_hv.loc[:,(telecom_hv.columns.str.contains('ic') & telecom_hv.columns.str.contains('mou'))]
og_mou_data = telecom_hv.loc[:,(telecom_hv.columns.str.contains('og') & telecom_hv.columns.str.contains('mou'))]
net_mou_data = telecom_hv.loc[:,telecom_hv.columns.str.contains('net_mou')]
data3g = telecom_hv.loc[:,(telecom_hv.columns.str.contains('3g'))]
data2g = telecom_hv.loc[:,(telecom_hv.columns.str.contains('2g'))]

In [None]:
rech_data.shape

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (25,25))
sns.heatmap(rech_data.corr(),annot = True)

#### Observations
- High correlation between Average Recharge Amount and Rechage amounts for 6 and 7
- This is expected as the recharge amount is calculated for purpose of filtering high value customers
- There is high correlation 80% between data recharge for month 7 and recharge for month 8. 
- Any factor that has correlation with month 8 is probably correlated to the churn prediction

In [None]:
tot_data.shape

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (12,12))
sns.heatmap(tot_data.corr(),annot = True)

#### Observations
- There is greater than 70% and some cases 82% correlation between months 7 and 8 regarding Incoming & Outgoing minutes of usage
- This is probabaly due to the fact that if there is heavy usage in month 7 then subsequently in month 8 there is also heavy usage - The cusotmer will not churn if there is heavy usage and vice versa

In [None]:
amt_data.shape

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (10,10))
sns.heatmap(amt_data.corr(),annot = True)

#### Observations
- Some of this correlation is the same as the First Recharge Amount correlation
- There is also higher correlation between the Max Recharge Amount in month 8 (Bad Phase) and the Last Day Recharge Amount
- This could indicate that if a customer is not going to Churn then they Recharge for a higher amount in month 8 

In [None]:
#Create scatter plot to understand distribution of amounts
plt.figure(figsize=(25, 10))
plt.subplot(2,3,1)
sns.scatterplot(x = 'total_rech_amt_6', y = 'total_rech_amt_8', data = telecom_hv, hue = 'Churn')
plt.subplot(2,3,2)
sns.scatterplot(x = 'total_rech_amt_7', y = 'total_rech_amt_8', data = telecom_hv, hue = 'Churn')
plt.subplot(2,3,3)
sns.scatterplot(x = 'av_rech_amt', y = 'total_rech_amt_8', data = telecom_hv, hue = 'Churn')

In [None]:
ic_mou_data.shape

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (36,36))
sns.heatmap(ic_mou_data.corr(),annot = True)

#### Observations

- Total Incoming minutes of usage is almost entirely explained by the LOCAL call usage and not a lot by the STD calls
- Total Incoming minutes of usage of month 8 is also correlated to the month 7. Indicating that if a customer has High MOU in       month 7 then they will continue to have High MOU in month 8
- The STD Incoming MOU is fully explained by the T2M Minutes of Usage
- High Correlation between Incoming T2T Usage for Months 6 and 7 and Months 7 and 8

In [None]:
og_mou_data.shape

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (40,40))
sns.heatmap(og_mou_data.corr(),annot = True)

#### Observations

- Total outgoing minutes of usage is almost entirely explained by the Std calls usage and not a lot by the Local calls
- Total Outgoing minutes of usage of month 8 is also correlated to the month 7. Indicating that if a customer has High MOU in       month 7 then they will continue to have High MOU in month 8
- The STD Outgoing MOU is highly correlated to the T2T Minutes of Usage
- High Correlation between OutGoing T2T Usage for Months 6 and 7 and Months 7 and 8

In [None]:
net_mou_data.shape


In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (6,6))
sns.heatmap(net_mou_data.corr(),annot = True)

#### Observations

- No Correlation between ONNET and OFFNET Minutes of usage
- High correlation between months 7 and 8 both for ONNET and OFFNET usage

In [None]:
data3g.shape

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (18,18))
sns.heatmap(data3g.corr(),annot = True)

#### Observations

- 70% correlation between Average revenue per user and the 3G Volume of data usage for all Months

In [None]:
data2g.shape

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (15,15))
sns.heatmap(data2g.corr(),annot = True)

#### Observations

- Very High correlation between the Recharge Sachets and the Count of Recharges for all months

#### Consider dropping columns where most of the values are 0 i.e. Greater than 75% 

In [None]:
check_cols = (telecom_hv[telecom_hv == 0].count(axis=0)/len(telecom_hv.index)*100)
check_cols = check_cols[check_cols > 75].index
check_cols

In [None]:
check_cols = check_cols[check_cols != 'Churn']

##### Numeric Features

In [None]:
telecom_n = telecom_hv.select_dtypes(include=np.number)

In [None]:
telecom_n.head()

In [None]:
telecom_n.shape

#### If there are any NAN values then fill them with 0

In [None]:
# Columns with more than 70% missing values
colmns_missing_data = round(100*(telecom_n.isnull().sum()/len(telecom_n.index)), 2)
cols = colmns_missing_data[colmns_missing_data>1]

In [None]:
cols

In [None]:
telecom_cat = pd.DataFrame(telecom_n,columns = ['mobile_number','night_pck_user_6','night_pck_user_7','night_pck_user_8','fb_user_6','fb_user_7','fb_user_8'])
telecom_n.drop(['night_pck_user_6','night_pck_user_7','night_pck_user_8','fb_user_6','fb_user_7','fb_user_8'],axis=1,inplace=True)


In [None]:
telecom_cat.shape


In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, verbose=0)
imp.fit(telecom_n)
imputed_df = imp.transform(telecom_n)

In [None]:
# Columns with more than 70% missing values
new_df = pd.DataFrame(imputed_df)
new_df.columns = telecom_n.columns
new_df.head()

In [None]:
telecom_n = pd.merge(new_df, telecom_cat, on='mobile_number', how='inner')
telecom_n.head()

##### Non Numeric Variables

In [None]:
telecom_nn = telecom_hv.select_dtypes(exclude=telecom_n.dtypes)

In [None]:
telecom_nn.head()

#### Since we are not predicting any churn based on the above Date variables - is it safe to drop them? 

### Derived Features based on the Exploratory Data Analysis performed in the previous step
- 1. Average ARPU during months 6 and 7
- 2. Average OG minutes of usage during months 6 and 7
- 3. Average IC Minutes of usage during months 6 and 7
- 4. Average OFFNET and ONNET minutes of usage for months 6 and 7

In [None]:
telecom_n['gp_avg_arpu'] = (telecom_n['arpu_6'] + telecom_n['arpu_7'])/2

- If a Customer is only joining in the Bad Phase assign the same valus to the Good Phase

In [None]:
telecom_n['gp_avg_arpu'] = np.where((telecom_n['arpu_8'] > 0) & (telecom_n['gp_avg_arpu'] == 0),telecom_n['arpu_8'],telecom_n['gp_avg_arpu'])                              

- Drop the individual month ARPU data as it's redundant

In [None]:
telecom_n.drop(['arpu_6','arpu_7'],axis=1,inplace=True)

In [None]:
telecom_n['total_og_mou_gp'] = (telecom_n['total_og_mou_6'] + telecom_n['total_og_mou_7'])/2

In [None]:
telecom_n['total_og_mou_gp'] = np.where((telecom_n['total_og_mou_8'] > 0) & (telecom_n['total_og_mou_gp'] == 0),telecom_n['total_ic_mou_8'],telecom_n['total_og_mou_gp'])                              

- Drop the individual month Outgoing usage data as it's redundant

In [None]:
telecom_n.drop(['total_og_mou_6','total_og_mou_7'],axis=1,inplace=True)

In [None]:
telecom_n['total_ic_mou_gp'] = (telecom_n['total_ic_mou_6'] + telecom_n['total_ic_mou_7'])/2

- If a Customer is only joining in the Bad Phase assign the same valus to the Good Phase

In [None]:
telecom_n['total_ic_mou_gp'] = np.where((telecom_n['total_ic_mou_8'] > 0) & (telecom_n['total_ic_mou_gp'] == 0),telecom_n['total_ic_mou_8'],telecom_n['total_ic_mou_gp'])                              

- Drop the individual incoming usage month data as it's redundant

In [None]:
telecom_n.drop(['total_ic_mou_6','total_ic_mou_7'],axis=1,inplace=True)

In [None]:
telecom_n['onnet_mou_gp'] = (telecom_n['onnet_mou_6'] + telecom_n['onnet_mou_7'])/2
telecom_n['offnet_mou_gp'] = (telecom_n['offnet_mou_6'] + telecom_n['offnet_mou_7'])/2

- Drop the individual ONNET and OFFNET usage month data as it's redundant

In [None]:
telecom_n.drop(['onnet_mou_6','onnet_mou_7','offnet_mou_6','offnet_mou_7'],axis=1,inplace=True)

In [None]:
telecom_n.fillna(0,inplace=True)
telecom_n.shape

#### Feature Generation - Introduce new Feature called Retain Factor
- Retain Factor is calculated as the ratio of the Bad phase Average Revenue / Good Phase Average revenue
- And the Ratio of the number of recharges in month 8 Vs Month 7

In [None]:
#telecom_n.dtypes
telecom_n['retain_factor_arpu'] = round(telecom_n['arpu_8'] / telecom_n['gp_avg_arpu'],2)
telecom_n['retain_factor_rech'] = round(telecom_n['total_rech_num_8'] / telecom_n['total_rech_num_7'],2)
telecom_n['retain_factor_rech'] = np.where(telecom_n['retain_factor_rech'] > 1,1,telecom_n['retain_factor_rech'])
telecom_n['retain_factor_arpu'] = np.where(telecom_n['retain_factor_arpu'] > 1,1,telecom_n['retain_factor_arpu'])

In [None]:
#Deduce a factor for retaining the customer
telecom_n['retain_factor'] = np.where((telecom_n['retain_factor_arpu'] > 0.5) & (telecom_n['retain_factor_rech'] > 0.5),0,1)
telecom_n.drop(columns = ['retain_factor_rech','retain_factor_arpu'], axis=1, inplace=True)

## Perform PCA and Predict Churn

In [None]:
from sklearn.model_selection import train_test_split
# Assign feature variable to X
X = telecom_n.drop(['Churn','mobile_number'],axis=1)
# Assign response variable to y
y = telecom_n['Churn']
y.head()

### Feature Standardisation

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

scaler = preprocessing.StandardScaler().fit(X)
XScale = scaler.transform(X)

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(XScale,y, train_size=0.7,test_size=0.3,random_state=100)

In [None]:
X_train.shape

### Understand the data imbalance

In [None]:
y_train_imb = (y_train != 0).sum()/(y_train == 0).sum()
y_test_imb = (y_test != 0).sum()/(y_test == 0).sum()
print("Imbalance in Train Data:", y_train_imb)
print("Imbalance in Test Data:", y_test_imb)

In [None]:
count_class = pd.value_counts(telecom_n['Churn'], sort=True)
count_class.plot(kind='bar',rot = 0)
plt.title('Churn Distribution')
plt.xlabel('Churn')

#### Handle data imbalance by Performing SMOTE oversampling on the data set


In [None]:
### Other Sampling Techniques just for playing around
#from imblearn.combine import SMOTETomek
#from imblearn.under_sampling import NearMiss
#smk = SMOTETomek(random_state = 42)
#X_trainb,y_trainb = smk.fit_sample(X_train,y_train)

In [None]:
### Other Sampling Techniques just for playing around
#from imblearn.over_sampling import RandomOverSampler
#os = RandomOverSampler(sampling_strategy=1)
#X_trainb,y_trainb = os.fit_sample(X_train,y_train)

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
smt = SMOTE(random_state = 2) 
X_trainb,y_trainb = smt.fit_sample(X_train,y_train)

In [None]:
X_trainb.shape

In [None]:
y_trainb.shape

In [None]:
#Improting the PCA module
from sklearn.decomposition import PCA
pca = PCA(svd_solver='randomized', random_state=42)

In [None]:
#Doing the PCA on the train data
pca.fit(X_trainb)

In [None]:
pca.components_

In [None]:
colnames = list(X.columns)
pcs_df = pd.DataFrame({'PC1':pca.components_[0],'PC2':pca.components_[1], 'PC3':pca.components_[2],'Feature':colnames})
pcs_df.head(10)

In [None]:
%matplotlib inline
fig = plt.figure(figsize = (20,20))
plt.scatter(pcs_df.PC1, pcs_df.PC2)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
for i, txt in enumerate(pcs_df.Feature):
    plt.annotate(txt, (pcs_df.PC1[i],pcs_df.PC2[i]))
plt.tight_layout()
plt.show()

In [None]:
#Making the screeplot - plotting the cumulative variance against the number of components
%matplotlib inline
fig = plt.figure(figsize = (12,9))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

#### Looks like approx. 50 components are enough to describe 90% of the variance in the dataset
- We'll choose 50 components for our modeling

In [None]:
#Using incremental PCA for efficiency - saves a lot of time on larger datasets
from sklearn.decomposition import IncrementalPCA
pca_final = IncrementalPCA(n_components=50)

#### Basis transformation - getting the data onto our PCs


In [None]:
df_train_pca = pca_final.fit_transform(X_trainb)
df_train_pca.shape

#### Creating correlation matrix for the principal components - we expect little to no correlation

In [None]:
#creating correlation matrix for the principal components
corrmat = np.corrcoef(df_train_pca.transpose())

In [None]:
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (50,30))
sns.heatmap(corrmat,annot = True)

In [None]:
# 1s -> 0s in diagonals
corrmat_nodiag = corrmat - np.diagflat(corrmat.diagonal())
print("max corr:",corrmat_nodiag.max(), ", min corr: ", corrmat_nodiag.min(),)
# we see that correlations are indeed very close to 0

#### There is no correlation between any two components! 


In [None]:
#Applying selected components to the test data - 45 components
telecom_test_pca = pca_final.transform(X_test)
telecom_test_pca.shape

## Multiple Logistic Regression Models with the Principal Components
#### Model 1 - Use the No. of Principal Components determined by PCA

In [None]:
#Training the model on the train data
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

learner_pca = LogisticRegression()
model_pca = learner_pca.fit(df_train_pca,y_trainb)

In [None]:
#Making prediction on the test data
pred_probs_test = model_pca.predict_proba(telecom_test_pca)[:,1]
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_probs_test))

In [None]:
# Predict Results from PCA Model
ypred_pca = model_pca.predict(telecom_test_pca)

In [None]:
# Confusion matrix 
confusion_PCA = metrics.confusion_matrix(y_test, ypred_pca)
print(confusion_PCA)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, ypred_pca))

#### Model 2 - Let PCA Choose the number of Principal Components explaining 90% of Variance

In [None]:
pca_again = PCA(0.90)

In [None]:
df_train_pca2 = pca_again.fit_transform(X_trainb)
df_train_pca2.shape
# we see that PCA selected 38 components

In [None]:
#training the regression model
learner_pca2 = LogisticRegression()
model_pca2 = learner_pca2.fit(df_train_pca2,y_trainb)

In [None]:
df_test_pca2 = pca_again.transform(X_test)
df_test_pca2.shape

In [None]:
#Making prediction on the test data
pred_probs_test2 = model_pca2.predict_proba(df_test_pca2)[:,1]
"{:2.2f}".format(metrics.roc_auc_score(y_test, pred_probs_test2))

In [None]:
# Predict Results from PCA Model
ypred_pca2 = model_pca2.predict(df_test_pca2)

In [None]:
# Confusion matrix 
confusion_PCA = metrics.confusion_matrix(y_test, ypred_pca2)
print(confusion_PCA)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, ypred_pca2))

#### Model 3 - Let PCA Choose the number of Principal Components explaining 95% of Variance

In [None]:
pca_again = PCA(0.95)

In [None]:
df_train_pca3 = pca_again.fit_transform(X_trainb)
df_train_pca3.shape
# we see that PCA selected 51 components

In [None]:
#training the regression model
learner_pca3 = LogisticRegression()
model_pca3 = learner_pca3.fit(df_train_pca3,y_trainb)

In [None]:
df_test_pca3 = pca_again.transform(X_test)
df_test_pca3.shape

In [None]:
#Making prediction on the test data
pred_probs_test3 = model_pca3.predict_proba(df_test_pca3)[:,1]
"{:2.2f}".format(metrics.roc_auc_score(y_test, pred_probs_test3))

In [None]:
# Predict Results from PCA Model
ypred_pca3 = model_pca3.predict(df_test_pca3)

In [None]:
# Confusion matrix 
confusion_PCA = metrics.confusion_matrix(y_test, ypred_pca3)
print(confusion_PCA)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, ypred_pca3))

## PCA Model with Logistic Regression Conclusions
- PCA way of selecting variables is simple and easy
- We choose Model1 using 50 components as the best fit model due to the Confusion Matrix and the Accuracy Scores
- The False Positives are still quite High

- **Confusion Matrix for Model 1**

   6818     1445

   137      571
   
   
- **Model Accuracy**

    88%
 
- **Classification Report for Model 1**

               precision    recall  f1-score   support

           0       0.98      0.83      0.90      8263
           1       0.28      0.81      0.42       708

   
 

#### Visualize the data to see if we can spot any patterns

In [None]:
# Function to map the colors as a list from the input list of x variables
def pltcolor(lst):
    cols=[]
    for l in lst:
        if l==0:
            cols.append('red')
        elif l==1:
            cols.append('blue')
        else:
            cols.append('green')
    return cols
# Create the colors list using the function above
cols=pltcolor(y_trainb)

In [None]:

%matplotlib inline
fig = plt.figure(figsize = (12,10))
plt.scatter(df_train_pca[:,0], df_train_pca[:,1], s=200,c = cols)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.gray()
plt.show()

#### PCA with 10 Principal Components just to illustrate the difference
- Model 4

In [None]:
pca_last = PCA(n_components=10)
df_train_pca4 = pca_last.fit_transform(X_trainb)
df_test_pca4 = pca_last.transform(X_test)
df_test_pca4.shape

In [None]:
#training the regression model
learner_pca4 = LogisticRegression()
model_pca4 = learner_pca4.fit(df_train_pca4,y_trainb)
#Making prediction on the test data
pred_probs_test4 = model_pca4.predict_proba(df_test_pca4)[:,1]
"{:2.2f}".format(metrics.roc_auc_score(y_test, pred_probs_test4))

#### We get good results with the Chosen Principal components and almost close to it with just 10 Principal Components.

### Logistic Regression Algorithm with RFE

In [None]:
# Create a copy
telecom_LR_wPCA = telecom_n.copy()

In [None]:
telecom_LR_wPCA.shape

In [None]:
telecom_LR_wPCA['Churn'].value_counts()

In [None]:
plt.figure(figsize=(8,4))
telecom_LR_wPCA['Churn'].value_counts().plot(kind = 'bar')
plt.ylabel('Count')
plt.xlabel('Churn status')
plt.title('Churn status Distribution',fontsize=14)

In [None]:
# Create correlation matrix and check correlation greater than 0.95 adn drop those columns
corr_matrix = telecom_LR_wPCA.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

In [None]:
print(to_drop)

In [None]:
# Drop high correlated features
telecom_LR_wPCA.drop(telecom_LR_wPCA[to_drop], axis=1, inplace=True)

In [None]:
telecom_LR_wPCA.shape

#### Feature Generation - Introduce new Feature called Retain Factor
- Retain Factor is calculated as the ratio of the Bad phase Average Revenue / Good Phase Average revenue
- And the Ratio of the number of recharges in month 8 Vs Month 7

In [None]:
#telecom_LR_wPCA.dtypes
telecom_LR_wPCA['retain_factor_arpu'] = round(telecom_LR_wPCA['arpu_8'] / telecom_LR_wPCA['gp_avg_arpu'],2)
telecom_LR_wPCA['retain_factor_rech'] = round(telecom_LR_wPCA['total_rech_num_8'] / telecom_LR_wPCA['total_rech_num_7'],2)
telecom_LR_wPCA['retain_factor_rech'] = np.where(telecom_LR_wPCA['retain_factor_rech'] > 1,1,telecom_LR_wPCA['retain_factor_rech'])
telecom_LR_wPCA['retain_factor_arpu'] = np.where(telecom_LR_wPCA['retain_factor_arpu'] > 1,1,telecom_LR_wPCA['retain_factor_arpu'])

- If the Ratio between the ARPU for the bad phase and the good phase is > than 0.6
  and if the Ratio of the Number of recharges is > 0.6 
- Then the consideration is that if the customer retention ratio is High then the user is likely not to Churn

In [None]:
#Deduce a factor for retaining the customer
telecom_LR_wPCA['retain_factor'] = np.where((telecom_LR_wPCA['retain_factor_arpu'] > 0.6) & (telecom_LR_wPCA['retain_factor_rech'] > 0.6),0,1)
telecom_LR_wPCA.drop(columns = ['retain_factor_rech','retain_factor_arpu'], axis=1, inplace=True)

In [None]:
telecom_LR_wPCA.retain_factor.describe()

In [None]:
# Assign feature variable to X
X = telecom_LR_wPCA.drop(['Churn','mobile_number'],axis=1)
X.head()

In [None]:
# Assign the response variable to y
y_LR = telecom_LR_wPCA[['Churn']]
y_LR.head()

In [None]:
# Splitting the data into train and test
X_train_LR, X_test_LR, y_train_LR, y_test_LR = train_test_split(X, y_LR, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
X_train_LR.shape

### Balance data set by oversampling

In [None]:
smt = SMOTE(random_state = 2) 
X_train_LR,y_train_LR = smt.fit_sample(X_train_LR,y_train_LR)

In [None]:
X_train_LR.shape

In [None]:
data_imbalance = (y_train_LR != 0).sum()/(y_train_LR == 0).sum()
print("Imbalance in Train Data: {}".format(data_imbalance))

In [None]:
#X_train_LR.head()
columns = X.columns
X_train_LR = pd.DataFrame(X_train_LR)
X_train_LR.columns = columns


In [None]:
ycolumns = y_LR.columns
y_train_LR = pd.DataFrame(y_train_LR)
y_train_LR.columns = ycolumns

In [None]:
y_train_LR.shape

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_LR[columns] = scaler.fit_transform(X_train_LR[columns])
X_train_LR.retain_factor.describe()

In [None]:
X_train_LR.retain_factor.describe()

In [None]:
import statsmodels.api as sm

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

#### Feature Selection Using RFE

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 45)             # running RFE with 38 variables as output
rfe = rfe.fit(X_train_LR, y_train_LR)

In [None]:
rfe.support_

In [None]:
list(zip(X_train_LR.columns, rfe.support_, rfe.ranking_))

In [None]:
col = X_train_LR.columns[rfe.support_]

#### Assessing the model with StatsModels

In [None]:
X_train_sm = sm.add_constant(X_train_LR[col])
logm2 = sm.GLM(y_train_LR,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
###Creating a dataframe with the actual churn flag and the predicted probabilities
y_train_pred_final = pd.DataFrame({'Churn':y_train_LR.Churn, 'Churn_Prob':y_train_pred})
y_train_pred_final['MobileNumber'] = y_train_LR.index
y_train_pred_final.head()

#### Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
from sklearn import metrics

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

- Training accuracy using RFE is Approx. 0.853

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_LR[col].columns
vif['VIF'] = [variance_inflation_factor(X_train_LR[col].values, i) for i in range(X_train_LR[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#### There are a some variables with very high VIF. It's best to drop these variables as they aren't helping much with prediction and unnecessarily making the model complex.
#### Lets drop all variables that have very high VIF i.e. above 9
col = vif[vif['VIF'] < 9]
col = col.Features

In [None]:
# Let's re-run the model using the selected variables
X_train_sm = sm.add_constant(X_train_LR[col])
logm3 = sm.GLM(y_train_LR,X_train_sm, family = sm.families.Binomial())
res = logm3.fit()
res.summary()

In [None]:
y_train_pred = res.predict(X_train_sm).values.reshape(-1)

In [None]:
y_train_pred[:10]

In [None]:
y_train_pred_final['Churn_Prob'] = y_train_pred

In [None]:
# Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

##### Let's check the VIFs again

In [None]:
vif = pd.DataFrame()
vif['Features'] = X_train_LR[col].columns
vif['VIF'] = [variance_inflation_factor(X_train_LR[col].values, i) for i in range(X_train_LR[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Let's re-run the model using the selected variables
X_train_sm = sm.add_constant(X_train_LR[col])
logm4 = sm.GLM(y_train_LR,X_train_sm, family = sm.families.Binomial())
res = logm4.fit()
res.summary()

In [None]:
y_train_pred = res.predict(X_train_sm).values.reshape(-1)

In [None]:
y_train_pred[:10]

In [None]:
y_train_pred_final['Churn_Prob'] = y_train_pred

In [None]:
# Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train_pred_final.Churn, y_train_pred_final.predicted))

#### Making predictions on the Test Data set

In [None]:
X_test_LR = X_test_LR[col]
X_test_LR.head()

In [None]:
X_test_sm = sm.add_constant(X_test_LR)

In [None]:
X_test_LR.shape

In [None]:
# Making predictions on the test set
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred.shape

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [None]:
# Assigning CustID to index
y_test_df['MobileNumber'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'Churn_Prob'})

In [None]:
# Rearranging the columns
#y_pred_final = y_pred_final.reindex_axis(['CustID','Churn','Churn_Prob'], axis=1)

In [None]:
y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5 else 0)

In [None]:
y_pred_final.describe(percentiles=[.25, .5, .75, .90, .95, .99])

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Churn, y_pred_final.final_predicted)

In [None]:
y_pred_final.shape

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Churn, y_pred_final.final_predicted )
confusion2

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_pred_final.Churn, y_pred_final.final_predicted))

## Conclusions of RFE & Logistic Regression Algorithm

- The Model Accuracy Score that the accuracy on the Test Data Set is High i.e. 0.913
- The confusion matrix is better than the PCA Algorithm
- There were other models created different Hyperparameter VIF settings for e.g. VIF<5 but the accuracy and the precision/recall was not very good
- Hence a Model with a VIF < 9 and and set of 25 variables is chosen here
- The precision and the recall for Churn Probability is still lower than optimal as there are a few False Positives and False   Negatives

#### The main predictor variables for Telecom Churn are

- total_ic_mou_8
- onnet_mou_8
- std_og_t2m_mou_8
- arpu_2g_8
- total_rech_num_8
- loc_ic_t2m_mou_7
- max_rech_data_8
- loc_ic_t2m_mou_6
- total_rech_num_7
- std_ic_t2t_mou_8
- count_rech_3g_8
- retain_factor
- std_ic_t2t_mou_7
- gp_avg_arpu
- loc_ic_t2t_mou_7
- aug_vbc_3g
- count_rech_2g_8
- last_day_rch_amt_8
- vol_2g_mb_8
- aon


## Alternate Model with Random Forest Algorithm

In [None]:
# create a copy first
telecom_wPCA_RF = telecom_LR_wPCA.copy()

In [None]:
# Assign feature variable to X
X_RF = telecom_wPCA_RF.drop(['Churn','mobile_number'],axis=1)
X_RF.head()

In [None]:
# Assign response variable to y
y_RF = telecom_wPCA_RF['Churn']
y_RF.head()

In [None]:
# Splitting the data into train and test
X_train_RF, X_test_RF, y_train_RF, y_test_RF = train_test_split(X_RF, y_RF, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
smt = SMOTE(random_state = 2) 
X_train_RF,y_train_RF = smt.fit_sample(X_train_RF,y_train_RF)

In [None]:
X_train_RF.shape

In [None]:
X_train_RF = pd.DataFrame(X_train_RF)
X_train_RF.columns = X_RF.columns

In [None]:
y_train_RF.shape

#### Parameter reduction using L1 LinearVector Classifier

In [None]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [None]:
numerics = ['int16','int32','int64','float16','float32','float64']
numerical_vars = list(X_train_RF.select_dtypes(include=numerics).columns)
X_train_RF = X_train_RF[numerical_vars]
X_train_RF.shape

#### Perform L1 Regularisation using LinearSVC Algorithm to do dimensionality reduction
#### This is a feature available as part of the SciKit Learn Llibrary
- https://scikit-learn.org/stable/modules/feature_selection.html

In [None]:
Linear_SVC = LinearSVC(C=0.1, penalty="l1", dual=False).fit(X_train_RF, y_train_RF)
lasso_model = SelectFromModel(Linear_SVC, prefit=False)
lasso_model.fit(scaler.transform(X_train_RF.fillna(0)), y_train_RF)
lasso_model.get_support()

In [None]:
np.sum(lasso_model.estimator_.coef_ == 0)

In [None]:
deleted_vars = X_train_RF.columns[(lasso_model.estimator_.coef_ == 0).ravel().tolist()]
deleted_vars

In [None]:
#perform the same operation in the Test Data set for matching the columns
X_train_RF.drop(columns = deleted_vars,inplace=True,axis=1)
X_test_RF.drop(columns = deleted_vars,inplace=True,axis=1)


In [None]:
X_train_RF.shape

#### Default Hyperparameters - Fit the Random Forest with default hyperparameters

In [None]:
# Importing random forest classifier from sklearn library
from sklearn.ensemble import RandomForestClassifier

# Running the random forest with default parameters.
rfc_d = RandomForestClassifier()

In [None]:
# fit
rfc_d.fit(X_train_RF,y_train_RF)

In [None]:
# Making predictions
predictions = rfc_d.predict(X_test_RF)

In [None]:
# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

In [None]:
# Let's check the report of our default model
print(classification_report(y_test_RF,predictions))

In [None]:
# Printing confusion matrix
print(confusion_matrix(y_test_RF,predictions))

In [None]:
print(accuracy_score(y_test_RF,predictions))

#### Get List of Important Features from the Decision Tree Classifier

In [None]:
importances = list(rfc_d.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train_RF.columns, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

#### Now lets look at the list of hyperparameters which we can tune to improve model performance.

#### Tuning max_depth
Let's try to find the optimum values for max_depth

In [None]:
# GridSearchCV to find optimal min_samples_leaf
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

In [None]:
# GridSearchCV to find optimal max_depth
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# specify number of folds for k-fold CV
n_folds = 5

# parameters to build the model on
parameters = {'max_depth': range(4, 10, 2)}

# instantiate the model
rf = RandomForestClassifier()


# fit tree on training data
rf = GridSearchCV(rf, parameters, 
                    cv=n_folds, 
                   scoring="accuracy", return_train_score=True)
rf.fit(X_train_RF, y_train_RF)

In [None]:
# scores of GridSearch CV
scores = rf.cv_results_
pd.DataFrame(scores).head()

In [None]:
# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

#### Tuning n_estimators

In [None]:
# GridSearchCV to find optimal n_estimators
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# specify number of folds for k-fold CV
n_folds = 5

# parameters to build the model on
parameters = {'n_estimators': range(50, 200, 50)}

# instantiate the model (note we are specifying a max_depth)
rf = RandomForestClassifier(max_depth=6)

# fit tree on training data
rf = GridSearchCV(rf, parameters, 
                    cv=n_folds, 
                   scoring="precision", return_train_score=True)
rf.fit(X_train_RF, y_train_RF)

In [None]:
# scores of GridSearch CV
scores = rf.cv_results_
pd.DataFrame(scores).head()

In [None]:
# plotting accuracies with n_estimators
plt.figure()
plt.plot(scores["param_n_estimators"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_n_estimators"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

#### Tuning max_features

In [None]:
# GridSearchCV to find optimal max_features
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# specify number of folds for k-fold CV
n_folds = 5

# parameters to build the model on
parameters = {'max_features': [ 8, 12, 16, 20, 24]}

# instantiate the model
rf = RandomForestClassifier(max_depth = 6)

# fit tree on training data
rf = GridSearchCV(rf, parameters, 
                    cv=n_folds, 
                    
                   scoring="accuracy", return_train_score=True)
rf.fit(X_train_RF, y_train_RF)

In [None]:
# scores of GridSearch CV
scores = rf.cv_results_
pd.DataFrame(scores).head()

In [None]:
# plotting accuracies with max_features
plt.figure()
plt.plot(scores["param_max_features"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_features"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_features")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

#### Tuning min_samples_leaf

In [None]:


# specify number of folds for k-fold CV
n_folds = 5

# parameters to build the model on
parameters = {'min_samples_leaf': range(30, 200, 50)}

# instantiate the model
rf = RandomForestClassifier()

# fit tree on training data
rf = GridSearchCV(rf, parameters, 
                    cv=n_folds,                   
                   scoring="accuracy", return_train_score=True)
rf.fit(X_train_RF, y_train_RF)

In [None]:
# scores of GridSearch CV
scores = rf.cv_results_
pd.DataFrame(scores).head()

In [None]:
# plotting accuracies with min_samples_leaf
plt.figure()
plt.plot(scores["param_min_samples_leaf"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_min_samples_leaf"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("min_samples_leaf")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

**Fitting the final model with the best parameters obtained from grid search.**

In [None]:
# model with the best hyperparameters
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(bootstrap=True,
                             max_depth=10,
                             min_samples_leaf=50, 
                             min_samples_split=200,
                             max_features=22,
                             n_estimators=100)

In [None]:
# fit
rfc.fit(X_train_RF,y_train_RF)

In [None]:
# predict
predictions = rfc.predict(X_test_RF)

In [None]:
# evaluation metrics
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test_RF,predictions))

In [None]:
print(confusion_matrix(y_test_RF,predictions))

In [None]:
print(accuracy_score(y_test_RF,predictions))

#### Get numerical feature importances

In [None]:
importances = list(rfc.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train_RF.columns, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

## Conclusions of Random Forest Algorithm

- The Model Accuracy Score that the accuracy on the Test Data Set is very High i.e. 0.93
- The model was tuned with a number of Hyperparameters for the yield

#### The main predictor variables for Telecom Churn are

- loc_ic_mou_8
- total_ic_mou_8
- loc_ic_t2m_mou_8
- last_day_rch_amt_8
- max_rech_data_8
- loc_ic_t2t_mou_8
- max_rech_amt_8
- count_rech_2g_8
- total_og_mou_8
- loc_og_t2t_mou_8
- total_rech_num_8




## XGBoost - Queen Bee Algorithm!

In [None]:
import xgboost as xgb

In [None]:
x_xgboost, y_xgboost = telecom_n.drop(['Churn'],axis=1),telecom_n[['Churn']]

In [None]:
#Create a matrix for identifying important predictors
data_dmatrix = xgb.DMatrix(data=x_xgboost,label=y_xgboost)

In [None]:
#separate the data into train and test
X_train_xg, X_test_xg, y_train_xg, y_test_xg = train_test_split(x_xgboost, y_xgboost, test_size=0.3, random_state=123)

In [None]:
#Crate XGBoost classifer model
xg_class = xgb.XGBClassifier(objective ='reg:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

In [None]:
#Train and Predict based on the model
xg_class.fit(X_train_xg,y_train_xg)

preds = xg_class.predict(X_test_xg)

In [None]:
print(accuracy_score(y_test_xg,preds))

In [None]:
print(confusion_matrix(y_test_xg,preds))

In [None]:
params = {"objective":"reg:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5,
                    num_boost_round=50,early_stopping_rounds=10,metrics="auc", as_pandas=True, seed=123)

In [None]:
cv_results.head()

In [None]:
#Perform KFold cross validation to obtain a better meausre of Accuracy
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_class, x_xgboost, y_xgboost, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

In [None]:
print(classification_report(y_test_xg,preds))

In [None]:
xg_class1 = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)

In [None]:
import matplotlib.pyplot as plt

xgb.plot_tree(xg_class1,num_trees=0)
plt.figure(figsize=(50,50))
plt.show()

In [None]:
xgb.plot_importance(xg_class1)
plt.rcParams['figure.figsize'] = [100, 100]
plt.show()

#####                                                                                      **Fin**

- The machine Learning model with **XGBoosting** Algorithm has been chosen as the best
- The reason is that the Accuracy scores and the precision / recall scores are the highest of all the algorithms
- The performance of the algorithm is also better than most other algorithms