<a href="https://colab.research.google.com/github/tariqzia5/ML_Telecom-Churn-Prediction/blob/main/Telecom_Churn_Prediction_New.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset Location
https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets

# Problem Statement and Project Objective 

## **PROJECT OBJECTIVE**
In a hyper-competitive telecom market, customers are able to select from multiple-service providers and actively switch between services providers for various reason: poor customer service (e.g., long waiting time to resolve issues, have to make multiple call to resolve queries, etc.), poor network coverage, lack of self-service tools and platforms, etc. The Churn burn is further compounded by the entry of MVNOs who offer wide variety of services in a bundle (voice + data + OTT) and have deployed new technologies compared to incumbent service provider. Customers are willing to leave incumbent service providers if the brand sucks.

Further, service providers are focused on customer acquisition and seldom pay attention to customer retention untill the churn rate is peaking. However, it can cost more to acquire a new customer than retaining existing customers.

Predicting churn in a hyper competitive market is becoming important factor in customer relationship management and to maintain or improve ARPU.

# **ISSUES WITH THE DATASET**
1. It is not clear if the data provided is for a post-paid or pre-paid subscriber.
2. Additional variables that would have made difference in prediction are ARPU, clarity on account lentght (days?, weeks? months?), plan type, gender, sold directly by Orange or via Partner (this would have further helped in identifying where the Churn is happening), Roaming (Yes or No), Day Pack (Voice and Data), Night Pack (Voice and Data), Internet Plan (if any)


# **TARGET VARIABLE**

Churn is the target variable

# **Importing Files**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
telecom_churn = pd.read_csv('https://raw.githubusercontent.com/tariqzia5/ML_Telecom-Churn-Prediction/main/churn-bigml-80.csv')
telecom_churn

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2661,SC,79,415,No,No,0,134.7,98,22.90,189.7,68,16.12,221.4,128,9.96,11.8,5,3.19,2,False
2662,AZ,192,415,No,Yes,36,156.2,77,26.55,215.5,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
2663,WV,68,415,No,No,0,231.1,57,39.29,153.4,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
2664,RI,28,510,No,No,0,180.8,109,30.74,288.8,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False


# **EXPLORE DATA**

**INFO**

In [2]:
telecom_churn.info() #to identify data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2666 entries, 0 to 2665
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   2666 non-null   object 
 1   Account length          2666 non-null   int64  
 2   Area code               2666 non-null   int64  
 3   International plan      2666 non-null   object 
 4   Voice mail plan         2666 non-null   object 
 5   Number vmail messages   2666 non-null   int64  
 6   Total day minutes       2666 non-null   float64
 7   Total day calls         2666 non-null   int64  
 8   Total day charge        2666 non-null   float64
 9   Total eve minutes       2666 non-null   float64
 10  Total eve calls         2666 non-null   int64  
 11  Total eve charge        2666 non-null   float64
 12  Total night minutes     2666 non-null   float64
 13  Total night calls       2666 non-null   int64  
 14  Total night charge      2666 non-null   

In [None]:
telecom_churn.shape # No. of Rows and Columns

# **DATA DICTIONARY**

1. STATE - STRING (US States-- AZ, WA, LA, IN, NY...); should be 51 only
2. ACCOUNT LENGTH - INTEGER (NO OF DAYS, I.E., HOW OLD IS THE ACCOUNT?)
3. AREA CODE - INTEGER (STATE CODE)
4. INTENATIONAL PLAN - STRING (YES OR NO; DESCRIBES IF THE CUSTOMER HAS A SUBSCRIBED TO AN INTERNATIONAL PLAN OR NOT)
5. VOICE MAIL PLAN - STRING (YES OR NO; HAS THE CUSTOMER SUBSCRIBED TO VOICE MAIL FEATURE)
6. NUMBER OF VMAIL MESSAGES - INTEGER (NO. OF VOICE MAIL MESSAGES)
7. TOTAL DAY MINUTES - FLOAT (NUMBER OF VOICE MINUTES USED IN A DAY)
8. TOTAL DAY CALLS - INTEGER (NUMBER OF CALLS MADE IN A DAY)
9. TOTAL DAY CHARGE - FLOAT (DAY CHARGE CACULATED BASED ON NUMBER OF CALLS MADE IN A DAY)
10. TOTAL EVE MINUTES - FLOAT (TOTAL CALL LENGHT IN THE EVENING)
11. TOTAL EVE CALLS - INTEGER (TOTAL NUMBER OF CALLS MADE IN THE EVENING)
12. TOTAL EVE CHARGE - FLOAT (TOTAL CHARGE FOR CALLS MADE IN THE EVENING)
13. TOTAL NIGHT MINUTES - FLOAT (TOTAL LENGTH OF THE CALL AFTER 9PM; FREE UNLIMITED TIME)
14. TOTAL NIGHT CALLS - INTEGER (TOTAL NUMBER OF CALLS MADE AT NIGHT)
15. TOTAL NIGHT CHARGE - FLOAT (TOTAL COST OF NIGHT CALLS)
16. TOTAL INTL MINUTES - FLOAT (TOTAL NUMBER OF MINUTES FOR INTERNATIONAL CALLS)
17. TOTAL INTL CALLS - INTEGER (TOTAL NUMBER OF INTERNATIONAL CALLS)
18. TOTAL INTL CHARGES - FLOAT (TOTAL CHARGE FOR INTERNATIONAL CALLS)
19. CUSTOMER SERVICE CALLS - INTERGER (TOTAL NUMBER OF CALLS MADE TO CUSTOMER SERVICE CENTER)
20. CHURN - BOOLEAN (TRUE = CHURN OR FALSE = NO CHURN)

**DATA TYPES:**

THERE ARE TOTAL OF # OF BOOLEAN VARAIBLES (1), # OF FLOAT64 (8), # OF INT64 (8), AND OBJECT (3)



**CONVERT NON-NUMERICAL DATA INTO NUMERICAL**

In [None]:
telecom_churn['Churn_num']  = telecom_churn['Churn'].replace({False : 0, True: 1})
print(telecom_churn['Churn'].value_counts())
print(telecom_churn['Churn_num'].value_counts())

In [None]:
telecom_churn['Intl_plan_num']  = telecom_churn['International plan'].map({'No' : 0, 'Yes': 1}).astype(int)
print(telecom_churn['International plan'].value_counts())
print(telecom_churn['Intl_plan_num'].value_counts())

In [None]:
telecom_churn['Vmail_plan_num']  = telecom_churn['Voice mail plan'].map({'No' : 0, 'Yes': 1}).astype(int)
print(telecom_churn['Voice mail plan'].value_counts())
print(telecom_churn['Vmail_plan_num'].value_counts())

In [None]:
telecom_churn.info()

In [None]:
telecom_churn

In [None]:
telecom_churn.columns.values

In [None]:
telecom_churn.head(10) #first 10 observations; looking at first few observations

In [None]:
telecom_churn.tail(10) #looking at last few observations

In [None]:
#SUMMARY OF NUMERICAL COLUMNS
telecom_churn.describe() 

In [None]:
#Numerical Columns
num_vars = telecom_churn.columns[telecom_churn.dtypes != 'object']
#Non Numerical Columns
cat_vars = telecom_churn.columns[telecom_churn.dtypes == 'object']
print(num_vars)
print(cat_vars)

In [None]:
#CHECKING THE MISSING VALUE
telecom_churn[num_vars].isnull().sum().sort_values(ascending = False)/len(telecom_churn)

In [None]:
telecom_churn[cat_vars].isnull().sum().sort_values(ascending = False)/len(telecom_churn)

**NONE OF THE COLUMNS HAVE MISSING VALUES**

# **EXPLORING VARIABLES**

## **PLOTS**

In [None]:
print(telecom_churn['Churn_num'].value_counts())

In [None]:
sns.countplot(x="Churn_num", data = telecom_churn)
#sns.barplot(x = 'Churn')

**MOST CUSTOMERS DON'T SEEM TO LEAVE THE SERVICE PROVIDER**

In [None]:
print(telecom_churn['Intl_plan_num'].value_counts())

In [None]:
print(telecom_churn['Intl_plan_num'].value_counts())
sns.countplot(x="Intl_plan_num", order= telecom_churn['Intl_plan_num'].value_counts(ascending = False).index, data = telecom_churn)

In [None]:
print(telecom_churn['Vmail_plan_num'].value_counts())
sns.countplot(x="Vmail_plan_num", data = telecom_churn)

In [None]:
print(telecom_churn['Customer service calls'].value_counts())
sns.countplot(y="Customer service calls", data = telecom_churn)

In [None]:
sns.barplot(x = 'Churn_num', y = 'Customer service calls', data = telecom_churn) 

**ChURNED CUSTOMERS HAVE MADE MORE CALLS TO CUSTOMER SERVICE CENTERS**

In [None]:
sns.barplot(x = 'Churn_num', y = 'Intl_plan_num', data = telecom_churn)

In [None]:
sns.barplot(x = 'Churn_num', y = 'Vmail_plan_num', data = telecom_churn)

In [None]:
sns.barplot(x = 'Churn_num', y = 'Account length', data = telecom_churn)

## **DATA DISTRIBUTION - HISTOGRAMS**

In [None]:
telecom_churn.hist(figsize = (15,15))

## **EXPLORING VARIABLE - ACCOUNT LENGTH**


In [None]:
plt.boxplot(telecom_churn["Account length"])

In [None]:
account_percentile=telecom_churn['Account length'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(account_percentile,2)

In [None]:
account_percentile=telecom_churn['Account length'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(account_percentile,2)

In [None]:
telecom_churn['Account length'].describe()

**The data spread is uniform. Replacing the 100percentile with the median**

In [None]:
median_Acc_length=telecom_churn['Account length'].median()
telecom_churn['Account_length_new']=telecom_churn['Account length']
telecom_churn['Account_length_new'][telecom_churn['Account_length_new']>194] = median_Acc_length

In [None]:
telecom_churn.info()

In [None]:
telecom_churn.info()

## **EXPLORING VARIABLE - AREA CODES**

In [None]:
plt.boxplot(telecom_churn["Area code"])

In [None]:
plt.hist(telecom_churn["Area code"])

In [None]:
telecom_churn['Area code'].unique()

## **EXPLORING VARIABLE - NUMBER VMAIL MESSAGES**

In [None]:
plt.boxplot(telecom_churn["Number vmail messages"])

In [None]:
vmail_percentile=telecom_churn['Number vmail messages'].quantile([0.05, 0.1, 0.25, 0.5, 0.6,0.65,0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(vmail_percentile,2)

In [None]:
telecom_churn['Number vmail messages'].mean()

In [None]:
telecom_churn['Number vmail messages'].median()

In [None]:
mean_vmail=telecom_churn['Number vmail messages'].mean()
telecom_churn['number_vmail_new']=telecom_churn['Number vmail messages']
telecom_churn['number_vmail_new'][telecom_churn['number_vmail_new'] < 19 ] = mean_vmail

In [None]:
plt.boxplot(telecom_churn["number_vmail_new"])

In [None]:
telecom_churn.drop(['number_vmail_new'], axis =1)

In [None]:
telecom_churn.info()

## **EXPLORING VARIABLE - TOTAL DAY MINUTES**

In [None]:
plt.boxplot(telecom_churn["Total day minutes"])

In [None]:
day_minutes_percentile=telecom_churn['Total day minutes'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(day_minutes_percentile,2)

In [None]:
telecom_churn['Total day minutes'].mean()

In [None]:
telecom_churn['Total day minutes'].median()

Box plot is clearly visible, hence not treating outliers

## **EXPLORING VARIABLE - TOTAL DAY CALLS**

In [None]:
plt.boxplot(telecom_churn["Total day calls"])

In [None]:
day_calls_percentile=telecom_churn['Total day calls'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(day_calls_percentile,2)

In [None]:
telecom_churn['Total day calls'].mean()

In [None]:
telecom_churn['Total day calls'].median()

**box plot is clearly visible, hence not treating outliers**

## **EXPLORING VARIABLE - TOTAL DAY CHARGE**

In [None]:
plt.boxplot(telecom_churn["Total day charge"])

In [None]:
day_charge_percentile=telecom_churn['Total day charge'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(day_charge_percentile,2)

In [None]:
telecom_churn['Total day charge'].mean()

In [None]:
telecom_churn['Total day charge'].median()

**Box plot is clear; not treating for outliers**

## **EXPLORING VARIABLE = TOTAL EVE MINUTES**

In [None]:
plt.boxplot(telecom_churn["Total eve minutes"])

In [None]:
eve_minutes_percentile=telecom_churn['Total eve minutes'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(eve_minutes_percentile,2)

In [None]:
telecom_churn['Total eve minutes'].mean()

In [None]:
telecom_churn['Total eve minutes'].median()

**not treating any outliers**

## **EXPLORING VARIABLE - TOTAL EVE CALLS**

In [None]:
plt.boxplot(telecom_churn["Total eve calls"])

In [None]:
eve_calls_percentile=telecom_churn['Total eve calls'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(eve_calls_percentile,2)

In [None]:
telecom_churn['Total eve calls'].mean()

In [None]:
telecom_churn['Total eve calls'].median()

In [None]:
median_eve_calls=telecom_churn['Total eve calls'].median()
telecom_churn['Total_eve_calls_new']=telecom_churn['Total eve calls']
telecom_churn['Total_eve_calls_new'][telecom_churn['Total_eve_calls_new'] > 156.34] = median_eve_calls

In [None]:
telecom_churn['Total_eve_calls_new'].describe()

## **EXPLORE VARIABLE - TOTAL EVE CHARGE**

In [None]:
plt.boxplot(telecom_churn["Total eve charge"])

In [None]:
eve_charge_percentile=telecom_churn['Total eve charge'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(eve_charge_percentile,2)

In [None]:
telecom_churn['Total eve charge'].mean()

In [None]:
telecom_churn['Total eve charge'].median()

**data is evenly spread out and box plot is also clear, not treating for outliers**

## **EXPLORE VARIABLE - TOTAL NIGHT MINUTES**

In [None]:
plt.boxplot(telecom_churn["Total night minutes"])

In [None]:
night_min_percentile=telecom_churn['Total night minutes'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(night_min_percentile,2)

In [None]:
telecom_churn['Total night minutes'].mean()

In [None]:
telecom_churn['Total night minutes'].median()

**not treating for outliers, data is evenly spreadout. box plot is clear**

## **EXPLORING VARIABLE - TOTAL NIGHT CALLS**

In [None]:
plt.boxplot(telecom_churn["Total night calls"])

In [None]:
night_calls_percentile=telecom_churn['Total night calls'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(night_calls_percentile,2)

In [None]:
telecom_churn['Total night calls'].mean()

In [None]:
telecom_churn['Total night calls'].median()

**data is evenly spead out, box plot is clear, not treating outliers**

## **EXPLORING VARIABLE - TOTAL NIGHT CHARGE**

In [None]:
plt.boxplot(telecom_churn["Total night charge"])

In [None]:
night_charge_percentile=telecom_churn['Total night charge'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(night_charge_percentile,2)

In [None]:
telecom_churn['Total night charge'].mean()

In [None]:
telecom_churn['Total night charge'].median()

**data is spread out evenly, box plot is clear, not treating for outlier**

## **EXPLORING VARIABLE - TOTAL INTL MINUTES**

In [None]:
plt.boxplot(telecom_churn["Total intl minutes"])

In [None]:
intl_min_percentile=telecom_churn['Total intl minutes'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(intl_min_percentile,2)

In [None]:
telecom_churn['Total intl minutes'].mean()

In [None]:
telecom_churn['Total intl minutes'].median()

## **EXPLORING VARIABLE - TOTAL INTL CALLS**

In [None]:
plt.boxplot(telecom_churn["Total intl calls"])

In [None]:
intl_calls_percentile=telecom_churn['Total intl calls'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(intl_calls_percentile,2)

In [None]:
telecom_churn['Total intl calls'].mean()

In [None]:
telecom_churn['Total intl calls'].median()

In [None]:
#median_intl_calls=telecom_churn['Total intl calls'].median()
#telecom_churn['Total_intl_calls_new']=telecom_churn['Total intl calls']
#telecom_churn['Total_intl_calls_new'][telecom_churn['Total_intl_calls_new']> 13] = median_intl_calls

In [None]:
telecom_churn.info()

## **EXPLORING VARIBALES - CUSTOMER SERVICE CALLS**

In [None]:
plt.boxplot(telecom_churn["Customer service calls"])

In [None]:
cust_calls_percentile=telecom_churn['Customer service calls'].quantile([0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.991,0.992,0.993,0.994,0.995,0.996,0.997,0.998,0.999,1])
round(cust_calls_percentile,2)

In [None]:
telecom_churn['Customer service calls'].mean()

In [None]:
telecom_churn['Customer service calls'].median()

In [None]:
telecom_churn.info()

In [None]:
telecom_churn_new = telecom_churn.drop(columns = ['Churn','number_vmail_new','Total eve calls','Account length','International plan','Voice mail plan'], axis = 1)

# **LOGISTIC REGRESSION**

In [None]:
telecom_churn_new.info()

**TARGET VARIBALE Y = Churn_num**


In [None]:
from sklearn.linear_model import LogisticRegression
logistic1 = LogisticRegression()
logistic1.fit(telecom_churn_new[['Area code']+['Number vmail messages'] + ['Total day minutes'] + ['Total day calls'] + ['Total day charge']+['Total eve minutes'] + ['Total eve charge'] + ['Total night minutes'] + ['Total night calls'] + ['Total night charge']+['Total intl charge'] + ['Total intl calls'] + ['Total intl minutes'] + ['Customer service calls']+ ['Intl_plan_num'] + ['Vmail_plan_num'] +['Account_length_new'] + ['Total_eve_calls_new']],telecom_churn_new[['Churn_num']])
#logistic1.fit(telecom_churn_new[['Intl_plan_num']+['Vmail_plan_num']+ ['Account_length_new'] +['Total_eve_calls_new'] + ['Customer service calls']+ ['Total intl charge'] + ['Total intl calls'] + ['Total intl minutes'] + ['Total night charge'] + ['Total night calls'] + ['Total eve minutes'] +['Total day charge'] + ['Total day calls'] +['Total day minutes']],telecom_churn_new[['Churn_num']])

In [None]:
print("Intercept", logistic1.intercept_)
print("Coefficient", logistic1.coef_)

## **CONFUSION MATRIX & ACCURACY**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

In [None]:
predict1 = logistic1.predict(telecom_churn_new[['Area code']+['Number vmail messages'] + ['Total day minutes'] + ['Total day calls'] + ['Total day charge']+['Total eve minutes'] + ['Total eve charge'] + ['Total night minutes'] + ['Total night calls'] + ['Total night charge']+['Total intl charge'] + ['Total intl calls'] + ['Total intl minutes'] + ['Customer service calls']+ ['Intl_plan_num'] + ['Vmail_plan_num'] +['Account_length_new'] + ['Total_eve_calls_new']])
predict1
cm1 = confusion_matrix(telecom_churn_new[['Churn_num']],predict1)
print(cm1)

In [None]:
print("col sums", sum(cm1))
total1 = sum(sum(cm1))
print("Total", total1)

In [None]:
accuracy1 = (cm1[0,0]+cm1[1,1])/total1
accuracy1

## **MULTICOLLINEARITY**

In [None]:
#import statsmodels.formula.api as sm

#def vif_cal(input_data, dependent_col):
 #   x_vars=input_data.drop([dependent_col], axis=1)
  #  xvar_names=x_vars.columns
   # for i in range(0,xvar_names.shape[0]):
    #    y=x_vars[xvar_names[i]] 
     #   x=x_vars[xvar_names.drop(xvar_names[i])]
      #  rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
       # vif=round(1/(1-rsq),2)
        #print (xvar_names[i], " VIF = " , vif)

In [None]:
import statsmodels.formula.api as sm1
def vif_cals(x_vars):
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm1.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [None]:
vif_cals(x_vars=telecom_churn_new.drop(["State", "Churn_num"], axis=1))

**Dropping variable one at a time that have the highest VIF or VIF > 5. An important point here is to have Industy knowledge on variables and which ones may be highly correlated, i.e., do not drop variables radomly**

In [None]:
vif_cals(x_vars=telecom_churn_new.drop(["State", "Churn_num", "Total day minutes"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_new.drop(["State", "Churn_num", "Total day minutes","Total eve minutes"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_new.drop(["State", "Churn_num", "Total day minutes","Total eve minutes","Total night minutes"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_new.drop(["State", "Churn_num", "Total day minutes","Total eve minutes","Total night minutes","Total intl minutes"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_new.drop(["State", "Churn_num", "Total day minutes","Total eve minutes","Total night minutes","Total intl minutes","Number vmail messages"], axis=1))

## **INDIVIDUAL IMPACT OF VARIABLES**

In [None]:
import statsmodels.api as sm1
m1=sm1.Logit(telecom_churn_new['Churn_num'],telecom_churn_new[['Area code'] + ['Total day calls'] + ['Total day charge'] + ['Total eve charge'] + ['Total night calls']+['Total night charge'] + ['Total intl calls'] +['Total intl charge'] + ['Customer service calls']+['Intl_plan_num'] + ['Vmail_plan_num'] + ['Account_length_new'] + ['Total_eve_calls_new']])
m1.fit()
print(m1.fit().summary())

**Output above:** 
1. Look at the Z-value: Thihs examples the null hypothesis - not impactful versus the alternative hypothesis, which is impactful.
2. Looking at the output, if the P>|Z| is  < 0.05, the variable is considered impactful. If the P>|Z| >= 0.05 it is not impactful
3. Looking at the summary output above, we can see Total day calls, total night charge, Account length new, and total night calls have P-value >0.05. We can safely drop these variables. 

In [None]:
import statsmodels.api as sm1
m2=sm1.Logit(telecom_churn_new['Churn_num'],telecom_churn_new[['Area code'] + ['Total day charge'] + ['Total eve charge'] + ['Total intl calls'] +['Total intl charge'] + ['Customer service calls']+['Intl_plan_num'] + ['Vmail_plan_num'] + ['Total_eve_calls_new']])
m2.fit()
print(m2.fit().summary())

## **CONFUSION MATRIX & ACCURACY**

# **MODEL VALIDATION**

## **SENSITIVITY AND SPECIFICITY**

In [None]:
import statsmodels.api as sm1
m2=sm1.Logit(telecom_churn_new['Churn_num'],telecom_churn_new[['Area code'] + ['Total day charge'] + ['Total eve charge'] + ['Total intl calls'] +['Total intl charge'] + ['Customer service calls']+['Intl_plan_num'] + ['Vmail_plan_num'] + ['Total_eve_calls_new']])
results = m2.fit()
print(results.summary())


#m2=sm1.Logit(telecom_churn['Churn_num'],telecom_churn[['Area code']+['Total day minutes']+['Total eve minutes'] + ['Total eve calls']+['Total intl charge']+['Customer service calls']+['Intl_plan_num']+['Vmail_plan_num']+['Total_intl_calls_new']])
#results = m2.fit()
#print(results.summary())

In [None]:
#CREATE THE CONFUSION MATRIX
#PREDICT THE VARIABLE
predictions = results.predict()
print(predictions[0:10])
len(predictions)

In [None]:
#converting predicted values into classes using threshold
threshold = 0.7
predicted_class1=[0 if x < threshold else 1 for x in predictions]
print(predicted_class1[0:10])

In [None]:
from sklearn.metrics import confusion_matrix

cm3 = confusion_matrix(telecom_churn_new["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm3)
total3 = sum(sum(cm3))
#From confusion matrix calcuate accuracy
accuracy1 = (cm3[0,0]+cm3[1,1])/total3
print('Accuracy', accuracy1)

sensitivity1 = cm3[0,0]/(cm3[0,0]+cm3[0,1])
print('Sensitivity : ', sensitivity1 )

specificity1 = cm3[1,1]/(cm3[1,0]+cm3[1,1])
print('Specificity : ', specificity1)

## **THRESHOLD**

In [None]:
#Threshold = 0.8
threshold = 0.8
predicted_class1=[0 if x < threshold else 1 for x in predictions]
#print(predicted_class1[0:10])
cm3 = confusion_matrix(telecom_churn_new["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm3)
total3 = sum(sum(cm3))
#From confusion matrix calcuate accuracy
accuracy1 = (cm3[0,0]+cm3[1,1])/total3
print('Accuracy', accuracy1)

sensitivity1 = cm3[0,0]/(cm3[0,0]+cm3[0,1])
print('Sensitivity : ', sensitivity1 )

specificity1 = cm3[1,1]/(cm3[1,0]+cm3[1,1])
print('Specificity : ', specificity1)

In [None]:
#Threshold = 0.2
threshold = 0.2
predicted_class1=[0 if x < threshold else 1 for x in predictions]
#print(predicted_class1[0:10])
cm3 = confusion_matrix(telecom_churn_new["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm3)
total3 = sum(sum(cm3))
#From confusion matrix calcuate accuracy
accuracy1 = (cm3[0,0]+cm3[1,1])/total3
print('Accuracy', accuracy1)

sensitivity1 = cm3[0,0]/(cm3[0,0]+cm3[0,1])
print('Sensitivity : ', sensitivity1 )

specificity1 = cm3[1,1]/(cm3[1,0]+cm3[1,1])
print('Specificity : ', specificity1)

In [None]:
#Threshold = 0.3
threshold = 0.3
predicted_class1=[0 if x < threshold else 1 for x in predictions]
#print(predicted_class1[0:10])
cm3 = confusion_matrix(telecom_churn_new["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm3)
total3 = sum(sum(cm3))
#From confusion matrix calcuate accuracy
accuracy1 = (cm3[0,0]+cm3[1,1])/total3
print('Accuracy', accuracy1)

sensitivity1 = cm3[0,0]/(cm3[0,0]+cm3[0,1])
print('Sensitivity : ', sensitivity1 )

specificity1 = cm3[1,1]/(cm3[1,0]+cm3[1,1])
print('Specificity : ', specificity1)

**By changing the threshold from to 0.2, we have further improved the specificty from 2.06% to 61.59%. However, the sensitivity has reduced from 99.8%  to 83.18%. if we further reduce the threshold, specificty will increase at the cost of sensitivity. we will be wrongly classifying customer who will churn vs. not churn.**

## **ROC**

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

actual = telecom_churn_new["Churn_num"]
False_positive_rate, True_positive_rate, thresholds = roc_curve(actual, predictions)
plt.figure(figsize=(10,10))
plt.title('ROC Curve', fontsize = 15)
plt.plot(False_positive_rate, True_positive_rate)
plt.plot([0,1],[0,1], 'r--')
plt.ylabel('True Positive Rate(Sensitivity)', fontsize = 15)
plt.xlabel('False Positive Rate(1-Sensitivity)', fontsize = 15)
plt.show()

## **AREA UNDER CURVE (AUC)**

In [None]:
roc_auc = auc(False_positive_rate, True_positive_rate)
roc_auc



```
# This is formatted as code
```

**The AUC is close to 1, i.e., 0.8, the model is considered to be good.**

## **PRECISION, RECALL, and F1 SCORE**

In [None]:
#F1 Score
threshold = 0.5
predicted_class1=[0 if x < threshold else 1 for x in predictions]
#print(predicted_class1[0:10])
cm3 = confusion_matrix(telecom_churn_new["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm3)
total3 = sum(sum(cm3))
#From confusion matrix calcuate accuracy
accuracy1 = (cm3[0,0]+cm3[1,1])/total3
print('Accuracy', accuracy1)

Precision_Class0 = cm3[0,0]/(cm3[0,0]+cm3[1,0])
print('Precision_Class0 : ', Precision_Class0 )

Recall_Class0 = cm3[0,0]/(cm3[0,0]+cm3[0,1])
print('Recall_Class0 : ', Recall_Class0 )

F1_Class0 = 2/((1/Precision_Class0)+(1/Recall_Class0))
print('F1_Class0 : ', F1_Class0 )

Precision_Class1 = cm3[1,1]/(cm3[0,1]+cm3[1,1])
print('Precision_Class1 : ', Precision_Class1 )

Recall_Class1 = cm3[1,1]/(cm3[1,0]+cm3[1,1])
print('Recall_Class1 : ', Recall_Class1 )

F1_Class1 = 2/((1/Precision_Class1)+(1/Recall_Class1))
print('F1_Class1 : ', F1_Class1 )

In [None]:
#F1 Score
threshold = 0.2
predicted_class1=[0 if x < threshold else 1 for x in predictions]
#print(predicted_class1[0:10])
cm3 = confusion_matrix(telecom_churn_new["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm3)
total3 = sum(sum(cm3))
#From confusion matrix calcuate accuracy
accuracy1 = (cm3[0,0]+cm3[1,1])/total3
print('Accuracy', accuracy1)

Precision_Class0 = cm3[0,0]/(cm3[0,0]+cm3[1,0])
print('Precision_Class0 : ', Precision_Class0 )

Recall_Class0 = cm3[0,0]/(cm3[0,0]+cm3[0,1])
print('Recall_Class0 : ', Recall_Class0 )

F1_Class0 = 2/((1/Precision_Class0)+(1/Recall_Class0))
print('F1_Class0 : ', F1_Class0 )

Precision_Class1 = cm3[1,1]/(cm3[0,1]+cm3[1,1])
print('Precision_Class1 : ', Precision_Class1 )

Recall_Class1 = cm3[1,1]/(cm3[1,0]+cm3[1,1])
print('Recall_Class1 : ', Recall_Class1 )

F1_Class1 = 2/((1/Precision_Class1)+(1/Recall_Class1))
print('F1_Class1 : ', F1_Class1 )

**Alternatively, we can use classification report to generate precision, Recall, F1-Score, and data (support)**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(telecom_churn_new["Churn_num"],predicted_class1))

**Precision** = Out of all the Customers that the model predicted would Churn, 38% actually did.

**Recall** = Out of all the Customers that actually Churned, the model only predicted this outcome correctly for 62% of those Customers.

**F1 Score** = Since this value isn’t very close to 1, it tells us that the model does a poor job of predicting whether or not Customers will Churn.

# **HANDLING UNBALANCED DATA**

In [None]:
telecom_churn_new.info()

In [None]:
print("Actual Data :", telecom_churn_new.shape)

#Frequency count on target column
freq=telecom_churn_new['Churn_num'].value_counts()
print(freq)
print((freq/freq.sum())*100)

#Classwise data
telecom_churn_new_class0 = telecom_churn_new[telecom_churn['Churn_num'] == 0]
telecom_churn_new_class1 = telecom_churn_new[telecom_churn['Churn_num'] == 1]

print("Class0 Actual :", telecom_churn_new_class0.shape)
print("Class1 Actual  :", telecom_churn_new_class1.shape)


**In the US market customer Churn rate in the telecom/wireless market is in the rage of 21% to 25%. Considering this we will boost the sample size of Class 1**

In [None]:
##Undersampling of class-0
## Consider half of class-0
telecom_churn_new_class0_under = telecom_churn_new_class0.sample(int(0.4*len(telecom_churn_new_class0)))
print("Class0 Undersample :", telecom_churn_new_class0_under.shape)

##Oversampling of Class-1 
# Lets increase the size by two times
telecom_churn_new_class1_over = telecom_churn_new_class1.sample(2*len(telecom_churn_new_class1),replace=True)
print("Class1 Oversample :", telecom_churn_new_class1_over.shape)

#Concatenate to create the final balanced data
telecom_churn_new_balanced=pd.concat([telecom_churn_new_class0_under,telecom_churn_new_class1_over])
print("Final Balannced Data :", telecom_churn_new_balanced.shape)

#Frequency count on target column in the balanced data
freq=telecom_churn_new_balanced['Churn_num'].value_counts()
print(freq)
print((freq/freq.sum())*100)

# **MODEL WITH BALANCED DATA**

In [None]:
import statsmodels.api as sm1
m3=sm1.Logit(telecom_churn_new_balanced['Churn_num'],telecom_churn_new_balanced[['Area code'] + ['Total day charge'] + ['Total eve charge'] + ['Total intl calls'] +['Total intl charge'] + ['Customer service calls']+['Intl_plan_num'] + ['Vmail_plan_num'] + ['Total_eve_calls_new']])
results = m3.fit()
print(results.summary())

**Droping variable total eve charge and Total Intl charge since its P>|Z| is greater than 0.05**

In [None]:
import statsmodels.api as sm1
m3=sm1.Logit(telecom_churn_new_balanced['Churn_num'],telecom_churn_new_balanced[['Area code'] + ['Total day charge'] + ['Total intl calls'] + ['Customer service calls']+['Intl_plan_num'] + ['Vmail_plan_num'] + ['Total_eve_calls_new']])
results = m3.fit()
print(results.summary())

## **UPDATED SENSITIVITY AND SPECIFICITY**

In [None]:
#CREATE THE CONFUSION MATRIX
#PREDICT THE VARIABLE
predictions = results.predict()
print(predictions[0:10])
len(predictions)

In [None]:
#converting predicted values into classes using threshold
threshold = 0.6
predicted_class1=[0 if x < threshold else 1 for x in predictions]
print(predicted_class1[0:10])


In [None]:
from sklearn.metrics import confusion_matrix

cm3 = confusion_matrix(telecom_churn_new_balanced["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm3)
total3 = sum(sum(cm3))
#From confusion matrix calcuate accuracy
accuracy1 = (cm3[0,0]+cm3[1,1])/total3
print('Accuracy', accuracy1)

sensitivity1 = cm3[0,0]/(cm3[0,0]+cm3[0,1])
print('Sensitivity : ', sensitivity1 )

specificity1 = cm3[1,1]/(cm3[1,0]+cm3[1,1])
print('Specificity : ', specificity1)

By changing the threshold to 0.6, we have further improved the specificty to 54.3percent. However, the sensitivity to 86.05percent. if we further reduce the threshold, specificty will increase at the cost of sensitivity. we will be wrongly classifying customer who will churn vs. not churn.

## **UPDATED PRECISION, RECALL AND F1SCORE**

**We can use classification report to generate precision, Recall, F1-Score, and data (support)**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(telecom_churn_new_balanced["Churn_num"],predicted_class1))

**Precision** = Out of all the Customers that the model predicted would Churn, 77% actually did.

**Recall** = Out of all the Customers that actually Churned, the model only predicted this outcome correctly for 54% of those Customers.

**F1 Score** = The value is close to 1, it tells us that the model does a good job of predicting whether or not Customers will Churn.

In [None]:
telecom_churn_new.info()

# **FEATURE ENGINEERING**

**Working with orginial data set**

In [None]:
telecom_churn.info() #to identify data types

In [None]:
#PRINT COLUMN NAMES
telecom_churn.columns.values

In [None]:
telecom_churn_fe=telecom_churn.drop(columns = ['Churn','State','International plan', 'Voice mail plan'], axis = 1)

In [None]:
telecom_churn_fe.info()

In [None]:
telecom_churn_fe.columns.values

In [None]:
telecom_churn_fe=telecom_churn_fe[['Churn_num','Account length', 'Area code','Number vmail messages', 'Total day minutes','Total day calls', 'Total day charge', 'Total eve minutes','Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes','Total intl calls', 'Total intl charge', 'Customer service calls','Intl_plan_num', 'Vmail_plan_num','Account_length_new', 'number_vmail_new', 'Total_eve_calls_new']]

In [None]:
telecom_churn_fe.info()

In [None]:
#Numerical Columns
num_vars = telecom_churn_fe.columns[telecom_churn_fe.dtypes != 'object']
#Non Numerical Columns
cat_vars = telecom_churn_fe.columns[telecom_churn_fe.dtypes == 'object']
print(num_vars)
print(cat_vars)

In [None]:
pred_cols = telecom_churn_fe.columns.values[1:]
print(pred_cols)

In [None]:
x = telecom_churn_fe[pred_cols]
y = telecom_churn_fe['Churn_num']

## **TRANSFORMATION**

### **TOTAL MINUTES**

In [None]:
telecom_churn_fe['Total_minutes']=telecom_churn_fe['Total day minutes']+telecom_churn_fe['Total day minutes']+telecom_churn_fe['Total night minutes']+telecom_churn_fe['Total intl minutes']

### **TOTAL CALLS**

In [None]:
telecom_churn_fe['Total_calls']=telecom_churn_fe['Total day calls']+telecom_churn_fe['Total eve calls']+telecom_churn_fe['Total night calls']+telecom_churn_fe['Total intl calls']

### **TOTAL CHARGE**

In [None]:
telecom_churn_fe['Total_charge']=telecom_churn_fe['Total day charge']+telecom_churn_fe['Total eve charge']+telecom_churn_fe['Total night charge']+telecom_churn_fe['Total intl charge']

## **MODEL-1 WITH TOTAL MINUTES, TOTAL CALLS, AND TOTAL CHARGE**

In [None]:
x = telecom_churn_fe[['Total_minutes','Total_calls','Total_charge','Account length', 'Area code','Number vmail messages','Customer service calls','Intl_plan_num', 'Vmail_plan_num','Account_length_new', 'number_vmail_new', 'Total_eve_calls_new']]
y=telecom_churn_fe['Churn_num']


In [None]:
from sklearn.linear_model import LogisticRegression
logistic5 = LogisticRegression()
logistic5.fit(telecom_churn_fe[['Total_minutes','Total_calls','Total_charge','Area code','Number vmail messages','Customer service calls','Intl_plan_num', 'Vmail_plan_num','Account_length_new', 'number_vmail_new', 'Total_eve_calls_new']],telecom_churn_fe[['Churn_num']])

In [None]:
print("Intercept", logistic5.intercept_)
print("Coefficient", logistic5.coef_)

### **CONFUSION MATRIX**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

predict5 = logistic5.predict(telecom_churn_fe[['Total_minutes','Total_calls','Total_charge','Area code','Number vmail messages','Customer service calls','Intl_plan_num', 'Vmail_plan_num','Account_length_new', 'number_vmail_new', 'Total_eve_calls_new']])
predict5
cm5 = confusion_matrix(telecom_churn_fe[['Churn_num']],predict5)
print(cm5)

In [None]:
print("col sums", sum(cm5))
total5 = sum(sum(cm5))
print("Total", total5)

In [None]:
accuracy5 = (cm5[0,0]+cm5[1,1])/total5
accuracy5

### **MULTICOLLINEARITY**

In [None]:
import statsmodels.formula.api as sm5
def vif_cals(x_vars):
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm5.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [None]:
telecom_churn_fe.info()

In [None]:
vif_cals(x_vars=telecom_churn_fe.drop(["Churn_num"], axis = 1))

In [None]:
vif_cals(x_vars=telecom_churn_fe.drop(["Total day minutes","Total day minutes","Total day charge","Total eve minutes","Total eve calls","Total night minutes","Total night calls","Total intl minutes","Total intl calls","Total intl charge", "Churn_num"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_fe.drop(["Total_minutes","Total day minutes","Total day minutes","Total day charge","Total eve minutes","Total eve calls","Total night minutes","Total night calls","Total intl minutes","Total intl calls","Total intl charge", "Churn_num"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_fe.drop(["Number vmail messages","Total_minutes","Total day minutes","Total day minutes","Total day charge","Total eve minutes","Total eve calls","Total night minutes","Total night calls","Total intl minutes","Total intl calls","Total intl charge", "Churn_num"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_fe.drop(["Account length", "Number vmail messages","Total_minutes","Total day minutes","Total day minutes","Total day charge","Total eve minutes","Total eve calls","Total night minutes","Total night calls","Total intl minutes","Total intl calls","Total intl charge", "Churn_num"], axis=1))

In [None]:
vif_cals(x_vars=telecom_churn_fe.drop(["Vmail_plan_num","Account length", "Number vmail messages","Total_minutes","Total day minutes","Total day minutes","Total day charge","Total eve minutes","Total eve calls","Total night minutes","Total night calls","Total intl minutes","Total intl calls","Total intl charge", "Churn_num"], axis=1))

### **INDIVIDUAL IMPACT OF VARIABLES**

In [None]:
import statsmodels.api as sm5
m5=sm5.Logit(telecom_churn_fe['Churn_num'],telecom_churn_fe[['Area code'] + ['Total day calls'] + ['Total eve charge'] + ['Total night charge'] + ['Customer service calls']+['Intl_plan_num'] + ['Account_length_new'] + ['number_vmail_new']+ ['Total_eve_calls_new']+['Total_calls']+['Total_charge']])
m5.fit()
print(m5.fit().summary())

**Output above:** 
1. Look at the Z-value: Thihs examples the null hypothesis - not impactful versus the alternative hypothesis, which is impactful.
2. Looking at the output, if the P>|Z| is  < 0.05, the variable is considered impactful. If the P>|Z| >= 0.05 it is not impactful
3. Looking at the summary output above, we can see Total day calls, total eve charge, Account length new, Total_eve_calls_new have P-value >0.05. We can safely drop these variables.

In [None]:
import statsmodels.api as sm5
m5=sm5.Logit(telecom_churn_fe['Churn_num'],telecom_churn_fe[['Area code'] + ['Total night charge'] + ['Customer service calls']+['Intl_plan_num'] + ['number_vmail_new']+ ['Total_calls']+['Total_charge']])
m5.fit()
print(m5.fit().summary())

In [None]:
import statsmodels.api as sm5
m5=sm5.Logit(telecom_churn_fe['Churn_num'],telecom_churn_fe[['Area code'] + ['Customer service calls']+['Intl_plan_num'] + ['number_vmail_new']+ ['Total_calls']+['Total_charge']])
m5.fit()
print(m5.fit().summary())

### **CONFUSION MATRIX**

In [None]:
from sklearn.linear_model import LogisticRegression
logistic6 = LogisticRegression()
logistic6.fit(telecom_churn_fe[['Area code']+['Customer service calls']+['Intl_plan_num']+['number_vmail_new']+['Total_calls']+['Total_charge']], telecom_churn_fe[['Churn_num']])

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

predict6 = logistic6.predict(telecom_churn_fe[['Area code']+['Customer service calls']+['Intl_plan_num']+['number_vmail_new']+['Total_calls']+['Total_charge']])
predict6
cm6 = confusion_matrix(telecom_churn_fe[['Churn_num']],predict6)
print(cm6)

In [None]:
print("col sums", sum(cm6))
total6 = sum(sum(cm6))
print("Total", total6)


In [None]:

accuracy6 = (cm6[0,0]+cm6[1,1])/total6
accuracy6

**The accuracy has improved from 85.8% to 86.2%**

## **MODEL VALIDATION**

### **SENSITIVITY AND SPECIFICVITY**

In [None]:
import statsmodels.api as sm5
m5=sm5.Logit(telecom_churn_fe['Churn_num'],telecom_churn_fe[['Area code'] + ['Customer service calls']+['Intl_plan_num'] + ['number_vmail_new']+ ['Total_calls']+['Total_charge']])
results=m5.fit()
print(results.summary())

In [None]:
#Predict the variable
predictions=results.predict()
print(predictions[0:10])
len(predictions)

In [None]:
#converting predicted values into classes using threshold
threshold=0.8
predicted_class1=[ 0 if x < threshold else 1 for x in predictions]
print(predicted_class1[0:10])

In [None]:
from sklearn.metrics import confusion_matrix

cm7 = confusion_matrix(telecom_churn_fe["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm7)
total7 = sum(sum(cm7))
#From confusion matrix calcuate accuracy
accuracy7 = (cm7[0,0]+cm7[1,1])/total7
print('Accuracy', accuracy7)

sensitivity7 = cm7[0,0]/(cm7[0,0]+cm7[0,1])
print('Sensitivity : ', sensitivity7 )

specificity7 = cm7[1,1]/(cm7[1,0]+cm7[1,1])
print('Specificity : ', specificity7)

### **THRESHOLDS**

In [None]:
#Threshold = 0.2
threshold = 0.2
predicted_class1=[0 if x < threshold else 1 for x in predictions]
#print(predicted_class1[0:10])
cm7 = confusion_matrix(telecom_churn_fe["Churn_num"], predicted_class1)
print('Confusion Matrix : \n', cm7)
total7 = sum(sum(cm7))
#From confusion matrix calcuate accuracy
accuracy7 = (cm7[0,0]+cm7[1,1])/total7
print('Accuracy', accuracy1)

sensitivity7 = cm7[0,0]/(cm7[0,0]+cm7[0,1])
print('Sensitivity : ', sensitivity7 )

specificity7 = cm7[1,1]/(cm7[1,0]+cm7[1,1])
print('Specificity : ', specificity7)

By changing the threshold to 0.2, we have further improved the specificty from 1.5% to 59%. However, the sensitivity has reduced from 99.8% to 83.53%. if we further reduce the threshold, specificty will increase at the cost of sensitivity. we will be wrongly classifying customer who will churn vs. 
not churn **bold text**

### **ROC**

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

actual = telecom_churn_fe["Churn_num"]
False_positive_rate, True_positive_rate, thresholds = roc_curve(actual, predictions)
plt.figure(figsize=(10,10))
plt.title('ROC Curve', fontsize = 15)
plt.plot(False_positive_rate, True_positive_rate)
plt.plot([0,1],[0,1], 'r--')
plt.ylabel('True Positive Rate(Sensitivity)', fontsize = 15)
plt.xlabel('False Positive Rate(1-Sensitivity)', fontsize = 15)
plt.show()

### **AUC**

In [None]:
roc_auc = auc(False_positive_rate, True_positive_rate)
roc_auc


**The AUC is close to 1, i.e., 0.8, the model is considered to be good.**

### **PRECISION, RECALL, and F1 SCORE**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(telecom_churn_fe["Churn_num"],predicted_class1))

**Precision** = Out of all the Customers that the model predicted would Churn, 38% actually did.

**Recall** = Out of all the Customers that actually Churned, the model only predicted this outcome correctly for 59% of those Customers.

**F1 Score** = Since this value isn’t very close to 1, it tells us that the model does a poor job of predicting whether or not Customers will Churn.

# **CONCLUSION**

**1. Original Dataset**

*   Accuracy 0.7219917012448133
*   Sensitivity :  0.8562019758507134
*  Specificity :  0.5644329896907216

**2. New Dataset with new calculated Features like Total Calls, Total Minutes, and Total Charge**
*   Accuracy 0.7219917012448133
*   Sensitivity :  0.8353819139596137
*   Specificity :  0.5902061855670103

**ADDING NEW FEATURES LIKE TOTAL CALLS, TOTAL MINUTES, AND TOTAL CHARGE HASN'T IMPROVED THE MODEL ACCURACY COMPARED TO THE MODEL WITH ORGINAL DATA SET. HOWEVER WITH FEWER FEATURES WE ARE ABLE TO EXPLAIN THE MODEL**

# **RANDOM FORREST**

In [None]:
#Shape of the dataset
telecom_churn.shape

In [None]:
#Column names
telecom_churn.columns

In [None]:
#Column types
telecom_churn.info()

**There are 2666 records in the dataset. There are 26 variables in the data. The target varibale name is Churn_num; other variables are representing the data collected about each customer. There are numerical and string columns. We will work on numerical columns. Next step is to conduct basic data exploration on the predictor and target variables.**

In [None]:
#Summary of all numerical columns.
all_cols_summary = telecom_churn.describe()
print(all_cols_summary, 2)

In [None]:
#Target Variable
print(telecom_churn['Churn_num'].value_counts())

## **MODEL BUILDING AND VALIDATION**

**Creating a new dataset**

In [None]:
import pandas as pd
import sklearn as sk
import numpy as np
import scipy as sp
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [None]:
#Droping certain variables
telecom_churn_rf = telecom_churn.drop(columns = ['Churn', 'State', 'International plan', 'Voice mail plan', 'Account length', 'Number vmail messages', 'Total eve calls'], axis = 1)

In [None]:
telecom_churn_rf.shape

In [None]:
telecom_churn_rf.info()

In [None]:
telecom_churn_rf.columns

In [None]:
telecom_churn_rf=telecom_churn_rf[['Churn_num','Area code', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve minutes', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer service calls', 'Intl_plan_num', 'Vmail_plan_num', 'Account_length_new', 'number_vmail_new', 'Total_eve_calls_new']]

In [None]:
telecom_churn_rf.info()

In [None]:
#Defining train and test data
features=list(telecom_churn_rf.columns[1:18])
X=telecom_churn_rf[features]
y=telecom_churn_rf['Churn_num']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 55)
print("X_train shape ", X_train.shape)
print("y_train shape ", y_train.shape)
print("X_test shape ", X_test.shape)
print("y_test shape ", y_test.shape)

In [None]:
#Building Decision Tree on training data
telecom_clf = tree.DecisionTreeClassifier(max_depth=6)
telecom_clf.fit(X_train, y_train)


## **DECISION TREE RESULTS**

In [None]:
#Accuracy on train data
tree_predict8 = telecom_clf.predict(X_train)
cm8 = confusion_matrix(y_train, tree_predict8)
accuracy_train = (cm8[0,0]+cm8[1,1])/sum(sum(cm8))
print("Decision Tree Accuracy on Train data =  ", accuracy_train)

#Accuracy on test data
tree_predict9 = telecom_clf.predict(X_test)
cm9 = confusion_matrix(y_test, tree_predict9)
accuracy_test = (cm9[0,0]+cm9[1,1])/sum(sum(cm9))
print("Decision Tree Accuracy on Test data =  ", accuracy_test)

#AUC on Train data
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, tree_predict8)
auc_train = auc(false_positive_rate, true_positive_rate)
print("Decision Tree AUC on Train data =  ", auc_train)

#AUC on Test data
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, tree_predict9)
auc_test = auc(false_positive_rate, true_positive_rate)
print("Decision Tree AUC on Test data =  ", auc_test)


**The optimal depth for this data which max_depth = 6. The best decision tree gives us accuracy of 96% and AUC of 88% (closer to 1). Pelase note AUC is a better estimate. AUC is calcuated from the ROC Curve, which considers all threshold values. If the AUC value is approximately 1, the model is considered to be good. Also, the change of threshold the Accuracy can be changed.**

## **BUILDING RANDOM FOREST MODELS**

In [None]:
#BUILDING RANDOM FOREST MODEL
from sklearn.ensemble import RandomForestClassifier
telecom_forest = RandomForestClassifier(n_estimators= 350, max_features = 7, max_depth = 7) 
telecom_forest.fit(X_train, y_train)


**We are building 350 trees here. A higher number is prefered. The number of features is chosen based on Logistic Regression we notices 5-7 variables were explain the target variable. We fixed the max_depth at 10.**

In [None]:
#Accuracy on train data
forest_predict10 = telecom_forest.predict(X_train)
cm10 = confusion_matrix(y_train, forest_predict10)
accuracy_train = (cm10[0,0]+cm10[1,1])/sum(sum(cm10))
print("Random Forest Accuracy on Train data =  ", accuracy_train)

#Accuracy on test data
forest_predict11 = telecom_forest.predict(X_test)
cm11 = confusion_matrix(y_test, forest_predict11)
accuracy_test = (cm11[0,0]+cm11[1,1])/sum(sum(cm11))
print("Random Forest Accuracy on Test data =  ", accuracy_test)

#AUC on Train data
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, forest_predict10)
auc_train = auc(false_positive_rate, true_positive_rate)
print("Random Forest AUC on Train data =  ", auc_train)

#AUC on Test data
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, forest_predict11)
auc_test = auc(false_positive_rate, true_positive_rate)
print("Random Forest AUC on Test data =  ", auc_test)

**We can the Random Forest gives us 97% accuracy and AUC of 91% on the train data. And it give us accuracy of 94% and AUC of 84.94% on the test data. Comared to a singel decision Tree aobove the Random Forest gives 3% improvement on the test data.** 

# **BOOSTING**

## **MODEL BUILDING AND VALIDATION**

In [None]:
import pandas as pd
import sklearn as sk
import numpy as np
import scipy as sp
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
import time
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
telecom_churn_gbm = telecom_churn.drop(columns = ['Churn', 'State', 'International plan', 'Voice mail plan', 'Account length', 'Number vmail messages', 'Total eve calls'], axis = 1)

In [None]:
telecom_churn_gbm=telecom_churn_rf[['Churn_num','Area code', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve minutes', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer service calls', 'Intl_plan_num', 'Vmail_plan_num', 'Account_length_new', 'number_vmail_new', 'Total_eve_calls_new']]

In [None]:
#Defining train and test data
telecom_gbm_features=list(telecom_churn_gbm.columns[1:18])
X=telecom_churn_gbm[telecom_gbm_features]
y=telecom_churn_gbm['Churn_num']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 55)
print("X_train shape ", X_train.shape)
print("y_train shape ", y_train.shape)
print("X_test shape ", X_test.shape)
print("y_test shape ", y_test.shape)

## **DECISION TREE RESULTS**

In [None]:
#Building Decision Tree on training data
from sklearn import tree
telecom_tree = tree.DecisionTreeClassifier(max_depth=6)
telecom_tree.fit(X_train,y_train)

In [None]:
#Accuracy on train data
print("Decision Tree Results \n")
print("Accuracy on train data" , telecom_tree.score(X_train, y_train))
print("Accuracy on test data" , telecom_tree.score(X_test, y_test))

## **GRADIENT BOOSTING MODEL**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
telecom_gbm_model = GradientBoostingClassifier(n_estimators=100,learning_rate=0.01, verbose=1, max_depth=4) 
##fitting the gradient boost classifier
telecom_gbm_model.fit(X_train,y_train)

In [None]:
#Validation on the train and test data
#Confusion Matrix
predictions12 = telecom_gbm_model.predict(X_train)
acutals12 = y_train
cm12 = confusion_matrix(acutals12, predictions12)
print("Confusion Matrix on Train data", cm12)
accuracy12 = (cm12[0,0] + cm12[1,1])/(sum(sum(cm12)))
print("Decision Tree Accuracy on Train data =  ", accuracy12)

predictions13 = telecom_gbm_model.predict(X_test)
acutals13 = y_test
cm13 = confusion_matrix(acutals13, predictions13)
print("Confusion Matrix on Test data", cm13)
accuracy13 = (cm13[0,0] + cm13[1,1])/(sum(sum(cm13)))
print("Decision Tree Accuracy on Test data =  ", accuracy13)



In [None]:
#loop for different iterations
for i in range(5,500, 50):
    telecom_gbm_model = GradientBoostingClassifier(n_estimators= i, learning_rate=0.01, verbose=1, max_depth = 4)
    telecom_gbm_model.fit(X_train,y_train)
    print("N estimators = ", i)
    #Train data
    predictions = telecom_gbm_model.predict(X_train)
    actuals=y_train
    cm=confusion_matrix(actuals, predictions)
    accuracy=(cm[0,0]+cm[1,1])/(sum(sum(cm))) 
    print("Train accuracy", accuracy)
    #Test data
    predictions = telecom_gbm_model.predict(X_test)
    actuals=y_test
    cm=confusion_matrix(actuals, predictions)
    accuracy=(cm[0,0]+cm[1,1])/(sum(sum(cm))) 
    print("Test accuracy", accuracy)


**From the test results, we can see that the accuracy of the test data gets saturated at the 94% to 95% from 255 range/iterations. We can finalize the learning rate as 0.01 and number of iterations as 255.**