# Abstract 
**Regression**

This is a customer churn dataset. we have features ranging from customer demographics ,customer account information to services each customer has subscribed to.<br>
This work is my attempt to predict the **Total Charges** inflicted to the customer depending on the services and all other factors from the dataset.<br>
Linear Regression,Knn and Decision Tree algorithms are used to predict the Total Charges.The models accuracy is compared and the best is choosen.



## Importing all the required packages

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor as Knn
from sklearn.metrics import mean_squared_error as mse
from sklearn.tree import DecisionTreeRegressor
import math as m

## Loading the dataset

In [None]:
data = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
data.head()


## Exploratory Data Analysis

In [None]:
data.shape

In [None]:
data.isna().sum()
data.dtypes

In [None]:
data.drop(columns=['customerID','Churn'],axis=1,inplace=True)

In [None]:
data['SeniorCitizen'].replace([1],'Yes',inplace=True)
data['SeniorCitizen'].replace([0],'No',inplace=True)     
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'],errors='coerce')
data.drop(index=data.loc[data['TotalCharges'].isna()].index,axis=0,inplace=True)

In [None]:
data1=data.copy(deep=True)
data.head()

In [None]:
sc = data.groupby(['gender','SeniorCitizen']).mean()
sc.reset_index(inplace=True)
sc
#px.bar(sc,x='gender',y='TotalCharges',facet_col='SeniorCitizen',category_orders={'SeniorCitizen':['No','Yes']})

In [None]:
part = data.groupby('Partner').sum()
part.reset_index(inplace=True)
part
px.pie(names=part['Partner'],values=part['TotalCharges'])

In [None]:
dep = data.groupby('Dependents').sum()
dep.reset_index(inplace=True)
px.pie(names=dep['Dependents'],values=dep['TotalCharges'])

In [None]:
deppat = data.groupby(['Partner','Dependents']).mean()
deppat.reset_index(inplace=True)
px.bar(deppat,x='Partner',y='MonthlyCharges',facet_col='Dependents')


In [None]:
gen = data.groupby(['gender','SeniorCitizen','Partner','Dependents']).mean()
gen.reset_index(inplace=True)
px.bar(gen,y='TotalCharges',x='gender',facet_col='SeniorCitizen',facet_row='Partner',color='Dependents',barmode='group')

In [None]:
px.scatter(x=data['tenure'],y=data['MonthlyCharges'],labels={'x':'tenure','y':'MonthlyCharges'})

In [None]:
ps = data.groupby('PhoneService').sum()
ps.reset_index(inplace=True)
px.pie(names=ps['PhoneService'],values=ps['TotalCharges'])

In [None]:
ml=data.groupby('MultipleLines').sum()
ml.reset_index(inplace=True)
px.pie(names=ml['MultipleLines'],values=ml['TotalCharges'])

In [None]:
Is = data.groupby('InternetService').sum()
Is.reset_index(inplace=True)
px.pie(names=Is['InternetService'],values=Is['TotalCharges'])

In [None]:
npi = data.groupby(['PhoneService','InternetService']).sum()
#npi.get_group(('No','Fiber optic'))['Total Charges'].sum()
npi.reset_index(inplace=True)
npi

In [None]:
data['Contract'].value_counts()

In [None]:
cont = data.groupby('Contract').mean()
cont.reset_index(inplace=True)
px.scatter(data,x='TotalCharges',y='tenure',facet_col='Contract')

In [None]:
for x in data.columns:
    print(x ,' : ' ,data[x].unique())
    print(data[x].value_counts())
    print('\n')

In [None]:
data.shape

## Linear Regression

In [None]:
noweb = data[data['InternetService'] =='No']
nophone = data[data['MultipleLines'] =='No phone service']

data.drop(index=noweb.index,axis=0,inplace=True)
data.drop(index=nophone.index,axis=0,inplace=True)
data.drop(columns=['MonthlyCharges'],inplace=True,axis=1)

In [None]:
data.head()

In [None]:
data_mod = pd.get_dummies(data,columns=['InternetService','Contract','PaymentMethod'],drop_first=True)
#'Gender','Partner','Dependents','Phone Service','Multiple Lines','Internet Service','Online Security'
data_mod.replace(to_replace='No',value='0',inplace=True)
data_mod.replace(to_replace='Yes',value='1',inplace=True)
data_mod['gender'].replace(to_replace='Female',value='1',inplace=True)
data_mod['gender'].replace(to_replace='Male',value='0',inplace=True)
data_mod.head()


In [None]:
#cor = 
data_mod.corr()
#plt.figure(figsize=(40,40))
#sns.heatmap(cor, annot=True)
#plt.rcParams['figure.figsize'] = [40, 40]
#plt.rcParams['figure.dpi'] = 100
#data_mod.corr()

In [None]:
data_mod.drop(index=data.loc[data_mod['TotalCharges'].isna()].index,axis=0,inplace=True)
x = data_mod.drop(columns='TotalCharges',axis=1)
y = data_mod['TotalCharges']

In [None]:
x.isna().sum()
y.isna().sum()

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=123)

In [None]:
model = sm.OLS(y_train.astype(float),sm.add_constant(x_train.astype(float))).fit()
print(model.summary()) 

## Back Propagation to select significant variables

In [None]:
def backward_regression(X, y,
                           initial_list=[], 
                           threshold_in=0.01, 
                           threshold_out = 0.05, 
                           verbose=True):
    included=list(X.columns)
    while True:
        changed=False
        model = sm.OLS(y.astype(float), sm.add_constant(pd.DataFrame(X[included].astype(float)))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {} with p-value {} '.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

In [None]:
backward_regression(x,y)

In [None]:
data_mod.drop(columns='PaymentMethod_Credit card (automatic)',axis=1,inplace=True)
data_mod.drop(columns='Contract_One year',axis=1,inplace=True) 
data_mod.drop(columns='PaymentMethod_Electronic check',axis=1,inplace=True) 
data_mod.drop(columns='SeniorCitizen',axis=1,inplace=True) 
data_mod.drop(columns='Partner',axis=1,inplace=True) 
data_mod.drop(columns='PaperlessBilling',axis=1,inplace=True) 
data_mod.drop(columns='Dependents',axis=1,inplace=True)

In [None]:
x = data_mod.drop(columns='TotalCharges',axis=1)
y = data_mod['TotalCharges']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=123)

In [None]:
model = sm.OLS(y_train.astype(float),sm.add_constant(x_train.astype(float))).fit()
print(model.summary()) 

## Model accuracy


In [None]:
model1 = LinearRegression()
model1.fit(x_train,y_train)
predict = model1.predict(x_test)

print(r2_score(y_test,predict))



## Regression when there is no internet service

In [None]:
noweb.drop(noweb.iloc[:,7:14],axis=1,inplace=True)
noweb.drop(columns='MonthlyCharges',axis=1,inplace=True)
noweb

In [None]:
noweb.replace(to_replace='No',value='0',inplace=True)
noweb.replace(to_replace='Yes',value='1',inplace=True)

noweb.replace(to_replace='Male',value='1',inplace=True)
noweb.replace(to_replace='Female',value='0',inplace=True)
noweb_mod = pd.get_dummies(noweb,columns=['Contract','PaymentMethod'],drop_first=True)
noweb_mod

In [None]:
corr1 = noweb_mod.corr()
sns.heatmap(corr1,annot=True)

In [None]:
x9 = noweb_mod.drop(columns='TotalCharges')
y9 = noweb_mod['TotalCharges']

In [None]:
x_train5,x_test5,y_train5,y_test5=train_test_split(x9,y9,test_size=0.3,random_state=125)

In [None]:
model2 = sm.OLS(y_train5.astype(float),sm.add_constant(x_train5.astype(float))).fit()
print(model2.summary())

In [None]:
model5 = LinearRegression()
model5.fit(x_train5,y_train5)
predict = model5.predict(x_test5)

print(r2_score(y_test5,predict))

## Knn

In [None]:
data_mod1 = pd.get_dummies(data1,columns=['InternetService','Contract','PaymentMethod','MultipleLines','PhoneService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies'])
#'Gender','Partner','Dependents','Phone Service','Multiple Lines','Internet Service','Online Security'
data_mod1.replace(to_replace='No',value='0',inplace=True)
data_mod1.replace(to_replace='Yes',value='1',inplace=True)
data_mod1['gender'].replace(to_replace='Female',value='1',inplace=True)
data_mod1['gender'].replace(to_replace='Male',value='0',inplace=True)
data_mod1.head()

In [None]:
x1 = data_mod1.drop(columns=['TotalCharges'],axis=1)
y1 = data_mod1['TotalCharges']

In [None]:
x_train1,x_test1,y_train1,y_test1 = train_test_split(x1,y1,test_size=0.3,random_state=122)
x_train1.shape
x_test1.shape
y_train1.shape
y_test1.shape

In [None]:
scaler = MinMaxScaler()
x_train_stand = scaler.fit_transform(x_train1)
x_test_stand = scaler.fit_transform(x_test1)

In [None]:
mse1 = []
r2 = []
for x in range(1,27):
  kNN = Knn(n_neighbors=x,p=2,metric='minkowski')
  kNN.fit(x_train_stand,y_train1)
  predictKnn = kNN.predict(x_test_stand)
  #mse1.append(mse(y_test1,predictKnn))
  r2.append(kNN.score(x_test_stand,y_test1))

In [None]:

values = pd.Series(r2)
#knnvalue = np.hstack((index,values))
#values = np.array(values)
print(values)
plt.plot(values.index,values)
plt.xticks(range(0,28))
plt.show()

In [None]:
kNN = Knn(n_neighbors=11,p=2)
kNN.fit(x_train_stand,y_train1)
predictKnn = kNN.predict(x_test_stand)

In [None]:
r2_score(y_test1,predictKnn)

## Decision Tree


In [None]:
data_mod2 = pd.get_dummies(data,columns=['InternetService','Contract','PaymentMethod','MultipleLines','PhoneService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies'])
data_mod2.replace(to_replace='No',value='0',inplace=True)
data_mod2.replace(to_replace='Yes',value='1',inplace=True)
data_mod2['gender'].replace(to_replace='Female',value='1',inplace=True)
data_mod2['gender'].replace(to_replace='Male',value='0',inplace=True)
data_mod2.head()
#data_mod2.drop(columns=['Monthly Charges'],axis=1,inplace=True)

In [None]:
x3= data_mod2.drop(columns='TotalCharges',axis=1)
y3 = data_mod2['TotalCharges']

In [None]:
x_train3,x_test3,y_train3,y_test3 = train_test_split(x3,y3,test_size=0.3,random_state=121)

In [None]:
tree1 = DecisionTreeRegressor()
tree1.fit(x_train3,y_train3)
predict3 = tree1.predict(x_test3)
r2_score(y_test3,predict3)

In [None]:
x = pd.DataFrame(data_mod2.columns)
x.columns=['Feature']
x.reset_index(inplace=True)
x.drop(columns='index',axis=1,inplace=True)
x.drop(index=5,inplace=True)
tree1.feature_importances_.shape
x['Importance']=tree1.feature_importances_
x.sort_values(by='Importance',ascending=False,inplace=True)
px.bar(x=x.Feature,y=x.Importance,labels={'y':'Score'},title="Feature importance with Tenure included")

In [None]:
data_mod3 = pd.get_dummies(data,columns=['InternetService','Contract','PaymentMethod','MultipleLines','PhoneService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies'])
data_mod3.replace(to_replace='No',value='0',inplace=True)
data_mod3.replace(to_replace='Yes',value='1',inplace=True)
data_mod3['gender'].replace(to_replace='Female',value='1',inplace=True)
data_mod3['gender'].replace(to_replace='Male',value='0',inplace=True)
data_mod3.head()
data_mod3.drop(columns='tenure',axis=1,inplace=True)

In [None]:
x4= data_mod3.drop(columns='TotalCharges',axis=1)
y4 = data_mod3['TotalCharges']

In [None]:
x_train4,x_test4,y_train4,y_test4 = train_test_split(x4,y4,test_size=0.3,random_state=121)
x_train4.shape
x_test4.shape
y_train4.shape
y_test4.shape

In [None]:
tree2 = DecisionTreeRegressor()
tree2.fit(x_train4,y_train4)
predict4 = tree2.predict(x_test4)
r2_score(y_test4,predict4)

In [None]:
x1 = pd.DataFrame(data_mod3.columns)
x1.columns=['Feature']
x1.reset_index(inplace=True)
x1.drop(columns='index',axis=1,inplace=True)
x1.drop(index=5,inplace=True)
tree2.feature_importances_.shape
x1['Importance']=tree2.feature_importances_
x1.sort_values(by='Importance',ascending=False,inplace=True)
px.bar(x=x1.Feature,y=x1.Importance,labels={'y':'Score'},title="Feature importance without Tenure")

## Conclusion

We have predicted the Total Charges based on the given data and we see that linear regression gives the best prediction accuracy score among the models.

**Decision Tree:**<br>
We have an accuracy score of **0.98** with this model.
We find the feature importance with Tenure and Monthly charges included and also with them being neglected.

With Tenure and Monthly charges included they dominate the other features and they highly influence the output variable.

with Tenure and Monthly Charges removed we can actually discover the other features which are important.
We find people having month to month contract seem to be contributing to the total charges, the next important feature in line being people opting Optic Fiber for thier internet connection.

**Knn:**<br>
With Knn we see we get a food prediction accuracy at **k = 11**. We see an accuracy of **0.86**. This model is not much interpretable.

**Linear Regression:**<br>
with Linear Regression we find Fiber optic connection has higher weightage followed by streaming Tv and Streaming Movies. 
Phone connection has the highest negative relation with output. 
We also find that the model is giving an accuracy of **0.96** which is pretty good.

When I run the model with no Internet connection I have Multiple phone connection contributing the highest to the total charges, followed by the tenure and payment method with check.

**Verdict:**<br>
With the models outputs and accuray levels I would prefer going with the Linear Regression model to predict the output because of :
<br>
The model is highly interpretable.<br>
Changes in the input can be observed with the output to a high degree.