## State a hypothesis

In [1]:
import pandas as pd

data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## EDA

1. Covariance matrix
2. Correlation matrix

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### Analysis of numeric features

In [3]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeniorCitizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
MonthlyCharges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


#### Convert `TotalCharges` column from string to float

In [4]:
l1 = [len(i.split()) for i in data['TotalCharges']]
l2 = [i for i in range(len(l1)) if l1[i] != 1]
print('Index Positions with empty spaces : ',*l2)

for i in l2:
    data.loc[i,'TotalCharges'] = data.loc[(i-1),'TotalCharges']
    
data['TotalCharges'] = data['TotalCharges'].astype(float)

Index Positions with empty spaces :  488 753 936 1082 1340 3331 3826 4380 5218 6670 6754


#### Drop `customerID` problem

In [5]:
data.drop(columns = ['customerID'], inplace = True)

#### Encode the categorical features

In [6]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

data_copy = data.copy(deep=True)

categorical_feats = [col for col in list(data.columns) if col not in list(data.describe().columns)]

for col in categorical_feats:
    data_copy[col] = le.fit_transform(data_copy[col])

    print(col,' : ',data_copy[col].unique(),' = ',le.inverse_transform(data_copy[col].unique()))


gender  :  [0 1]  =  ['Female' 'Male']
Partner  :  [1 0]  =  ['Yes' 'No']
Dependents  :  [0 1]  =  ['No' 'Yes']
PhoneService  :  [0 1]  =  ['No' 'Yes']
MultipleLines  :  [1 0 2]  =  ['No phone service' 'No' 'Yes']
InternetService  :  [0 1 2]  =  ['DSL' 'Fiber optic' 'No']
OnlineSecurity  :  [0 2 1]  =  ['No' 'Yes' 'No internet service']
OnlineBackup  :  [2 0 1]  =  ['Yes' 'No' 'No internet service']
DeviceProtection  :  [0 2 1]  =  ['No' 'Yes' 'No internet service']
TechSupport  :  [0 2 1]  =  ['No' 'Yes' 'No internet service']
StreamingTV  :  [0 2 1]  =  ['No' 'Yes' 'No internet service']
StreamingMovies  :  [0 2 1]  =  ['No' 'Yes' 'No internet service']
Contract  :  [0 1 2]  =  ['Month-to-month' 'One year' 'Two year']
PaperlessBilling  :  [1 0]  =  ['Yes' 'No']
PaymentMethod  :  [2 3 0 1]  =  ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
Churn  :  [0 1]  =  ['No' 'Yes']


In [7]:
data_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   int64  
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   int64  
 3   Dependents        7043 non-null   int64  
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   int64  
 6   MultipleLines     7043 non-null   int64  
 7   InternetService   7043 non-null   int64  
 8   OnlineSecurity    7043 non-null   int64  
 9   OnlineBackup      7043 non-null   int64  
 10  DeviceProtection  7043 non-null   int64  
 11  TechSupport       7043 non-null   int64  
 12  StreamingTV       7043 non-null   int64  
 13  StreamingMovies   7043 non-null   int64  
 14  Contract          7043 non-null   int64  
 15  PaperlessBilling  7043 non-null   int64  
 16  PaymentMethod     7043 non-null   int64  


In [18]:
print (data['TechSupport'].unique())
print (data_copy['TechSupport'].unique())

print (data['OnlineSecurity'].unique())
print (data_copy['OnlineSecurity'].unique())

print (data['Contract'].unique())
print (data_copy['Contract'].unique())

['No' 'Yes' 'No internet service']
[0 2 1]
['No' 'Yes' 'No internet service']
[0 2 1]
['Month-to-month' 'One year' 'Two year']
[0 1 2]


### Target variable visualisation

- Churned customers are less in number as compared to not-churned customers.
- The dataset is unbalanced with not-churn: churn ratio as 3:1
- predictions will be biased towards not-churn customers

### Divide categorical features into groups based on their names

In [19]:
user_feats = ['gender','SeniorCitizen','Partner','Dependents'] # Customer Information
services_feats = ['PhoneService','MultipleLines','InternetService','StreamingTV','StreamingMovies',
      'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport'] # Services Signed Up for!
payment_feats = ['Contract','PaperlessBilling','PaymentMethod'] # Payment Information

### Compare each categorical feat with the target variable



#### User features

- Customer churning for male & female customers is very similar to each other!
- Similarly, number of SeniorCitizen customers is pretty low! Out of that, we can observe a near about 40% churn of SeniorCitizen customers. It accounts for a total of 476 customers out of 1142 Senior Citizen customers.
- Customers who are housing with a Partner churned less as compared to those not living with a Partner.
- Similary, churning is high for the customers that don't have Dependents with them!

#### Services features


- For PhoneService, despite having no phone service, more customers were retained as compared to the number of customers who dropped the services.

- In case of MultipleLines, churn rate in when the Multiplelines are present or not is the same.

- A high number of customers have displayed their resistance towards the use of Fiber optic cables for providing the InternetService. On the contrary, from the above graph, customers prefer using DSL for their InternetService!

- StreamingTV and StreamingMovies display an identical graph. Irrespective of being subscribed to StreamingTV & StreamingMovies, a lot of customers have been churned. Looks like the streaming content was not entirely at fault!

- When it comes down to catering the customers, services w.r.t OnlineSecurity, OnlineBackup, DeviceProtection & TechSupport are crucial from the above visualizations!

- A high number of customers have switched their service provider when it comes down poor services with the above mentioned features.


#### Payment features

- Customer churning for a Month-to-Month based Contract is quite high. This is probably because the customers are testing out the varied services available to them and hence, in order to save money, 1 month service is tested out!

- Another reason can be the overall experience with the internet service, streaming service and phone service were not consistent. Every customer has a different priority and hence if one of the 3 was upto par, the entire service was cutoff!

- PaperlessBilling displays a high number of customers being churned out. This is probably because of some payment issue or receipt issues.

- Customers clearly resented the Electronic check PaymentMethod. Out of the 2365 number of bills paid using Electronic check, a staggering 1071 customers exited the pool of service due to this payment method. Company definitely needs to either drop Electronic check method or make it hassle-free and user-friendly.


## Data cleaning

- a. [DONE] encode all categorical variables
- b. [WIP] find features that are highly correlated by plotting the correlation and covariance plot. Remove such features and only retain one of them that has the highest correlation with the target variable `churn`.
- c. [WIP] find missing values - although NULLs are not present, find values that are blank strings and then implement a way to deal with such values.
- d. [NOT STARTED] find any outliers (this will happen to only numeric colummns) and remove them.

## Train, validation and test split

1. [WIP] ensure that each class has roughly equal samples. if not then undersample from the majority class

## Feature engineering and reduction

1. **MAX 6 raw/derived features**

- a. based on correlation and covariance matrices you can identify which features to remove. 
- b. if the features are still more than 6, then we will have to research advanced techniques.

## Model building

### Model 1

### Model 2

## Implement hyperparameter tuning

## Model evaluation metrics

**Success metrics:** model accuracy on the test dataset > 70%