# Mitigating Customer Attrition at Telcoco

> ####  By Stephen Kipkurui 2022-03-13

## Project Goal


>### This project aims to statistically explore and formulate the factors that contribute to observed increase in customer churning at TelCoCo with a main focus to answer the question '..why are customers parting ways with company services?' Inferred findings are to be used in advicing possible measures to be implemented to reduce the current customer loss rate. 

## Project Description

> ### Untill recently, business solutions were not solely based on infomation sourced from data. However, this pattern of thinking is rapidly changing and more and more companies are adapting to the idea of inferring meaning and descisions from data- including our competitors. Therefore it is paramount that we complie to change also and adapt applications of statistics and computer science in understanding our customer's behaviors. In this particular case- an observed increase in customer attrition as observed by the sales department. Key features to be modeled in this project are:
 >- Relationships of churned customers among the varius classifications of our key services
 >- Whether our pricing, payment methods offered and length of tenure creates a pettern with attrition
 >- How other factors directly related to the customer such as whether if they are senior citizens or not, what if they have dependents or not affect these observations 


## Exploratory Questions

### This project aims to understand relationship guided by the following questions:

>-  ####  Do customers with multiple phone lines churn more compared to customer with only one line? Following up to ths question, what difference does the length of tenure and type of payment has on these customers?  

>-  #### What if they have packaged internet services too, what relationship can we infer from customers with both services in relationship with their monthly and total charges?

>-  #### Have these customers contacted the customer support to have their issues addressed? Are those that have contacted customer suppport less likely to churn than those that did not? 





## Data Dictionary

## Explorary Telco Data Analysis

>#### Data in this project was freshly acquired from TelCoCo online database resource- **telco_churn db** on Thursday, March 10, 2022: 10:00am CST. A copy cached in local machine for ease of access. Database tables resourced are: (customers, internet_service_types, contract_types, customer_contracts, customer payments, and payment_types). 

***
SELECT  c.customer_id, c.churn, c.gender, c.tenure, c.partner, c.senior_citizen, 
            c.dependents, c.phone_service, c.multiple_lines, ist.internet_service_type,  
            ct.contract_type, cc.paperless_billing, c.tech_support, cp.monthly_charges, 
            cp.total_charges, pt.payment_type
            
FROM customers c
                
   JOIN internet_service_types ist  USING (internet_service_type_id)
   JOIN contract_types ct USING (contract_type_id)
   JOIN customer_contracts cc USING (customer_id)
   JOIN customer_payments cp USING (payment_type_id)
   JOIN payment_types pt USING (payment_type_id);
***
 

Call the __acquireTelco.py__ module and get the data from online database through the __get_telco_data()__ function


>- telco = acquireTelco.get_telco_data() 

## Data cleaning:

#### First, created a module prepareTelco.py with a clean_split_telco_data() function that calls the other functions in the module to clean data(drop unnecessary columns), encode the categorical data into numeric data, change total_charges column datatype from object to floating point format, and lastly split the data as train, validate, and test subsets. Data sized in ratio: test: 20%, train and validate: 80%. (Train subset: 56% and validate: 24%)

###  Data encoding Key:

###  Binary data - (churn, partner, dependents, phone_service, paperless_billing):

>- #### 'Yes' == 1 & 'No' == 0

###  Multivariate Data - (payment_type, contract_type, internet_service_type)

>- #### 'Electronic check' == 1, 'Mailed check' == 2, 'Bank transfer (automatic)' == 3, 'Credit card (automatic)' == 4

>- ####  'Month-to-month' == 1, 'One_year' == 2, 'Two_year' == 3

>- ####  'DSL' == 1, 'Fiber_optic' == 2, 'None' == 3
    

### train, validate, test = clean_split_telco_data(telco)



Split ratios:

	Train Data: (5497669, 14) (exploratory data analysis dataset)
	Validate Data:(2463974, 14) (check for overfitting on the train dataset)
	Test Data: (2064720, 14) (predict unseen data behaviour)
    
    
    A copy of the train, validate and split data was saved in separate files. This is to allow faster program excecution calling the train data direcly. Later in the program, valiidate and test data are to be called from their files respectively.

## Initial observation on our data:

#### On performing visualizations and running python code to determine churning drive factors, it is determined that churning is hightest within the paperless billing group at 72.47% vs non-churn of 55.71%. Therefore we narrowed my focus to understanding the cause of these observations. 

#### In addition, I examined churning among the phone users and determined the rate of churn and non-churn is equal. Eventhough important, it is worth examining in the future project. This project will focus on the paperless billing churning henceforth. My initial hypothesis is as follows:

### HO: Mean of churned customers using paperless billing = mean of non-churned customers not using paperless billing.

### H!: Mean of churned customers using paperless billing > mean of non-churned customers not using paperless billing.


## Statistical Test 


##### NOTE: For a 2-tailed test, we take the p-value as is. For a 1-tailed test, we evaluate (p / 2 < α)  and t > 0  (to test if higher), and of a less-than test when (p / 2) < α and t < 0.

#### This test adapted the independent t-test (one sample) for our analytical evaluation with alpha set at 0.05 confidence level.


Since the variance is not the same, we'll use equal_var = False method.

print(f'\n\tChurned Sample Variances: {churn_paperless_sample.var()}\n')

print(f'\tNon-Churned Sample Variances: {no_churn_paperless_sample.var()}\n')

#### T-test
t, p = stats.ttest_ind(churn_paperless_sample, no_churn_paperless_sample, equal_var=False)

print()

print(f'\tT-value: {t} \n\n\tP-value: {p}\n')

# Evaluate the hypothesis
HO = 'Mean of churned customers using paperless billing  = mean of non-churned customers not using paperless billing'

H1 = 'Mean of churned customers using paperless billing  > mean of non-churned customers not using paperless billing'

if (p) > alpha:
    
    print(f'\tFailed to reject HO:-> \n\n{HO}')
    
elif t < 0:
    
    print(f'\tFailed to reject HO:-> \n\n{HO}')
    
else:
    print()
    print(f'\tWe reject HO (Accept H1):-> \n\n{HO}')
    
    
    print('\n\n\nAdapted hypothesis:\n')
    print(f'= [ {H1} ]\n')


### Model Results

We reject HO (Accept H1):-> 

Mean of churned customers using paperless billing  = mean of non-churned customers not using paperless billing



## Adapted hypothesis:

####  Mean of churned customers using paperless billing  > mean of non-churned customers not using paperless billing


#### How accurate is this evaluation or rather how sure am I about the hypothesis? 

#### I proceeded testing the paperless billing hypothesis by modeling. Key concepts to be determined are:
>- #### Accuracy (the number of correct predictions over the number of total instances that have been evaluated)
>- #### Precision (pin-point the positives within our observations)
>- #### Recall (verification that our model did not miss the positives predictions)

### Step I: Selected my paperless churn vs non-churn data from the train model

In [None]:
# Create churn and non_churn paperless series
churn_paperless_bill = train_df.enc_churn[train_df.enc_paperless_billing == 1]
non_churn_paperless_bill = train_df.enc_churn[train_df.enc_paperless_billing == 0]

# convert to list
churn_paperless_bill = churn_paperless_bill.tolist()
non_churn_paperless_bill = non_churn_paperless_bill.tolist()

# Convert the series into dataframe and give labels
df_train = pd.DataFrame({'actual_paperless_bill_churn': [churn_paperless_bill],
                        'predicted_paperless_bill_churn': [non_churn_paperless_bill]})

df_train.head(5).T

### Step II: Model the data on Evaluation models: Decision tree, Random Forest, and K-Nearest Neighbor


#### NOTE: All models were fitted with the same features to determine the accuracy, precision, and recall. Remember data, already split into train, validate and test are collected from separated files in this section.



All models followed the following process:

(1). Import the required sklearn metrics

(2). Create the respective decison tree object

(3). Fit the model

(4). Visualize the model

(5). Make predictions

(6). Estimate probability 

(7). Compute Acccuracy 

(8). Confusion matrix

(9). Classification report

(10). Evaluate our model with the out-of-sample data (validate data)


#### Please refer to the customer_churn_telcoco.ipynb file under modeling for the code on this section.

### Decision tree result:

print(classification_report(y_train, y_pred))


 precision    recall  f1-score   support

           0       0.76      1.00      0.86   4150754
           1       0.00      0.00      0.00   1339039

    accuracy                           0.76   5489793
   macro avg       0.38      0.50      0.43   5489793
weighted avg       0.57      0.76      0.65   5489793


### Out-of-sample data:


print('\nAccuracy of Decision Tree classifier on validate set: {:.2f}\n'
     .format(clf.score(X_validate, y_validate)))
     

   Accuracy of Decision Tree classifier on validate set: __0.75__
    

### Random forest result:
 

print(classification_report(y_train_rf, y_pred_rf))

              precision    recall  f1-score   support

           0       0.76      1.00      0.86   4150754
           1       0.00      0.00      0.00   1339039

    accuracy                           0.76   5489793
   macro avg       0.38      0.50      0.43   5489793
weighted avg       0.57      0.76      0.65   5489793


#### Validate model: Evaluate. Out-of-sample data:

print('\nAccuracy of random forest classifier on test set: {:.2f}\n'
     .format(rf.score(X_validate_rf, y_validate_rf)))
     
     
Accuracy of random forest classifier on test set: __0.75__


### K-Nearest neighbor result

print(classification_report(y_train_knn, y_pred_knn))

              precision    recall  f1-score   support

           0       0.86      0.91      0.88   4150754
           1       0.65      0.53      0.58   1339039

    accuracy                           0.82   5489793
   macro avg       0.75      0.72      0.73   5489793
weighted avg       0.81      0.82      0.81   5489793


#### Validate model: Evaluate. Out-of-sample data:

print('\nAccuracy of k-nearest neighbor classifier on test set: {:.2f}\n'
     .format(knn.score(X_validate_knn, y_validate_knn)))
     
     
Accuracy of k-nearest neighbor classifier on test set: __0.74__



### Interpretation:

#### Remember Accuracy is the number of correct predictions over the number of total instances that have been evaluated


Model 1 accuracy is 76%

Model II accuracy is 76%

Model III accuracy is 82%

##### K-Nearest Neighbor model performed the best in this project at 82%

## Recommendations:

This project, confirms that the reported churn within Teco Company is true and part of this churning is observed with paperless billing. 

Offer customer paper billing and re-access churn after 6 months. 