# Project Telco

* Learn to discern what turns the churn burn

## Goal

* Discover drivers of churn of Telco customers
* Use drivers to develop a machine learning model to classify churn as a customer ending their contract or not ending (renewing) their contract with Telco

## Imports

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

from scipy import stats

import wrangle as w
# import explore as e
# import modeling as m

## Acquire

* Data acquired from Codeup MySQL DB
* Data initially acquired on 25 Apr 2023
* It contained 7,043 rows and 21 columns before cleaning
* Each row represents a unique customer of Telco
* Each column represents an element of the customer account

## Prepare

**Prepare Actions**:

* Removed columns that did not contain useful information
* Renamed columns to promote readability
* Checked for nulls in the data
    - total_charges nulls matched tenure of 0, therefore changed nulls to 0
* Checked that column data types were appropriate
* Encoded categorical variables
    - turned 'Yes'/'No' to 1/0
* Split data into train, validate and test (approx. 60/20/20), stratifying on 'churn'
* Outliers have not been removed for this iteration of the project

In [2]:
# acquire, clean, and prepare
df = w.wrangle_telco_data()
df.head()

csv file found and loaded
data cleaned and prepped


Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,phone_service,multiple_lines,internet_service_type,online_security,online_backup,...,streaming_tv_Yes,streaming_movies_No internet service,streaming_movies_Yes,contract_type_One year,contract_type_Two year,internet_service_type_Fiber optic,internet_service_type_None,payment_type_Credit card (automatic),payment_type_Electronic check,payment_type_Mailed check
0,0002-ORFBO,Female,0,1,1,1,No,DSL,No,Yes,...,1,0,0,1,0,0,0,0,0,1
1,0003-MKNFE,Male,0,0,0,1,Yes,DSL,No,No,...,0,0,1,0,0,0,0,0,0,1
2,0004-TLHLJ,Male,0,0,0,1,No,Fiber optic,No,No,...,0,0,0,0,0,1,0,0,1,0
3,0011-IGKFF,Male,1,1,0,1,No,Fiber optic,No,Yes,...,1,0,1,0,0,1,0,0,1,0
4,0013-EXCHZ,Female,1,1,0,1,No,Fiber optic,No,No,...,1,0,0,0,0,1,0,0,0,1


In [4]:
# split into train, validate, and test
train, validate, test = w.split_data(df, 'churn')

data split
train -> (4225, 43); 59.99%
validate -> (1409, 43); 20.01%
test -> (1409, 43); 20.01%


#### A brief look at the data

In [5]:
# show head of train data
train.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,phone_service,multiple_lines,internet_service_type,online_security,online_backup,...,streaming_tv_Yes,streaming_movies_No internet service,streaming_movies_Yes,contract_type_One year,contract_type_Two year,internet_service_type_Fiber optic,internet_service_type_None,payment_type_Credit card (automatic),payment_type_Electronic check,payment_type_Mailed check
2332,3338-CVVEH,Male,0,0,0,1,Yes,Fiber optic,No,No,...,1,0,1,0,0,1,0,0,1,0
5275,7442-YGZFK,Male,0,0,0,1,Yes,DSL,No,No,...,0,0,0,0,0,0,0,1,0,0
6429,9102-OXKFY,Male,0,0,0,1,Yes,DSL,No,No,...,0,0,0,0,1,0,0,1,0,0
89,0141-YEAYS,Female,1,0,0,1,Yes,Fiber optic,No,Yes,...,0,0,0,0,0,1,0,0,0,0
6412,9079-YEXQJ,Female,0,0,0,1,Yes,Fiber optic,No,Yes,...,1,0,1,0,0,1,0,0,1,0


#### A summary of the data

In [7]:
# describe train data, maybe transpose
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
senior_citizen,4225.0,0.159053,0.365769,0.0,0.0,0.0,0.0,1.0
partner,4225.0,0.479527,0.49964,0.0,0.0,0.0,1.0,1.0
dependents,4225.0,0.305562,0.4607,0.0,0.0,0.0,1.0,1.0
phone_service,4225.0,0.907219,0.29016,0.0,1.0,1.0,1.0,1.0
paperless_billing,4225.0,0.60426,0.489067,0.0,0.0,1.0,1.0,1.0
monthly_charges,4225.0,65.273243,30.218179,18.4,36.45,70.75,90.35,118.75
total_charges,4225.0,2320.103183,2297.297588,0.0,392.65,1414.8,3902.45,8684.8
tenure,4225.0,32.562367,24.755164,0.0,9.0,29.0,56.0,72.0
churn,4225.0,0.265325,0.441559,0.0,0.0,0.0,1.0,1.0
female,4225.0,0.496095,0.500044,0.0,0.0,0.0,1.0,1.0


## Explore

* How often does a customer churn?

* Here you will explore your data then highlight 4 questions that you asked of the data and how those questions influenced your analysis
* Remember to split your data before exploring how different variables relate to one another
* Each question should be stated directly 
* Each question should be supported by a visualization
* Each question should be answered in natural language
* Two questions must be supported by a statistical test, but you may choose to support more than two
* See the following example, and read the comments in the next cell

In [17]:
# churn rate
(train.churn==1).mean()

0.26532544378698225

In [13]:
# unique values
for i in train.columns.to_list():
    if len(train[i].unique())<5:
        print(i,train[i].unique())

gender ['Male' 'Female']
senior_citizen [0 1]
partner [0 1]
dependents [0 1]
phone_service [1 0]
multiple_lines ['Yes' 'No phone service' 'No']
internet_service_type ['Fiber optic' 'DSL' 'None']
online_security ['No' 'No internet service' 'Yes']
online_backup ['No' 'Yes' 'No internet service']
device_protection ['No' 'Yes' 'No internet service']
tech_support ['No' 'Yes' 'No internet service']
streaming_tv ['Yes' 'No' 'No internet service']
streaming_movies ['Yes' 'No' 'No internet service']
contract_type ['Month-to-month' 'Two year' 'One year']
payment_type ['Electronic check' 'Credit card (automatic)' 'Bank transfer (automatic)'
 'Mailed check']
paperless_billing [0 1]
churn [0 1]
female [0 1]
multiple_lines_No phone service [0 1]
multiple_lines_Yes [1 0]
online_security_No internet service [0 1]
online_security_Yes [0 1]
online_backup_No internet service [0 1]
online_backup_Yes [0 1]
device_protection_No internet service [0 1]
device_protection_Yes [0 1]
tech_support_No internet serv

In [18]:
train1 = train
train1['tenure_bin'] = (pd.cut(train.tenure, bins=6, labels=[1,2,3,4,5,6])).astype(int)
train1

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,phone_service,multiple_lines,internet_service_type,online_security,online_backup,...,streaming_movies_No internet service,streaming_movies_Yes,contract_type_One year,contract_type_Two year,internet_service_type_Fiber optic,internet_service_type_None,payment_type_Credit card (automatic),payment_type_Electronic check,payment_type_Mailed check,tenure_bin
2332,3338-CVVEH,Male,0,0,0,1,Yes,Fiber optic,No,No,...,0,1,0,0,1,0,0,1,0,1
5275,7442-YGZFK,Male,0,0,0,1,Yes,DSL,No,No,...,0,0,0,0,0,0,1,0,0,1
6429,9102-OXKFY,Male,0,0,0,1,Yes,DSL,No,No,...,0,0,0,1,0,0,1,0,0,5
89,0141-YEAYS,Female,1,0,0,1,Yes,Fiber optic,No,Yes,...,0,0,0,0,1,0,0,0,0,3
6412,9079-YEXQJ,Female,0,0,0,1,Yes,Fiber optic,No,Yes,...,0,1,0,0,1,0,0,1,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5592,7874-ECPQJ,Female,0,0,1,1,No,,No internet service,No internet service,...,1,0,0,0,0,1,0,0,0,1
1739,2530-FMFXO,Male,0,1,1,1,Yes,Fiber optic,No,No,...,0,1,0,1,1,0,0,1,0,5
4993,7018-FPXHH,Male,0,1,1,1,No,DSL,Yes,Yes,...,0,0,0,1,0,0,0,0,0,5
2582,3692-JHONH,Female,1,1,0,1,Yes,Fiber optic,No,Yes,...,0,1,1,0,1,0,0,1,0,5


In [19]:
train1['internet_packages'] = train1.online_security_Yes + train1.online_backup_Yes + train1.device_protection_Yes + train1.tech_support_Yes + train1.streaming_tv_Yes + train1.streaming_movies_Yes
train1

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,phone_service,multiple_lines,internet_service_type,online_security,online_backup,...,streaming_movies_Yes,contract_type_One year,contract_type_Two year,internet_service_type_Fiber optic,internet_service_type_None,payment_type_Credit card (automatic),payment_type_Electronic check,payment_type_Mailed check,tenure_bin,internet_packages
2332,3338-CVVEH,Male,0,0,0,1,Yes,Fiber optic,No,No,...,1,0,0,1,0,0,1,0,1,2
5275,7442-YGZFK,Male,0,0,0,1,Yes,DSL,No,No,...,0,0,0,0,0,1,0,0,1,0
6429,9102-OXKFY,Male,0,0,0,1,Yes,DSL,No,No,...,0,0,1,0,0,1,0,0,5,1
89,0141-YEAYS,Female,1,0,0,1,Yes,Fiber optic,No,Yes,...,0,0,0,1,0,0,0,0,3,2
6412,9079-YEXQJ,Female,0,0,0,1,Yes,Fiber optic,No,Yes,...,1,0,0,1,0,0,1,0,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5592,7874-ECPQJ,Female,0,0,1,1,No,,No internet service,No internet service,...,0,0,0,0,1,0,0,0,1,0
1739,2530-FMFXO,Male,0,1,1,1,Yes,Fiber optic,No,No,...,1,0,1,1,0,0,1,0,5,4
4993,7018-FPXHH,Male,0,1,1,1,No,DSL,Yes,Yes,...,0,0,1,0,0,0,0,0,5,3
2582,3692-JHONH,Female,1,1,0,1,Yes,Fiber optic,No,Yes,...,1,1,0,1,0,0,1,0,5,4


**The following empty code block** is here to represent the countless questions, visualizations, and statistical tests 
that did not make your final report. Data scientist often create a myriad of questions, visualizations 
and statistical tests that do not make it into the final notebook. This is okay and expected. Remember 
that shotgun approaches to your data such as using pair plots to look at the relationships of each feature 
are a great way to explore your data, but they have no place in your final report. 
**Your final report is about showing and supporting your findings, not showing the work you did to get there!**

## You may use this as a template for how to ask and answer each question:

### 1) Question about the data
* Ask a question about the data for which you got a meaningful result
* There is no connection can be a meaningful result

### 2) Visualization of the data answering the question

* Visualizations should be accompanied by takeaways telling the reader exactly what you want them to get from the chart
* You can include theses as bullet points under the chart
* Use your chart title to provide the main take-away from each visualization
* Each visualization should answer one, and only one, of the explore questions

### 3) Statistical test
* Be sure you are using the correct statistical test for the type of variables you are testing
* Be sure that you are not violating any of the assumptions for the statistical test you are choosing
* Your notebook should run and produce the results of the test you are using (This may be done through imports)
* Include an introduction to the kind of test you are doing
* Include the Ho and Ha for the test
* Include the alpha you are using
* Include the readout of the p-value for the test
* Interpret the results of the test in natural language (I reject the null hypothesis is not sufficient)

### 4) Answer to the question
* Answer the question you posed of the data by referring to the chart and statistical test (if you used one)
* If the question relates to drivers, explain why the feature in question would/wouldn't make a good driver

## Exploration Summary
* After your explore section, before you start modeling, provide a summary of your findings in Explore
* Include a summary of your takeaways
* Include a summary of the features you examined and weather or not you will be going to Modeling with each feature and why
* It is important to note which features will be going into your model so the reader knows what features you are using to model on

## Modeling

### Introduction
* Explain how you will be evaluating your models
* Include the evaluation metric you will be using and why you have chosen it
* Create a baseline and briefly explain how it was calculated 

In [3]:
# If you use code to generate your baseline run the code and generate the output here

Printout should read: <br>
Baseline: "number" "evaluation metric"

### Best 3 Models
* Show the three best model results obtained using your selected features to predict the target variable
* Typically students will show the top models they are able to generate for three different model types

## You may use this as a template for how to introduce your models:

### Model Type

In [4]:
# Code that runs the best model in that model type goes here 
# (This may be imported from a module)

Printout of model code should read: <br>
"Model Type" <br>
"evaluation metric" on train: "evaluation result" <br>
"evaluation metric" on validate: "evaluation result"

### Test Model
* Choose the best model out of the three as you best model and explain why you have chosen it
* Explain that you will now run your final model on test data to gauge how it will perform on unseen data

In [5]:
# Code that runs the best overall model on test data (this may be imported from a module)

Printout of model code should read: <br>
"Model Type" <br>
"evaluation metric" on Test: "evaluation result" <br>

### Modeling Wrap 
* Give a final interpretation of how the models test score compares to the baseline and weather you would recommend this model for production

## Conclusion

### Summery
* Summarize your findings and answer the questions you brought up in explore 
* Summarize how drivers discovered lead or did not lead to a successful model 

### Recommendations
* Recommendations are actions the stakeholder should take based on your insights

### Next Steps
* Next Steps are what you, as a Data Scientist, would do if provided more time to work on the project

**Where there is code in your report there should also be code comments telling the reader what each code block is doing. This is true for any and all code blocks even if you are using a function to import code from a module.**
<br>
<br>
**Your Notebook should contain adequate markdown that documents your thought process, decision making, and navigation through the pipeline. As a Data Scientist, your job does not end with making data discoveries. It includes effectively communicating those discoveries as well. This means documentation is a critical part of your job.**

# README

Your README should contain all of the following elements:

* **Title** Gives the name of your project
* **Project Description** Describes what your project is and why it is important 
* **Project Goal** Clearly states what your project sets out to do and how the information gained can be applied to the real world
* **Initial Hypotheses** Initial questions used to focus your project 
* **Project Plan** Guides the reader through the different stages of the pipeline as they relate to your project
* **Data Dictionary** Gives a definition for each of the features used in your report and the units they are measured in, if applicable
* **Steps to Reproduce** Gives instructions for reproducing your work. i.e. Running your notebook on someone else's computer.