# Why Are Our Customers Churning?

**1. Project Plan**<br>
**2. Acquire and Split Data**<br>
**3. Create a Baseline Model**<br>
**4. Explore Data**<br>
**5. Create and Compare Different Models**<br>
**6. Predict on Test Model**<br>
**7. Exporting CSV with Predictions**

## 1. Project Plan

### Background

Our team leader wants us to find out why our customers are churning.

> Our team lead would like us to take a look at some of our recent customer data. We've been tasked with identifying areas that represent high customer churn.

> Aside from the more general question, *why are our customers churning?* Some other questions we will look to answer: Is there a price threshold for specific services where the likelihood of churn increases? Is their a negative impact once the price for those services goes past that point? If so, what is that point for what service(s)? Among numerous other possible questions.

> For this particular project she would like to see our code documentation and commenting buttoned-up. In addition, she'd like us to not leave any individual numbers or figures displayed in isolation. Adding context to these situations are necessary.

### Goals

To identify as many different customer subgroups that have a propensity to churn more than others. Our target audience is our team lead, however, she will be presenting these findings to the Senior Leadership Team. We will need to keep this final audience in mind with regards to report readability, etc. We will need to communicate in a more concise and clear manner.

The deliverables for this project are the following data assets:

1. Report detailing our analysis in an .ipynb format
2. A CSV with the customer_id, probability of churn, and the prediction of churn
3. Slide Deck explaining our analysis with the SLT audience in mind
4. All .py files that are necessary to reproducible work
5. Detailed README on a Github and repo containing all files for this project

In [8]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
import sklearn.impute
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import env
import acquire
import prepare

pd.set_option('display.max_columns', None)

In [2]:
telco = acquire.get_telco_data()

In [3]:
telco.head()

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,1,0003-MKNFE,Male,0,No,No,9,Yes,...,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
1,4,1,1,0013-MHZWF,Female,0,No,Yes,9,Yes,...,Yes,Yes,Yes,Yes,69.4,571.45,No,Month-to-month,DSL,Credit card (automatic)
2,1,1,1,0015-UOCOJ,Female,1,No,No,7,Yes,...,No,No,No,Yes,48.2,340.35,No,Month-to-month,DSL,Electronic check
3,1,1,1,0023-HGHWL,Male,1,No,No,1,No,...,No,No,No,Yes,25.1,25.1,Yes,Month-to-month,DSL,Electronic check
4,3,1,1,0032-PGELS,Female,0,Yes,Yes,1,No,...,No,No,No,No,30.5,30.5,Yes,Month-to-month,DSL,Bank transfer (automatic)


In [4]:
X_train, y_train, X_validate, y_validate, X_test, y_test = prepare.split_telco(telco)

In [None]:
def fill_na(df):
    df.replace(to_replace = " ", value = np.nan, inplace = True)
    return df

def drop_na(df):
    return df.dropna(axis = 0, inplace = True)

In [None]:
fill_na(telco)

In [None]:
drop_na(telco)

In [None]:
telco.total_charges = telco.total_charges.astype('float')

In [None]:
telco.info()

In [None]:
def drop_na(df):
    df = df.dropna(axis = 0)
    return df

In [None]:
telco = drop_na(telco)

In [None]:
telco.head()

In [None]:
telco.shape

In [9]:
X_train.head()

Unnamed: 0_level_0,gender,senior_citizen,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,tenure_years,phone_and_multi_line,partner_and_dependents,Electronic check,Mailed check,Credit card (automatic),Bank transfer (automatic),DSL,Fiber optic,None,Month-to-month,One year,Two year
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
3714-JTVOV,Female,1,2,0,0,0,0,0,1,74.15,3229.4,3.5,1,1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3049-SOLAY,Female,0,0,0,0,0,2,2,1,95.2,292.85,0.25,2,1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
5035-PGZXH,Female,0,0,2,2,0,2,2,1,106.8,5914.4,4.666667,2,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1051-EQPZR,Female,0,1,1,1,1,1,1,0,19.6,780.25,3.666667,1,3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
8755-OGKNA,Female,0,1,1,1,1,1,1,0,19.5,1167.6,4.75,1,3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [6]:
X_train.shape

(4500, 24)

In [7]:
print('   train: %d rows' % X_train.shape[0])
print('validate: %d rows' % X_validate.shape[0])
print('    test: %d rows' % X_test.shape[0])

   train: 4500 rows
validate: 1125 rows
    test: 1407 rows
