# Why Are Our Customers Churning?

**1.** [**Project Plan**](#Project-Plan)<br>
**2.** [**Acquire and Split Data**](#acquire_and_split_data)<br>
**3.** [**Explore Data**](#explore_data)<br>
**4.** [**Create a Baseline Model**](#baseline_model)<br>
**5.** [**Create and Compare Different Models**](#modeling)<br>
**6.** [**Predict on Test Model**](#predict_test)<br>
**7.** [**Exporting CSV with Predictions**](#csv_export)

## 1. Project Plan

### Background

Our team leader wants us to find out why our customers are churning.

> Our team lead would like us to take a look at some of our recent customer data. We've been tasked with identifying areas that represent high customer churn.

> Aside from the more general question, *why are our customers churning?* Some other questions we will look to answer: Is there a price threshold for specific services where the likelihood of churn increases? Is their a negative impact once the price for those services goes past that point? If so, what is that point for what service(s)? Among numerous other possible questions.

> For this particular project she would like to see our code documentation and commenting buttoned-up. In addition, she'd like us to not leave any individual numbers or figures displayed in isolation. Adding context to these situations are necessary.

### Goals

To identify as many different customer subgroups that have a propensity to churn more than others. Our target audience is our team lead, however, she will be presenting these findings to the Senior Leadership Team. We will need to keep this final audience in mind with regards to report readability, etc. We will need to communicate in a more concise and clear manner.

The deliverables for this project are the following data assets:

1. Report detailing our analysis in an .ipynb format
2. A CSV with the customer_id, probability of churn, and the prediction of churn
3. Slide Deck explaining our analysis with the SLT audience in mind
4. All .py files that are necessary to reproducible work
5. Detailed README on a Github and repo containing all files for this project

### Data Dictionary for Selected Features

#### *Target Variable*:

**churn** - defines whether or not the customer is still with telco: 0 == still with telco, 1 == they have churned


#### *Independent Variables*:

**gender** - gender identity of customer - 0 = female, 1 = male

**senior_citizen** - if the customer is a senior - 0 = not senior citizen, 1 = senior citizen

**online_security** - if the customer has online security through telco - 0 = No, 1 = No Internet Service, 2 = Yes

**online_backup** - if the customer has online backup through telco - 0 = No, 1 = No Internet Service, 2 = Yes

**device_protection** - if the customer has device protection through telco - 0 = No, 1 = No Internet Service, 2 = Yes

**tech_support** - if the customer has tech support through telco - 0 = No, 1 = No Internet Service, 2 = Yes

**streaming_tv** - if the cusomter has streaming tv service through telco - 0 = No, 1 = No Internet Service, 2 = Yes

**streaming_movies** - if the cusomter has streaming movie service through telco - 0 = No, 1 = No Internet Service, 2 = Yes

**paperless_billing** - if the customer has elected to have a paperless billing format - 0 = No, 1 =Yes

**monthly_charges** - the monthly charges per customer for all services - represented as a float, calculated in USD

**total_charges** - the total lifetime charges per customer for all services - represented as a float, calculated in USD

**tenure_years** - tenure of each customer - represented in total years, used the tenure (calculated in months) column from original data pull then divided by 12.

**phone_and_multi_line** - a combination of whether a customer has a phone line, and if they do, do they have multiple lines (used the phone_service and multipe_line columns from original pull) - 0 = No phone lines, 1 = Yes, but only one line, 2 = Yes, multiple lines

**partner_and_dependents** - a combination of whether a customer has a partner or dependents (used the partner and dependents columns from original pull) - 0 = No partner and No dependents, 1 = Yes, a partner and No dependents, 2 = No partner and Yes, dependents, 3 = Yes, partner and Yes, dependents

**Electronic check** - the customer pays with Electronic check, one hot encoded from the payment_types column from original pull - 0 = No, 1 = Yes

**Mailed check** -the customer pays with Mailed check, one hot encoded from the payment_types column from original pull - 0 = No, 1 = Yes

**Credit card (automatic)** - the customer pays with Credit card, one hot encoded from the payment_types column from original pull - 0 = No, 1 = Yes

**Bank transfer (automatic)** - the customer pays with Bank transfer, one hot encoded from the payment_types column from original pull - 0 = No, 1 = Yes

**DSL** - the customer has DSL internet service, one hot encoded from the internet_types column from original pull - 0 = No, 1 = Yes

**Fiber optic** - the customer has Fiber optic internet service, one hot encoded from the internet_types column from original pull - 0 = No, 1 = Yes

**None** - the customer does not have internet service, one hot encoded from the internet_types column from original pull - 0 = No, 1 = Yes

**Month-to-month** - the customer pays Month-to-month, one hot encoded from the contract_types column from original pull - 0 = No, 1 = Yes

**One year** - the customer pays Month-to-month, one hot encoded from the contract_types column from original pull - 0 = No, 1 = Yes

**Two year** - the customer pays Month-to-month, one hot encoded from the contract_types column from original pull - 0 = No, 1 = Yes

#### *Data Scaling*:

**Min/Max Scaler** - The deliverables requires we proved a model that preforms better than the baseline at predicting customer churn. We'll be using Classification models to make this prediciton. It's possible we'll need to scale the data, generally; however, a model like K Nearest Neighbors requires scaling, so we'll at least be scaling when testing that model.

In [1]:
import numpy as np
import pandas as pd
from math import sqrt
from scipy import stats

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
import sklearn.impute
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import env
import acquire
import prepare

pd.set_option('display.max_columns', None)

<a id='acquire_and_split_data'></a>

# 2. Acquire and Split Data

In [2]:
# We're be pulling the telco data from our SQL servers. You'll need the acquire.py file and an env.py file for
# this data pull.

telco = acquire.get_telco_data()

### Let's take a look at the data

In [3]:
telco.head(10)

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,1,0003-MKNFE,Male,0,No,No,9,Yes,Yes,No,No,No,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
1,4,1,1,0013-MHZWF,Female,0,No,Yes,9,Yes,No,No,No,No,Yes,Yes,Yes,Yes,69.4,571.45,No,Month-to-month,DSL,Credit card (automatic)
2,1,1,1,0015-UOCOJ,Female,1,No,No,7,Yes,No,Yes,No,No,No,No,No,Yes,48.2,340.35,No,Month-to-month,DSL,Electronic check
3,1,1,1,0023-HGHWL,Male,1,No,No,1,No,No phone service,No,No,No,No,No,No,Yes,25.1,25.1,Yes,Month-to-month,DSL,Electronic check
4,3,1,1,0032-PGELS,Female,0,Yes,Yes,1,No,No phone service,Yes,No,No,No,No,No,No,30.5,30.5,Yes,Month-to-month,DSL,Bank transfer (automatic)
5,1,1,1,0067-DKWBL,Male,1,No,No,2,Yes,No,Yes,No,No,No,No,No,Yes,49.25,91.1,Yes,Month-to-month,DSL,Electronic check
6,2,1,1,0076-LVEPS,Male,0,No,Yes,29,No,No phone service,Yes,Yes,Yes,Yes,No,No,Yes,45.0,1242.45,No,Month-to-month,DSL,Mailed check
7,2,1,1,0082-LDZUE,Male,0,No,No,1,Yes,No,No,No,No,No,No,No,Yes,44.3,44.3,No,Month-to-month,DSL,Mailed check
8,1,1,1,0096-BXERS,Female,0,Yes,No,6,Yes,Yes,No,No,No,No,No,No,No,50.35,314.55,No,Month-to-month,DSL,Electronic check
9,2,1,1,0096-FCPUF,Male,0,No,No,30,Yes,Yes,Yes,No,No,No,No,Yes,Yes,64.5,1888.45,No,Month-to-month,DSL,Mailed check


In [4]:
telco.shape

(7043, 24)

In [5]:
telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   payment_type_id           7043 non-null   int64  
 1   internet_service_type_id  7043 non-null   int64  
 2   contract_type_id          7043 non-null   int64  
 3   customer_id               7043 non-null   object 
 4   gender                    7043 non-null   object 
 5   senior_citizen            7043 non-null   int64  
 6   partner                   7043 non-null   object 
 7   dependents                7043 non-null   object 
 8   tenure                    7043 non-null   int64  
 9   phone_service             7043 non-null   object 
 10  multiple_lines            7043 non-null   object 
 11  online_security           7043 non-null   object 
 12  online_backup             7043 non-null   object 
 13  device_protection         7043 non-null   object 
 14  tech_sup

In [6]:
telco.describe()

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,senior_citizen,tenure,monthly_charges
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,2.315633,1.872923,1.690473,0.162147,32.371149,64.761692
std,1.148907,0.737796,0.833755,0.368612,24.559481,30.090047
min,1.0,1.0,1.0,0.0,0.0,18.25
25%,1.0,1.0,1.0,0.0,9.0,35.5
50%,2.0,2.0,1.0,0.0,29.0,70.35
75%,3.0,2.0,2.0,0.0,55.0,89.85
max,4.0,3.0,3.0,1.0,72.0,118.75


#### Takeaways

- total_charges should be in the above .describe(), but it is the wrong dtype

- After seveeral attempts to change total_charges into a float, we discovered there were some empty spaces in the cell, so once we replace those, we can now see a few missing values there as well, and we can change the dtype

- We decided to drop the 0 values (found in the total_charges column) from the data set, because we have plenty of data points for our analysis, and these customers haven't even had a chance to churn yet.

- After viewing the data post this change, we felt like keeping the customer_id was vital, but we'd like to 'take it out' of the data for future scaling, exploration, etc. So, we will set the index with the customer_id

- We have a ton of variables that are related, so we're looking to combine a few columns into single 'encoded' variables, the first combination was whether a customer has phone service at all, and if they do do they have 1 or more lines, so we'll put a 0 for none, 1 for yes, but 1 line, 2 for yes and more than 1 line

- We will look at the dependent and partner columns. With these we'll address them similar to before, 0 for 'no and no', 1 for 'yes and no', 2 for 'no and yes', and 3 for 'yes and yes'.

- We felt that dropping the type_id fields for payment, internet_service, and contract was appropriate becuase these are artifacts of the joining process during SQL pull

- We're drop the columns partner, dependents, phone_service, tenure, and multiple_lines, because we've added 'encoded' versions on the dataframe

- Now we need to encode our data for numerous columns: payment, internet_service, and contract_type. We will use the one hot encoder because we're not 'ranking' the different options - they are categorical in nature.

- Now that all the one hot encoded dataframes are ready to go with a *train, test split*. We'll split into train and test at 80/20, then split into train and validate at 80/20.

In [7]:
# We're now left with this code based on functions in the prepare.py file. They accomplish all items addresses
# above, and return all train, validate and test datasets.

X_train, y_train, X_validate, y_validate, X_test, y_test = prepare.split_telco(telco)

In [17]:
y_train = y_train.to_frame()

In [18]:
y_train

Unnamed: 0_level_0,churn
customer_id,Unnamed: 1_level_1
3714-JTVOV,Yes
3049-SOLAY,Yes
5035-PGZXH,No
1051-EQPZR,No
8755-OGKNA,No
...,...
3640-PHQXK,Yes
8593-WHYHV,Yes
0455-XFASS,No
5519-TEEUH,No


In [None]:
print('   train: %d rows' % X_train.shape[0])
print('validate: %d rows' % X_validate.shape[0])
print('    test: %d rows' % X_test.shape[0])

In [None]:
X_train.head()

In [None]:
#telco[telco.customer_id == '3714-JTVOV']

<a id='explore_data'></a>

## 3. Explore Data

What are we really aksing when we ask the question, why do our customers churn?

- The first important thing to know is that customer churn is part of consumer behavior.

- With that in mind, we want to understand what drove that behavior. To do that we need to walk in that customers shoes, look through their eyes, etc. Why did they cancel their service? What motivated them to leave?

During the exploration phase for this project we'll breakdown customer segments against known churn. This will give us an understanding for which segment is most likely to churn - and potentially give us actionable insights.

In [None]:
# Some basic Tableau graph exploration
# Wanted to see some of the totals for a few features against one another

from IPython.display import Image
Image(filename="img/tableau_exploration.png", width=800)

Takeaways:
    
- The split between female and male in the dataset is almost equal

- Way more people are signed up for phone service

- Customer base has fewer senior citizens

- Month-to-month is the predominant contract type

- Internet service is favoring fiber optic, but DSL and none are not far behind

- An almost tie for second place among payment types, with electronic check being the only clear favorite

Let's dive deeper.

In [None]:
# Let's take a look at what is heppening with just churn.

plt.figure(figsize=(16,9))
X_train.churn.value_counts().plot.bar().set_title("Churn")
plt.show()

In [None]:
# what are our typical monthly revenue is per customer that has churned
mean_charges = X_train[X_train.churn == 1].monthly_charges.mean()
mean_charges

In [None]:
# what are the total customers that have churned in our train dataset
churned = X_train[X_train.churn == 1].monthly_charges.count()
churned

In [None]:
mean_charges * churned

Takeaway:
    
- Our average monthly revenue per customer is $74.87

- We have 1190 churned customers in our train dataset

- Which means we have potenitally lost somewhere in the neighborhood of $89,094.10 in monthly revenue from these churned customers

In [None]:
# is there correlation in our variables?

X_train.corr()

In [None]:
# Though it looks busy, we can use this correlation table to locate some important info
# And the results, are not what we'd expect
X_train.corr().iloc[11].sort_values()[0:10]

In [None]:
corr = X_train.corr()

plt.figure(figsize=(20,12))

ax = sns.heatmap(
    corr,
    #annot = True,
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

Takeaways:
    
- We have a ton of variables
- Ideally, we'd like to have variables that are as independent as possible from one another.
- There are a few variables the seem to have weak correlation with one another. We can run a few tests to see if there is any signifigance in those findings. 
-  A few feautures seem to have a bit of correlation, though in the negative direction, which is interesting. Again we can run a few tests on this to provide clarity.
- A possible hypothesis for worth further investigation is the correlation between churn and month-to-month billing.

In [None]:
sns.set(font_scale=1.2)
plt.figure(figsize=(10,6))
sns.barplot(X_train.senior_citizen, X_train.churn, data=X_train, ci = None)
plt.title('Senior Citizens & Churn')
plt.ylabel('Churn')
plt.xlabel('')
#plt.xticks(['No', 'Yes'])
plt.xticks(np.arange(2), ('Yes', 'No'))
#plt.xtick.label.set_fontsize(14) 

plt.show()

In [None]:

plt.figure(figsize=(10,6))
sns.barplot(X_train.gender, X_train.churn, data=X_train, ci = None)
plt.title('Gender & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(2), ('Female', 'Male'))

plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(X_train.online_security, X_train.churn, data=X_train, ci = None)
plt.title('Online Security & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(3), ('No', 'No Internet Service', 'Yes'))

plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(X_train.online_backup, X_train.churn, data=X_train, ci = None)
plt.title('Online Backup & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(3), ('No', 'No Internet Service', 'Yes'))

plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(X_train.tech_support, X_train.churn, data=X_train, ci = None)
plt.title('Tech Support & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(3), ('No', 'No Internet Service', 'Yes'))

plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(X_train.device_protection, X_train.churn, data=X_train, ci = None)
plt.title('Device Protection & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(3), ('No', 'No Internet Service', 'Yes'))

plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(X_train.streaming_tv, X_train.churn, data=X_train, ci = None)
plt.title('Streaming TV & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(3), ('No', 'No Internet Service', 'Yes'))

plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(X_train.streaming_movies, X_train.churn, data=X_train, ci = None)
plt.title('Streaming Movies & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(3), ('No', 'No Internet Service', 'Yes'))

plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(X_train.paperless_billing, X_train.churn, data=X_train, ci = None)
plt.title('Paperless Billing & Churn')
plt.ylabel('Churn')
plt.xlabel('')
plt.xticks(np.arange(2), ('No', 'Yes'))

plt.show()

In [None]:
# after reviewing the correlation plots and some of these barplots, we'd like to look at a few crosstabs
# and run some statistical tests

$H_0$: There is no significant correlation between household size and churn.

$H_a$: There is a significant correlation between household size and churn.

$\alpha$ = 0.05

In [None]:
# after reviewing the correlation plots and some of these barplots, we'd like to look at a few crosstabs
# and run some statistical tests

# Chi2-Test for Household Size (0: Single, 1: Partner Only, 2: Dependents Only, 3: Partner & Dependents)

household_size = pd.crosstab(X_train.partner_and_dependents, X_train.churn)
household_size

In [None]:
chi2, p_household, degf, expected_household = stats.chi2_contingency(household_size)

print(expected_household)
print(f"The p value is: {p_household:.35f}. We reject the null hypotehesis.")

#### Takeaways:
- p-value is less than our alpha ($\alpha = 0.05$). Therefore, as customer segment, household size seems to be statistically significant.
- There's a 47% decrease from the actual churn of partner/dependent churn compared to the expeccted - seems to be a rather high number. Why is this group not churning?
- Single member households have the highest percentage of churn among all the groups.
- Partner only has a 4% increase expected to the observed, not a big number, might be worth further investigation.
- Dependent only has a more than 20% increase expected than observes, which is a bit larger. Certainly worth looking into what might be going on here.

$H_0$: There is no significant correlation between gender and churn.

$H_a$: There is a significant correlation between gender and churn.

$\alpha$ = 0.05

In [None]:
# Chi2-Test for Gender (0: Female, 1: Male)

gender = pd.crosstab(X_train.gender, X_train.churn)
gender

In [None]:
chi2, p_gender, degf, expected_gender = stats.chi2_contingency(gender)

print(expected_gender)
print(f"The p value is: {p_gender:.2f}. We fail to reject the null hypothesis.")

#### Takeaways:
- p-value is greater than our alpha ($\alpha = 0.05$). The rates of churn vs gender seems almost indistinguishable.
- Not sure it's worth doing more exploration of this segment of the customer base.
- Perhaps a Cramer's V can be done to test the strength of the correlation.

$H_0$: There is no significant correlation between senior citizens and churn.

$H_a$: There is a significant correlation between senior citizens and churn.

$\alpha$ = 0.05

In [None]:
# Chi2-Test for Senior Citizen (0: Senior, 1: Not A Senior)

senior = pd.crosstab(X_train.senior_citizen, X_train.churn)
senior

In [None]:
chi2, p_senior, degf, expected_senior = stats.chi2_contingency(senior)

print(expected_senior)
print(f"The p value is: {p_senior:.26f}. We reject the null hypothesis")

### Takeaways:
- p-value is less than our alpha ($\alpha = 0.05$). It would seem our senior_citizen is statistically significant.
- There looks to be a 60% increase in expected vs actually churned customers. This is a very signifigant number. However, the customer base of senior citizen vs non-senior citizen is stark as well. They are not a significant portion of our customer base.

<div class="alert alert-block alert-warning">

<b>Question:</b> Can we isolate some opportunies within groupby segments?

</div>

In [None]:
X_train.groupby(["senior_citizen"])[["churn","monthly_charges","tenure_years"]].mean()

Takeaway:

- Like previously noted, senior citizens make up a small portion of our train data, but 42% are churning. Also their monthly spend is a bit higher.
- Seems like they average a decent tenure though. So, maybe we can offer some incentives to keep them longer, and reduce the churning.

In [None]:
X_train.groupby(["partner_and_dependents", "Month-to-month", "One year", "Two year"])[["churn","monthly_charges"]].mean()

Takeaway:

- Single member and Partner Only have the highest avg rate of churn, with the highest going to Single member.
- Month-to-month contract is every segments highest churn position.

In [None]:
X_train.groupby(["Month-to-month", "One year", "Two year", "DSL", "Fiber optic", "None"])[["churn","monthly_charges"]].mean()


Takeaway:

- Once again the highest churn rate is coming from month-to-month contracts. Worth noting these customers also have fiber optic service. Also, fiber optic service is more expensive than DSL.

- Though, we don't think fiber optics alone is a problem, as the churn rate in other contract types are not nearly as high.

- No picket signs necessary for our fiber optic service, just need to work on extending the contract length per customer.

In [None]:
X_train.groupby(["partner_and_dependents",'Electronic check', 'Mailed check', 'Credit card (automatic)',
       'Bank transfer (automatic)'])[["churn","monthly_charges"]].mean()

Takeaway:

- It seems that electronic checks have the highest churn among payment types.

In [None]:
X_train.groupby(['online_security', 'online_backup','device_protection', 'tech_support'])[["churn","monthly_charges"]].mean()

Takeaway:

- Highest churn seems to be among those that don't take advantage of any of the additional services. Maybe they feel stuck or lost with a particular service, then chance since they have no place to turn.

- Those with all four services have the lowest rate of churn.

In [None]:
X_train.groupby(['streaming_tv', 'streaming_movies', 'paperless_billing'])[["churn","monthly_charges"]].mean()

<div class="alert alert-block alert-warning">

<b>Question:</b> Have any features popped up as being more useful for predicting churn?

</div>

Takeaway:

A few features do seem to show a correlation with predicting churn.

- Senior Citizen, specifically if the customer is a senior
- Household size, specifically if it is single memeber, then partner only
- Online Security, only
- Online Backup, only
- Having more than a single premium/additonal feature, particularly all four
- Month-to-month contracts
- Fiber optic service, though this might be due to cost
- Electronic Check payment type

<div class="alert alert-block alert-warning">

<b>Question:</b> As we reviewed the list above, one of the features stood out again and again, month-to-month contracts. What does the churn rate looke like at the end of one year on month-to-month vs a one year contract? 

</div>

In [None]:
# basically right around/right after that one year mark is up
twelve_month_tenure = X_train[(X_train['tenure_years'] >= 1) & (X_train['tenure_years'] <= 1.12)]
twelve_month_tenure.groupby(['Month-to-month', 'One year', 'Two year'])[["churn"]].mean()

In [None]:
sns.barplot(y="churn", x="Month-to-month", data=twelve_month_tenure, ci=None)
plt.show()

Takeaway:

- Right at, or shortly after, the 12 month period, a much larger portion of the month-to-month have churned.

<a id='baseline_model'></a>

## Create a Baseline Model

Right off the top, we will do is create a dataframe to store all our our predicted values during the modeling process.

The first two data sets to be places will be the actual and the baseline model.

In [None]:
#y_train.reset_index().drop(columns='customer_id')

In [19]:
# We create a new dataframe to store the predicted values from all the models. 
# This makes it significantly easier to compare our model's performances against the actual values, 
# and the baseline.

evaluation = pd.DataFrame({"actual": y_train})
evaluation["baseline"] = 0

ValueError: If using all scalar values, you must pass an index

In [15]:
evaluation

Unnamed: 0_level_0,actual,baseline
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3714-JTVOV,Yes,0
3049-SOLAY,Yes,0
5035-PGZXH,No,0
1051-EQPZR,No,0
8755-OGKNA,No,0
...,...,...
3640-PHQXK,Yes,0
8593-WHYHV,Yes,0
0455-XFASS,No,0
5519-TEEUH,No,0


In [16]:
y_train['churn'].value_counts()

KeyError: 'churn'

<a id='modeling'></a>

## Create and Compare Different Models

<a id='predict_test'></a>

## Predict on Test Model

<a id='csv_export'></a>

## Exporting CSV with Predictions