# Customer Churn Prediction

## Part 1 - DEFINE

### ---- 1 Define the problem ----    

The most important asset in any company is is the people—the human capita. It’s important to find great talent, and more importantly to keep the great talent happy and loyal to the company.   

Salary, without a doubt is a great factor in attracting great people and keep people happy in the organization.   

In this project, we want to find out:    

With the data available, can we develop a model that predicts the salary for a specific job and profile?    

How accurate can we get?    

In [5]:
# Import your libraries
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import RFECV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
pd.options.display.max_columns = 999
pd.options.display.float_format = '{:,.2f}'.format
sns.set_style('dark')

# My info here
__author__ = "Sha Brown"
__email__ = "sha821@gmail.com"

# Part 2 - DISCOVER

## ---- 2 Load the data ----

In [7]:
# Load the data into a Pandas dataframe
customer_df = pd.read_csv('customer_churn.csv')

In [9]:
# Display the shape of the dataframe
customer_df.shape

(900, 10)

In [11]:
# Display the first 5 rows
customer_df.head()

Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Onboard_date,Location,Company,Churn
0,Cameron Williams,42.0,11066.8,0,7.22,8.0,2013-08-30 07:00:40,"10265 Elizabeth Mission Barkerburgh, AK 89518",Harvey LLC,1
1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,2013-08-13 00:38:46,"6157 Frank Gardens Suite 019 Carloshaven, RI 1...",Wilson PLC,1
2,Eric Lozano,38.0,12884.75,0,6.67,12.0,2016-06-29 06:20:07,"1331 Keith Court Alyssahaven, DE 90114","Miller, Johnson and Wallace",1
3,Phillip White,42.0,8010.76,0,6.71,10.0,2014-04-22 12:43:12,"13120 Daniel Mount Angelabury, WY 30645-4695",Smith Inc,1
4,Cynthia Norton,37.0,9191.58,0,5.56,9.0,2016-01-19 15:31:15,"765 Tricia Row Karenshire, MH 71730",Love-Jones,1


In [14]:
# Show the summary statistics of the dataframe
customer_df.describe()

Unnamed: 0,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Churn
count,900.0,900.0,900.0,900.0,900.0,900.0
mean,41.82,10062.82,0.48,5.27,8.59,0.17
std,6.13,2408.64,0.5,1.27,1.76,0.37
min,22.0,100.0,0.0,1.0,3.0,0.0
25%,38.0,8497.12,0.0,4.45,7.0,0.0
50%,42.0,10045.87,0.0,5.21,8.0,0.0
75%,46.0,11760.1,1.0,6.11,10.0,0.0
max,65.0,18026.01,1.0,9.15,14.0,1.0


## ---- 3 Check the quality of the data ----

In [16]:
# check to see if there are duplicated entries
customer_df.duplicated().sum()

0

In [18]:
# check to see if there are null values
customer_df.isnull().sum()

Names              0
Age                0
Total_Purchase     0
Account_Manager    0
Years              0
Num_Sites          0
Onboard_date       0
Location           0
Company            0
Churn              0
dtype: int64

## ---- 4 Data Processing ----

### Process 'Onboard_date'

In [31]:
# first need to convert Onboard_date into pandas datetime
customer_df['Onboard_date'] = pd.to_datetime(customer_df['Onboard_date'])

In [44]:
# create a column to show onboard year
customer_df['onboard_year'] = customer_df['Onboard_date'].dt.year
# create a column to show onboard month
customer_df['onboard_month'] = customer_df['Onboard_date'].dt.month
# create a column to show onboard month
customer_df['onboard_weekday'] = customer_df['Onboard_date'].dt.dayofweek #Monday=0, Sunday=6
# create a column to show onboard month
customer_df['onboard_time'] = customer_df['Onboard_date'].dt.time

In [45]:
# check the dataframe and we can see new features have been added
customer_df.head(2)

Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Onboard_date,Location,Company,Churn,onboard_year,onboard_month,onboard_date,onboard_time,onboard_day,onboard_weekday
0,Cameron Williams,42.0,11066.8,0,7.22,8.0,2013-08-30 07:00:40,"10265 Elizabeth Mission Barkerburgh, AK 89518",Harvey LLC,1,2013,8,2013-08-30,07:00:40,4,4
1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,2013-08-13 00:38:46,"6157 Frank Gardens Suite 019 Carloshaven, RI 1...",Wilson PLC,1,2013,8,2013-08-13,00:38:46,1,1


### Process 'Location'

In [51]:
customer_df['state'] = customer_df['Location'].apply(lambda x: x[-8:-6])

In [52]:
customer_df.head(2)

Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Onboard_date,Location,Company,Churn,onboard_year,onboard_month,onboard_date,onboard_time,onboard_day,onboard_weekday,state
0,Cameron Williams,42.0,11066.8,0,7.22,8.0,2013-08-30 07:00:40,"10265 Elizabeth Mission Barkerburgh, AK 89518",Harvey LLC,1,2013,8,2013-08-30,07:00:40,4,4,AK
1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,2013-08-13 00:38:46,"6157 Frank Gardens Suite 019 Carloshaven, RI 1...",Wilson PLC,1,2013,8,2013-08-13,00:38:46,1,1,RI


In [81]:
customer_df.head(2)

Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Onboard_date,Location,Company,Churn,onboard_year,onboard_month,onboard_date,onboard_time,onboard_day,onboard_weekday,state
0,Cameron Williams,42.0,11066.8,0,7.22,8.0,2013-08-30 07:00:40,"10265 Elizabeth Mission Barkerburgh, AK 89518",Harvey LLC,1,2013,8,2013-08-30,07:00:40,4,4,AK
1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,2013-08-13 00:38:46,"6157 Frank Gardens Suite 019 Carloshaven, RI 1...",Wilson PLC,1,2013,8,2013-08-13,00:38:46,1,1,RI


In [76]:
# Extract state from location
# P is a named capturing group, as opposed to an unnamed capturing group.
customer_df['state'] = customer_df['Location'].str.extract(r'(?P<state>[A-Z]{2})') 

In [83]:
customer_df.head(2)

Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Onboard_date,Location,Company,Churn,onboard_year,onboard_month,onboard_date,onboard_time,onboard_day,onboard_weekday,state
0,Cameron Williams,42.0,11066.8,0,7.22,8.0,2013-08-30 07:00:40,"10265 Elizabeth Mission Barkerburgh, AK 89518",Harvey LLC,1,2013,8,2013-08-30,07:00:40,4,4,AK
1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,2013-08-13 00:38:46,"6157 Frank Gardens Suite 019 Carloshaven, RI 1...",Wilson PLC,1,2013,8,2013-08-13,00:38:46,1,1,RI


In [88]:
churn_by_state = pd.DataFrame(customer_df.groupby('state')['Churn'].mean().sort_values(ascending=False))

In [89]:
churn_by_state

Unnamed: 0_level_0,Churn
state,Unnamed: 1_level_1
MH,0.44
WY,0.43
NJ,0.38
AS,0.36
DE,0.36
...,...
ID,0.06
SD,0.00
TX,0.00
PA,0.00
