# Classification Project

Please fill out:
* Student name: Wayne Lam
* Student pace: Full Time
* Scheduled project review date/time: 
* Instructor name: Sean Abu
* Blog post URL:


## Import Modules, Packages, and Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import classification_funcs as cl_f

pd.set_option('display.max_columns', 300)

In [2]:
# Data from https://www.kaggle.com/becksddf/churn-in-telecoms-dataset
df = pd.read_csv('bigml_59c28831336c6604c800002a.csv')
df.head(3)

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False


In [3]:
model_df = cl_f.ModelDF(df,
                        cat_features=['area code', 'state', 'international plan', 'voice mail plan'], 
                        target='churn')

### Comments on Initial DataFrame Features
- Area code: convert to categorical and dummy
- Int./VM plans: convert to dummies
- State: convert to categorical and dummy

In [4]:
# Initial data cleaning
model_df._drop_columns(['phone number'])
model_df._cat_and_drop()
model_df._info()

---
List of columns dropped: ['phone number']
---
Added dummies for and dropped "area code"
Added dummies for and dropped "state"
Added dummies for and dropped "international plan"
Added dummies for and dropped "voice mail plan"
Now has 70 columns
---
Shape: (3333, 70)
There are 54 uint8 features
There are 8 float64 features
There are 7 int64 features
There are 1 bool features


## EDA

In [None]:
# Plot histograms of continuous/"semi-continuous" features
model_df._multi_plot(plot='hist')

In [None]:
# Plot LM plots of continuous/"semi-continuous" features in relation to churn rate
model_df._multi_plot(plot='lmplot')

### Comments on Initial EDA
- Number vmail messages: Stat test churn rate of no vm and has vm (chi-squared)
- Total intl calls: right skew
- Customer service calls: right skew
- Total calls (internation, evening, night) do not appear to increase churn rate
- Total minutes and charges seem to increase churn rate:
    Likely that individuals who make long calls want unlimited minutes plans

In [5]:
# Null hypothesis: churn rate of (VM Messages = 0) = churn rate of (VM Messages > 0), alpha = 0.05
model_df._chi_sq('number vmail messages')

Reject Null Hypothesis
Chi-Squared: 34.77733072701296
p-value: 2.8067167404597397e-08


### Chi-Squared Test of VM Messages
- There appears to be a relationship between 0 VM messages or >0 VM messages with respect to churn rate
- Bin VM messages