# Analyze Telco Customer Churn: Prepocessing and Training Data Development
In preparation for a presentation to Telco executives about customer churn, the CFO is asking for an analysis and predictions for the factors that most impact churn at the company.

This notebook will prepare the data for fitting models.

## Summary of Process
1. **Load dataset**: The dataset cleaned in 1-Data_Prep was loaded into a dataframe. 
2. **Create dummy variables**: Since the vast majority of the variables are categorical, one-hot encoding was used to convert the categories into numerical values. One-hot encoding was used instead of binary encoding because there were a maximum of three values for the categorical variables.
3. **Standardize numeric variables**: Give that tenure and monthly/total charges are on different scales, the three numeric values were standardized. Standardization was used instead of min/max scaling as new data might have higher values for all numeric values.
4. **Split into training and testing sets**: A 80/20 test/train split was used as a starting point for the split.

## Data Sources
- summary.csv: cleaned dataset from the 1-Data_Prep Notebook

## Import Libraries

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## File Locations

In [2]:
summary_file = Path.cwd() / "data" / "processed" / f"summary.csv"

In [3]:
df = pd.read_csv(summary_file)

In [4]:
df.head().T

Unnamed: 0,0,1,2,3,4
gender,Female,Male,Male,Male,Female
SeniorCitizen,No,No,No,No,No
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No
OnlineBackup,Yes,No,Yes,No,No


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   object 
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 


## Create dummy features for categorical variables via one-hot encoding

In [6]:
# Drop the target variable: churn
X = df.drop(['Churn'], axis=1)

In [7]:
# Filter for the categorical variables
only_objects = X.select_dtypes(include='object')

In [8]:
# Confirm the object columns remain
only_objects.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod
0,Female,No,Yes,No,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check
1,Male,No,No,No,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check
2,Male,No,No,No,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check
3,Male,No,No,No,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic)
4,Female,No,No,No,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check


In [9]:
# Create dummy columns using one-hot encoding
X = pd.get_dummies(X, columns=only_objects.columns)

In [10]:
X.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
tenure,1.0,34.0,2.0,45.0,2.0,8.0,22.0,10.0,28.0,62.0
MonthlyCharges,29.85,56.95,53.85,42.3,70.7,99.65,89.1,29.75,104.8,56.15
TotalCharges,29.85,1889.5,108.15,1840.75,151.65,820.5,1949.4,301.9,3046.05,3487.95
gender_Female,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0
gender_Male,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
SeniorCitizen_No,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
SeniorCitizen_Yes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Partner_No,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
Partner_Yes,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Dependents_No,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0


## Standardize the numeric variables

In [11]:
# Filter for the numeric variables
only_nums = X.select_dtypes(include=['int', 'float'])

In [12]:
# Confirm the numeric columns remain
only_nums.head()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
0,1,29.85,29.85
1,34,56.95,1889.5
2,2,53.85,108.15
3,45,42.3,1840.75
4,2,70.7,151.65


In [13]:
# Instantiate StandardScaler (adapted from DataCamp's Feature Engineering for Machine Learning in Python course)
SS_scaler = StandardScaler()

In [14]:
# Standarize the numeric variables
X['tenure_ss'] = SS_scaler.fit_transform(df[['tenure']])
X['MonthlyCharges_ss'] = SS_scaler.fit_transform(df[['MonthlyCharges']])
X['TotalCharges_ss'] = SS_scaler.fit_transform(df[['TotalCharges']])

In [15]:
# Drop the non-standardized columns
X = X.drop(['tenure', 'MonthlyCharges', 'TotalCharges'], axis=1)

In [16]:
X.head().T

Unnamed: 0,0,1,2,3,4
gender_Female,1.0,0.0,0.0,0.0,1.0
gender_Male,0.0,1.0,1.0,1.0,0.0
SeniorCitizen_No,1.0,1.0,1.0,1.0,1.0
SeniorCitizen_Yes,0.0,0.0,0.0,0.0,0.0
Partner_No,0.0,1.0,1.0,1.0,1.0
Partner_Yes,1.0,0.0,0.0,0.0,0.0
Dependents_No,1.0,1.0,1.0,1.0,1.0
Dependents_Yes,0.0,0.0,0.0,0.0,0.0
PhoneService_No,1.0,0.0,0.0,1.0,0.0
PhoneService_Yes,0.0,1.0,1.0,0.0,1.0


## Split into testing and training datasets

In [17]:
#Split into testing and training
X_train, X_test, y_train, y_test = train_test_split(X, df['Churn'], test_size=0.2, random_state=42)

In [18]:
# Confirm size of X_train and X_test
X_train.shape, X_test.shape

((5634, 46), (1409, 46))