# Preprocessing

## Table of contents
* [Overview](#Overview)
* [1. Imports](#1.-Imports)
* [2. Data Load](#2.-Load-data-and-drop-variables-to-run-dummy-variables)
* [3. Dummy Variables](#3.-Dummy-Variables)
* [4. Scale Data](#4.-Scale-Data)
* [5. Concatenate Dataframes](#5.-Concatenate-Dataframes)
* [6. Split train and test](#6.-Split-train-and-test-dataset)

## Overview

1. Pick proper categorical variables to run machine learning model.
2. Create dummy variables.
3. Apply StandardScaler to scale int and float variables.
4. Concat dummy variables, scaled data, and dependent variable ('Churn Label').

## 1. Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve

## 2. Load data and drop variables to run dummy variables

In [2]:
# data load
df = pd.read_pickle('/Users/hansangjun/Desktop/Springboard/Capstone2/data/telco_data/AfterWrangling.pkl')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   City               7043 non-null   object 
 1   Zip Code           7043 non-null   object 
 2   Latitude           7043 non-null   float64
 3   Longitude          7043 non-null   float64
 4   Gender             7043 non-null   object 
 5   Senior Citizen     7043 non-null   object 
 6   Partner            7043 non-null   object 
 7   Dependents         7043 non-null   object 
 8   Tenure Months      7043 non-null   int64  
 9   Phone Service      7043 non-null   object 
 10  Multiple Lines     7043 non-null   object 
 11  Internet Service   7043 non-null   object 
 12  Online Security    7043 non-null   object 
 13  Online Backup      7043 non-null   object 
 14  Device Protection  7043 non-null   object 
 15  Tech Support       7043 non-null   object 
 16  Streaming TV       7043 

In [8]:
# Before proceeding modeling part, take out object type variables with too many unique values. 
df = df.drop(columns=['City','Zip Code', 'Latitude', 'Longitude', 'Churn Reason'])

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             7043 non-null   object 
 1   Senior Citizen     7043 non-null   object 
 2   Partner            7043 non-null   object 
 3   Dependents         7043 non-null   object 
 4   Tenure Months      7043 non-null   int64  
 5   Phone Service      7043 non-null   object 
 6   Multiple Lines     7043 non-null   object 
 7   Internet Service   7043 non-null   object 
 8   Online Security    7043 non-null   object 
 9   Online Backup      7043 non-null   object 
 10  Device Protection  7043 non-null   object 
 11  Tech Support       7043 non-null   object 
 12  Streaming TV       7043 non-null   object 
 13  Streaming Movies   7043 non-null   object 
 14  Contract           7043 non-null   object 
 15  Paperless Billing  7043 non-null   object 
 16  Payment Method     7043 

## 3. Dummy Variables
1. extract object type column names.
2. make sure deleting dependent variable.
3. create dummy variables

In [11]:
# Extract object type column names from a new list.
names_dummies = df.select_dtypes(include=['object']).columns.to_list()

# Dependent variable doesn't have to be a dummy variable.
names_dummies.remove('Churn Label')
print(names_dummies)

['Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method']


In [12]:
# Create dummy variables.
df = pd.get_dummies(df, columns=names_dummies, prefix=names_dummies)

# Print the columns names
print(df.columns)

Index(['Tenure Months', 'Monthly Charges', 'Total Charges', 'Churn Label',
       'Churn Score', 'CLTV', 'Gender_Female', 'Gender_Male',
       'Senior Citizen_No', 'Senior Citizen_Yes', 'Partner_No', 'Partner_Yes',
       'Dependents_No', 'Dependents_Yes', 'Phone Service_No',
       'Phone Service_Yes', 'Multiple Lines_No',
       'Multiple Lines_No phone service', 'Multiple Lines_Yes',
       'Internet Service_DSL', 'Internet Service_Fiber optic',
       'Internet Service_No', 'Online Security_No',
       'Online Security_No internet service', 'Online Security_Yes',
       'Online Backup_No', 'Online Backup_No internet service',
       'Online Backup_Yes', 'Device Protection_No',
       'Device Protection_No internet service', 'Device Protection_Yes',
       'Tech Support_No', 'Tech Support_No internet service',
       'Tech Support_Yes', 'Streaming TV_No',
       'Streaming TV_No internet service', 'Streaming TV_Yes',
       'Streaming Movies_No', 'Streaming Movies_No internet servi

In [13]:
df.head()

Unnamed: 0,Tenure Months,Monthly Charges,Total Charges,Churn Label,Churn Score,CLTV,Gender_Female,Gender_Male,Senior Citizen_No,Senior Citizen_Yes,...,Streaming Movies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,Paperless Billing_No,Paperless Billing_Yes,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
0,2,53.85,108.15,Yes,86,3239,0,1,1,0,...,0,1,0,0,0,1,0,0,0,1
1,2,70.7,151.65,Yes,67,2701,1,0,1,0,...,0,1,0,0,0,1,0,0,1,0
2,8,99.65,820.5,Yes,86,5372,1,0,1,0,...,1,1,0,0,0,1,0,0,1,0
3,28,104.8,3046.05,Yes,84,5003,1,0,1,0,...,1,1,0,0,0,1,0,0,1,0
4,49,103.7,5036.3,Yes,89,5340,0,1,1,0,...,1,1,0,0,0,1,1,0,0,0


Dummy variables has been created.

## 4. Scale Data
1. collect int and float.
2. define scaler.
3. fit scale_df
4. transform the data using fitted scaler.
5. make it dataframe.

In [14]:
# collect int and float
names_list = ['Tenure Months', 'Monthly Charges', 'Total Charges', 'Churn Score', 'CLTV']
names_list_SS = ['Tenure Months_SS', 'Monthly Charges_SS', 'Total Charges_SS', 'Churn Score_SS', 'CLTV_SS']
scale_df = df[names_list]

# define scaler
scaler = StandardScaler()

# fit scale_df
scaler.fit(scale_df)

# transform the data using fitted scaler
scaled_df = scaler.transform(scale_df)

# make it dataframe
scaled_df = pd.DataFrame(scaled_df, columns=names_list_SS) 
scaled_df.head()

Unnamed: 0,Tenure Months_SS,Monthly Charges_SS,Total Charges_SS,Churn Score_SS,CLTV_SS
0,-1.236724,-0.36266,-0.958066,1.268402,-0.981675
1,-1.236724,0.197365,-0.938874,0.38565,-1.436462
2,-0.992402,1.159546,-0.643789,1.268402,0.821409
3,-0.177995,1.330711,0.338085,1.175481,0.509483
4,0.677133,1.294151,1.21615,1.407784,0.794358


## 5. Concatenate Dataframes

In [15]:
# drop original int and float varaibles
df.drop(columns=names_list, inplace=True)

# concatenating scaled_df and df along columns
df = pd.concat([scaled_df, df], axis=1)

In [16]:
df.head()

Unnamed: 0,Tenure Months_SS,Monthly Charges_SS,Total Charges_SS,Churn Score_SS,CLTV_SS,Churn Label,Gender_Female,Gender_Male,Senior Citizen_No,Senior Citizen_Yes,...,Streaming Movies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,Paperless Billing_No,Paperless Billing_Yes,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
0,-1.236724,-0.36266,-0.958066,1.268402,-0.981675,Yes,0,1,1,0,...,0,1,0,0,0,1,0,0,0,1
1,-1.236724,0.197365,-0.938874,0.38565,-1.436462,Yes,1,0,1,0,...,0,1,0,0,0,1,0,0,1,0
2,-0.992402,1.159546,-0.643789,1.268402,0.821409,Yes,1,0,1,0,...,1,1,0,0,0,1,0,0,1,0
3,-0.177995,1.330711,0.338085,1.175481,0.509483,Yes,1,0,1,0,...,1,1,0,0,0,1,0,0,1,0
4,0.677133,1.294151,1.21615,1.407784,0.794358,Yes,0,1,1,0,...,1,1,0,0,0,1,1,0,0,0


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 49 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Tenure Months_SS                          7043 non-null   float64
 1   Monthly Charges_SS                        7043 non-null   float64
 2   Total Charges_SS                          7043 non-null   float64
 3   Churn Score_SS                            7043 non-null   float64
 4   CLTV_SS                                   7043 non-null   float64
 5   Churn Label                               7043 non-null   object 
 6   Gender_Female                             7043 non-null   uint8  
 7   Gender_Male                               7043 non-null   uint8  
 8   Senior Citizen_No                         7043 non-null   uint8  
 9   Senior Citizen_Yes                        7043 non-null   uint8  
 10  Partner_No                          

info() presents scaled data and dummy variables with a dependent variable.

In [17]:
df.to_pickle('df_preprocessed_1016.pkl')

## 6. Split train and test dataset

In [18]:
X = df.loc[:, df.columns != 'Churn Label']
y = df.loc[:, df.columns == 'Churn Label']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)

In [19]:
# check the dimension of the splited dataset
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4930, 48)
(2113, 48)
(4930, 1)
(2113, 1)
