# Capstone Step 4 Preprocessing and Training Data Development

 **Pre-processing and Training Data Development includes**  
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set

In [1]:
#importing libraries
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib

In [2]:
#Loading the csv file and storing it in a variable covid
covid = pd.read_csv("step_3_output.csv")

In [3]:
#Printing the first five rows of the dataframe
covid.head()

Unnamed: 0.1,Unnamed: 0,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,...,median_age,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
0,1,Asia,Afghanistan,2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
1,2,Asia,Afghanistan,2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2,3,Asia,Afghanistan,2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
3,4,Asia,Afghanistan,2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
4,5,Asia,Afghanistan,2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83


In [4]:
#Printing the maximum and minimum values of date
covid['date'].min(), covid['date'].max()

('2019-12-31', '2020-08-02')

In [5]:
#Checking if there is any nan vakues in dataframe
covid.isnull().sum()

Unnamed: 0                    0
continent                     0
location                      0
date                          0
total_cases                   0
new_cases                     0
total_deaths                  0
new_deaths                    0
total_cases_per_million       0
new_cases_per_million         0
total_deaths_per_million      0
new_deaths_per_million        0
new_tests                     0
total_tests                   0
total_tests_per_thousand      0
new_tests_per_thousand        0
tests_units                   0
stringency_index              0
population                    0
population_density            0
median_age                    0
gdp_per_capita                0
extreme_poverty               0
cardiovasc_death_rate         0
diabetes_prevalence           0
female_smokers                0
male_smokers                  0
handwashing_facilities        0
hospital_beds_per_thousand    0
life_expectancy               0
dtype: int64

<font color = 'teal'>Creating a datetime index<font>

In [6]:
#Converting the date column into index using pd.to_datetime()
covid.index = pd.to_datetime(covid["date"])

In [7]:
#Dropping the date column form the dataframe and storing it in in covid
covid = covid.drop(columns = ['date'],axis =1)
#Printing the first five rows of the dataframe
covid.head()

Unnamed: 0_level_0,Unnamed: 0,continent,location,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,...,median_age,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-01,1,Asia,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2020-01-02,2,Asia,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2020-01-03,3,Asia,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2020-01-04,4,Asia,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83
2020-01-05,5,Asia,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,18.6,1803.987,11.5572,597.029,9.59,0.0,0.0,37.746,0.5,64.83


In [8]:
#Printing the shape of the dataframe 
covid.shape

(26081, 29)

<font color = 'teal'> Create dummy features for categorical variables <font>

Create dummy variables for continent. 

In [9]:
df=pd.get_dummies(covid['continent'])
df.head()

Unnamed: 0_level_0,Africa,Asia,Europe,North America,Oceania,South America
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01,0,1,0,0,0,0
2020-01-02,0,1,0,0,0,0
2020-01-03,0,1,0,0,0,0
2020-01-04,0,1,0,0,0,0
2020-01-05,0,1,0,0,0,0


Add the dummies back to the covid dataframe and remove the original column for continent.

In [10]:
covid=pd.concat([covid,df],axis=1)
covid=covid.drop('continent',axis=1)

In [15]:
#Printing the first five rows of the dataframe
covid.head()
covid = covid.drop(columns = ['Unnamed: 0'],axis = 1)
covid.head()

Unnamed: 0_level_0,location,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,new_tests,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,Africa,Asia,Europe,North America,Oceania,South America
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-01,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00181,...,0.0,37.746,0.5,64.83,0,1,0,0,0,0
2020-01-02,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00362,...,0.0,37.746,0.5,64.83,0,1,0,0,0,0
2020-01-03,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00543,...,0.0,37.746,0.5,64.83,0,1,0,0,0,0
2020-01-04,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00724,...,0.0,37.746,0.5,64.83,0,1,0,0,0,0
2020-01-05,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00905,...,0.0,37.746,0.5,64.83,0,1,0,0,0,0


<font color = 'teal'>Standardize the magnitude of numeric features<font>

Using sklearn preprocessing standardize the scale of the features of the dataframe except the location  which we done't need in the dataframe for modeling, so it can be droppped here as well. Also, we want to hold out our response variable(s) "total_cases" so we can have their true values available for model performance review.

In [16]:
# first we import the preprocessing package from the sklearn library
from sklearn import preprocessing

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'AdultWeekend' from the df
X = covid.drop(['location','tests_units'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y=covid['total_cases']

# Here we use the StandardScaler() method of the preprocessing package, and then call the fit() method with parameter X 
scaler = preprocessing.MinMaxScaler().fit(X)

# Declare a variable called X_scaled, and assign it the result of calling the transform() method with parameter X 
X_scaled=scaler.transform(X) 

<font color = 'teal'>Split into training and testing datasets <font>

Using sklearn model selection import train_test_split, and create a 75/25 split with the y = total_cases. We will start by using the total covid-19 cases as our response variable for modeling.

In [17]:
# Import the train_test_split function from the sklearn.model_selection utility.  
from sklearn.model_selection import train_test_split

# Get the 1-dimensional flattened array of our response variable y by calling the ravel() function on y
y = np.ravel(y)

# Call the train_test_split() function with the first two parameters set to X_scaled and y 
# Declare four variables, X_train, X_test, y_train and y_test separated by commas 
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size=0.25, random_state=1)