#### Preprocessing Data

the preprocessing step

all the data we've been using so far has been pretty nice, it's been in a format that's allowed you to plug and play inte scikit-learn without any extra processing, but that won't be the case in the real world, you'll have to preprocess your data before you can build models 

scikit-learn api will not accept categorical features like "male" and "female" so you'll have to encode them numerically, you can do this by splitting the feature into a number of binary features called dummy variables with 0 meaning the observation wasn't that category and 1 meaning it was, if you have 3 colors like red, blue, and purple, you only need 0 and 1 because if both blue and red are 0 you implicitly know that the color is purple

boxplots are useful for visualizing categorical features

you could use scikit-learn's OneHotEncoder() or pandas' get_dummies()

In [None]:
# encoding how mpg varies by US, ASIA, and EUROPE using dummy variables
import pandas as pd

# read in the dataframe 
df = pd.read_csv('auto.csv')
# apply the get_dummies() function
df_origin = pd.get_dummies(df)

print(df_origin.head())
# this will create 3 new binary features, looking at them you know if Europe and US are both 0 that the car's origin is Asia
# that 3rd column is redundant info so we can drop it

# drop the redundant column
df_origin = df_origin.drop('origin_Asia', axis=1)
print(df_origin.head())
# another option to do this is to pass the drop first option to get_dummies: df_region = pd.get_dummies(df, drop_first=True)

# now that you have the dummy variables you can fit models just like before
# fit the ridge regression model to the data and compute its r squared
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

ridge = Ridge(alpha=0.5, normalize=True).fit(X_train, y_train)

ridge.score(X_test, y_test)

In [None]:
# exercise example, exploring categorical features using a boxplot
# Import pandas
import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)

# Show the plot
plt.show()

In [None]:
# exercise example, ridge regression with categorical features
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)

# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)

# Print the cross-validated scores
print(ridge_cv)

#### Handling Missing Data

missing data can exist because of a lack of observation, a transcription error, corrupted data, etc.

sometimes you'll look at the info of a dataframe and all the features will have the correct number of non-null entries but you have to remember that missing values can be encoded in a bunch of different ways (like zeroes, question marks, negatives), for example, if insulin, BMI, or thickness of skin are 0 that doesn't make sense since that's not possible

you can make all those weird 0 entries NaN by using replace() or the relevant columns
to deal with missing data you could drop all rows that are missing data but if you lose half the data that is bad 

a more robust option could be to **impute** (make an educated guess about) missing data, you could compute the mean of the all non-missing entries and then fill that in for all the missing ones
imputers are also known as transformers, any model that can transform data like the example below (using the transform method) is called a transformer

after you transform the data you could then fit a supervised learning model to it, you can do both at once using a scikit-learn pipeline object

In [None]:
# check out the dataset
df = pd.read_csv('diabetes.csv')
df.info()

df.head()

In [None]:
# dropping missing data
df.insulin.replace(0, np.nan, inplace=True)
df.triceps.replace(0, np.nan, inplace=True)
df.bmi.replace(0, np.nan, inplace=True)
df.info()

In [None]:
# drop all rows with missing data
df = df.dropna()
df.shape

In [None]:
# impute missing data
from sklearn.preprocessing import Imputer

# the missing values are represented by NaN, strategy is to use the mean, axis=0 means it'll impute along coulmns (1=rows)
imp=Imputer(missing_values='NaN', strategy='mean', axis=0)

# fit the imputer to the data
imp.fit(X)

# transform the data
X = imp.transform(X)

In [None]:
# imputing within a pipeline
from sklearn.pipeline import PipeLine
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
# instantiate a logreg model
logreg = LogisticRegression()

# build the pipeline object, construct a list of steps
# each step is a 2-tuple containing the name for the step and the estimator 
# all the steps except the last have to be a transformer, the last step has to be an estimator like a classifier or regressor 
steps = [('imputation', imp), ('logistic_regression', logreg)]
# pass the list to the pipeline constructor
pipeline = Pipeline(steps)

# split the data into training and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# fit the pipeline to the training set
pipeline.fit(X_train, y_train)
# predict on the test set
y_pred = pipeline.predict(X_test)

# compute accuracy
pipeline.score(X_test, y_test)

In [None]:
# exercise example, replace ? with NaN and then drop those suckers from the datafram
# Convert '?' to NaN
df[df == '?'] = np.nan

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))

In [None]:
# exercise example, imputing mode with a pipeline, with a Support Vector Machine classifier
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC #support vector classification, a type of SVM

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

In [None]:
# exercise example, imputing missing data by using the pipeline interface and create a classification report
# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))

#### Centering and Scaling

centering and scaling (normalizing) your data is another important preprocessing step for machine learning

you can use df.describe to check out the ranges of the feature variables, many ml models use some form of distance to inform them so if you have features on really large scales the could unduly influence the model, one example is that knn explicitly uses distance when making predictions and because of that we want features to be on a similar scale

scaling with have minimal effect when all the features are binary though :(

ways to normalize data:
- **standardization** for a column, subtract the mean and divide by the variance so that all the features are centered around 0 and have a variance of 1
- you could subtract the minimum and divide by the range of the data so the normalized dataset has minimum 0 and maximum 1
- normalize so that the data ranges from -1 to 1

In [None]:
# check out the ranges of the feature variables
print(df.describe())

In [None]:
# scale with scikit-learn
from sklearn.preprocessing import scale

# pass the feature data to the scale() method
X_scaled = scale(X)

# you can compare the mean and st dev of the columns of the original and scaled data to see the change
np.mean(X), np.std(X) # original
np.mean(X_scaled), np.std(X_scaled) # scaled 

In [None]:
# you can also but a scaler in a pipeline object
from sklearn.preprocessing import StandardScaler

steps =[('scaler', StandardScaler()), ('knn', KNeighborClassifier())]
pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

accuracy_score(y_test, y_pred) # you get .956

# performing knn without scaling
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
knn_unscaled.score(X_test, y_test) # you get .928 which isn't as good as the scaled data 

In [None]:
# CV and scaling in a pipeline

steps =[('scaler', StandardScaler()), ('knn', KNeighborClassifier())]
pipeline = Pipeline(steps)

# specify the hyperparameter space by creating a dictionary
# the keys are pipeline step name followed by a double underscore, followed by the hyperparameter name
# the corresponding value is an list or array of the values to try for that particular hyperparameter
# in this example we're only tuning the n neighbors in the KNN model
parameters = {knn__n_neighbors: np.arange(1, 50)}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

# perform a grid search over the parameters pipeline
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)

# predict on the estimator with the best found parameters and we do this on the holdout set
y_pred = cv.predict(X_test)

# print the best parameters, accuracy(score) , and classification report
print(sv.best_params_)
print(cv.score(X_test, y_test))
print(classification_report(y_test, y_pred))

In [None]:
# exercise example, the whole shebang
#  build a pipeline that includes scaling and hyperparameter tuning to classify wine quality

# Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

# Instantiate the GridSearchCV object, cv,  with the pipeline and hyperparameter space with 3 fold cross validation, 3 is the default so you don't see it here
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

In [None]:
# exercise example, the whole shebang again
# build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data
# then tune the l1_ratio of your ElasticNet using GridSearchCV

# Setup the pipeline steps: steps
# impute the missing data, scale the features, instantiate an elastic net regressor
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))