In [1]:
import warnings
warnings.simplefilter('ignore')

# %matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Cleaning and Preprocessing Data for Machine Learning

Preprocessing your data is the process of preparing your data to be analyzed. As you can imagine, there is no "correct" way to do this. The approach taken for this depends on the data and the type of analysis. In this notebook, we'll look at encoding categorical variables, scaling, and normalizing.

**Dataset:**  brain_categorical.csv

**Source:** R.J. Gladstone (1905). ["A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123](https://doi.org/10.1093/biomet/4.1-2.105)

**Description:** Brain weight (grams) and head size (cubic cm) for 237 adults classified by gender and age group.

Variables/Columns
- **GENDER:** Gender  \[*Male* or *Female*\]
- **AGE:** Age Range  \[*20-46* or *46+*\]
- **SIZE:** Head size (cm^3)
- **WEIGHT:** Brain weight (grams)



### Read the csv file into a pandas DataFrame

In [2]:
Strain_Frame = pd.read_csv('../Resources/Strain_Frame.csv')
Strain_Frame.head()

Unnamed: 0,Strain,Rating,Type: Hybrid,Type: Indica,Type: Sativa,Effect: Creative,Effect: Energetic,Effect: Tingly,Effect: Euphoric,Effect: Relaxed,...,Descriptor: Potent,Descriptor: Body High,Descriptor: Head High,Descriptor: Daytime,Descriptor: Nighttime,Descriptor: Outside,Descriptor: Creative,Descriptor: Psychedelic,Descriptor: Lazy,Descriptor: Calm
0,100 Og,4.0,1,0,0,1,1,1,1,1,...,1,1,1,0,0,0,0,0,0,0
1,98 White Widow,4.7,1,0,0,1,1,0,0,1,...,1,1,0,0,0,0,0,0,0,0
2,1024,4.4,0,0,1,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,13 Dawgs,4.2,1,0,0,1,0,1,0,1,...,1,1,0,0,0,0,1,0,0,0
4,24K Gold,4.6,1,0,0,0,0,0,1,1,...,1,0,0,0,0,0,0,0,0,0


### Split data and labels and reshape

In [4]:
X = Strain_Frame[["Type: Hybrid", "Type: Indica", "Type: Sativa"]]
y = Strain_Frame["Rating"].values.reshape(-1, 1)
print(X.shape, y.shape)

(2351, 3) (2351, 1)


## Working with Categorical Data

What's wrong with the following code?

```
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
```

Machine Learning algorithms work with numerical data. We have to convert our strings into meaningful numbers. We often use Integer, One-hot, or Binary Encoding. Sklearn provides a preprocessing library for all of these standard preprocessing techniques. Pandas also provides a `get_dummies` method that is useful to generate binary encoded data from a Data Frame. 

## Dummy Encoding (Binary Encoded Data)

Dummy Encoding (also known as One-Hot Encoding) transforms each categorical feature into new columns with a 1 (True) or 0 (False) encoding to represent if that categorical label was present or not in the original row. 

Pandas provides a shortcut to create Binary Encoded data.

In [None]:
# data = X.copy()

# # using get_dummies with a single column
# data_binary_encoded = pd.get_dummies(data, columns=["gender"])
# data_binary_encoded.head()

We can encode multiple columns using `get_dummies`.

In [None]:
# data = X.copy()

# #using get_dummies across all categorical columns
# data_binary_encoded = pd.get_dummies(X)
# data_binary_encoded.head()

## Scaling and Normalization

The final step that we need to perform is scaling and normalization. Many algorithms will perform better with a normalized or scaled dataset. You may not see a difference with the Sklearn LinearRegression model, but other models that incorporate calculated distances into the training process may benefit from normalization. 

Additionally, normalization is benefitial when you're working with input features that use significantly different scales (e.g., age vs income).

Sklearn provides a variety of scaling and normalization options. The two most common are minmax and StandardScaler. Use StandardScaler when you don't know anything about your data.

The first step is to split your data into Training and Testing using `train_test_split`.

In [5]:
from sklearn.model_selection import train_test_split

X = pd.Strain_Frame(X)

X = Strain_Frame, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

X_train.head()

AttributeError: module 'pandas' has no attribute 'Strain_Frame'

### StandardScaler

Now, we fit our StandardScaler model to our training data. We can apply this StandardScaler model to any future data. Note that we use this fit/transform approach so that we isolate our testing data from the training data that we use to fit our model. Otherwise, we might bias our model to the testing data. 

StandardScaler applies a Gaussian distribution to our data where the mean is 0 and the standard deviation is 1. We can see the difference in the following plots.

### Fit the training data to the StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
y_scaler = StandardScaler().fit(y_train)

### Create variables to hold the scaled train & test data

In [None]:
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)

### Plot the scaled data

In [None]:
# Create your subplots
fig1 = plt.figure(figsize=(12, 6))
axes1 = fig1.add_subplot(1, 2, 1)
axes2 = fig1.add_subplot(1, 2, 2)

# Add title labels
axes1.set_title("Original Data")
axes2.set_title("Scaled Data")

# Using your max x & y values, set the plot axis limits for your 
maxx = X_train["size"].max()
maxy = y_train.max()
axes1.set_xlim(-maxx - 100, maxx + 100)
axes1.set_ylim(-maxy - 100, maxy + 100)

# Set limits for your scaled data
axes2.set_xlim(-3, 3)
axes2.set_ylim(-3, 3)

# Use a function to apply plot formatting, to avoid having to write it out twice
def set_axes(ax):
    ax.spines['left'].set_position('center')
    ax.spines['right'].set_color('none')
    ax.spines['bottom'].set_position('center')
    ax.spines['top'].set_color('none')
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
    
# apply formatting function to each axis
set_axes(axes1)
set_axes(axes2)

# plot the original data and the scaled data
axes1.scatter(X_train["size"], y_train)
axes2.scatter(X_train_scaled[:,0], y_train_scaled[:])

# Put it all together

### Step 1) Convert Categorical data to numbers using Integer or Binary Encoding

In [None]:
X = pd.get_dummies(brain[["size", "gender", "age"]])
y = brain["weight"].values.reshape(-1, 1)
X.head()

### Step 2) Split data into training and testing data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Step 3) Scale or Normalize your data. Use StandardScaler if you don't know anything about your data.

In [None]:
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
y_scaler = StandardScaler().fit(y_train)

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)

### Step 4) Fit the Model to the scaled training data and make predictions using the scaled test data

In [None]:
# Generate the model and fit it to the scaled training data
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train_scaled)

### Step 5) Plot the residuals

In [None]:
# create a residuals plot using the predictions for both test and train data
plt.scatter(model.predict(X_train_scaled), model.predict(X_train_scaled) - y_train_scaled, c="blue", label="Training Data")
plt.scatter(model.predict(X_test_scaled), model.predict(X_test_scaled) - y_test_scaled, c="orange", label="Testing Data")
plt.legend()

# create a horizontal line at y=0 to show how much error is in each prediction
plt.hlines(y=0, xmin=y_test_scaled.min(), xmax=y_test_scaled.max())
plt.title("Residual Plot")
plt.xlabel("Prediction")
plt.show()

### Step 6) Quantify your model using the scaled data

In [None]:
from sklearn.metrics import mean_squared_error

predictions = model.predict(X_test_scaled)
MSE = mean_squared_error(y_test_scaled, predictions)
r2 = model.score(X_test_scaled, y_test_scaled)

print(f"MSE: {MSE}, R2: {r2}")