# Another data cleaning techinique

Data scaling and normalization (and what the difference is between the two).




The first thing we'll need to do is load in the libraries and datasets we'll be using.

**Note** If you have error about missing libraries - just remember to stop jupyter, and install it by our previous lesson `pip install your-missing-library`


In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# for Box-Cox Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# read in all our data
houses = pd.read_csv("../DATA/houses.csv")

# set seed for reproducibility
np.random.seed(0)

# Scaling vs. Normalization: What's the difference?
____

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and they are very similar!  
In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the *range* of your data while in normalization you're changing the *shape of the distribution* of your data. 


___

## **Scaling**

This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1.  You want to scale data when you're using methods based on measures of how far apart data points, like [support vector machines, or SVM](https://en.wikipedia.org/wiki/Support_vector_machine) or [k-nearest neighbors, or KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). With these algorithms, a change of "1" in any numeric feature is given the same importance. 

For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar!  With currency, you can convert between currencies. But what about if you're looking at something like height and weight?



In [None]:
# generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

The *shape* of the data doesn't change, but instead of ranging from 0 to 8ish, it now ranges from 0 to 1.

___
## Normalization

Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

> **[Normal distribution:](https://en.wikipedia.org/wiki/Normal_distribution)** - the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. The normal distribution = the Gaussian distribution.

In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed, like:t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA), Gaussian naive Bayes...

The method were  using to normalize here is called the [Box-Cox Transformation](https://en.wikipedia.org/wiki/Power_transform#Box%E2%80%93Cox_transformation). 

In [None]:
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

Notice that the *shape* of our data has changed. Before normalizing it was L-shaped, after normalizing it looks  like "bell curve". 


# Scaling
___

Let's start by scaling the goals of each campaign, which is how much money they were asking for.

In [None]:
# select the LotArea column
lotArea = houses.LotArea.dropna()

# scale the goals from 0 to 1
scaled_data = minmax_scaling(lotArea, columns = [0])

# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(houses.LotArea.dropna(), ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

Scaling changed the scales of the plots dramatically (but not the shape of the data: it looks like most hauses have typical aresa but a few have very large ones)

**Challenge**

In [None]:
# Challenge
# We just scaled the "LotArea" column. What about the "LotFrontage" column?


# Normalization
___


In [None]:
# get only positive pledges (using their indexes)
positive_pledges = houses.LotArea 

# normalize the pledges (w/ Box-Cox)
normalized_pledges = stats.boxcox(positive_pledges)[0]

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_pledges, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_pledges, ax=ax[1])
ax[1].set_title("Normalized data")

It's not perfect but it is much closer to normal


**Challenge**

In [None]:
# Challenge 
# We looked as the LotArea column. What about the "LotFrontage" column? Does it have the same info?


In [None]:
x = houses.LotArea
y = houses.SalePrice

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( normalized_pledges, y, test_size=0.20, random_state=0)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)

model.fit(X_train[:, np.newaxis], Y_train)

Y_pred = model.predict(X_test[:, np.newaxis])

plt.scatter(X_train, Y_train)
plt.plot(X_test, Y_pred, color='#FF0000');

In [None]:
model.score(X_test[:, np.newaxis],Y_test)

HA!! If you remember previous lessons - you can see that using normalization we just double score on single input!

**ADVANCED CHALLENGE**   
Can you apply the same for 4 columns used in previous lesson??

possible tip: you may use `np.asarray([c1],[c2],[c3])`

In [None]:
H = houses[['LotFrontage','LotArea','OverallQual','OverallCond','SalePrice']].dropna()

# your code: normalize all 4 input columns
x2 = None

y2 = H.SalePrice

print("x2 shape:", x2.shape)
print("y2 shape:",y2.shape)

X2_train, X2_test, Y2_train, Y2_test = train_test_split( x2, y2, test_size=0.20, random_state=0)

Expected output:

```
x2 shape: (1201, 4)
y2 shape: (1201,)
```

In [None]:
model2 = LinearRegression(fit_intercept=True)

model2.fit(X2_train, Y2_train)

y2fit = model2.predict(X2_test)


In [None]:
print(X2_train[:,0].shape)
print(Y2_train.shape)

plt.scatter(X2_train[:,0], Y2_train , color='#2F08EC', marker="s")
plt.scatter(X2_test[:,0], Y2_test , color='#FF082C')
plt.scatter(X2_test[:,0], y2fit , color='#00FF00')

In [None]:
model2.score(X2_test,Y2_test)

In [None]:
model2.score(X2_train,Y2_train)

Expected output:


```
test: 0.685395007040821
train: 0.6779247339356669
```

**ADVANCED CHALLENGE 2**

Now try the same but normalize only first 2 columns: LotFrontage and  LotArea while OverallQual and OverallCond leave in orginal condition...


In [None]:
## your code: # your code: normalize 2 first input columns
x3 = None
y3 = H.SalePrice

## your code: split 80/20 %
X3_train, X3_test, Y3_train, Y3_test = None


In [None]:
model3 = LinearRegression(fit_intercept=True)

model3.fit(X3_train, Y3_train)

y3fit = model3.predict(X3_test)

In [None]:
model3.score(X3_test,Y3_test)

In [None]:
model3.score(X3_train,Y3_train)

Expected output:

```
test: 0.7151474283355291
train: 0.6929990270245685
```

Again! You are great again!