# Machine Learning




In [None]:
!pip install -U scikit-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In the following cells we will load the data and define some useful plotting functions.


In [None]:
np.random.seed(72018)


def to_2d(array):
    return array.reshape(array.shape[0], -1)
    
def plot_exponential_data():
    data = np.exp(np.random.normal(size=1000))
    plt.hist(data)
    plt.show()
    return data
    
def plot_square_normal_data():
    data = np.square(np.random.normal(loc=5, size=1000))
    plt.hist(data)
    plt.show()
    return data

### Loading the California Housing Data


In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(download_if_missing=True)

data = np.c_[housing.data, housing.target]
columns = np.append(housing.feature_names, ["MedVal"])
housing_df = pd.DataFrame(data, columns=columns)

In [None]:
housing_df.head(15)

### Determining Normality


In [None]:
housing_df.MedVal.hist();

#### Using a Statistical Test


Without getting into Bayesian vs. frequentist debates, for the purposes of this lesson, the following will suffice:

* This is a statistical test that tests whether a distribution is normally distributed or not. It isn't perfect, but suffice it to say: 
    * This test outputs a **p-value**. The _higher_ this p-value is the _closer_ the distribution is to normal.
    * Frequentist statisticians would say that you accept that the distribution is normal (more specifically: fail to reject the null hypothesis that it is normal) if p > 0.05.


In [None]:
from scipy.stats.mstats import normaltest # D'Agostino K^2 Test

In [None]:
normaltest(housing_df.MedVal.values)

p-value is _extremely_ low. Our **y** variable which we have been dealing with this whole time was not normally distributed!


### Apply transformations to make target variable more normally distributed for Regression


* Log Transformation
* Square root Transformation
* Box cox Transformation


### Log Transformation


In [None]:
log_medv = np.log(housing_df.MedVal)

In [None]:
log_medv.hist();

In [None]:
normaltest(log_medv)

Conclusion: The output is closer to normal distribution, but still not completely normal.


### Square root Transformation

In [None]:
plt.hist(np.sqrt(data));

In [None]:
normaltest(sqrt_medv)

### Box cox Transformation


In [None]:
from scipy.stats import boxcox

In [None]:
bc_result = boxcox(housing_df.MedVal)
boxcox_medv = bc_result[0]
lam = bc_result[1]

In [None]:
lam

In [None]:
housing_df['MedVal'].hist();

In [None]:
plt.hist(boxcox_medv);

In [None]:
normaltest(boxcox_medv)

### Testing regression:


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler, 
                                   PolynomialFeatures)

In [None]:
lr = LinearRegression()

**Load the dataframe `housing_df`:**


In [None]:
data = np.c_[housing.data, housing.target]
columns = np.append(housing.feature_names, ["MedVal"])
housing_df = pd.DataFrame(data, columns=columns)

In [None]:
y_col = "MedVal"

X = housing_df.drop(y_col, axis=1)
y = housing_df[y_col]

**Create Polynomial Features**


In [None]:
pf = PolynomialFeatures(degree=2, include_bias=False)
X_pf = pf.fit_transform(X)

**Split the data into Training and Test Sets**   

The split ratio here is 0.7 and 0.3 which means we will assign **70%** data for training and **30%** data for testing


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pf, y, test_size=0.3, 
                                                    random_state=72018)

**Normalize the training data using `StandardScaler` on `X_train`. Use fit_transform() function**


In [None]:
s = StandardScaler()
X_train_s = s.fit_transform(X_train)


1. Fit regression
1. Transform testing data
1. Predict on testing data


In [None]:
y_train_bc.shape

In [None]:
lr.fit(X_train_s, y_train_bc)
X_test_s = s.transform(X_test)
y_pred_bc = lr.predict(X_test_s)

#### Apply inverse transformations to be able to use these in a Regression context


Every transformation has an inverse transformation. The inverse transformation of $f(x) = \sqrt{x}$ is $f^{-1}(x) = x^2$, for example. Box cox has an inverse transformation as well: notice that we have to pass in the lambda value that we found from before:


In [None]:
from scipy.special import inv_boxcox

In [None]:
# code from above
bc_result = boxcox(housing_df.MedVal)
boxcox_medv = bc_result[0]
lam = bc_result[1]

In [None]:
inv_boxcox(boxcox_medv, lam)[:10]

In [None]:
housing_df['MedVal'].values[:10]