## Load and Clean Data

In [1]:
import pandas as pd

df = pd.read_csv('house_prices.csv')
df_new = df[df.BldgType=='1Fam'].copy()
df_new = df_new.dropna()
df_new.head()

Unnamed: 0,Id,BldgType,LotArea,GrLivArea,YearBuilt,YrSold,SalePrice
0,1,1Fam,8450,1710.0,2003,2008,208500
2,3,1Fam,11250,1786.0,2001,2008,223500
3,4,1Fam,9550,1717.0,1915,2006,140000
4,5,1Fam,14260,2198.0,2000,2008,250000
5,6,1Fam,14115,1362.0,1993,2009,143000


## Split Data Into Train and Test

In [3]:
train_raw = df_new[df_new.YrSold < 2010].reset_index(drop=True)
test_raw = df_new[df_new.YrSold >= 2010].reset_index(drop=True)
train = train_raw[['GrLivArea', 'SalePrice']].copy()
test = test_raw[['GrLivArea', 'SalePrice']].copy()


## Get Features and Target

In [4]:
features = list(train.columns)
target = "SalePrice"
features.remove(target)

X_train = train[features].copy()
y_train = train[target].copy()

X_test = test[features].copy()
y_test = test[target].copy()

## Create Pipeline

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [24]:
steps = [('poly', PolynomialFeatures(degree=30)),
         ('rescale', MinMaxScaler()),
         ('lr', LinearRegression())]
pipeline_lr = Pipeline(steps)
pipeline_lr = pipeline_lr.fit(X_train, y_train)

<font color='red'>Assignment:</font> Calculate train and test loss; plot **GrLivArea** vs **SalePrice** using test data, and overlay it with model to see how the model works with test data.

# Lasso

<font color='red'>Assignment:</font> Use **Lasso** instead of **LinearRegression** in **Pipeline**. Tune **alpha** in **Lasso** to search for the  alpha that has the lowest test loss, which is your best model. Visualize your best model and calculate its R2 scores.

<font color='red'>Question:</font> **Alpha** is a hyper parameter. What is hyper parameter? 

<font color='red'>Question:</font> How does **alpha** affect model complexity?

<font color='red'>Question:</font> How does **alpha** affect the coefficient values?

# Ridge

<font color='red'>Assignment:</font> Try **Ridge** instead of **Lasso**.

<font color='red'>Question:</font> What is regularization? How does it work?

<font color='red'>Question:</font> What are "L1" and "L2" regularizations, respectively? How are the effects on coefficients from **Ridge** and **Lasso** different?