# Connect Intensive - Machine Learning Nanodegree

## Week 3. Build Regression Models

### Objectives    

- Implement regression models to predict housing price  
    - Perform exploratory analysis with pandas, matplotlib, and seaborn
    - Build univariate and multivariate linear regression models     
    - Build decision tree regression models
- Evaluate the model performance  
    - Calculate performance metrics
    - Tune the model complexity and understand the performance change
  
### Prerequisites   

 - You should have the following python packages installed:
    - [numpy](http://www.scipy.org/scipylib/download.html)
    - [pandas](http://pandas.pydata.org/getpandas.html)
    - [matplotlib](http://matplotlib.org/index.html)
    - [seaborn](http://seaborn.pydata.org) 
    - [sklearn](http://scikit-learn.org/stable/install.html)
 - If you're rusty on basic python programming and exploratory data analysis, check out the [Jupyter notebooks from week 1](https://github.com/yanfei-wu/Udacity_connect/tree/master/wk1). If you are not familar with `sklearn`'s model building workflow, check out the [Jupyter notebook from week 2](https://github.com/yanfei-wu/Udacity_connect/tree/master/wk2).


---

## Step 0: Getting Started

As usual, we start by importing some useful libraries and modules (make sure you have `scikit-learn` installed) and reading the dataset into pandas. And note that it is always a good idea to take a quick look at the dataset using the DataFrame's `head()`, `info()`, or `describe()` methods. 

**Run** the cells below to import the libraries, read the data, and take a quick look at the dataset. 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from IPython.display import display

%matplotlib inline

In [None]:
# set maximum rows to display
pd.options.display.max_rows = 15 # default is 60

In [None]:
# read the data 
data = pd.read_csv('./data/home_data.csv')
display(data.head())

In [None]:
data.info()

In [None]:
data.describe()

For this programming exercise, we are going to use a simple housing dataset with 7 features:  
- '`bedrooms`': number of beedrooms
- '`bathrooms`': number of bathrooms
- '`sqft_living`': living area size in square feet
- '`sqft_lot`': lot size in square feet  
- '`sqft_basement`': size of basement in square feet   
- '`floors`': number of floors  
- '`condition`': condition of the home coded with integers  

The label (or target variable) we are going to predict is the '`price`'.

---

## Step 1: Exploratory Data Analysis 

So far you should have a rough idea of the basic structure of the data. Remember, when working on a machine learning problem, before you jump into the model building, you should always explore you dataset to gain more insights. 


#### EXERCISE
By now, you should have some exploratory data analysis tools in your toolbox. I will leave it to you to get some in-depth understanding of the dataset you are going to build a model with. 

> **Something you might want to explore:** 
    > - Distribution of individual features 
    > - Correlation between pairs of features (You will find the DataFrame's `corr()` method useful. Also, check out `seaborn`'s `pairplot()` method.)
    > - (OPTIONAL) Experiment with feature combinations  

---

## Step 2: Prepare the Data for Model Building

After you have explored your dataset, you usually have to do some preprocessing and cleaning steps to prepare the dataset for your machine learning algorithms. For example, as you learned from last week's tutorial, you should check whether there are missing values in your dataset. If so, there are different strategies for dealing with them (e.g., dropping the feature or some form of imputation). In cases when your dataset has categorical features, you should convert them to numerical variables, either by mapping the categorical features to numerical values, or use *one-hot encoding* to turn the features into dummy variables. 

**Sometimes, it is also a good idea to scale the features because some machine learning algorithms do not perform well when the input numerical variables have different scales.** For example, in algorithms that calculate the distance between two points by the Euclidean distance, the distance will be governed by the feature with broad range of values. Also, for regularized linear models (Ridge, Lasso, and ElasticNet), normalization is very important because the scale of the variable affects how much regularization will be applied to that specific variable. (More on this [Wikipedia](https://en.wikipedia.org/wiki/Feature_scaling) page for feature scaling). 

For now, we will just use the data as it is and build linear regression models and decision tree models. 

---

## Step 3: Training, Testing, and Model Evaluation

### 1. Univariate Linear Regression 

Univariate linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). For example, we can use `sqft_living` as our feature to predict `price`. However, we know that this feature along probably does not have enough predictive power for the home price. We are only using this as an example to demonstrate linear regression with `sklearn`. 

But, first, let's split our data into training and test sets using `train_test_split()`. For more impormation about `train_test_split()`, read the [documentation about the method](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). In Jupyter notebook, a handy shortcut to view the documentation is to use **shift + tab** with your cursor inside the parentheses.

In [None]:
# import train_test_split
from sklearn.model_selection import train_test_split  

# Split the original dataset into training and test data sets
# Note that we split the original dataset instead of `X` (features) and `y` (label) arrays 
train_set, test_set = train_test_split(data, test_size=0.2, random_state=21)

# Take a look at the first few rows of the training features and classes
display(train_set.head())
display(test_set.head())

# Verify that the data sets were split 80% training and 20% testing
print 'The number of instances in the original data: {}'.format(data.shape[0])
print 'The number of instances in the training set: {}'.format(train_set.shape[0])
print 'The number of instances in the test set: {}'.format(test_set.shape[0]) 

In [None]:
# Now let's separate features and label
X_train = train_set['sqft_living'].values.reshape(-1, 1) # need do reshape if data has a single feature
y_train = train_set['price']
X_test = test_set['sqft_living'].values.reshape(-1, 1)
y_test = test_set['price']

Now we can train our model with `sklean`. 

In [None]:
from sklearn.linear_model import LinearRegression

# CREATE regression object... in this example
lreg = LinearRegression()

# TRAIN the object using the method .fit()
lreg.fit(X_train, y_train)

# PREDICT labels for the train and test set using the method .predict()
y_pred_train = lreg.predict(X_train)
y_pred_test  = lreg.predict(X_test)

### Intepreting the coefficients

In [None]:
plt.plot(X_train, y_train, '.', X_train, y_pred_train, '-')
plt.xlabel('Living Area Size (sqft)')
plt.ylabel('Home Price (USD)')

In [None]:
print 'Intercept of the regression line is %f.' % lreg.intercept_
print 'Slope of the regression line is %f.' % lreg.coef_

#### QUESTION: What does the coefficients mean? 

**Answer:** 

> **Note:** the above data shows the so-called [heterosedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity), i.e., the variability of a variable is unequal across the range of values of a second variable that predicts it. In the above case, the regression model is highly inconsistent when it predicts high values of `sqft_living`. Read more about how heterosedasticity would affect regression in this [Quora post](https://www.quora.com/How-would-homo-heteroskedasticity-affect-regression-analysis).

### Calculating Metrics

In [None]:
# root mean squared error
from sklearn.metrics import mean_squared_error
print 'Root mean squared error for training set: %.3f' % np.sqrt(mean_squared_error(y_train, y_pred_train))
print 'Root mean squared error for test set: %.3f' % np.sqrt(mean_squared_error(y_test, y_pred_test))

#### EXERCISE: Calculate regression performance metrics

In [None]:
# TODO: CALCULATE r squared
from sklearn.metrics import r2_score



In [None]:
# TODO: CALCULATE mean absolute error



### 2. Multivariate Linear Regression

In [None]:
# Now let's separate features and label
X_train = train_set.drop('price', axis=1)
y_train = train_set['price']
X_test = test_set.drop('price', axis=1)
y_test = test_set['price']

#### EXERCISE: Implement multivariate linear regression model

In [None]:
# TODO: CREATE regression object


# TODO: TRAIN the object using the method .fit()


# TODO: PREDICT labels for the train and test set using the method .predict()


# TODO: CALCULATE performance metrics



### 3. Decision Tree Regression  

Linear regression does not do a very good job in predicting the home price with the given features. Let's see if a different algorithm could do any better. Remember we built a decision tree classifier last week to predict whether a given Titanic passenger survived or not. Decision tree algorithm can also be used for regression problem. It works very similarly as in classification problems by splitting but now the leaf of the tree represents a numerical value instead of a class label. I will let you to work through the unfinished code below. 

#### EXERCISE: Finish the implementation of a decision tree regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

# TODO: CREATE regression object
 

# TODO: TRAIN the object using the method .fit()


# TODO: PREDICT labels for the train and test set using the method .predict()


# TODO: CALCULATE performance metrics



#### EXERCISE: Tune the complexity of the model and compare the model performance. What is the optimal complexity for a decision tree regression model in this case? 

In [None]:
# TODO


#### EXERCISE (OPTIONAL) : Can you plot a complexity plot showing train and test performance as a function of model complexity?

In [None]:
# TODO
