# Lecture 3 Regression Analysis

In [1]:
# import necessary libraries and specify that graphs should be plotted inline
%matplotlib inline 
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

## 1. Linear Regression Analysis

We will use Scikit-Learn for regression analysis. Scikit-Learn is one of the best known Python packages that provide efficient versions of a large number of common algorithms. 

After preparing the data, the regression analysis in general consists of the following steps: (1) loading data and select variables for analysis; (2) data splitting; (3) define the model and feed the training dataset into the model; (4) prediction; and (5) performance evaluation. Step 3 is also known as "training the model", which returns the parameters of interest and may take some time if your data is huge.

### 1.1(a) Data Description
We will use another housing dataset, house.csv. This is a toy dataset for prediction models. Our goal is to predict the **'TOTAL_VALUE'** of the houses. Other important variables are: 
- **TAX**: Tax bill amount of the property.
- **LOT_SQFT**: Total lot size in square feet.
- **GROSS_AREA**: gross floor area.
- **FLOORS**: Number of floors.
- **ROOMS**: Number of rooms.


In [2]:
## Load the data, print the dimensions and variable names
house = pd.read_csv('house.csv')
print(house.shape) # get record and variable counts
print(house.columns) # get variable names

(5802, 12)
Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT', 'GROSS_AREA', 'LIVING_AREA', 'FLOORS',
       'ROOMS', 'BEDROOMS', 'FULL_BATH', 'HALF_BATH', 'KITCHEN', 'FIREPLACE'],
      dtype='object')


In [3]:
## show the first several rows of data.
house.head(10)

Unnamed: 0,TOTAL_VALUE,TAX,LOT_SQFT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE
0,344.2,4330,9965,2436,1352,2.0,6,3,1,1,1,0
1,412.6,5190,6590,3108,1976,2.0,10,4,2,1,1,0
2,330.1,4152,7500,2294,1371,2.0,8,4,1,1,1,0
3,498.6,6272,13773,5032,2608,1.0,9,5,1,1,1,1
4,331.5,4170,5000,2370,1438,2.0,7,3,2,0,1,0
5,337.4,4244,5142,2124,1060,1.0,6,3,1,0,1,1
6,359.4,4521,5000,3220,1916,2.0,7,3,1,1,1,0
7,320.4,4030,10000,2208,1200,1.0,6,3,1,0,1,0
8,333.5,4195,6835,2582,1092,1.0,5,3,1,0,1,1
9,409.4,5150,5093,4818,2992,2.0,8,4,2,0,1,0


### 1.1(b) Data Representation in Scikit-Learn
We need to prepare the dataset so that sklearn understands. Specifically, we need:
1. Attribute Matrix (X): a two-dimentional matrix, can be Numpy array or Pandas Dataframe. Rows must be observations, and columns must be variables.
    - Note: you may have only one variable, but make sure it is a 2D matrix!
2. Target Array (Y): one-dimensional, Numpy array or Pandas Serires. 

For now, consider the following variables for the regression (in practice, you need to use your domain knowledge and exploratory data analysis).

<center><b>'TAX', 'FLOORS', 'ROOMS'</b></center>

In [4]:
# Choose predictors to construct attribute matrix (X) and target (Y) accordingly.
var = ['TAX', 'FLOORS', 'ROOMS'] # define a list of variables. 
                                 # Spelling should be the same as house.columns
house_X = house[var]             # Define the X to be used (i.e., based on selected var)
house_y = house['TOTAL_VALUE']   # Define the Y

In [5]:
# Print dimensions
print(house_X.shape)
print(house_y.shape)

(5802, 3)
(5802,)


### 1.2 Split the Data

Recall that we train our model only on training set. So we need to first split our data into two parts: training set and test set. 

To split our data, we use syntax:
**<center>sklearn.model_selection.train_test_split() </center>**
- **test_size**: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If int, represents the absolute number of validation samples.
- **random_state**: This parameter is for result reproduction/replication. If int, random_state is the seed used by the random number generator.

**Output of data splitting function**
- If you have one input (e.g., X), then it will return two outputs with a specific order: X_train, X_test. X_test is defined based on "test_size". 

- If you have multiple inputs (e.g., X, y), then the outputs will be (ordered): X_train, X_test, y_train, y_test. 
- We use multiple assignment to collect the corresponding outputs.



**Practice:** 
- Randomly split the dataset into 70% training and 30% validation. Set random seed to 42.
- Show how many observations are there in the training set and full set.

In [6]:
# Split the dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(house_X, house_y, test_size=0.3, random_state=42)

In [7]:
# show number of observations in the training set and the test set

X_train.shape[0], X_test.shape[0]

(4061, 1741)

### 1.3 Linear Regression Model

Now we need to fit our training sample into the model. First, we initialize the regression model. That is, we import the model from Scikit-Learn and create the linear regression model instance.

Use syntax:

**<center>sklearn.linear_model.LinearRegression()</center>**


In [8]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()


The general workflow to apply ML algorithms using sklearn is the same. Specifically:
- We need to first **train** our model. That is, we plug in our corresponding data so as to get resulting parameters.
    - In sklearn, use .fit(INPUTS) to train the model. For supervised learning models, we need to let the model know both the X and the y. Thus, syntax would be **MODEL_NAME.fit(X_train, y_train)**

- After training, we can get some outputs (e.g., parameters) and use the trained model for some further analysis/calculations (e.g., prediction). 
    - In sklearn, use .predict(INPUT) to get predictions of a **trained** model. This time, we are predicting y, thus, y should not be an input. The syntax would be **MODEL_NAME.predict(x)**

- Performance evaluation is a key factor for supervised learning algorithms. 
    - In sklearn, use **MODEL_NAME.score(X, y)** to obtain the performance measure based on the inputs.

Specific to linear regressions, we may also be interested in the parameter estimations. 
- To get parameters, use **MODEL_NAME.coef_**
- To get intercept, use **MODEL_NAME.intercept_**

**Practice:**
- Train the model. Then obtain the coefficients (i.e., all parameter estimations, including the intercept) of model. Specify which one is the intercept.

In [9]:
# Train the model using the training set

model.fit(X_train, y_train)


In [10]:
# Print out the coefficients
model.coef_, model.intercept_

(array([ 0.07949157,  0.00028905, -0.00028666]), 0.03971078352742552)

In [11]:
# y = 0.07949157 * x_'TAX' + 0.00028905 * x_'FLOORS' - 0.00028666 * x_'ROOMS' + 0.03971078352793711

### 1.4 Predictions
Once the model is trained, the main task is to evaluate it based on what it says about new data. In sklearn, we first obtain the predicted value, then calculate the corresponding performance measures. 

Recall that to obtain an unbiased performanve evaluation, and to avoid the overfitting problem, we need to get predictions for the **test set** and compare the predicted values to their true values. You can also make predictions for the training set (by changing the input argument of the predict() from X_test to X_train), or any "new" sample, but these should not be used as performance measures of your model.

**Practice:**
- Make prediction for test set
- Predict for a new house with tax = 4330, floor = 2, and rooms = 6

In [12]:
# Test set prediction
y_pred_test = model.predict(X_test)
y_pred_test, y_test

(array([306.31930765, 305.20642564, 239.22856522, ..., 297.41654061,
        582.3932537 , 383.58483159]),
 5070    306.3
 1103    305.2
 812     239.2
 1632    396.5
 1128    322.5
         ...  
 5648    356.8
 5126    407.4
 1338    297.4
 3885    582.4
 4269    383.6
 Name: TOTAL_VALUE, Length: 1741, dtype: float64)

In [13]:
# New point prediction
model.predict([[4330, 2, 6], [5680, 2, 6]])



array([344.23707665, 451.55069914])

### 1.5 Performance Evaluation - Regression Models
We consider two measures for performance evaluation: $MAE$ and $MSE$. Let $\hat{y}_i$ denote the predicted value of $y$, and $\hat{y}_i$ denote the mean value of $y$.

$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$


**Practice**
- Compute the MAE for the test set. (Hint: to obtain absolute value, use np.abs())
- Compute the MSE for the test set.

In [14]:
# MAE
np.mean(np.abs(y_test - y_pred_test))

0.019733239242125705

In [15]:
# MAE - Alternative
from sklearn.metrics import mean_absolute_error as mae

mae(y_test, y_pred_test)

0.019733239242125705

In [16]:
# MSE
from sklearn.metrics import mean_squared_error as mse

mse_test = mse(y_test, y_pred_test)

Many regression models also consider another measure, R-squared ($R^2$) to evaluate performance. Specifically:

<br><center>$R^2 = 1 - RSS/TSS $ </center>

where $RSS$ is the <b>sum-squared residual</b> (i.e., n* MSE); and $TSS$ is the <b>total sum of squares</b>. Specifically, $ TSS = \sum(y_i - \bar{y})^2$
    

Thus, R-squared shows, how well the model captures the variance of the data. A higher R-squared value indicates a better performance. Technically, we do not need to manually calculate R-squared. Instead, we can obtain this measure through .score method.

In [17]:
# Obtain R-squared for test set




0.9999999499089725

In [18]:
# Obtain R-squared using sklearn 
from sklearn.metrics import r2_score

r2_score(y_test, y_pred_test)

0.9999999499089725

In [19]:
# Obtain R-squared using .score

model.score(X_test, y_test), model.score(X_train, y_train)

(0.9999999499089725, 0.9999999464980553)