# Regression Analysis

In [1]:
# import necessary libraries and specify that graphs should be plotted inline
%matplotlib inline 
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt


## 1. Linear Regression Analysis

We will use Scikit-Learn for regression analysis. Scikit-Learn is one of the best known Python packages that provide efficient versions of a large number of common algorithms. 

After preparing the data, the regression analysis in general consists of the following steps: (1) loading data and select variables for analysis; (2) data splitting; (3) define the model and feed the training dataset into the model; (4) prediction; and (5) performance evaluation. Step 3 is also known as "training the model", which returns the parameters of interest and may take some time if your data is huge.

### 1.1(a) Data Description
We will use another housing dataset, house.csv. Our goal is to predict the **'TOTAL_VALUE'** of the houses. Other important variables are: 
- **TAX**: Tax bill amount of the property.
- **LOT_SQFT**: Total lot size in square feet.
- **GROSS_AREA**: gross floor area.
- **FLOORS**: Number of floors.
- **ROOMS**: Number of rooms.


In [2]:
## Load the data, print the dimensions and variable names
house = pd.read_csv("C:\\Users\\raksh\\OneDrive\\Desktop\\Sem 3\\AML\\Class 3\\house.csv")   
# loading data, data is called "house"
print(house.shape)      # get record and variable counts 
print(house.columns)    # get variable names

(5802, 12)
Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT', 'GROSS_AREA', 'LIVING_AREA', 'FLOORS',
       'ROOMS', 'BEDROOMS', 'FULL_BATH', 'HALF_BATH', 'KITCHEN', 'FIREPLACE'],
      dtype='object')


In [3]:
## show the first several rows of data.
house.head()   

Unnamed: 0,TOTAL_VALUE,TAX,LOT_SQFT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE
0,344.2,4330,9965,2436,1352,2.0,6,3,1,1,1,0
1,412.6,5190,6590,3108,1976,2.0,10,4,2,1,1,0
2,330.1,4152,7500,2294,1371,2.0,8,4,1,1,1,0
3,498.6,6272,13773,5032,2608,1.0,9,5,1,1,1,1
4,331.5,4170,5000,2370,1438,2.0,7,3,2,0,1,0


### 1.1(b) Data Representation in Scikit-Learn
We need to prepare the dataset so that sklearn understands. Specifically, we need:
1. Attribute Matrix (X): a two-dimentional matrix, can be Numpy array or Pandas Dataframe. Rows must be observations, and columns must be variables.
    - Note: you may have only one variable, but make sure it is a 2D matrix!
2. Target Array (Y): one-dimensional, Numpy array or Pandas Serires. 

For now, consider the following variables for the regression (in practice, you need to use your domain knowledge and exploratory data analysis).

<center><b>'TAX', 'FLOORS', 'ROOMS'</b></center>

In [5]:
# Choose predictors to construct attribute matrix (X) and target (Y) accordingly.
var = ['TAX', 'FLOORS', 'ROOMS'] # define a list of variables. 
                                 # Spelling should be the same as house.columns
house_X = house[var]             # Define the X to be used (i.e., based on selected var)
house_y = house['TOTAL_VALUE']   # Define the Y

house_X.shape, house_y.shape

((5802, 3), (5802,))

In [6]:
# Print dimensions
print(house_X.shape)
print(house_y.shape)

(5802, 3)
(5802,)


### 1.2 Split the Data

Recall that we train our model only on training set. So we need to first split our data into two parts: training set and test set. 

To split our data, we use syntax:
**<center>sklearn.model_selection.train_test_split(Data Inputs, test_size, random_state) </center>**
- **test_size**: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the validation split. If int, represents the absolute number of validation samples.
- **random_state**: This parameter is for result reproduction/replication. If int, random_state is the seed used by the random number generator.

**Output of data splitting function**
- If you have one input data (e.g., X), then it will return two outputs with a specific order: X_train, X_test. X_test is defined based on "test_size". 

- If you have multiple inputs (e.g., X, y), then the outputs will be (ordered): X_train, X_test, y_train, y_test. 
- We use multiple assignment to collect the corresponding outputs.



**Practice:** 
- Randomly split the dataset into 70% training and 30% validation. Set random seed to 42.
- Show how many observations are there in the training set and full set.

In [10]:
# Split the dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =train_test_split(house_X, house_y, test_size=0.3, random_state=42)

In [15]:
# show number of observations in the training set

X_train.shape[0]; house_X.shape[0]

5802

### 1.3 Linear Regression Model

Now we need to train our model. First, we initialize the regression model. That is, we import the model from Scikit-Learn and create the linear regression model instance.

Use syntax:

**<center>sklearn.linear_model.LinearRegression()</center>**


In [12]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()


The general workflow to apply ML algorithms using sklearn is the same. Specifically:
- We need to first **train** our model. That is, we plug in our corresponding data so as to get resulting parameters.
    - In sklearn, use .fit(INPUTS) to train the model. For supervised learning models, we need to let the model know both the X and the y. Thus, syntax would be **MODEL_NAME.fit(X_train, y_train)**

- After training, we can get some outputs (e.g., parameters) and use the trained model for some further analysis/calculations (e.g., prediction). 
    - In sklearn, use .predict(INPUT) to get predictions of a **trained** model. This time, we are predicting y, thus, y should not be an input. The syntax would be **MODEL_NAME.predict(x)**

- Performance evaluation is a key factor for supervised learning algorithms. 
    - In sklearn, use **MODEL_NAME.score(X, y)** to obtain the performance measure based on the inputs.

Specific to linear regressions, we may also be interested in the parameter estimations. 
- To get parameters, use **MODEL_NAME.coef_**
- To get intercept, use **MODEL_NAME.intercept_**

**Practice:**
- Train the model. Then obtain the coefficients (i.e., all parameter estimations, including the intercept) of model. Specify which one is the intercept.

In [16]:
# Train the model using the training set

model.fit(X_train, y_train)


LinearRegression()

In [18]:
model.coef_, np.array(model.intercept_)

(array([ 0.07949157,  0.00028905, -0.00028666]), 0.03971078352793711)

In [20]:
# Print out the coefficients
# A Trick: set the precision for numpy arrays using "np.set_printoptions". Remember to load numpy.
np.set_printoptions(precision=4, suppress=True) # True means to print as fixed point notation

model.coef_, model.intercept_

(array([ 0.0795,  0.0003, -0.0003]), 0.03971078352793711)

In [21]:

model.coef_, np.array(model.intercept_)

(array([ 0.0795,  0.0003, -0.0003]), array(0.0397))

### 1.4 Predictions
Once the model is trained, the main task is to evaluate it based on what it says about new data. In sklearn, we first obtain the predicted value, then calculate the corresponding performance measures. 

Recall that to obtain an unbiased performanve evaluation, and to avoid the overfitting problem, we need to get predictions for the **test set** and compare the predicted values to their true values. You can also make predictions for the training set (by changing the input argument of the predict() from X_test to X_train ), or any "new" sample, but these should not be used as performance measures of your model.

**Practice:**
- Make prediction for test set
- Predict for a new house with tax = 4330, floor = 2, and rooms = 6

In [25]:
# Test set prediction

y_pred_test =model.predict(X_test)


In [24]:
# New point prediction

X_new = [[4330, 2, 6]] #has to be 2-D
model.predict(X_new)



array([344.2371])

### 1.5 Performance Evaluation - Regression Models
We consider two measures for performance evaluation: MAE and MSE.

**Practice**
- Compute the MAE for the test set. (Hint: to obtain absolute value, use np.abs())
- Compute the MSE for the test set.

In [27]:
# MAE: absolute value of errors, to get the mean

e= y_test - y_pred_test
np.abs(e) #absolute values of errors



5070    0.019308
1103    0.006426
812     0.028565
1632    0.037247
1128    0.035591
          ...   
5648    0.003542
5126    0.032590
1338    0.016541
3885    0.006746
4269    0.015168
Name: TOTAL_VALUE, Length: 1741, dtype: float64

In [28]:
MAE_test = np.mean(np.abs(e) )
MAE_test

0.01973323924212563

In [29]:
# MSE

e**2

5070    0.000373
1103    0.000041
812     0.000816
1632    0.001387
1128    0.001267
          ...   
5648    0.000013
5126    0.001062
1338    0.000274
3885    0.000046
4269    0.000230
Name: TOTAL_VALUE, Length: 1741, dtype: float64

In [31]:
MSE_test = np.mean(e**2)
MSE_test

0.0005217407559530123

In [32]:
#RMSE is root of MSE, to get the scale of the error correct

MSE_test**0.5

0.02284164521117103

Many regression models also consider another measure, R-squared ($R^2$) to evaluate performance. Specifically:

<br><center>$R^2 = 1 - RSS/TSS $ </center>

where RSS is the sum-squared residual (i.e., n* MSE); and TSS is the total sum of squares. Specifically, $ TSS = \sum(y_i - \bar{y})^2$
    

Thus, R-squared shows, how well the model captures the variance of the data. A higher R-squared value indicates a better performance. Technically, we do not need to manually calculate R-squared. Instead, we can obtain this measure through .score method.

In [33]:
model.score(X_test, y_test) #test set

0.9999999499089725

In [34]:
model.score(X_train, y_train) #train set

0.9999999464980553