# Homework 1 - Linear Regression

## Dataset
The dataset you will be using is the "Bike Sharing". 

There are two data files: "BikeSharing_training.csv" and "BikeSharing_Xtest.csv"<br/>
Both files have the following fields, except cnt which is not available in "BikeSharing_Xtest.csv"

Features:
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)

Target:
- cnt: count of total rental bikes


Training dataset, "BikeSharing_training.csv", contains 300 rows and 12 columns. This is the training set containing both of the features and the target.<br/>
Test dataset, "BikeSharing_Xtest.csv", contains 200 rows and 11 columns. This is the test set which only contains the features.<br/>

Your goal is to predict the number of total rental bikes (cnt) based on the features.

In [219]:
import csv
import numpy as np
import pandas as pd

Load the training data and View the first 5 lines

In [478]:
# Load the data


# Show the first 5 lines


## Data Exploration
We can plot a histogram of the dataframe for the features: "temp", "atemp","hum","windspeed" to understand the distributions of the continuous values.<br/>


In [459]:
### WRITE CODE TO OBTAIN AND DISPLAY HISTOGRAMS ###


##### Q1. What can you infer from the histograms? <br/>
Ans- 

Compute the correlation matrix to get an understanding of the correlation between cnt and the other features.<br/>



##### Answer the following questions:<br/>

##### Q2. Why is the diagonal made up of 1's in the correlation matrix?<br/>
Ans - 

##### Q3. Why is the matrix symmetric along diagonal?<br/>
Ans - 

##### Q4. Looking at the correlation matrix, if you have to choose one predictor for a simple linear regression model with cnt as the outcome, which one would you choose and why? <br/>
Ans - 


### Standardization of features

Feature standardization makes the values of each feature in the data have zero-mean and unit-variance. This method is widely used for normalization in many machine learning algorithms. The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values of each feature by its standard deviation.

$x'$ = ($x$ - $\bar{x}$)/$\sigma$ 

where $x$ is the original feature vector,
$\bar{x}$ is the mean of the feature vector and
$\sigma$ is its standard deviation.

This is also called Z-score Normalization. 

Perform Z-score Normalization on "temp", "atemp","hum","windspeed"


In [460]:
from sklearn.preprocessing import StandardScaler





##### Q5. What are the advantages and disadvantages of using Z-score Normalization?<br/>
Ans-

##### Q6. In this dataset, do you need to use the Z-score Normalization? Explain.<br/>
Ans- 

### One-Hot Encoding

"temp", "atemp","hum" and "windspeed" are continuous values whereas the others contain discrete values. E.g. "mnth" can only take on the integers from 1 to 12. We need to perform one-hot encoding on discrete values for it to be processed in the model. One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Perform one-hot encoding on all the categorical features and print the shape of your encoded array

In [461]:
from sklearn.preprocessing import OneHotEncoder


# Print the shape of your encoded X


##### Q7. What are the advantages and disadvantages of using One-hot encoding?<br/>
Ans-

## Multiple Linear Regression

In the big data era, it is highly unlikely that we are interested in the effect of a single variable on another. To simultaneously account for the effects of multiple variables, we use multiple regression (which accounts for the covariances between predictors).

While the algorithmic solution to multiple regression exists, it is easier to conceptualize in terms of linear algebra. The optimal $\hat{\beta}$ vector that minimizes the residual sum of squares is:

$\hat{\beta} = (X^TX)^{-1}X^Ty $


Perform multiple linear regression on the training dataset, where the outcome is "cnt".

In [462]:
from sklearn.linear_model import LinearRegression

In [463]:
# Bulding and fitting the Linear Regression model 

# Evaluating the Linear Regression model by computing MSE on training set
from sklearn.metrics import mean_squared_error


###### Q8. Print the value of bhat

###### Q9. Is there a problem of multicolinearity? Explain what you can do<br/>
Ans- 

### Goodness of fit

A model can always make predictions. But it is important to determine how good the model is.
How do we know that our model captures the data well? When evaluating model fit, a good metric is $R^2$, which corresponds to the amount of variance explained by the model. The formula for $R^2$ is the following:

$R^2$ = $1 - \dfrac{RSS}{TSS}$<br/>
where:<br/>
$RSS = \Sigma(y - \hat{y})^2$<br/>
$TSS = \Sigma(y - \bar{y})^2$<br/>

$R^2$ is also one metric for comparing models against each other. It is intuitive to say that the model that explains more variation in the data is a better fit than one that explains less variation. 

Fill in the code for calculation of R2 score 

In [464]:
from sklearn.metrics import r2_score

#### R2 score for model with "temp" as predictor and "cnt" as outcome

In [465]:


# Print R2 score


#### R2 score for model with "temp", "hum" as predictors and "cnt" as outcome

In [466]:


# Print R2 score


#### R2 score for model with  "temp", "atemp", "hum" as predictors and "cnt" as outcome

In [467]:


# Print R2 score


You can see $R^2$ is always going up as we keep adding features.

This is one drawback of only using $R^2$ to evaluate your model. Adding predictors seems to always improve the predictive ability of your model, though it may not be true.

That is to say, we are not necessarily interested in making a perfect prediciton of our training data. If we were, we would always use all of the predictors available. Rather, we would like to make a perfect prediction of our test data. In this case, adding all the predictors may not be a good idea due to the trade-off between bias and variance. Thus, we are interested in the most predictive features, in the hopes that we can create a simpler model that performs well in the future.

This is why we consider another metric, Adjusted R2.
The adjusted R-squared increases only if the new term improves the model more than would be expected by chance.


$AdjustedR^2$ = $1 - \dfrac{(1-R^2)(n-1)}{(n-k-1)}$<br/>
where:<br/>
n = number of samples<br/>
k = number of features

Fill in the code for calculation of adjusted R2 score

#### Adjusted R2 score for model with "temp" as predictor and "cnt" as outcome

In [468]:


# Print Adjusted R2 score


#### Adjusted R2 score for model with "temp", "hum" as predictors and "cnt" as outcome

In [469]:


# Print Adjusted R2 score


#### Adjusted R2 score for model with  "temp", "atemp", "hum" as predictors and "cnt" as outcome

In [470]:


# Print Adjusted R2 score


### K-fold Cross-Validation

However, adjusted R2 is not enough to help us ahieve the best model, a more robust method is k-fold cross-validation.

* Randomly split dataset into K equal-sized subsets, or folds
* Treat each fold as validation set (train on all but K'th fold and test on K'th fold only)

* The overall error is then the mean error over all K models.
* Most common are 5- or 10-fold cross-validation

Please implement a 5-fold cross-validation by yourselves to find the best model in terms of Mean Square Error(MSE)

**Do not use sklearn.model_selection.cross_val_score or other built-in cross-validaiton functions**

In [479]:
# Design a function to implement 5-fold cross-validation. 
# The input: training features X, training target y.
# The output: the average of MSE over the 5 folds.

def cross_val_mse(X, y):
    # Write your code here
    
    
    return()

# By using your above function, find the best combination of features, which has the lowest averaged MSE





In [474]:
# Print the best features 




### Test your model
Now, apply your best model to predict the target values from the test feature set "BikeSharing_Xtest.csv". We will grade this part based on your prediction error.

Hint: Please be careful on standardization and one-hot encoding (if you use), the test set should be consistent with the training set in terms of any transformation.

Hint2: You may want to modify the previous steps to make the transformation of the test set consistent with the training set.

In [477]:









# Output your prediction on test set as y_pred. It should be a 200 x 1 vector.
# y_pred =

In [476]:
#end