Created by: Sangwook Cheon

Date: Dec 23, 2018

This is step-by-step guide to Regression using scikit-learn, which I created for reference. I added some useful notes along the way to clarify things. I am excited to move onto more advanced concepts, such as deep learning, using frameworks like Keras and Tensorflow.
This notebook's content is from A-Z Datascience course, and I hope this will be useful to those who want to review materials covered, or anyone who wants to learn about the basics of regression.

# Content:

### 1. Simple Linear Regression
### 2. Multiple Linear Regression
### 3. Polynomial Regression
### 4. Supper Vector Regression (SVR)
### 5. Decision Tree Regression
### 6. Random Forest Regression
### 7. R-squared/Adjusted R-squared
_______________________________________________________
_______

# Simple Linear Regression  

![i7](https://i.imgur.com/LEgEZqA.png)

In [None]:
#data preprocessing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('../input/salary-data/Salary_Data.csv')
x = data.iloc[:, :-1].values
y = data.iloc[:, 1].values

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 1/3, random_state = 42)

#No need to do feature scaling, as the library automatically takes care of this.

#fitting regressor
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)



In [None]:
#predicting the test set results
y_pred = regressor.predict(X_test) #vector of all predictions of the dependent variable
y_pred

In [None]:
#visualize the training set results
plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title("years vs salary")
plt.xlabel("number of years")
plt.ylabel("salary (dollars)")
plt.show()

In [None]:
#visualizing test set results
plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue') #this should not change as the regressor fitted to train set should be shown
plt.title("years vs salary")
plt.xlabel("number of years")
plt.ylabel("salary (dollars)")
plt.show()

# Multiple linear regression
![i8](https://i.imgur.com/1YjjMvT.png)

## Dummy variables
Never include all dummy variable columns. Always omit one column (ex: if 9 columns, include 8), because including all leads to
--> Dummy variable trap, which will disrupt the learning of the machine learning model.

## 5 methods of building models:

**1. All-in**

*     • Putting all the variables into the equation.
*     • Prepare the Backward Elimination

**2. Backward Elimination**

*     • Step 1: Select a significant level to stay in the model 
*     • Step 2: Fit the full model with all possible predictors
*     • Step 3: Consider the predictor with the highest P-value. If P > Significance level, go to Step 4, otherwise go to FIN
*     • Step 4: Remove the predictor
*     • Step 5: Fit model without this variable. Go back to Step 3

**3. Forward Selection**

*     • Step 1: Select a significant level to enter the model 
*     • Step 2: Fit all simple regression models y ~ xn  Select the one with the lowest P-value.
*     • Step 3: Keep this variavle and fit all possible models with one extra predictor added to the one(s) you already have
*     • Step 4: Consider the predictor with the lowest P-value. If P > SL, go to STEP 3, other waise go to FIN

**4. Bidirectional Elimination**

*     • Step 1: Select a significance level to enter and to stay in the model (e.g: SLENTER = 0.05, SLSTAY = 0.05
*     • Step 2: Perform the next step of Forward Selection (new variables must have P < SLENTER to enter)
*     • Step 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay)
*     • Step 4: No new variables can enter and no old variables can exit. Until this happens,  repeat Step 2 and 3.

**5. All Possible Models**

*     • Step 1: Select a criterian of goodness of fit 
*     • Step 2: Construct All possible regression models 2^n - 1 total combinations
*     • Step 3: Select the one with the best criterion.


In [None]:
data2 = pd.read_csv('../input/m-50-startups/50_Startups.csv')
x = data2.iloc[:, :-1]
y = data2.iloc[:, 4]

from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

xtemp = x.iloc[:, 3]
labelencoder = LabelEncoder()
xtemp = labelencoder.fit_transform(xtemp)
xtemp = pd.DataFrame(to_categorical(xtemp))

x = x.drop(['State'], axis = 1) #In pandas axis = 1 --> column
x = pd.concat([x, xtemp], axis = 1)

# x = x.iloc[:, :-1] This is to avoid dummay variable trap. However, the library already takes care of this, so no need.

X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)



In [None]:
#Linear Regression

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

y_pred = regressor.predict(X_test)
y_pred, Y_test

print(x.shape[1])

In [None]:
#Find an optimal team of independent variables, so that each variable has significant impact on the prediction. 
# --> Backward elimination

import statsmodels.formula.api as sm
x = np.array(x) #use numpy arrays instead of DataFrames for more useful functions. DataFrames are useful for preparing dataset
x = np.append(np.ones((x.shape[0], 1), dtype = 'int'), x, axis = 1) #(x.shape[0], 1).astype(int) does not work
#Above is done to add constant to the model, which is necessary for Ordinary Least Squares to work

In [None]:
x_opt = x[:, [0,1,2,3,4,5]] #needs to specify all the indexes, so that individual index is evaluated.
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary() #shows statistical summary

In [None]:
'''The significance value is set to 0.05 If P-value is lower than this, 
then it is significant. If higher, it is less significant. Therefore, variables
with higher P_values need to be removed, as they do not have large impact. 
This is called backward elimination. 

In this case, as x4 has 0.990, it needs to be removed'''

x_opt = x[:, [0,1,2,3,4,5]] #needs to specify all the indexes, so that individual index is evaluated.
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary() #shows statistical summary


In [None]:



#repeat the step --> remove the insignificant variable, fit it, repeat it.

x_opt = x[:, [0,1,3,5]] #needs to specify all the indexes, so that individual index is evaluated.
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary() #shows statistical summary

In [None]:
x_opt = x[:, [0,1,3]] #needs to specify all the indexes, so that individual index is evaluated.
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary() #shows statistical summary

In [None]:
#To check if the team of variables are correct
x_show = pd.DataFrame(x)
# x_show

#Therefore, only using these variables 

# Polynomial regression
y = b0 + b1x1 + b2x1^2 --- bnx1^n

In [None]:
data3 = pd.read_csv('../input/polynomial-position-salary-data/Position_Salaries.csv')
x = data3.iloc[:, 1:2].values #1:2 is done instead of only 1, because independent variable should be a matrix.
y = data3.iloc[:, 2].values

# X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
# Use whole dataset to train

data3
plt.scatter(data3.iloc[:, 1], y)
plt.title('Level vs Salary')
plt.xlabel("Level")
plt.ylabel("Salary (dollrs)")
plt.show()

In [None]:
from sklearn.preprocessing import PolynomialFeatures
#transforms the x matrix into a new matrix that has x1, x2, x3 --- columns
poly_reg = PolynomialFeatures(degree = 2) #specify the degree -> how many terms
x_poly = poly_reg.fit_transform(x)
x_poly

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(x_poly, y)

y_pred = lin_reg.predict(poly_reg.fit_transform(x)) #this is used instead of x_poly, so that this model will work for any matrix input x 

plt.figure(2)
plt.scatter(x, y)
plt.plot(x, y_pred)
plt.show()

In [None]:
#improving the model --> add degrees to make it more complex

poly_reg = PolynomialFeatures(degree = 4) #specify the degree -> how many terms
x_poly = poly_reg.fit_transform(x)
x_poly

lin_reg = LinearRegression()
lin_reg.fit(x_poly, y)

y_pred = lin_reg.predict(poly_reg.fit_transform(x)) #this is used instead of x_poly, so that this model will work for any matrix input x 

#This is to get a more continuous curve, by plotting more x values.
x_grid = np.arange(min(x), max(x), 0.1) #0.1 --> increment 
x_grid = x_grid.reshape(x_grid.shape[0], 1)

plt.figure(2)
plt.scatter(x, y)
plt.plot(x_grid, lin_reg.predict(poly_reg.fit_transform(x_grid)))
plt.show()

# Support Vector Regression (SVR)
![i8](https://i.imgur.com/QDhMroy.png)

In SVR, the objective is to make sure errors do not exceed the threshold, while in linear regression it is to minimize the error between prediction and data.

In [None]:
data3 = pd.read_csv('../input/polynomial-position-salary-data/Position_Salaries.csv')
x = data3.iloc[:, 1:2].values #1:2 is done instead of only 1, because independent variable should be a matrix.
y = data3.iloc[:, 2].values
y = y.reshape(y.shape[0], 1)
x = x.reshape(x.shape[0], 1)

print(y.shape)
# X_train, X_test, Y_train, Y_test = train_test_split()

#svr does not have feature scaling built in
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x = sc_x.fit_transform(x)
y = sc_y.fit_transform(y)

from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(x, y)

y_pred = sc_y.inverse_transform(regressor.predict(sc_x.transform(np.array([[6.5]])))) #input scaled value, then inverse scale the predicted value
y_pred

# Decision tree Regression
![i9](https://i.imgur.com/JZkzrXt.png)

Information Entropy --> tries to find an optimal way to split the dataset into leaves (each section is called a leaf).

![i10](https://i.imgur.com/CdxakvJ.png)
Take the average of each leaf, and assign that value to any coordinate that falls under any leaf.

In [None]:
data3 = pd.read_csv('../input/polynomial-position-salary-data/Position_Salaries.csv')
x = data3.iloc[:, 1:2].values #1:2 is done instead of only 1, because independent variable should be a matrix.
y = data3.iloc[:, 2].values

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(x, y)

y_pred = regressor.predict(np.array([[6.5]]))

x_grid = np.arange(min(x), max(x), 0.01)
x_grid = x_grid.reshape(len(x_grid), 1)
plt.figure(3)
plt.plot(x_grid, regressor.predict(x_grid))
plt.show()

#Notice how average is used to represent each interval.

# Random Forest Regression

* Step 1: Pick at random K points from the training set
* Step 2: Build the Decision Tree associated to these K points
* Step 3: Choose the number of Decision trees to build, and repeat step 1 and 2
* Step 4: For a new data point, make each Decision tree output a prediction, and assign the average of these values.

## Forest --> A team of trees

In [None]:
data3 = pd.read_csv('../input/polynomial-position-salary-data/Position_Salaries.csv')
x = data3.iloc[:, 1:2].values #1:2 is done instead of only 1, because independent variable should be a matrix.
y = data3.iloc[:, 2].values

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 5000, criterion = 'mse', random_state = 0) #can tune the n_estimators
regressor.fit(x, y)

y_pred = regressor.predict([[6.5]])

x_grid = np.arange(min(x), max(x), 0.01)
x_grid = x_grid.reshape(len(x_grid), 1)
plt.plot(x_grid, regressor.predict(x_grid))
plt.show()
print(y_pred)

# R-Squared

r^2 = 1 - SSres /SStot

SSres = SUM (yi - yihat)^2 --> Sum of residuals

SStot = SUM (yi - yavg)^2 

Find a best line that minizes R^2, and make it best compared to the average line. 

**Closer to 1, better. If smaller, worse**

![i11](https://i.imgur.com/4KGJmle.png)

# Adjusted R-Squared

This takes care of adding non-meaningful regressors. It penalizes the model for adding a variable.  This metric can be used to assess the model accurately.
![i12](https://i.imgur.com/bqZZu05.png)

## When doing backward elimination, check Adjusted R-Squared to see if removing a variable is beneficial to the model. If the Adjusted R-Squared grows, then removing it is a good idea.

	coef	std err	t	P>|t|	[0.025	0.975]
const	4.698e+04	2689.933	17.464	0.000	4.16e+04	5.24e+04
x1	0.7966	0.041	19.266	0.000	0.713	0.880
x2	0.0299	0.016	1.927	0.060	-0.001	0.061

When interpreting these coeffients, which is shown in the Statsmodel library, needs to be careful about units.

The coefficient part of the table shows how much impact a variable has on the independent variable **per unit**. If the variables are of same unit, then they can be compared, but if they are not, it is only valid to say "one has more impact then the other per unit"