# Regressions

In this notebook, I will try to understand the regressions.  
Especially, I want to find some answers to my questions, like what is the math behind them, how to use them and when to use them.  

# 1. Importing libraries and loading datasets

In [None]:
import numpy as np
import pandas as pd

# Plot
import matplotlib.pyplot as plt
import seaborn as sns

# Modelling
from sklearn.linear_model import LinearRegression

In [None]:
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv', index_col=0)
test_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv', index_col=0)

# 2. Explore data

In [None]:
train_data.tail()

In [None]:
train_data.describe()

In [None]:
print("Columns: \n{0} ".format(train_data.columns.tolist()))

# 3. Basic data check

In [None]:
missing = train_data.isna()
percent = (missing.sum()/missing.count()*100).sort_values(ascending=False)
missing_columns = percent[percent > 0].index.tolist() # Any
print('Columns which have missing values: \n{0}'.format(missing_columns))
#missing_columns = percent[percent > 10].index.tolist() # More than 10 percent
#print('Columns which have more than 10% missing values: \n{0}'.format(missing_columns))

In [None]:
duplicates = train_data.duplicated().sum()
print('Duplicates in train data: {0}'.format(duplicates))

# 4. Taking care of the missing data

It looks like I won't need all of those columns anyways, at least for my purposes (we will see). So, there should be no harm removing them for now.

In [None]:
train_data.drop(missing_columns, axis=1, inplace=True)
test_data.drop(missing_columns, axis=1, inplace=True)

In [None]:
train_data.head()

# 6. Regressions

Regression models are used to estimate the relationship between a dependent variable (target) and one or more independent variables (features).  
The most common one is the linear regression model, which uses only one independent variable and corresponds to a straight line that most closely fits the data according to a mathematical criterion.  
While linear regression models use a straight line, logistic and nonlinear regression models use a curved line.


Regression analysis and models are mainly used for two purposes.
* Prediction and forecasting.
* To understand relationships between the independent and dependent variables.


**References**  
https://en.wikipedia.org/wiki/Regression_analysis  
https://corporatefinanceinstitute.com/resources/knowledge/finance/regression-analysis/  

# 6.1 Simple Linear Regression

A simple linear regression model determines the relationship between a dependent variable and an independent variable.

It is expressed using the equation:  
$Y = \beta_0 + \beta_1X_1 + \epsilon$

where:  
$Y$ - Dependent variable  
$X_1$ - Independent variable  
$\beta_0$ - Intercept  
$\beta_1$ - Slope  
$\epsilon$ - Residual (error)  

**References**  
https://www.scribbr.com/statistics/simple-linear-regression/  
https://www.imsl.com/blog/what-is-regression-model  
https://online.stat.psu.edu/stat462/node/91/

In [None]:
X = train_data[['OverallQual']]
y = train_data['SalePrice']

regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)

In [None]:
plt.figure(figsize=(13, 6))
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title("Sale Price vs Overall Quality\n" +
          "Equation: Y = {0:.2f} + {1:.2f}X₁".format(regressor.intercept_, regressor.coef_[0]))
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

## How does linear regression find the best fitting line?

The best fitting line is a straight line that represents the best approximation of the given data.  
The difference between the actual (observed) value and the predicted value for any data point is known as residual error.

In [None]:
fig, ax = plt.subplots(figsize=(13, 6))
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red', label="best fitting line")
ax.hlines(y, X-0.2, X+0.2, color='green', label="residuals")
ax.vlines(X, y, y_pred, color='green')
plt.title("Sale Price vs Overall Quality")
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.legend()
plt.show()

The best fitting line is the one which has the smallest possible residual errors in the overall sense.  
Regression analysis uses the “least squares method” to generate the best fitting line.  
It makes the sum of the squared prediction errors the smallest it can be.  

For each i-th point in the data set,  

$
\begin{align}
Y_i &= \beta_0 + \beta_1X_i \\
\epsilon_i &= y_i - Y_i \\
Q &= \sum_{i=1}^{n} (y_i - Y_i)^2 \\
Q &= \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1X_i))^2 \\
\end{align}
$

The least squares estimates,  

$
\begin{align}
\beta_0 &= \overline{y} - \beta_1\overline{x} \\ 
\beta_1 &= \frac{\sum_{i=1}^{n}{(x_i - \overline{x})(y_i - \overline{y})}}{\sum_{i=1}^{n}{(x_i - \overline{x})^2}}
\end{align}
$

where:  
$y_i$ - Actual value  
$Y_i$ - Predicted value  
$Q$ - Residual Sum of Squares

**References**  
https://statisticsbyjim.com/glossary/ordinary-least-squares/  
https://www.numpyninja.com/post/what-is-line-of-best-fit-in-linear-regression  
https://medium.com/@rndayala/linear-regression-a00514bc45b0  
https://online.stat.psu.edu/stat501/lesson/1/1.2  
https://www.immagic.com/eLibrary/ARCHIVES/GENERAL/WIKIPEDI/W120529O.pdf  

## Assumptions of Single Linear Regression

These assumptions are important conditions which should be met before the model is used to make the predictions.  
If these assumptions are violated, the results may be misleading.

Still, it will be a waste of time to check those assumptions every time working on a new dataset. It is much easier to try the model (possibly multiple models) and see its accuracy. When the dataset has linear relationships, the model will give higher accuracy than the other models, and it is enough evidence that the assumptions are satisfied. Although the dataset doesn't have any linear relationships, a linear regression model can still be used. It will just perform poorly and will have an accuracy lower than the other models. Remember that, when searching for the most appropriate model for the dataset and for better predictions, they will be helpful.

Since in this notebook, I am trying to understand the math. I will check all of them.

In [None]:
from statsmodels.formula.api import ols
fit = ols(formula='SalePrice~OverallQual', data=train_data).fit()
predictions = fit.predict()

### Linearity

There should be a linear relationship between the dependent variable and the independent variable. A straight line should be able to represent all points as well as possible.  
This assumption is easy to test with a scatter plot.

In [None]:
plt.figure(figsize=(13, 6))
plt.scatter(X, y, color='blue')
plt.plot(X, predictions, color='red')
plt.title("Sale Price vs Overall Quality")
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

### Normality

The residuals must be normally distributed.  
It is possible to understand normality by looking at the residuals histogram or using the p-value from the [Kolmogorov-Smirnov test](https://www.statsmodels.org/stable/generated/statsmodels.stats.diagnostic.kstest_normal.html) for normality.

> If the p-value is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution.

In [None]:
plt.figure(figsize=(13, 6))
sns.histplot(fit.resid, kde=True, color='blue')
plt.show()

In [None]:
from statsmodels.stats.diagnostic import kstest_normal
labels = ['Kolmogorov-Smirnov statistic', 'p-value']
test = kstest_normal(fit.resid)
print(dict(zip(labels, test)))
print("Since p-value is lower than 0.05, the assumption is satisfied.")

### Independence

The residuals should be independent. There should be no correlation between the consecutive residuals.  
Autocorrelation is a characteristic of data in which the correlation between the values of the same variables is based on related objects. It violates the assumption of independence.  
We will perform a [Durbin-Watson test](https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html) to determine if either positive or negative correlation is present.

> The test statistic is approximately equal to 2*(1-r) where r is the sample autocorrelation of the residuals.  
> Thus, for r == 0, indicating no serial correlation, the test statistic equals 2.  
> This statistic will always be between 0 and 4.  
> The closer to 0 the statistic, the more evidence for positive serial correlation.  
> The closer to 4, the more evidence for negative serial correlation.  

In [None]:
from statsmodels.stats.stattools import durbin_watson
test = durbin_watson(fit.resid)
print({'Durbin-Watson statistic': test})
print("Since statistic is almost 2, this assumption is also satisfied.")

### Homoscedasticity

The residuals must have constant variance.  
Heteroscedasticity, the violation of homoscedasticity, occurs when there is no constant variance across the residuals.

It is possible to plot the residuals and see if the variance appears to be uniform or using the [Breusch-Pagan test](https://www.statsmodels.org/devel/generated/statsmodels.stats.diagnostic.het_breuschpagan.html) for test heteroscedasticity.

> Statistics provides two p-values, Lagrange Multiplier and F test (widely used and basically equivalent).  
> Heteroscedasticity is indicated if p-value < 0.05.  
> https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test

In [None]:
plt.figure(figsize=(13, 6))
plt.scatter(x=predictions, y=fit.resid, color='blue')
plt.title("Residuals vs Fitted")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()  

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan
labels = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']
test = het_breuschpagan(fit.resid, fit.model.exog)
print(dict(zip(labels, test)))
print("Since p-value is lower than 0.05, heteroscedasticity is assumed. Hence, model actually doesn't satify the assumption.")

**References**  
https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html  
https://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod2/6/index.html  
https://www.statology.org/linear-regression-assumptions/  
https://www.statsmodels.org/stable/stats.html

# 6.2 Multiple Linear Regression

Different from simple linear regression models, multiple linear regression models use multiple independent variables that may affect the target variable.   

Model is expressed using the equation:  
$Y = \beta_0 + \beta_1X_1 + \beta_2X_2  + ... +  \beta_nX_n + \epsilon$

where:  
$Y$ – Dependent variable  
$X_1$, $X_2$, $X_3$ – Independent variables  
$\beta_0$ – Intercept  
$\beta_1$, $\beta_2$, $\beta_3$ – Slopes  
$\epsilon$ – Residual (error)  

**References**  
https://www.imsl.com/blog/what-is-regression-model  
https://corporatefinanceinstitute.com/resources/knowledge/finance/regression-analysis/  
https://towardsdatascience.com/simple-and-multiple-linear-regression-with-python-c9ab422ec29c  

In [None]:
X = train_data[['OverallQual', 'OverallCond']]
y = train_data['SalePrice']

regressor = LinearRegression()
regressor.fit(X, y);

In [None]:
OverallQual = X.values[:, 0]
OverallCond = X.values[:, 1]
SalePrice = y

x_surf, y_surf = np.meshgrid(
    np.linspace(OverallQual.min(), OverallQual.max(), 10),
    np.linspace(OverallCond.min(), OverallCond.max(), 10))
x_surf = x_surf.flatten()
y_surf = y_surf.flatten()
y_pred = regressor.predict(np.array([x_surf, y_surf]).T)

In [None]:
fig = plt.figure(figsize=(14, 14))
ax = plt.axes(projection='3d')
ax.scatter(OverallQual, OverallCond, SalePrice, color='blue')
ax.plot_trisurf(x_surf, y_surf, y_pred, color='red', alpha=0.5)
ax.set_xlabel('Overall Quality', fontsize=12)
ax.set_ylabel('Overall Condition', fontsize=12)
ax.set_zlabel('Sale Price', fontsize=12)
ax.tick_params(axis='both', labelsize=8)
ax.view_init(elev=15, azim=120)
plt.title("Sale Price vs Overall Quality and Overall Condition\n" +
          "Equation: Y = {0:.2f} + {1:.2f}X₁ + {0:.2f}X₂".format(regressor.intercept_, regressor.coef_[0], regressor.coef_[1]))
plt.show()

## Assumptions of Multiple Linear Regression

There are five assumptions of the multiple linear regression. Again, when those assumptions are violated the results may be misleading.  

* Linearity: There should be a linear relationship between the dependent variable and each independent variable.
* Normality: The residuals must be normally distributed.
* Independence: The residuals should be independent.
* Homoscedasticity: The residuals must have constant variance.
* No Multicollinearity: None of the independent variables are highly correlated with each other.

It is possible to use the model and compare its accuracy with the other models. When the multiple linear regression model has the higher accuracy than other models, it is possible to say that the dataset has some linear relationships and the assumptions are satisfied. Still, to find out the most appropriate model for the dataset, checking assumptions will be helpful.

I have already tried the first four assumptions (for simple linear regression). So, I will check for the last one to understand the details.

**References**  
https://www.statology.org/multiple-linear-regression-assumptions/  
https://jeffmacaluso.github.io/post/LinearRegressionAssumptions/  

### No Multicollinearity

Independent variables shouldn't show multicollinearity, which occurs when the they are highly correlated.  
It is possible to test this assumption by plotting an heatmap of the correlations and examine the [Variance Inflation Factors (VIF)](https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html).

**References**  
https://corporatefinanceinstitute.com/resources/knowledge/other/multiple-linear-regression/  
https://stackoverflow.com/questions/42658379/variance-inflation-factor-in-python  
https://www.statology.org/how-to-calculate-vif-in-python/  

In [None]:
plt.figure(figsize = (10,8))
sns.heatmap(X.corr(), annot=True)
plt.title('Correlation of Variables')
plt.show()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

X_constant = add_constant(X)
VIF = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
print(dict(zip(X_constant.columns, VIF)))
print("Since VIF is lower than 5 for the independent variables `OverallQual` and `OverallCond`, the assumption is satisfied.")

# 6.3 Polynomial Regression

Like simple linear regression, polynomial regression determines the relationship between a dependent variable and an independent variable.  
Different than simple linear regression, it is modeled as an nth degree polynomial; hence it fits a non-linear relationship between an independent and a dependent variable.  

Model is expressed using the equation:  
$Y = \beta_0 + \beta_1X_1 + \beta_2X_1^2  + ... +  \beta_nX_1^n + \epsilon$

where:  
$Y$ – Dependent variable  
$X_1$ – Independent variable   
$\beta_0$ – Intercept  
$\beta_1$, $\beta_2$ – Slopes  
$\epsilon$ – Residual (error)  

**References**  
https://en.wikipedia.org/wiki/Polynomial_regression  
https://towardsdatascience.com/polynomial-regression-bbe8b9d97491  
https://www.w3schools.com/python/python_ml_polynomial_regression.asp  

## Comparison between Polynomial and Simple Linear regression

Polynomial regression helps fit the best line to non-linear data.  
To understand it easily, I will generate two models, one with simple linear regression and one with polynomial regression.  
By comparing the curves it will be easier to see which model will generate the best fitting curve.  

In [None]:
X = train_data[['OverallQual']]
y = train_data['SalePrice']

### Simple Linear Regression

In [None]:
linear = LinearRegression()
linear.fit(X, y)
y_pred_linear = linear.predict(X)

In [None]:
plt.figure(figsize=(13, 6))
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred_linear, color='red')
plt.title("Sale Price vs Overall Quality (Linear Regression)" +
          "\nEquation: Y = {0:.2f} + {1:.2f}X₁".format(linear.intercept_, linear.coef_[0]))
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

### Polynomial Regression

To generate polynomial features, I will use [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) class provided by scikit-learn.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)

polynomial = LinearRegression()
polynomial.fit(X_poly, y)

X_grid = np.linspace(X.min(), X.max(), 10)
X_poly = poly.fit_transform(X_grid)
y_pred_polynomial = polynomial.predict(X_poly)

In [None]:
plt.figure(figsize=(13, 6))
plt.scatter(X, y, color='blue')
plt.plot(X_grid, y_pred_polynomial, color='red')
plt.title("Sale Price vs Overall Quality (Polynomial Regression)" +
          "\nEquation: Y = {0:.2f} + {1:.2f}X₁ + {2:.2f}X₁² + {3:.2f}X₁³ + {4:.2f}X₁⁴".format(
            polynomial.intercept_, polynomial.coef_[1], polynomial.coef_[2], polynomial.coef_[3], polynomial.coef_[4]))
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

# 6.4 Support Vector Regression

In [None]:
X = train_data[['OverallQual']]
y = train_data['SalePrice'].values.reshape(-1, 1)

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)

In [None]:
from sklearn.svm import SVR
regressor = SVR(kernel='rbf')
regressor.fit(X, y.ravel())

X_grid = np.linspace(X.min(), X.max(), 10).reshape(10, 1)
y_pred = regressor.predict(X_grid)

In [None]:
plt.figure(figsize=(13, 6))
plt.scatter(sc_X.inverse_transform(X), sc_y.inverse_transform(y), color='blue')
plt.plot(sc_X.inverse_transform(X_grid), sc_y.inverse_transform(y_pred), color='red')
plt.title("Sale Price vs Overall Quality")
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

# 6.5 Decision Tree Regression

In [None]:
X = train_data[['OverallQual']]
y = train_data['SalePrice']

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X, y)

X_grid = np.linspace(X.min(), X.max(), 10).reshape(10, 1)
y_pred = regressor.predict(X_grid)

In [None]:
plt.figure(figsize=(13, 6))
plt.plot(X_grid, y_pred, color='red')
plt.scatter(X, y, color='blue')
plt.title("Sale Price vs Overall Quality")
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

# 6.6 Random Forest Regression

In [None]:
X = train_data[['OverallQual']]
y = train_data['SalePrice']

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=10, random_state=1)
regressor.fit(X, y)

X_grid = np.linspace(X.min(), X.max(), 10).reshape(10, 1)
y_pred = regressor.predict(X_grid)

In [None]:
plt.figure(figsize=(13, 6))
plt.plot(X_grid, y_pred, color='red')
plt.scatter(X, y, color='blue')
plt.title("Sale Price vs Overall Quality")
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

# WORK IN PROGRESS