# Regression

Regression methods can be used to predict a continuous value (target) using some other variables (features). In regression, there are two types of variables, a dependent variable (target, typically denoted by Y) and one or more independent variables (features, typically denoted by X). Our regression models relates Y as a function of X. The key point in the regression is that our dependent variable Y should be continuous and can not be a discrete value. However, the independent variables can be measured on categorial or continous measurement scales.

Basically, there are two types of regression models :
<ul>
<li>Simple Regression :: When only one independent variable is used to estimate the dependent variable. It can be either linear or non-linear regression, depending on the relationship between the independent and dependent variable.</li>
    
<li>Multiple Regression :: When more than one independent variables are used in modelling. Again, this regression can be linear or non-linear.</li>
</ul>

We have many regression algorithms, each of them has their own importance and specific conditions to which their application is best suited.
<ul>
<li> Ordinal regression </li>
<li> Poisson regression </li>
<li> Fast forest quantile regression </li>
<li> Linear, Polynomial, Lasso, Stepwise, Ridge regression </li>
<li> Bayesian Linear Regression </li>
<li> Neural network regression </li>
<li> Decision forest regression </li>
<li> Boosted decision tree regression </li>
<li> KNN (K-nearest neighbors) </li>
</ul>

# Linear Regression

Linear regression is the approximation of a linear model used to describe the relationship between two or more variables. Linear Regression fits a linear model $\hat{y} = \theta_0+\theta_1x_1+\theta_2x_2+...$ with coefficients (also called parameters) $\theta = (\theta_0,\theta_1, ..., \theta_n)$ to minimize the 'residual sum of squares. Here $x_i$'s are independent feature variables and $\hat{y}$ is the prediction made by model given a certain values of $x_i$'s.

The mean of all residual errors ($y-\hat{y}$) is a measure of how the model fits with the dataset:

\begin{equation*}
MSE = \frac{1}{n}\sum_{i}^{n}(y_i - \hat{y}_i)^2
\end{equation*}

where $y_i$, $\hat{y}_i$ are the actual and predicted values respectively, and $n$ is the number of sample data points. Our objective is to find the parameters values $\theta_i$'s such that MSE is minimized. 

We have two options to determine these error minimizing $\theta_i$'s: 
<ui>
<li> mathematical approach : for a simple linear regressin with only one independent variable
    \begin{equation*}
    \theta_1 = \frac{\sum_i^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_i^n(x_i-\bar{x})^2}; \theta_0 = \bar{y} - \theta_1\bar{x}
    \end{equation*}
    where $\bar{x}, \bar{y}$ represent mean values of $x_i$'s and $y_i$'s respectively.

</li>
<li> optimization approach : gradient descent (this approach is efficient for very large dataset, as the mathematical approach deals with large matrices and becomes inefficient)</li>
</ui>

# Model Evaluation

The goal of regression is to build a model to accurately predict an unseen case. To this end, we have to perform evaluation of our model. When considering evalution of models, we clearly want to choose the one that gives the most accurate result. Thus, the question is how we can estimate the accuracy of our models. In other words, how much we can trust this model in prediction of an unknown sample using the given dataset and the model we have built.

One of the solution is train-test split, in which we split the dataset into mutually exclusive training and test dataset. We use the training set for building the model. Now we can use the model to make prediction on the test set and see how the predicted values on the test set compare with the actual values of the test set. This indicates how accurate how model is. Once, we have selected the model with best accuracy, we train our model on the whole dataset, as we do not want to lose any valueable data for training.

The issue with train-test is that it is highly dependent on the dataset on which the data was trained and tested. This can be resolved by using another evaluation model called K-fold cross validation. In K-fold cross validation, we split the whole data into K number of folds. We use K-1 folds for training and building the model and the remaining one is used for testing and evaluating the model. We continue this process until each fold is used for testing the model, finally averaging the correctness evalution of all the folds.

Now, we know that we have to compare the actual values and predicted values to estimate the accuracy of a regression model. For this purpose, we need accuracy metrics to quantify the accuracy of models. Evaluation metrics are used to explain the performance of a model. Some model evalution metrics that are used for regression are:
<ul>
    <li> Mean absolute error(MAE): It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error.
        \begin{equation*}
        MAE = \frac{1}{n}\sum_{i=1}^n|y_i - \hat{y}_i|
        \end{equation*}
    </li>
    <li>Mean Squared Error (MSE): Mean sqaured error is the mean of the squared error. 
        \begin{equation*}
        MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2
        \end{equation*}
        It’s more popular than Mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones.
    </li>
    <li>Root Mean Squared Error (RMSE): This is the square root of the Mean Squared Error.
        \begin{equation*}
        RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2}
        \end{equation*}
        This is one of the most popular of the evaluation metrics because RMSE is in the same units as the target variable, making it easy to interpret.
    </li>
    <li>Relative Absolute Error (RAE):
        \begin{equation*}
        RAE = \frac{\sum_{i=1}^n|y_i - \hat{y}_i|}{\sum_{i=1}^n|y_i - \bar{y}|}
        \end{equation*}
    </li>
    <li>Relative Squared Error (RSE):
        \begin{equation*}
        RSE = \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2}
        \end{equation*}
        R-squared = 1-RSE, is not an error but is a popular metric for estimating the accuracy of model. It represents how close the data values are to the fitted regression line. The higher the R-squared, the better the model fits the data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
    </li>
</ul>

# Simple Linear Regression

In simple linear regression, there are two variables, a dependent variable and an independent variable

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

This fuel consumption dataset contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicals

In [None]:
df = pd.read_csv('FuelConsumption.csv')

In [None]:
df.head()

In [None]:
df.describe()                        # to get basic statistics about the data

Let's select some features for futher exploration

In [None]:
cdf = df[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB', 'CO2EMISSIONS']]

We can plot each of these features

In [None]:
viz = cdf[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB', 'CO2EMISSIONS']]
viz.hist()
plt.show()

Now, let's plot these features vs the Emission to see how linear is their relation:

In [None]:
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='blue')
plt.xlabel('FUELCONSUMPTION_COMB')
plt.ylabel('CO2EMISSIONS')
plt.show()

In [None]:
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue')
plt.xlabel('ENGINESIZE')
plt.ylabel('CO2EMISSIONS')
plt.show()

Lets split our dataset into train and test sets, 80% of the entire data for training, and the rest 20% for testing. We create a mask to select random rows:

In [None]:
msk = np.random.rand(len(cdf)) < 0.8
train = cdf[msk]
test = cdf[~msk]

Let's use scikit-learn to model engine-size vs Co2 emissions

In [None]:
from sklearn import linear_model
lr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
lr.fit(train_x, train_y)
print(lr.intercept_, lr.coef_)

Let's plot the data alongwith the fitted line

In [None]:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')
plt.plot(train_x, lr.coef_[0][0]*train_x + lr.intercept_[0], '-r')
plt.xlabel('ENGINESIZE')
plt.ylabel('CO2EMISSIONS')

In [None]:
from sklearn.metrics import r2_score

test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
test_yhat = lr.predict(test_x)

print('mean absolute error : %.2f' % np.mean(np.absolute(test_yhat - test_y)))
print('residual sum of squares (MSE) : %.2f' % np.mean((test_yhat - test_y)**2))
print('R2-score : %.2f' % r2_score(test_yhat, test_y))