# Regression

Fundamentally, we have study two predictive model the first one is about classification, and we have seen that it is about predicting a label, and we will now investigate how we will solve another kind of problem, regression and this time it will be about predicting a quantity.

Predictive modeling is the problem of developing a model using data-set to make a prediction on new data where we do not have the answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y).

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

A continuous output variable is a real-value, such as an integer or floating-point value. These are often quantities, such as amounts and sizes.

We have chosen to do house price prediction for different reason but the main one is that they are some dataset available.

We will use modeling algorithms that will find the best possible mapping function given the time and resources available. In this notebook  the regression will be solved using linear regression and a neural network. 

## Contextualisation

The goal of simple regression is to explain a variable Y using a variable X.

-the variable Y is called dependent variable in our case it will be the price of the house 

-the variables Xj are called independent variable in our case it could be the neighborhood, the built year, if it has a pool or not, etc.

We rarely estimate a home using only one parameter. We define $x_j$ from the set of parameter $X\hspace{3mm} \forall(j = 1, ..., q)$ that describe the house. 

As said in the introduction, the regression would be solved using a linear regression algorithm and a neural network, although there are many algorithms other than this two for this particular task.


In [4]:
#import the librairie
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import keras
from keras.models import Sequential
from keras.layers import Dense
from scipy import stats
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from scipy.stats import norm
from scipy.stats import binned_statistic
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via `pip install tensorflow`

# Data analysis

The first thing to do is to load the data then we do some analysis , the first thing we notice there are some  holes in same cells

Missing data will leads to a depreciation of our predictive models and will bring some lacks of accuracy because its will give some wrong information to our algorithms and bring some noise in output graphs. 

The solution is to do some cleaning, we have to eliminate every missing cells, that will reduce our data set but brings him more pertinence 

Before cleaning the data we will do some analysis to see the correlations between variable and  there inside, that will increase the quality of our predictions.

In [None]:
train = pd.read_csv('dataset/regression/train.csv') 
train.head(15)
train['SalePrice'].describe()

In [None]:
train.describe()

In [None]:
train.info()

In [None]:
test = pd.read_csv('dataset/regression/test.csv')
test.head(15)


In [None]:
#check if the data set has any missing values. 
train.columns[train.isnull().any()]

In [None]:
#missing value counts in each of these columns
missing_value = train.isnull().sum()

missing_value.sort_values(inplace=True, ascending=False)
missing_value

In [None]:
#Convert into dataframe
missing_value_frame = missing_value.to_frame()

missing_value_frame.columns = ['count']

missing_value_frame.index.names = ['Parametre']

missing_value_frame['Parametre'] = missing_value_frame.index

#plot Missing values
plt.figure(figsize=(13, 5))
sns.set(style='whitegrid')
sns.barplot(x='Parametre', y='count', data=missing_value_frame)
plt.xticks(rotation = 90)
plt.show()



We can see that there are some parameters with a big amount of null cells we will represent that in percentage by parameter to visualize better.

In [None]:
percentage_missing_value = train.isnull().sum()/len(train)*100
percentage_missing_value= percentage_missing_value[percentage_missing_value>0]
percentage_missing_value.sort_values(inplace=True, ascending=False)
percentage_missing_value = percentage_missing_value.to_frame()

percentage_missing_value.columns = ['Percentage']
percentage_missing_value.index.names = ['Parametere']

percentage_missing_value['Parametre'] = percentage_missing_value.index

#plot Missing values
plt.figure(figsize=(13, 5))
sns.set(style='whitegrid')
sns.barplot(x='Parametre', y='Percentage', data=percentage_missing_value)
plt.xticks(rotation = 90)
plt.show()



As we can see here, there 7 parameters with more than 50 % and 16 parameters with missing value. We have to do some choice and eliminate some columns of our data set. To help us   decide we will analyze the correlation between our dependent variable the sale price and our independent variable the other columns by outputting a correlation matrix. This is a statistical tool that will give use a good view on which parameter to keep we keep only the parameter with a  correlation parameter >0.7. We should also avoid every column where the type is not a number because it will be unusable for our algorithm. 

In [None]:
correlation = train.corr()
correlation.sort_values(['SalePrice'], ascending=False, inplace=True)
correlation.SalePrice


In [None]:
data_number_only = train.select_dtypes(include=[np.number])
del data_number_only['Id']
correlation= data_number_only.corr()

top_core_parametre = correlation.index[abs(correlation['SalePrice']>0.3)]
plt.subplots(figsize=(12, 8))
top_core_matrix = train[top_core_parametre].corr()
sns.heatmap(top_core_matrix, annot=True)
plt.show()


We will keep the variable that are the most correlated with our dependent variable.

# Data cleaning

In [None]:


# PoolQC has missing value ratio is 99%+. So, there is fill by None
train['PoolQC'] = train['PoolQC'].fillna('None')

#Arround 50% missing values attributes have been fill by None
train['MiscFeature'] = train['MiscFeature'].fillna('None')
train['Alley'] = train['Alley'].fillna('None')
train['Fence'] = train['Fence'].fillna('None')
train['FireplaceQu'] = train['FireplaceQu'].fillna('None')

#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
train['LotFrontage'] = train.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

#GarageType, GarageFinish, GarageQual and GarageCond these are replacing with None
for col in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    train[col] = train[col].fillna('None')

#GarageYrBlt, GarageArea and GarageCars these are replacing with zero
for col in ['GarageYrBlt', 'GarageArea', 'GarageCars']:
    train[col] = train[col].fillna(int(0))

#BsmtFinType2, BsmtExposure, BsmtFinType1, BsmtCond, BsmtQual these are replacing with None
for col in ('BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtCond', 'BsmtQual'):
    train[col] = train[col].fillna('None')

#MasVnrArea : replace with zero
train['MasVnrArea'] = train['MasVnrArea'].fillna(int(0))

#MasVnrType : replace with None
train['MasVnrType'] = train['MasVnrType'].fillna('None')

#There is put mode value 
train['Electrical'] = train['Electrical'].fillna(train['Electrical']).mode()[0]

#There is no need of Utilities
train = train.drop(['Utilities'], axis=1)





We convert the string into numbers the models is mathematical, it cannot evaluate a string this is why this step is necessary.

In [None]:
idx_meta = ['SalePrice','GrLivArea', 'MasVnrArea', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'OverallQual', 'Fireplaces', 'GarageCars']
train_meta = train[idx_meta].copy()
train_meta.head(n=5)
idx_meta


In [None]:
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold', 'MSZoning', 'LandContour', 'LotConfig', 'Neighborhood',
        'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
        'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'Foundation', 'GarageType', 'MiscFeature', 
        'SaleType', 'SaleCondition', 'Electrical', 'Heating')


for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(train[c].values)) 
    train[c] = lbl.transform(list(train[c].values))



# Normalisation

Now that the data is prepared we will do some analysis on the target function  and check if the distribution of our data fits a normal distribution 

In statistic normal distribution is in probability theory, a normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a type of continuous probability distribution for a real-valued random variable.  

The Normal Distribution or often called the Gaussian distribution  is one of the most important and commonly-encountered probability distribution.
Any time you add together a large amount of random variables, even if those variables are from different distributions, if you get enough samples you'll find that the sum of the variables tends to be normally distributed. In our case the variable are the parameter that will influence the price of the house. 

In [None]:
plt.subplots(figsize=(12,10))
sns.distplot(train['SalePrice'], fit=stats.norm)

# Get the fitted parameters used by the function

(mu, sigma) = stats.norm.fit(train['SalePrice'])

# plot with the distribution vs normal distribution 

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')

fig = plt.figure()
stats.probplot(train['SalePrice'], plot=plt)
plt.show()


In machine learning, it’s normal to have different errors from various sources, from data corruption to classification errors. While it’s important to always check to ensure your assumption is correct, it’s not unreasonable to think that the combined effect of these errors is approximately normal. 

Accepting the normal distribution also makes the math easier and faster to do, which is an important consideration when teaching AI. Due to the simplicity of normal distribution. Here we can see that our distribution don't fit the normal one to solve this we will use a log scale to represent the data. 

In [None]:
# 
train['SalePrice'] = np.log1p(train['SalePrice'])


plt.subplots(figsize=(12,9))
sns.distplot(train['SalePrice'], fit=stats.norm)


(mu, sigma) = stats.norm.fit(train['SalePrice'])

# plot with the distribution

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')

#Probablity plot

fig = plt.figure()
stats.probplot(train['SalePrice'], plot=plt)
plt.show()

# Preditiction

## lineair regression

Linear regression is perhaps one of the most well known and well-understood algorithms in statistics and machine learning. Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables.

In statistics, linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables.
The relationships are modeled using linear predictor functions $h(x)$. To perform regression, you must decide the way you are going to represent $h$. As an initial choice, let’s say you decide to approximate $y$ as a linear function of $x$ :


 $$ h_θ(x) = θ_0 + θ_1x$$

Here, the $θ_i$’s are the parameters (the weights in our neural network) parameterizing the space of linear functions mapping from $X$ to $Y$. In simple words, these parameters are used for accurately mapping $X$ the independent variable to $Y$, the predicted sale price, we will also introduce the $x_0$ , this is the intercept.

![](other/h.jpg)

But the main question is how we will pick or learn the parameters $θ$ we cannot change your input instances as to predict the prices. You have only these $θ$ parameters to modify.

One prominent method seems to be to make $h(x)$ close to y, at least for the training examples we have. To understand this more formally, let's try defining a function that determines, for each value of the $θ$’s, how close the $h(x(i))$’s are to the corresponding $y(i)$’s, it is a minimization problem of the objective function.

We provide a mathematical formulation of the function with a statistical tool called the mean squared value :

$$\sum_{j} (x{j}-pre{j})^2$$

It measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. It is called as cost function in machine learning

There are a several way to model the loos function like the Mean Absolute Error, but we choose the most efficient one and the must used.

![](other/m.png)




It is important to note that, linear regression can often be divided into two basic forms:

- Simple Linear Regression (SLR) which deals with just two variables 
- Multi-linear Regression (MLR) which deals with more than two variables - this is our case  

Now we will focus more about the ways of estimating the parameters. This estimation of parameters is essentially known as the training of linear regression. There are many methods to train a linear regression model.

In [None]:
#Take targate variable into y
y = train['SalePrice']

#Delete the saleprice
del train['SalePrice']

#Take their values in X and y
X = train.values
y = y.values

# Split data into train and test formate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

Here we used Ordinary least squares Linear Regression

In [None]:

model = linear_model.LinearRegression()

#Fit the model
model.fit(X_train, y_train)
yd_lm = model.predict(X_test)

print("Predict value " + str(model.predict([X_test[142]])))
print("Real value " + str(y_test[142]))

print("Accuracy --> ", model.score(X_test, y_test)*100)

sns.regplot(y_test, yd_lm)

## Deep Neural network:

We will now use another model to do house price prediction we will this time use a deep neural network combined with an optimization algorithm to train it. 


A deep neural network is an artificial neural network with multiple layers between the input and output layers called hidden layers. The DNN finds the correct mathematical manipulation to turn the input into the output, for linear relationship and non-linear relationship.

DNNs are typically feed forward networks in which data flows from the input layer to the output layer without looping back. 

### feed-forward neural network:

A feed-forward neural network  It consist of a number of simple processing units often called perceptron, organized in layers.

The structure of the connections between layers are very simple each perceptron in a layer is connected with all the units in the previous layer.
 
These connections have different value: they may have a different strength or weight. The weights on these connections encode the knowledge of a network.

Data enters at the inputs and passes through the network, layer by layer, until it arrives at the outputs.  There is no feedback between layers only one  way. This is why they are called feed-forward neural networks.

![](other/c.jpg)

## Activation function

Artificial neuron or perceptron simply calculates a “weighted sum” of its input (y), add a bias and then decides whether it activate or not the neuron. Finally, the activation function performs this decision.

 $$\sum (weight*input)+bias $$

In a neural network, the activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output for that input.here job is to create the final patterns that will to the right model 

The most common activation function are the sigmoid and hyperbolic tangent, but cannot be used in networks with many layers due to the vanishing gradient problem. The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better.

The rectified linear activation is the default activation when developing multilayer Perceptron

It used to determine the output of the neural network if yes or no, the resulting value it’s always between 0 and 1 or -1 to 1.

![](other/1.svg)

The ReLU (Rectified linear unit) is the most used activation function in the world right now. Since, it is used in almost all the deep neural networks of deep learning.  The mathematical formulation for ReLU is  $y = max(0,x)$.

It's a piece wise linear function that will output the same input value directly if it is positive, otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

## Fulfillment

Our neural network model goes through 2 main phases :

-The first allows the diffusion of the information from the layer of input parameter to the deeper layers, this is ensured by the feed forward algorithm, it aims to try to arrive at the right target.

-The second phase consists of optimization the function using an algorithm and with calculating the cost with a sum-squared method we will talk about this later. 

## Stochastic gradient descent and Adam 

Earlier we talked about supervised learning. In this kind of learning we have a 2 groups of  data-set that has been labeled for training the neural network and the second for the test data-set and will be used to verify how good the trained network predict unseen data.

When training our neural network we feed sample by sample from the training data-set through the network and for each of these we inspect the outcome. In particular, we check how much the outcome differs from what we expected (the label). The difference between what we expected and what we got is called the error or loss. The loss tells us how much we differ from reality, This measure can then be used to adjust the network slightly so that it will be less wrong the next time the

There are several cost functions that can be used in our project we used the quadratic cost function: 


$$\sum_{j} (x{j}-pre{j})^2$$


In statistics, the mean squared error measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.we will. This loss functions evaluate how well our line fits the data for our project the sum of:


$$ \text{sum squared residual} = (\text{observed digit}  -\text{predicted digit} )^2 $$

The predicted sales price fit a line of equation  

$$ \text{pred} = \text{intercept} + \text{slope}*x$$ 

When we replace predicted value in the first equation we can calculate two derivative:

the derivative of pred with respect to intercept, that will lead us to a number that will define the step size of the intercept after that we can calculate  the new intercept according to the equation 

$$  \text{stepsizeintercept} = (\frac{\partial pred }{\partial inter })*\text{LearningRate} $$

We can then calculate the value of the new intercept: 


$$\text{newintercept}=\text{oldintercept}-\text{stepsizeintercept}  $$


The derivative of pred with respect to slope, that will lead us to a number that will define the step size of the slope. 

$$ \text{ stepsizeslope}  = (\frac{\partial pred}{\partial slope})*\text{LearningRate} $$

We can then calculate the value of the new slope :

 $$\text{newslope}=\text{oldsolve}-\text{stepsizeslope} $$


We can play on the learning process to take bigger step or smaller one. We advise to take a big number for the value of the learning rate because we are far away from the minimum then reduce it when getting closer to minimum, that will give us result whit more accuracy. 

By picking randomly a sample for each step and doing the sum squared each time, it will lead us to modification on our prediction line to fit the data as well as possible.

This technique reduces the number of terms computed by 3, compared to the gradient descend it more efficient for big amount of data like the one we have. 

The calculation becomes faster, but the process of gradient descent becomes fluctuating. The direction of gradient descent calculated from only one data is not globally stable. It can be even opposite to the real gradient descent direction.

In our model the Stochastic gradient descent can be implemented by the adam optimize, which is predefined in the keras library. We opted for the adam algorithm instead.

Adam algorithm is a variant to classical stochastic gradient descent.

Stochastic gradient descent maintains a single learning rate for all weight updates and the learning rate does not change during training.

Instead, the adam optimizer computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

Adam is a replacement optimization algorithm for stochastic gradient descent there exist other optimizer that we could use but Adam combines the best properties the AdaGrad and RMSProp algorithms, the two other concurrent, to provide an optimization algorithm that can handle sparse gradients on noisy problems.

Adam is relatively easy to configure where the default configuration parameters do well on most problems.

    *alpha:the learning rate or step size.
    *beta1: The exponential decay rate for the first moment estimates.
    *beta2: The exponential decay rate for the second-moment estimates.    
    *epsilon.:Is a very small number to prevent any division by zero in the implementation.



In [None]:
def baseline_nn_model(dims):
    model = Sequential()
    model.add(Dense(dims, input_dim=dims,kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [None]:
def larger_nn_model(dims):
    model = Sequential()
    model.add(Dense(dims, input_dim=dims,kernel_initializer='normal', activation='relu'))
    model.add(Dense(35, kernel_initializer='normal', activation='relu'))
    model.add(Dense(15, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [None]:
def use_keras_nn_model(nn_model, x, y, xx, yy, epoch, ylim):

    yy_predict =nn_model.predict(xx)
 
    ax = sns.regplot(yy, yy_predict)

    ax.set(ylim=ylim)
    plt.show()   
use_keras_nn_model(baseline_nn_model(X_train.shape[1]), X_train, y_train, X_test, y_test, 700,  (-500, 500))

In [None]:
rmse_largernn = use_keras_nn_model(larger_nn_model(X_train.shape[1]), X_train, y_train, X_test, y_test, 500, (-30, 30))