<img src="AV_Logo.png" style="width: 200px;height: 75px"/>

Table of Contents
--------------
* [What is Predictive Modelling?](#What-is-Predictive-Modelling?)
* [Building the first model](#Building-the-first-model)
* [How to find the best regression line?](#How-to-find-the-best-regression-line?)
* [Performance Evaluation Metrics in Regression](#Performance-Evaluation-Metrics-in-Regression)
* [Multivariate Regression](#Multivariate-Regression)
* [Hands-on practice problem](#Hands-on-practice-problem)
* [A few points of caution when applying Linear Regression](#A-few-points-of-caution-when-applying-Linear-Regression)

## What is Predictive Modelling?

Now that we are comfortable in processing the data its time to see what all we can do with it. You may remember the life cycle of data modelling that was discussed on [day 1](https://datahack.analyticsvidhya.com/s/832e7b4373ed46d98bfc849a8c607d1a).  

<img src="lifecycle.png" style="width: 400px;height: 300px">

We have already seen problem definition, hypothesis generation, data collection and exploration. Let's now see what predictive modelling is. 

**Predictive Modelling** is a process to create statistical model for estimating/predicting the future behaviour based on past data. 

Let's take a simple example. A retail bank wants to know the default behaviour of its credit card customers. They want to predict the probability of default for each customer within next 3 months. What do they do ?

* First the problem is defined i.e. to identify the customers who will default in next 3 months
* Generate and define the hypothesis
* Collect past data 
* Conduct univariate and bivariate analysis for the collected data
* Treat Missing values and Outliers
* Build a predictive model
* Deploy the model

Now whenever we do predictive modelling, we face a difficulty as to which algorithm we should apply to solve the problem. This can be remedied by doing the following: 

* Check if there is a target / dependent variable in your data. If it is not present, you should use an **unsupervised learning** algorithm (for example, k-means).
* If the target variable is present, it probably is a **supervised learning** problem. 
* Even in supervised learning problems, you have to check whether the target variable is categorical or continuous; because depending on the target your algorithms would change.
* If the target variable is continuous, your problem is a regression problem. So you should use an algorithm like linear regression.
* If the target variable is categorical, your problem is a classification problem. So you should use an algorithm like logistic regression.

Let's look at supervised and unsupervised learning algorithms in more detail.

### Supervised Learning

A supervised learning algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Also, as we saw previously, these algorithms can again be divided into two main categories

* Classification : where we use the algorithm to predict a categorical outcome - for example, whether democrats will win the election or the Republicans 
* Regression : where we use the algorithm to predict a numeric outcome - for example, predicting the house prices 

### Unsupervised Learning

In unsupervised learning algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention, for example, we use it segregate the customers on the basis of their liking to identify their preference

<img src="crux.png" style="width: 400px;height: 400px"/>

There is also a third type of learning called reinforcement learning, but for simplicity we will not be discussing it today. If you are curious, you can refer [this article](https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning-implementation/).

**Exercise**:

Q1. For the scenarios given below, indentify what kind of problem does it belong to (whether it is a supervised (regression or classification) or unsupervised learning problem):

*Scenario 1:*

You have to predict how many CD's of a music album will be sold in next three months of the launch. 

*Scenario 2:*

Given a person’s credentials and background information, your system should assess whether a person should be eligible for a loan grant.

*Scenario 3:*

A marketing team wants to have a targeted marketing campaign based on the customer segment. 

## Building the first model

Let's take the simplest example of supervised learning. Suppose we try to predict the weight of a person using their height. Here Let the height be on X axis and weight be on the Y axis. 

Since we're trying to predict the weight, it is known as the **dependent** variable, while the height is known as the **independent** variable. We're assuming that weight depends on height. Also, you can see that our dependent variable is continuous. So it is a regression problem. We can plot the scatter plot of height and weight as below, and identify that they are very strongly correlated.

<img src="linear5.png" style="width: 300px;height: 200px">

The line that you see, is a best fit line which tries to summarise the points on the scatter plot. The simplest form of regression with one dependent and one independent variable is defined by the formula:
$Y = aX + b$

Above, you can see that a black line passes through the data points. Now, you carefully notice that this line intersects the data points at coordinates (190, 67) and (180,65). Here’s a question. Find the equation that describe this line? Your answer should be:

$Y= a * X + b$

Now, find the value of a and b?

With out going in its working, the outcome after solving these equations is:

 $a = 0.2811, b = 13.9$

Hence, our regression equation becomes: $Y = 0.2811*X +  13.9$

Here, Slope = 0.2811 and Intercept = 13.9 (as Y = 13.9 when x is 0). 

This equation is known as linear regression equation, 

where,

* Y is target variable, 
* X is input variable. 
* ‘a’ is known as slope and 
* ‘b’ as intercept. 

This equation is used to estimate real values based on input variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. 

Now, you might think that in above example, there can be multiple regression lines those can pass through the data points. So, how to choose the best fit line or value of co-efficients a and b. You can read more about it [here](https://discuss.analyticsvidhya.com/t/importance-of-error-term-in-linear-equation/2428/2?u=jalfaizy).

## How to find the best regression line?

We discussed above that regression line establishes a relationship between independent and dependent variable(s). A line which can explain the relationship better is said to be best fit line.

In other words, the best fit line tends to return most accurate value of Y based on X  i.e. causing a minimum difference between actual and predicted value of Y (lower prediction error).

<img src="linear2.png" style="width: 300px;height: 200px">

Here are some methods by which we check for errors:

* **Sum of all errors (∑error)** - Using this method leads to cancellation of positive and negative errors, which certainly isn’t our motive. Hence, it is not the right method.
* **Sum of absolute value of all errors (∑|error|) and Sum of square of all errors (∑error^2)** - Both methods perform well but, in case of ∑error^2, we penalize the error value much more as compared to ∑|error|.

We generally prefer the sum of square error and therefore, the coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

## Performance Evaluation Metrics in Regression

To evaluate the performance of regression line, we should look at the minimum sum of squared errors (SSE). It works well but it has one concern. SSE is highly sensitive to number of data points. 

Other metric to evaluate the performance of linear regression is R-square and most common metric to judge the performance of regression models. R² measures, “How much the change in output variable (y) is explained by the change in input variable(x).

**R-squared**:

R-squared is always between 0 and 1:

<img src="R2.png" style="width: 300px;height: 100px">

0 indicates that the model explains NIL variability in the response data around its mean.
1 indicates that the model explains full variability in the response data around its mean.
In general, higher the R², more robust will be the model.

One disadvantage of R-squared is that it can only increase as predictors are added to the regression model. This increase is artificial when predictors are not actually improving the model’s fit. To cure this, we use “Adjusted R-squared”.

Adjusted R-squared is nothing but the change of R-square that adjusts the number of terms in a model. Adjusted R square calculates the proportion of the variation in the dependent variable accounted by the explanatory variables. It incorporates the model’s degrees of freedom. Adjusted R-squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom. Likewise, it will increase as predictors are added if the increase in model fit is worthwhile. Adjusted R-squared should always be used with models with more than one predictor variable. It is interpreted as the proportion of total variance that is explained by the model.

<img src="R2ad.png" style="width: 300px;height: 200px">

# Multivariate Regression

Let’s now examine the process to deal with multiple independent variables related to a dependent variable.

Once you have identified the level of significance between independent variables and dependent variables, use these significant independent variables to make more powerful and accurate predictions. This technique is known as  “Multi-variate Regression”.

Let’s take an example here to understand this concept further.

We know that, compensation of a person depends on his age i.e. the older one gets, the higher he/she earns as compared to previous year. You build a simple regression model to explain this effect of age on a person’s compensation . You obtain R2 of 27%. What does this mean?

In this example, R² as 27%, says, only 27% of variance in compensation is explained by Age. In other words, if you know a person’s age, you’ll have 27% information to make an accurate prediction about their compensation.

Now, let’s take an additional variable as ‘time spent with the company’ to determine the current compensation. By this, R2 value increases to 37%. How do we interpret this value now?

Notice that a person’s time with company holds only 10% responsible for his/her earning by profession. In other words, by adding this variable to our study, we improved our understanding of their compensation from 27% to 37%.

Therefore, we learnt, by using two variables rather than one, improved the ability to make accurate predictions about a person’s salary.

Things get much more complicated when your multiple independent variables are related to with each other. This phenomenon is known as Multicollinearity. This is undesirable.  To avoid such situation, it is advisable to look for Variance Inflation Factor (VIF). For no multicollinearity, VIF should be ( VIF < 2). In case of high VIF, look for correlation table to find highly correlated variables and take one of them.

In an multiple regression model, the equation looks like below

<img src="Linear4.png" style="width: 300px;height: 50px">

Here, b1, b2, b3 …bk are slopes for each independent variables X1, X2, X3….Xk and "a" is intercept.

*Note: In general, a linear regression equation looks like $Y = b1X1 + b2X2 .. + a + e$, where e is the error term (noise in the dataset). For simplicity, we are not discussing it here. 

Example: Net worth = a+ b1 (Age) +b2 (Time with company)

-----------------------

## Hands-on practice problem

Now let's apply our learnings for today on a practice problem. This practice problem we specially created for participants of DataHack hour.

To participate in the hackathon, 
* Click on the [link provided](https://datahack.analyticsvidhya.com/contest/datahack-hour-bike-sharing/) 
* Register for the contest. 
* Then download the dataset from the website to work on the problem.
* Make sure you extract the files in the same folder as this jupyter notebook
* Your directory structure will similar to this

![img](Capture.png)

Now you can go forward and build your first linear regression model!

Let's look at a few basic steps you should always follow when participating in a hachathon

* importing important libraries
* loading dataset
* exploring the dataset and visualizing it
* taking care of missing values and outliers
* separating predictor and target variables to make a prediction model
* create a machine learning model
* train the model and get its predictions
* finally create a submission file and submit it.

In [8]:
# import important libraries
import pandas as pd

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

In [9]:
# load dataset
train_data = pd.read_csv('train.csv')
#test_data = pd.read_csv('test_uLBXQQR.csv')

In [10]:
train_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [11]:
# separate predictor and target variables
X = train_data[['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed']]
y = train_data[['cnt']]

#X_test = test_data[['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       #'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed']]

In [12]:
# create machine learning model
lin = LinearRegression()

In [13]:
# train model
lin.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [14]:
# print multivariate corefficients, i.e. slopes
lin.coef_

array([[ -6.86366153e-03,   9.74255429e+00,   1.42690138e+02,
          7.37083015e+00,   6.90370950e+00,  -2.31765356e+01,
          1.50249988e+00,  -9.62561554e-01,  -7.46169147e+00,
          8.47724680e+01,   2.17417117e+02,  -1.59158992e+02,
          2.90376918e+01]])

In [15]:
# print intercept
lin.intercept_

array([-18.45206006])

In [16]:
# get predictions
predictions = lin.predict(X_test)

NameError: name 'X_test' is not defined

In [10]:
# create submission file
submission = pd.DataFrame(data=[], columns=['instant', 'cnt'])
submission.instant = test_data.instant; submission.cnt = predictions

submission.to_csv('submission.csv', index=False)

submission.head()

Unnamed: 0,instant,cnt
0,13036,302.732226
1,13037,322.607611
2,13038,342.562414
3,13039,378.024896
4,13040,375.520944


Now submit this as a solution to datahack platform. If you got a score of ~300, congratulations, you have just bulit your first linear regression model!

## A few points of caution when applying Linear Regression

Regression is a parametric approach. ‘Parametric’ means it makes assumptions about data for the purpose of analysis. Due to its parametric side, regression is restrictive in nature. It fails to deliver good results with data sets which doesn’t fulfill its assumptions. Therefore, for a successful regression analysis, it’s essential to validate these assumptions.

Let’s look at the important assumptions in regression analysis:

* There should be a linear and additive relationship between dependent (response) variable and independent (predictor) variable(s). A linear relationship suggests that a change in response Y due to one unit change in X is constant, regardless of the value of X. An additive relationship suggests that the effect of X on Y is independent of other variables.
* There should be no correlation between the residual (error) terms. Absence of this phenomenon is known as Autocorrelation.
* The independent variables should not be correlated. Absence of this phenomenon is known as multicollinearity.
* The error terms must have constant variance. This phenomenon is known as homoskedasticity. The presence of non-constant variance is referred to heteroskedasticity.
* The error terms must be normally distributed.

Taking care of assumptions is a bit advanced topic, so we will not be covering here. You can refer the article [Going Deeper into Regression Analysis with Assumptions, Plots & Solutions](https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/) for more information.

That's all for today!
----------------
-------------------------------
<img src="AV_Datafest_logo.png" style="width: 200px;height: 200px"/>
[www.analyticsvidhya.com](www.analyticsvidhya.com)

DATAFEST 2017