In [1]:
import ipywidgets as widgets
from IPython.display import display, clear_output

In [2]:
file = open("Figures/vertica.png", "rb")
image = file.read()

image_headline = widgets.Image(
                    value=image,
                    format='png',
                    #height='80%',
                    width='20%'
                )

vbox_headline = widgets.VBox([image_headline])
vbox_headline.layout.align_items = 'center'
display(vbox_headline)

VBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x04\xd4\x00\x00\x01\x04\x08\x06\x00\x…

In [3]:

file = open("Figures/back.png", "rb")
image = file.read()

image_headline = widgets.Image(
                    value=image,
                    format='png',layout=widgets.Layout(width='30px', height='24px')
                )

text_0=link = widgets.HTML(
    value="<a href=http://localhost:8888/voila/render/Documents/Template/MAIN_v1.ipynb target='_blank'><font size='2' color='blue'>Go Back to Main Page</font></a>",
)
hbox_line2 = widgets.HBox([image_headline,text_0])
display(hbox_line2)

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00Q\x00\x00\x00;\x08\x03\x00\x00\x00…

# Linear Regression
## Fundamentals

In [4]:
file = open("Figures/time.png", "rb")
image = file.read()

image_headline = widgets.Image(
                    value=image,
                    format='png',layout=widgets.Layout(width='14px', height='19px')
                )

text_0=link = widgets.HTML(value="<font size='2'>15 mins</font>")
hbox_line2 = widgets.HBox([image_headline,text_0])
display(hbox_line2)

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00J\x00\x00\x00`\x08\x03\x00\x00\x00…

__Table of Contents__

- [Regression](#regression_cell)
- [Linear Regression](#regression_cell)
- [Multiple Linear Regression](#regression_cell)
- [Assumptions](#assumptions_cell)

Linear regression is one of the most fundamental and highly effective tools in data analysis and, as the bedrock for more advanced techniques, it has no prerequisites knowledge beyond a basic understanding of algebra.

This lesson goes over the key ideas of linear regression, the contexts where it should be used, and how to create and improve linear regression models.

### Covered in This Module
- The motivation and key ideas behind regression
- Linear and multiple linear regression
- When to use linear regression
- How to improve a linear regression model

***

### Regression

<a id='regression_cell'></a>

Regression is data analysis tool that identifies relationships between a continuous, quantitaive/numeric dependent variable (e.g. the price of a house) and one or more independent variables (e.g. the size of the house, the number of rooms).

Regression is fundamentally different from classification; in classification the independent variables are a set of categorical, discrete values (e.g. the type of house, like a condo or apartment). However, the independent variables for both regression and classification can take on any form (numeric or categorical).

The key idea behind regression is to use a set of data points to estimate the some true underlying function:

$Y=\hat{f}(X)$

Where $X$ is the independent variable, $Y$ is the dependent variable, and $\hat{f}$ is the approximation of the true underlying function.

After using regression to identify the relationship between the independent and dependent variables, you can make inferences and predictions.

Making inferences is the act of creating insights from the data. This could be figuring out the strength of the relationship is between the independent and dependent variables, or how a change in the independent variable could affect the dependent variable.

Prediction involves using the estimated relationship (identified by the regression and expressed as a function) and the independent variable to estimate changes to the dependent variable. For example, you could use regression to approximate the underlying cost function of houses with respect to their size, then use the function to predict the price of a house with a given size.

***

### Linear Regression

<a id='linear_regression_cell'></a>

As the name suggests, linear regression is a type of regression that assumes that the underlying relationship of independent and dependent variables is linear. This results in the following parametric equation, where X is the independent variable, Y is the dependent variable, and $\beta_{0}$ and $\beta_{1}$ are the intercept and slope:

 $Y\approx\ \beta_{0}+\beta_{1}X$

This parametric equation produces a line, and the goal of linear regression is to identify values for $\beta_{0}$ and $\beta_{1}$ that produce a line that best fits a given dataset.

For example, you could create a simple equation for the fuel consumption of a car based on its engine size:

 $ consumption= \beta_{0} + \beta_{1} \times engine$  $size$

The question here is: how do you get the values for $\beta_{0}$ and $\beta_{1}$?
        
As previously stated, $\beta_{0}$ and $\beta_{1}$ represent the intercept and slope, so any given combination of $\beta_{0}$ and $\beta_{1}$ creates a unique line through a dataset. The most common method for finding the line of best fit is with the "least squares" method, where the goal is to minimize the square of the error between the predicted line and the observations in the dataset.

For example, the graph below plots a dataset and contains three lines (that is, three unique combinations of $\beta_{0}$ and $\beta_{1}$). Of these three lines, the solid one highlighted in red is considered the line of best fit because it minimizes the squared error between the points on the line and the points in the dataset:

In [14]:
file = open("Figures/Theory_Fig1.png", "rb")
image = file.read()

image= widgets.Image(
                    value=image,
                    format='png',
                    width='500px'
                )
display(image)

Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02?\x00\x00\x01\xb9\x08\x06\x00\x00\x00a\xadUo\x00\…

<b>Finding the Line of Best Fit</b>

In [140]:
def create_multipleChoice_widget(options, correct_answer):
    if correct_answer not in options:
        options.append(correct_answer)
    
    correct_answer_index = options.index(correct_answer)
    
    radio_options = [(words, i) for i, words in enumerate(options)]
    alternativ = widgets.RadioButtons(
        options = radio_options,
        description = '',
        disabled = False
    )
    
    description_out = widgets.Output()
#    with description_out:
#        print(description)
        
    feedback_out = widgets.Output()

    def check_selection(b):
        a = int(alternativ.value)
        if a==correct_answer_index:
            s = '\x1b[6;30;42m' + "Correct!" + '\x1b[0m' +"\n" #green color
        else:
            s = '\x1b[5;30;41m' + "Try again" + '\x1b[0m' +"\n" #red color
        with feedback_out:
            clear_output()
            print(s)
        return
    
    check = widgets.Button(description="submit")
    check.on_click(check_selection)
    
    
    return widgets.VBox([description_out, alternativ, check, feedback_out])

In [141]:
Q1 = create_multipleChoice_widget(['Positively','Negatively'],'Positively')

<div class="alert alert-block alert-success">
<b>Knowledge check:</b> In the above plot, how are X and Y correlated?
</div>

In [142]:
display(Q1)

VBox(children=(Output(), RadioButtons(options=(('Positively', 0), ('Negatively', 1)), value=0), Button(descrip…

***

### Multiple Linear Regression

<a id='multiple_linear_regression_cell'></a>

In most datasets, you will not have a single predictor (independent variable); rather, your output (dependent variable) will depend on several elements. For example, the mileage of the car is not solely determined by its model year, and instead depends on other factors such as engine size and weight.

To account for multiple independent variables, multiple linear regression extends the parametric equation as follows:

 $Y\approx\ \beta_{0}+\beta_{1}X_{1}+...+\beta_{n}X_{n}$
 
Where $n$ is the number of independent variables. 

***

### Assumptions

<a id='assumptions_cell'></a>

Linear regression relies on the following assumptions:

- __Linear__:  As the name suggests, linear regression assumes that the relationship between the dependent and independent variables is linear. However, you can work around this limitation if you know that the underlying relationship is non-linear (e.g. quadratic); you can create a new variable ($X_{1}^2$) and add it to the equation:

  $Y\approx\ \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{1}^2$
 
  Depending on the underlying function, the original term $X_{1}$ can also be completely removed from the equation.
 
- __Independence__: Linear regression assumes that the independent variables are not related to each other (i.e. they are not correlated).
  
  You can verify if the independent variables are correlated with a correlation matrix, but correlation matrices are limited to finding collinearity among pairs of two variables, and there may exist a multicollinear relationship (where more than two variables are collinear). In this case, you can compute the variance inflation factor to estimate the severity of the multicollinearity.
  
  After assessing the collinearity, some variables can be dropped to increase the accuracy of the linear regression.
 
- __Additive__: All independent variables are additive. That is, the effect on one independent variable on the dependent variable does not depend on the effect of any other independent variable on that dependent variable. If you find that your dataset contains non-additive independent variables, you can work around them by creating a new interaction variable that encompasses their combined synergistic effects on the dependent variable.
  
  For example, the effect that weight of a car and its engine size are both independent variables that affect the car's mileage, the dependent variable. However, the weight of the car and its engine size are not additive because engine size necessarily has implications on the weight of the car. To address this, you can introduce an interaction variable $X_{1} X_{2}$. This results in the following equation:

  $Y\approx\ \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+ \beta_{3}X_{1}X_{2}$
 

- __Homoscedastic noise__: The variance of errors (noise) must be consistent throughout the dataset. The opposite of consistent noise is __heteroscedastic noise__, which is shown on the right side in the figure below; the data points vary significantly when the prediction values are higher, while the lower prediction values have relatively low variance. Compare this to the __homoscedastic noise__ where the errors are uniformly distributed throughout all predictions.



In [15]:
file = open("Figures/Theory_Fig4.png", "rb")
image = file.read()

image= widgets.Image(
                    value=image,
                    format='png',
                    width='800px'
                )
display(image)

Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x0f\x96\x00\x00\x06\x96\x08\x06\x00\x00\x00\xcc)\xb0…

  In cases where a dataset contains __heteroscedastic noise__, you can mitigate the effect by changing the dependent variable to $log (Y)$ or $\sqrt{Y}$. In the Figure below we can see how taking the Log of output makes the variance of errors more evenly distributed. 


In [17]:
file = open("Figures/Theory_Fig7.png", "rb")
image = file.read()

image= widgets.Image(
                    value=image,
                    format='png',
                    width='800px'
                )
display(image)

Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x0f7\x00\x00\x06\x98\x08\x06\x00\x00\x00\xb0\xb9\xb0…

***

<font style="font-family:Calibri"> Author Name: Umar Farooq Ghumman
<br>
Author Contact: abs@verticapy.com</font>

### Resources

- [<font size='2'>Auto Data</font>](https://raw.githubusercontent.com/mail4umar/ISLR-python/master/Notebooks/Data/Auto.csv)

- [<font size='2'>An Introduction to Statistical Learning</font>](https://www.statlearning.com/)