# Regression for Learning Plattform

Regression is a wide class of models and a powerful tool when it comes to modeling different problems. 

The **basic task** is to fit a function, in the simplest case a line, through data. This simple method serves well to illustrate basic concepts, relevant for more complex models. The resulting model as a generalization of the data is used for interpretation and prediction of continuous values.  

We will start with the simplest regression model, a **linear regression**. Linear regression is useful to:
- Check the existence of a linear relationship between two or more variables
- Quantify the strength of a given relationship
- Grant insights via interpretation of coefficients
- Make predictions for unknown values given a model and an input

Regression is very often and widely used in practice due to its simplicity, interpretability, and relative robustness given the data as well as fulfillment of basic assumptions. Below, you will find several examples for applications:

- Economics: Finding trends in time series
- Medicine: Investigating treatment effects

## Some Use Case to Begin With

What we will do first is a short example, to understand, what linear regression can be used for.  
Let us look at a basic modeling problem of food quality.
Imagine, an internet plattform having as information the average price (paid per person in Euro) and the corresponding average ranking of food quality from 1 to 5 stars (5 is the best).

<img src="assets/img/food.png" style="float: left;width: 400px;">

What we definitely see here is that a higher price in general leads to a higher rating, which is also intuitive.  Now, given the data, we can quantify this relationship for the given data. At the end of this notebook, you should be able to answer following quesions more precisely with help of linear regression. For example:

- How can we describe the given data with a simple mathematical formula (model) with two coefficients?
- How well does the model describe the data?
- If we have 30 Euros, what food quality can we approximately expect for this money?
- What does investing an additional 5 Euros can bring with respect to food quality?

## Takeaways from This Notebook 

- Basic understanding of the regression principle
- Understanding of least squares estimation
- Performing a simple regression in Python with two packages
- Interpreting regression results via coefficients & fit statistics
- Performing polynomial regression
- Understanding the concept of bias-variance trade-off
- Performing a residual analysis

In the following, we will do a more general example with some mathematical background, where we use only two variables, $X$ and $Y$. Feel free to think about what these could be in your context ;). we will work on a data set which we create by means of python functions. This data will help us to check later on, how good our regresson is performing as we know the true parameters of the underlying function.

In [1]:
# For computations
import pandas as pd
import numpy as np

# For plotting
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

# Setting the plot to standard parameters
matplotlib.rcdefaults()
# Formatting the size of future plots
plt.rcParams["figure.figsize"] = (4,3)
sns.set(style="whitegrid")

# For random number generation
from numpy.random import seed
from numpy.random import rand

# For fitting of regression line 

# For univariate regression
from scipy import stats

# For multivariate regression
import statsmodels.api as sm
import statsmodels.formula.api as smf

We have a small set of points, which were originally generated by sampling some values from a uniform distribution, adding some random noise and plotting the result. This is the **true linear relationship** we wish to reconstruct by means of the linear regression.

In [None]:
# This function reassures the reproducibility of the results

# The number in brackets is arbitrary and just has to be the same
seed(1)

# Simulate 20 points in the interval 0 to 10
number_points = 20

# The multiplicator is used to stretch the scale but does not have any specific meaning
# This holds also for all numbers, as an exercise the influence of it can be tested by changing it
x = rand(number_points)*10

# Define equation for the line and adding some noise with mu=0, sigma=5
y = 3*x + 5 + np.random.normal(0, 5, size=number_points)

# Plot the result
plt.plot(x, y, '.', markersize=10)
plt.xlabel("Value X")
plt.ylabel("Value Y")
plt.grid(False)
plt.rcParams["figure.figsize"] = (4,3)
plt.show()

What we now want to do is to **fit a linear function** in the equation below. 

Thereby, X is the input value and Y is the output we wish to model. We cannot equate our model with the measured data, as there is still a random noise component leading to a distance between the line and each data point. 

This is an example of the **univariate** case, where only one input variable is given, a case with more variables is called **multivariate**, and the task generalizes to fitting a hyperplane.

$$ Y \thickapprox \beta_0 + \beta_1X$$

## Optimization Criterion - Residual Sum of Squares

There are multiple possibilities to draw a line through the sample data. The following plot visualizes some of them, which at a first glance might seem plausible. Therefore, an important step is the definition of the optimization criterion (goal), which defines the problem to be solved in our case and which defines the line. 

<img src="assets/img/optcriterion.png" style="float: left;width: 400px;">

The general idea is to fit a line closest to the data. So mathematically, the idea is to optimize the sum of distances to the line, measured in the $Y$ direction (ordinal regression). As the goal is to predict the $Y$ value, we measure the projection in its direction, which is visualized in the following figure. The distance of a single point to the line is called **residual**.

<img src="assets/img/residual.png" style="float: left; width: 400px;">

As the second step, we take the squared residual. The squared approach has several reasons:
1. Squaring cancels out the sign of the residual - the values do not cancel out if summed and are contribute equally for both directions.
2. Strong deviations are penalized heavier. 
3. Optimization of a quadratic function is easier than of an absolute value (due to simpler derivatives). 

In the following, the formula of the residual sum of squares is displayed. Thereby, an individual residual $e_i$ of point $i$ is computed as the difference between the value $\hat{y_i}$ and the real value $y_i$. Then the difference is squared and summed over all $n$ points.

$$RSS = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y_i})^2$$

The overall goal is to minimize the residual sum of squares, which means that the line lies closest to the data with respect to the predicted variable. The resulting optimization problem is solved for the parameters. 

There are several ways to optimize this function. The straight forward approach known from basic calculus is to take the first derivative of the function, set it to zero and then to solve for the parameters. In the multivariate case, it means computing the [gradient](https://en.wikipedia.org/wiki/Gradient) with respect to the parameters. Later on we will see examples, where this approach of analytically solving for the optimal parameters is no longer feasible. In such a case one would resort to using numerical optimization algorithms, such as [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent). In any case, a function minimization problem is to be solved, the method is flexible.

## Derivation of Coefficients $\beta_0$ and $\beta_1$

We derivate the coefficients as follows:
1. Replace the estimated $\hat{y_i}= \beta_1x_i+\beta_0$
2. Compute the partial derivatives, set them to zero and solve the resulting system of equations for parameters.

$$RSS= \sum_{i=1}^n (y_i - (\beta_1x_i+\beta_0))^2$$

$$ \frac{\partial RSS}{\partial \beta_0}= \sum_{i=1}^n 2(y_i- (\beta_0+ \beta_1x_i))(-1) \overset{!}{=} 0$$

$$ n \beta_0 = \sum_{i=1}^n y_i - \beta_1x_i$$

3. Divide both sides by the number of points $n$ and substitute $\frac{1}{n}\sum_{i=1}^n y_i$ by the mean $\bar{y}$, analogously for $\frac{1}{n}\sum_{i=1}^n x_i=\bar{x}$.

4. Rewriting the original equation of the linear regression leads to $\bar{y} = \beta_0 + \beta_1\bar{x}$. 

5. Use the reformulation of the latter and use the reformulation $\beta_0 = \bar{y}-\beta_1\bar{x}$ in the next steps. 
    
$$ RSS= \sum_{i=1}^n (y_i - (\beta_1x_i+\beta_0))^2 $$

$$ RSS= \sum_{i=1}^n (y_i - (\beta_1x_i+\bar{y}-\beta_1\bar{x}))^2 $$

$$ RSS= \sum_{i=1}^n (y_i - \bar{y} - \beta_1(x_i-\bar{x}))^2 $$

$$ \frac{\partial RSS}{\partial \beta_1}= \sum_{i=1}^n 2(y_i - \bar{y} - \beta_1(x_i-\bar{x}))(-1)(x_i - \bar{x}) \overset{!}{=} 0$$

$$\beta_1 = \frac{ \sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^n(x_i - \bar{x})^2}$$

The generic visualization of the univariate problem is below and shows the basic idea to find the minimum of the RSS function. In the figure, the values of $\beta_0$ and $\beta_1$ correspond to the minimum of the RSS function.

<img src="assets/img/rss.png" >

## Goodness of Fit - $R^2$

Typically, statistical libraries provide the $R^2$ statistic to evaluate the fit quality of a linear regression.
In order to derive the goodness of fit coefficient $R^2$ mathematically, we introduce the definition of the variance of the dependent variable $Y$ first. 

The **variance** measures the squared distance of the fit to the mean $\bar{y}$ of the variable $Y$. This corresponds to a regression with only the intercept. 

So the variance of $Y$ summarizes the total variance of the dependent variable, also called total sum of squares (TSS). The overall goal is to reduce the variance by introducing the model, e.g. a linear model:

$$Var(Y)=\frac{1 }{n}\sum_{i=1}^N (y_i-\bar{y})(y_i-\bar{y})= TSS$$
where 
$$\bar{y}=\frac{1}{n}\sum_{i=1}^N y_i$$

The following figure illustrates the result. 

In the left figure a stronger noise was added, therefore there is a higher **TSS** - more variance of $Y$ to explain. 

At the same time, the **RSS**, the residual sum of squares, is higher in the left figure - you can observe that the line is overall "further away" from the data. 

<img src="assets/img/tss.png" style=" width: 650px;">

$R^2$ measures the quotient of the unexplained variance (TSS and RSS in the numerator) normalized by the total variance of the dependent variable (TSS in the denominator): 

$$R^2 = \frac{TSS-RSS}{TSS}$$

As we can see from the formula, the value of $R^2$ ranges from 0 to 1. 

The optimal case $R^2=1$ occurs when all the points lie exactly on the regression line. In this case, the resulting error measured by the sum of squared residuals is $RSS=0$, leading to $R^2 = (TSS - 0)/TSS = 1$.

In the figure above, we see that $R^2$ is higher in the left plot, where about 91 % of the total variance of $Y$ is explained.

## Implementations in Python

After we have seen, how the coefficient estimation works analytically, we will look at the implementation. There are several ways to implement linear regression in Python (eight of them are shown [here](https://medium.freecodecamp.org/data-science-with-python-8-ways-to-do-linear-regression-and-measure-their-speed-b5577d75f8b)). In the following, we will exercise two of them. The first package *stats* implements the functionality of the univariate regression and returns the essential parameters as the result.

In [4]:
# The following line computes the relevant parameters of the regression
# The left sides creates variables to receive returned values
# The right side is given the input and output variables
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

Visualizing the result leads to the code and the figure before. It is always a good **plausibility check** to visually look at the result, a further check using the residuals will be discussed below.

In [None]:
# Plotting the result leads to the following
# Here the original data used for regression
plt.plot(x, y, 'o', label='Original data')
#Using the slope and the intercept estimates the resulting linear function is plotted as follows
plt.plot(x, intercept + slope*(x), 'r', label='Fitted line')

# Configuring the plot
plt.title("Linear Regression")
plt.xlabel("Value X")
plt.ylabel("Value Y")
plt.legend()
plt.grid(False)
plt.show()

## Exemplary Interpretation of the Parameters

The interpretation of the parameters from above teaches us the following:

- **Slope**: the relationship between independent and dependent variable, where an increase of 1 in the independent variable $X$ corresponds to the increase in the dependent variable $Y$ of the corresponding slope
- **Intercept**: given the value 0 of the independent variable, we get the corresponding value of the dependent variable
- **$R^2$ (R-Squared)**: shows how well the line explains the data in the range from 0 to 1 as goodness of fit
- **Standard error**: average deviation from the computed parameter 
- **P-value**: the statistical significance level - how sure are we about the estimated coefficients

The next code block prints the coefficients, which are shortly interpreted below.

In [None]:
print("slope:", slope)
print("intercept:", intercept)
print("r-squared:", r_value**2)
print("std_err:", std_err)
print("p-value:", p_value)

- The slope of 2.87 is very close to the original one of value 3.00, so given only 20 points we are quite close to the true value despite of noise with a standard deviation of 5.0 (see data generation to find the respective parameters and to change these to see the effect). 
- The intercept is even closer to the true parameter of 5.0. 
- The value of $R^2$ is close to 1 (maximal possible value), which says that 77 % of the variance in the Y direction are explained by the line. 
- The standard error computes the residual sum of squares normalized by a factor, containing the number of observations reduced by number of predictors, and therefore can be interpreted as the average squared error per observation (here with the 0.36 well below the standard deviation of the noise specified upfront). 
- The p-value is below the level of 0.001 (strong significance), which means that we can be statistically sure about the estimated slope parameter.

## Fit using Statmodels

For fitting a multivariate regression and getting a more detailed result, another package is more suitable. The exemplary code for the same regression problem is shown below.

In [None]:
# Creating a pandas data frame to be processed by the package
df = pd.DataFrame({ 'x': x, 'y': y})

# Creating the result object carrying all the information
# Specifying the equation to be estimated
results = smf.ols('y ~ x', data=df).fit()

# Printing the result
print(results.summary())

As the result, we see the same coefficients for the slope and intercept, and some additional information, which, however, we do not discuss here in detail. For details, please see the documentation and the referenced literature at the end of this notebook.

# Polynomial Regression

We can easily generalize the fit of a linear function to other functions, e.g. quadratic or polynomials of a higher degrees.
The idea remains the same - fitting a function closest to the data given the residual sum of squares as optimization target, but of the following form, where $d$ is the polynomial degree:

$$Y = \beta_dX^d + \dots + \beta_0 $$

Given the same data, we now try to fit a polynomial function to the same data:

In [None]:
# Fitting a polynom of the tenth degree with least squares
z = np.polyfit(x, y, 10)

# Poly1d creates objects for handling polynoms
# Encodes the coefficients or roots, if parameter r ist set to True
z_equation = np.poly1d(z)

# The following code prints the estimated equation
print('Resulting Equation: \n',z_equation)

In [None]:
# Plotting routine

# Define a vector of size 100 in the interval from 0 to 9
xp = np.linspace(0, 9, 100)

# Plot the original points and the estimated line as a polynomial of the input space
plt.plot(x, y, 'o',  xp, z_equation(xp), 'r-')
plt.ylabel("Value Y")
plt.xlabel("Value X")
plt.title('Polynomial Function')
plt.grid(False)
plt.show()

## Bias-Variance Trade-Off

As shown above, there are multiple possibilities for fitting a model of varying complexity. The complexity of a model can be quantified by many metrics, one of the simplest and usually a good choice are the number of parameters a model uses. With respect to complexity, linear regression is then the simplest. The higher the degree of the polynomial, the higher the resulting complexity. Therefore, the question of the right model complexity arises. The answer is provided by the so-called **bias-variance trade-off**.

In this notebook, we discuss only the general idea of this concept. For more detailed information please refer to the book [Introduction to Statistical Learning](https://inside-docupedia.bosch.com/confluence/display/EAIAB/Introduction+to+Statistical+Learning), which explains this topic very well.

The following figure illustrates the trade-off. The **bias** describes the loss of information due to the lacking complexity of the model. An example would be fitting a line through quadratic data. So the simpler the model, the higher the bias.

The second component is the **variance**, which rises with rising complexity. Variance in this case refers to the variance of the estimated parameters, meaning that given a small change in the data, a strong change in the parameter estimate arises. As a result, every point in the data influences the fitted function. Consequently, the model does not generalize well, as can be seen in the right figure:

<img src="assets/img/bv.png">

Both components are interchangeable, as each reduction of one of them leads to an increase of the other, which is shown in the next figure. The total error of a model is then the sum of both components plus an irreducible random error which is not discussed here. Overall, it is possible to find an optimum for each data set with respect to model complexity, which is in our case the degree of the polynomial. A more general description is the number of model parameters.

<img src="assets/img/to.png">

## Residual Analysis - Checking the Linearity Assumption

Fitting a linear function might not the best idea. As the following example will show, sometimes we need another solution. Now, we will generate some data using a quadratic function and see how we can assess the modeling result, not only for linear regression. Namely, we will do it by means of the residual analysis, which is a part of the general approach for **regression diagnostics**. 

Let us generate some data first:

In [None]:
# Setting the seed for reproducibility
np.random.seed(0)
# Generate 100 equally spaced points between -10 and 10 as input variable
x = np.linspace(-3, 3, 100)
# Generate random noise vector of the same length
nse = np.random.normal(size=100)
# Calculate the y value for each element of the x vector
y = x**2 + 2*x + nse 

plt.scatter(x, y) 
plt.ylabel("Value Y")
plt.xlabel("Value X")
plt.rcParams["figure.figsize"] = (4,3)
plt.grid(False)
plt.show()

Next, we fit a linear function through the data as done before, and visualize the result:

In [None]:
# Getting the resulting parameters
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

# Plotting the result leads to the following
plt.plot(x, y, 'o', label='Original data')
plt.plot(x, intercept + slope*(x), 'r', label='Fitted line')
plt.title("Linear Regression")
plt.xlabel("Value X")
plt.ylabel("Value Y")
plt.legend()
plt.grid(False)
plt.rcParams["figure.figsize"] = (4,3)
plt.show()

If we look at the plot, we immediately see that a line is a misfit in this case. Nevertheless, by evaluating the $R^2$ of the model, we get a result of 0.52, meaning that even a suboptimal model explains more than 50% of the output variable. Therefore, a high $R^2$ does not necessary mean that the model is optimal to the data. So the question is how else we can assess the modeling result.

In [None]:
# Displaying the resulting estimates
print("intercept:", intercept)
print("slope:", slope)
print("r-squared:", r_value**2)

As seen in the plot above, fitting a line is obviously not optimal. Besides visual analysis, which immediately shows the resulting problem, we will get to know a very simple but a very powerful idea from **residual analysis**. The core idea is to plot the residuals, i.e. the distance to the regression line, but unsquared as it keeps the direction of the error.

The residuals are plotted against the predicted value of the output variable $Y$. Therefore, this approach can be used for multivariate cases, where visual plot assessment per dimension gets more difficult. There is a very simple rule: As long there is a pattern in the residuals, something is wrong. If a person recognizes a system behind the errors, there should be a way to explain it with a mathematical model. The question is then how to "explain" these systematics, which a person sees in the residual analysis, to the model with the goal to incorporate this information. 

The package *seaborn* delivers a very simple utility to visualize the residuals. It gets the predicted value as a scale for the x-axis and the true value of the output variable as y-axis. As a result, we see a clear pattern, which is due to the presence of a direction (+/-). In the center, we see a systematic overestimation. At the sides, we see a systematic underestimation. The green line represents a trend line, obtained by a locally weighted regression. The latter performs a linear regression on the interval.

In [None]:
sns.residplot(intercept + slope*x, y, lowess=True, color="g")
plt.rcParams["figure.figsize"] = (4,3)
plt.title("Residual Analysis")
plt.xlabel("Predicted Value Y")
plt.ylabel("Residuals")
plt.show()

As the next step, we will raise the complexity of the model to reduce the bias (underfit of a too simple model). Therefore, we will fit a quadratic function.

In [None]:
# The following code fits a quadratic function
z = np.polyfit(x, y, 2)

z_equation = np.poly1d(z)
# Printing the equation
print('Resulting Equation: \n',z_equation)

# Defining the x-values for plotting the function line
xp = np.linspace(-3, 3, 100)

# Plotting the result leads to the following
plt.plot(x, y, 'o', label='Original data')
plt.plot( xp, z_equation(xp), 'r', label='Fitted line')
plt.title("Polynomial Regression")
plt.xlabel("Value X")
plt.ylabel("Value Y")
plt.legend()
plt.grid(False)
plt.rcParams["figure.figsize"] = (4,3)
plt.show()

In the plot we see, that the fit is much better, even visually. In the next code, we reevaluate the residuals.

In [None]:
# By setting the parameter "order", the function automatically fits a second 
# degree polynom and computes the corresponding residuals
sns.residplot(intercept + slope*x,y, lowess=True, color="g", order = 2)
plt.rcParams["figure.figsize"] = (4,3)
plt.title("Residual Analysis")
plt.xlabel("Predicted Value Y")
plt.ylabel("Residuals")
plt.show()

Now, the residual plot shows no pattern, which is the perfect picture of the residuals. The residuals are distributed randomly and the trend line does not expose any pattern.

## Questions You Should Be Able to Answer Now

After you have carefully read and understood the contents of this notebook, you should be able to answer following questions.

- Which optimization criterion does the least squares approach have? 
- Name some of the packages and functions, that implement linear regression in Python.
- What is the interpretation of the following parameters?
    - Slope
    - Intercept 
    - R-Squared
- What is the difference between the residual sum of squares (RSS) and the total sum of squares (TSS)?
- What does the bias-variance trade-off state? How would you explain bias and variance with respect to model complexity?
- What is the main rule if performing residual analysis?

# Further Topics and Reading

This notebook should serve as an initial read for understanding the principle and basic application of the linear and polynomial regression in Python. The following list of related topics and keywords is a good way to proceed.

- Basic assumptions of linear regression
- Adjusted R-Squared
- Confidence and prediction intervals
- Regression diagnostics
- Feature selection
- Ridge and lasso regression
- Bayesian linear regression
- Logistic regression

As a general overview, we advise following literature:

- [Introduction to Statistical Learning](https://inside-docupedia.bosch.com/confluence/display/EAIAB/Introduction+to+Statistical+Learning)
- [Further ressources on regression](https://inside-docupedia.bosch.com/confluence/display/EAIAB/Regression)

# Task for You to Perform on Your Own

We would like to provide you with a task to perform on your own in order to apply the knowledge presented before.
The data set from the [UCI repository](http://mlr.cs.umass.edu/ml/datasets.html) summarizes some parameters of different car types. The data set and the list of attributes can be found [here](http://mlr.cs.umass.edu/ml/datasets/Auto+MPG).
In the following, we will propose some tasks and reference some commands to perform these.

## Loading and Analyzing the Data

The following commands will load the data and perform an initial analysis.

In [17]:
# Loading the car data
# To read the file and to specify the delimiter parameter we do the following
df1 = pd.read_csv('input/car_data.csv', delimiter=",")

We propose to get to know the data with the listed commands.
For you to learn better we provide only the link to the respective function documentation. If you have troubles, you can do one of the following: 

- Write a post in our community [here](https://connect.bosch.com/communities/service/html/communitystart?communityUuid=210bdd9f-74c5-4831-a5b5-618ad489f103)
- The less preferred one: consult the solution code [here](assets/solution/regression_intro_solution.ipynb)

In [18]:
# Getting the dimensionality of the data with shape
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html

# Your code here

In [None]:
# Getting first rows of the data 
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

# Your code here

In [None]:
# Getting descriptive statistics of the continuous variables
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

# Your code here

In [None]:
# Chechikng if the data types are plausible
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html

# Your code here

Column "horsepower" has the type object, which is a sign for it to contain different data types (e.g. missing values). We will try casting it to numeric, as the values are obviously numbers (see *head* command and its result).

In [None]:
# Casting horsepower to numeric with the following function
# Use downcast and errors parameters with appropriate values
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html

# Your code here

In [None]:
# Check the result with an appropriate command

# Your code here

In [None]:
# Filter for missing values with dropna
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

# Your code here

In [None]:
# Check the result with an appropriate command
# How many rows had a missing value in horse power?

# Your code here

In [None]:
# For frequent numerical or categorical variables the function crosstab can be used, e.g. for cylinders
# https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

# Your code here

In [None]:
# Getting a Box plot of a continuous variable, e.g. for the weight
# https://matplotlib.org/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py

# Your code here

In [None]:
# Getting a scatter plot of a variable pair, e.g. for the weight and miles per gallon (mpg)
# https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html

# Your code here

In [None]:
# Getting a pair plot for each variable pair or for the whole data set
# https://seaborn.pydata.org/generated/seaborn.pairplot.html

# Your code here

In [None]:
# Performing a regression using the stats-syntax
# Print and interpret the estimates
# Use weight as input variable and miles per gallon (mpg) as output variable
# See the example above or the documentation example at
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

# Your code here

In [None]:
# Plot the results using the example above or
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

# Your code here

In [None]:
# Performing a regression using the statmodels-syntax
# Use weight as input variable miles per gallon (mpg) as output variable
# See the example above or the documentation example (alternative syntax) at
# https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html

# Your code here

In [None]:
# Perform a residual analysis using the example above or 
# https://seaborn.pydata.org/examples/residplot.html

# Your code here

In [None]:
# Fit a polynomial model for the same variables using the example above or
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.polyfit.html

# Your code here

In [None]:
# Perform a residual analysis using the example above or 
# https://seaborn.pydata.org/examples/residplot.html
# Use the order parameter to adkust the polynomial degree

# Your code here

Parts of this notebook are a basic introductory part from our class room training [Basics of Machine Learning](https://inside-docupedia.bosch.com/confluence/display/EAIAB/Basics+of+Machine+Learning). If you are interested to know more, please visit the training website and contact us for further information using the envelope button at the top of your screen.

# Feel Free to Consult our [Training Bar](https://inside-docupedia.bosch.com/confluence/display/EAIAB/Training+Offers) for Further Topics and Sources

**Attention regarding external links**: You are forwarded to an external provider who is not related to us. Upon clicking on the link, we have no influence on the collecting, processing and use of personal data possibly transmitted by clicking on the link to the third party (such as the IP address or the URL of the site on which the link is located) as the conduct of third parties is naturally beyond our control. We do not assume responsibility for the processing of personal data by third parties