# 1D-Regression Analysis


#### Paul Vessel??


## Introduction

In this notebook we will learn how to apply a regression on simplified, synthetic data. 
The regression is to learn/find a relationship between one or multiple independend variables and one dependend variable. We focus here one single dependend variable. The result is a regression model. Afterwards, this model can be used to predict new continuous numerical data. [More](https://en.wikipedia.org/wiki/Regression_analysis)


<div class ="alert alert-info">
    The term "regression" is used when predicting a numerical, continous variable, while "classification" is the term when predicting a discrete variable. Although the 'Logistic Regression' includes the term regression, it is acutally an algorithm for classification.
</div>

Possible applications are:
- Gutenberg-Richter-Law
- Wadati-Diagram

Is used/needed for the following lectures:
- Inversion
- ....



## Table of Contents
- [Linear Regression](#Linear_Regression)
- [Split in Train and Test data](#Split)
- [Extrapolation](#Extrapolation)
- [Polynomial Regression](#Polynomial_Regression)
- [Summary](#Summary)



### Learning a model, how?
In short:
Minimizing an error function, e.g. RMS (root-mean-square) error, or in other words least-square-fit, between the forward prediction of the model and the dependent variable of the data. The model with the lowest misfit is defined as best model. Find out more: [Wiki](https://en.wikipedia.org/wiki/Least_squares) and/or [More](https://www.mathsisfun.com/data/least-squares-regression.html) 



<a id='Linear_Regression'></a> 
# Linear Regression

Linear model:

x * m + n = y 

Learning the parameters m and n.

We use the [LinearRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) model from scikit-learn which uses ordinary least-squares as error function. This functions assumes that only the y-data has a uncertainty. If both x- and y-data have an uncertainty, one should use the orthogonal regression.

[More](https://machinelearningmastery.com/linear-regression-for-machine-learning/)

In [None]:
# Load packages
import numpy as num
import matplotlib.pyplot as plt

First, we create some synthetic data based on a functional relation between x and y, further we add noise.

In [None]:
# Creating data
xdata = num.linspace(-100, 100, 200)

# Setting random seed to create same noise
num.random.seed(0)

# Creating gaussian noise 
noise = num.random.normal(0, 50, len(xdata))

# x-y-Function
ydata = 5 * xdata + noise

# Plotting data
plt.figure()
plt.scatter(xdata, ydata)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

Next, we load the regression model from sklearn.

In [None]:
# Importing Linear Regression Model from scikit-learn
from sklearn.linear_model import LinearRegression

# Setting the model
model = LinearRegression()

Before fitting the data we need to reshape our input data, as it is requested by the model. Afterwards, we give the model our input (x) and output (y) data. The model is fitted and can be used to calculated some quality score and the regression parameters.

In [None]:
# Reshaping xdata to fit into model requirements 
xdata = xdata.reshape((-1, 1))

# Fitting the model to the data (or vice-versa?)
model.fit(xdata, ydata)

# Retrieving the quality score and coeffients
r_sq = model.score(xdata, ydata)
print('Train-R² :', r_sq)
print('Intercept:', model.intercept_)
print('Slope    :', model.coef_)

The R²-score gives an idea how good the model fits the data. A value of 1 is perfect. 
The intercept and slope values shoud be close to the values of the function you defined for creating the synthetic data.

Finally we can also have a look at the model directly - the regression line. Visually speaking, a good model is obtained when the data points are close to the line.

In [None]:
# Creating a regression line and plotting it
## There are two ways to do the prediction. One is calculating it manually with the known relation. 
## But sklearn also provides a .predict() functionality which is handy, especially for more complex models.
yregr = model.coef_ * xdata + model.intercept_
#yregr = model.predict(xtrain)

plt.figure()
plt.scatter(xdata, ydata, label='Data')
plt.plot(xdata, yregr, color='red', label='Regression')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

### Tasks
- create different linear functions
- play with different data densities and see how the linear regression behaves



<a id='Split'></a> 
# Split in train and test datasets

Good practise in the machine learning community is to split the data into at least a training and a test set. The idea is that the model is learned with the help of the training data, while the test data is afterwards used to check how good the learned model performs with 'new', unseen data. If the score/fit of the test data is sufficient one speaks that the model generalized the problem sufficient enough.

There are multiple ways of splitting the data:
- randomly select data points
- with a specific scheme: e.g. time based (e.g. last/first) points or points with a specific charateristic 

First we test the random selection. There is also a help from sklearn that helps us creating subsets of our data.

In [None]:
# Creating data
xdata = num.linspace(-100, 100, 20)
num.random.seed(0)
noise = num.random.normal(0, 50, len(xdata))

# x-y-Function
ydata = 5 * xdata + noise

# Split data randomly into xx% train and xx% test set
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(xdata, ydata, test_size=0.33, random_state=42)

# Reshaping
xtrain = xtrain.reshape((-1, 1))
xtest = xtest.reshape((-1, 1))

from sklearn.linear_model import LinearRegression

# Fitting
model = LinearRegression()
model.fit(xtrain, ytrain)
print('Train-Intercept:', model.intercept_)
print('Train-Slope    :', model.coef_)

# Predicting train data
yregr = model.predict(xtrain)
train_rsq = model.score(xtrain, ytrain)
print('Train-R²       :', train_rsq)

# Predicting test data
ypred = model.predict(xtest)
test_rsq = model.score(xtest, ytest)
print('Test-R²        :', test_rsq)

plt.figure()
plt.scatter(xtrain, ytrain, label='Train')
plt.scatter(xtest, ytest, label='Test')
plt.plot(xtrain, yregr, color='red', label='Regression')
plt.plot(xtest, ypred, color='red', linestyle=':', label='Prediction test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

We see that the regression lines are close to the train and test data points, further, the R² values are close to 1.

### Tasks
- play around with the data density
- change the train-test split sizes   

<a id='Extrapolation'></a> 
# Extrapolation
Extrapolation is the term when predicting values that are outside of the training set range. Usually, this is not a good idea as the model wasn't trained for this range. But there are cases where extrapolation is the main goal of the model. For example, predicting future data in weather forecast.

If we now change the split method to first/last, we simulate this extrapolation.

In [None]:
# Creating data
xdata = num.linspace(-100, 100, 20)
num.random.seed(0)
noise = num.random.normal(0, 50, len(xdata))
ydata = 5 * xdata + noise

## Splitting by indices
## Selecting the first data as training and the last as testing
index = int(len(xdata)/2)
xtrain = xdata[:index]
ytrain = ydata[:index]

xtest = xdata[index:]
ytest = ydata[index:]

xtrain = xtrain.reshape((-1, 1))
xtest = xtest.reshape((-1, 1))


# Fitting and plotting
model = LinearRegression()
model.fit(xtrain, ytrain)

# Predicting
yregr = model.predict(xtrain)
ypred = model.predict(xtest)
train_rsq = model.score(xtrain, ytrain)
test_rsq = model.score(xtest, ytest)
print('Train-Intercept:', model.intercept_)
print('Train-Slope    :', model.coef_)
print('Train-R²       :', train_rsq)
print('Test-R²        :', test_rsq)

# Plotting
plt.figure()
plt.scatter(xtrain, ytrain, label='Train')
plt.scatter(xtest, ytest, label='Test')
plt.plot(xtrain, yregr, color='red', label='Regression')
plt.plot(xtest, ypred, color='red', linestyle=':', label='Prediction outside of training')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

What we now see is that the regression line learned on the training data does not fit the test so well as before.  

### Tasks
- play around with the data density
- change the train-test split sizes   


<a id='Polynomial_Regression'></a> 
# Polynominal Regression
or Multi Linear Regression.

Often there is not a linear relation between the two variables, but a polynominal, e.g. a squared or cubic.

For that general case the linear formula:

x * m + n = y 

is expanded to include higher order terms (here 3rd order):

a * x^3 + b * x^2 + c * x + d = y

All the parameters a, b, c, d, ... will be trained.


First we look at an example where the linear regression fails:

In [None]:
# Creating data
xdata = num.linspace(0, 100, 20)
num.random.seed(0)
noise = num.random.normal(0, 200, len(xdata))
ydata = xdata**2 + noise

## Splitting randomly
# xtrain, xtest, ytrain, ytest = train_test_split(xdata, ydata, test_size=0.33, random_state=42)

## Splitting by indices
index = int(len(xdata)/2)
xtrain = xdata[:index]
ytrain = ydata[:index]

xtest = xdata[index:]
ytest = ydata[index:]

# Reshaping
xtrain = xtrain.reshape((-1, 1))
xtest = xtest.reshape((-1, 1))


# Fitting and plotting
model = LinearRegression()
model.fit(xtrain, ytrain)

# Predicting
yregr = model.predict(xtrain)
ypred = model.predict(xtest)
train_rsq = model.score(xtrain, ytrain)
test_rsq = model.score(xtest, ytest)
print('Train-Intercept:', model.intercept_)
print('Train-Slope    :', model.coef_)
print('Train-R²       :', train_rsq)
print('Test-R²        :', test_rsq)

# Plotting
plt.figure()
plt.scatter(xtrain, ytrain, label='Train')
plt.scatter(xtest, ytest, label='Test')
plt.plot(xtrain, yregr, color='red', label='Regression')
plt.plot(xtest, ypred, color='red', linestyle=':', label='Prediction test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

Maybe the training data can be fitted roughly well, but the test data shows a completely different behaviour.

To fit this data additional parameters are needed: e.g. Polynomial regression. Without going into to much detail, in sklearn there is no polynomial model directly, but there is the option to combine two functionalities to recreate it. First a polynomial feature selector followed by the known linear regression model. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Combining a the linear model with a polynomial features selector in sklearn
model = Pipeline([('poly', PolynomialFeatures(degree=2)),
                  ('linear', LinearRegression(fit_intercept=False))])

# Fitting
model.fit(xtrain, ytrain)

# Predicting
yregr = model.predict(xtrain)
ypred = model.predict(xtest)
train_rsq = model.score(xtrain, ytrain)
test_rsq = model.score(xtest, ytest)
print('Train-R²       :', train_rsq)
print('Test-R²        :', test_rsq)

# Plotting
plt.figure()
plt.scatter(xtrain, ytrain, label='Train')
plt.scatter(xtest, ytest, label='Test')
plt.plot(xtrain, yregr, color='red', label='Regression')
plt.plot(xtest, ypred, color='red', linestyle=':', label='Prediction test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

Now, the model fits really good to the test data.

### With numpy polynomial

There is the option to use numpy for the regression. There are slight changes in syntax, e.g. no reshaping needed.

In [None]:
# Creating data
xdata = num.linspace(0, 100, 20)
num.random.seed(0)
noise = num.random.normal(0, 50, len(xdata))
ydata = xdata**2 + noise
#ydata = xdata * 1 + noise
#ydata = num.sin(xdata * 2 * num.pi * 0.01) + xdata * 0.01


## Splitting randomly
# xtrain, xtest, ytrain, ytest = train_test_split(xdata, ydata, test_size=0.33, random_state=42, shuffle=True)

## Splitting by indices
index = int(len(xdata)/2)
xtrain = xdata[:index]
ytrain = ydata[:index]

xtest = xdata[index:]
ytest = ydata[index:]


# Fitting
order = 2
coeff, residuals, rank, singular_values, rcond = num.polyfit(xtrain, ytrain, order, full=True)
print('Coeffs:', coeff)
print(residuals)

model = num.poly1d(coeff)
yregr = model(xtrain)
ypred = model(xtest)

# Plotting
plt.figure()
plt.scatter(xtrain, ytrain, label='Train')
plt.scatter(xtest, ytest, label='Test')
plt.plot(xtrain, yregr, color='red', label='Regression')
plt.plot(xtest, ypred, color='red', linestyle=':', label='Prediction outside of training')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()


### Tasks
- different number in datasets
- different orders
- different functions
- overfitting
- underfitting
- prediction outside of learning space


<a id='Summary'></a> 
# Summary

We have learned
- the basic ideas behind regression for linear data and polynominal data
- regression investigates the relationship between (at least) two variables
- leared models can be used for prediction of new data or simply representing
- that it is better to split the data before training in at least a training and a testing set
- extrapolation should be avoid, unless the goal of our model is to predict future or extrem data