# Basic Linear regression

In this example we manufacture our own data using a linear function.
Then we use `LinerRegression` from `sklearn` to create a model that would predic the result.
For more details see: [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
!pip install numpy pandas scikit-learn matplotlib joblib

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from joblib import dump
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

### Generate the data

You can set any size here and any noise level. This function returns a Pandas DataFrame.

In [None]:
def generate_data_with_noise(size, noise_level):
    x = np.arange(size)
    noise = noise_level * (np.random.rand(size)-0.5)
    y = x + noise
    df = pd.DataFrame(data=[x, y]).T
    df = pd.DataFrame({"x":x, "y":y})
    return df


In [None]:
size = 10
df = generate_data_with_noise(10, 1)

In [None]:
df

Fix the seed so we can repeate the same example with the same random data

In [None]:
np.random.seed(42)

First we'll use the data as was created by the linear function, later we'll add a bit noise to see how that impacts the results.

### Show the data

Let's plot the data so we can have a feeling how it looks like.

The "features" or the "independent variables" are usually stored in a variable called `X` (capital letter).
In most cases there are many features and thus `X` is a matrix, but in our simplified case we only have one value.
Therefore we had to massage it a bit to become a matrix.

In [None]:
X = df["x"]
X

We usually focus on a single "result" or "dependent variable" that can be represented as a vector. The values are usually stored in a variable called `y`.

In [None]:
y = df["y"]
y

Use df.plot to show the two data series

In [None]:
df.plot();

In [None]:
df.plot.scatter(x='x', y='y', c='Blue');

Using Matplotlib directly it is easier to add a line

In [None]:
plt.scatter(df["x"], df["y"], s=20);
plt.plot([0, size], [0, size], color="red");

In [None]:
X = df[["x"]]
X

In [None]:
y = df["y"]
y

### Train the model

We are looking for a linear function like `y = ax+b` for which the mistakes are the smallest. The mistake is the distance of the calculated `y` value from the actual `y` value. We can measure this mistake in many different ways. This is usually called the "cost function".

For example this mistake can be the absolute distance between the calculated value and the actual value, but there can be more complex "cost functions".

First we need to decide the algorithm we would like to use. This is usually an "educated guess" of which algorithm might best describe the relationship between our values.

In this example we know we need "Linear Regression".

We start by creating an object to hold our model.

## Split the Dataset With train_test_split

* [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
* [Stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling)
* [RealPython - Split Your Dataset With scikit-learn's train_test_split()](https://realpython.com/train-test-split-python-data/)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=4)

In [None]:
print(len(y_train), len(y_test))

In [None]:
model = LinearRegression()

In [None]:
print(model)

Then we "train" the model by calling the `fit` method.

In [None]:
model.fit(x_train, y_train)

Evaluate the model

`intercept_` is where the line crosses the `y` axis. (It is `b` in the above expression.)
This is aproximately 0 in our case.

In [None]:
model.intercept_

`coef_` is how steep the line is. (It is `a` in the above expression.)

In [None]:
model.coef_

[Coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)

In [None]:
print('train coefficient of determination:', model.score(x_train, y_train))
print('test coefficient of determination:', model.score(x_test, y_test))
print('coefficient of determination:', model.score(X, y))

Once we have a model we can use the `predict` method to calculate a what should be the value for a given input.

In [None]:
#model.predict([[10]]);
model.predict(pd.DataFrame({'x': [10]}))

We can also represent it on a graph

In [None]:
x1, x2 = min(df["x"]), max(df["x"]) # 0, size-1
y1, y2 = model.predict(pd.DataFrame({'x': [x1, x2]}))
plt.plot([x1, x2], [y1, y2], color="red");
plt.scatter(df["x"], df["y"]);


### Save the model

Save the model in a file using some kind of serialization. `joblib` is used frequently

In [None]:
dump(model, 'linear.joblib')

Run the script `basic_linear_regression_predict.py` on the command line to see how to use the model.