# Regression
In this section, we will discuss the basics of using a linear model on a diabetes dataset coming from `sklearn` as example.

### Diabetes dataset 

Ten baseline features, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
```
  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:
```

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
**two-dimensional array or matrix**. 
The size of the array is expected to be `n_samples x n_features`.

- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- **n_features:**  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.<br><br>

The number of features must be fixed in advance. However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample.

If there are labels or targets, they need to be stored in **one-dimensional arrays or lists**.

In [None]:
import matplotlib.pylab as plt
%matplotlib inline
import numpy as np
from sklearn import datasets

In [None]:
dataset = datasets.load_diabetes() # load data

In [None]:
dataset.data.shape # feature matrix shape

In [None]:
dataset.target.shape # target vector shape

## Linear Regression Model

Linear Regression assumes the following model: 
 
 $y = X\beta + c + \epsilon$
 
 X data <br />
 $\beta$ coefficients <br />
 c intercept <br />
 $\epsilon$ error, cannot explained by model <br />
 y target <br />

Now that we have our features and target, the next step is to split this data into training and test sets. We'll do this by using Scikit-Learn's built-in train_test_split() method. The above script splits 80% of the data to training set while 20% of the data to test set. The test_size variable is where we actually specify the proportion of test set:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)

### Train algorithm (Scikit-learn 4-Step Modeling Pattern)

We have split our data into training and testing sets, and now is finally the time to train our algorithm.

**Step 1.** Import the model you want to use
In scikit-learn, all machine learning models are implemented as Python classes

In [None]:
from sklearn.linear_model import LinearRegression

**Step 2.** Make an instance of the Model and define parameters (optional)

In [None]:
# all parameters set t default
model = LinearRegression()

**Step 3.** Training the model on the data, storing the information learned from the data.

The model is learning the relationship between features (x_train) and labels (y_train).

In [None]:
model.fit(X_train, y_train)

With Scikit-Learn it is extremely straight forward to implement linear regression models, as all you really need to do is import the LinearRegression class, instantiate it, and call the fit() method along with our training data.

In the theory section we said that linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. To see the value of the intercept and slope calculated by the linear regression algorithm for our dataset, execute the following code:

#### model coefficients $\beta$
Estimated coefficients for the linear regression problem. This is a 1D array of length n_features.

In [None]:
model.coef_

#### model intersept
Independent term c in the linear model.

In [None]:
model.intercept_

**Step 4.** Predict labels for new data (new images)

Uses the information the model learned during the model training process

Now that we have trained our algorithm, it's time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score. The model uses the information learned during the model training process. To make predictions on the test data, execute the following script:

In [None]:
model.predict(X_test) # Predict unkown data

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test series.
To better understand why the prediction and actual are different , you can plot it in the following way:

In [None]:
# plot prediction and actual data
y_pred = model.predict(X_test) 
plt.plot(y_test, y_pred, '.')

# plot a line, in case of a perfect predict all dots would fall on this line
x = np.linspace(0, 330, 100)
y = x
plt.plot(x, y)
plt.show()