# Modelling and statistics used

Throughout this week you'll be using various models and metrics available in both keras and sklearn.

To used to some of the most common ones we'll be using, have a look through the following:
1. Training, validation and test sets
2. Cross Validation
3. Model selection, fitting and predicting
4. Mean squared error, R-squared

    
## Data 

To get you to grips with this, we'll use a dataset readily available in [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes).

In [None]:
from sklearn import datasets

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

Try and make sense of the data:

1. What is diabetes["data"] referring to?
2. What is the diabetes["target"]?
3. What are the different features?

In [None]:
#work here 


To make this much easier to visualise, we can put these into a pandas dataframe.

In [None]:
import pandas as pd

In [None]:
data = pd.DataFrame(diabetes["data"]) #turn array into a pandas dataframe
data.columns = diabetes["feature_names"] #save the column names
data["Y"] = diabetes["target"] #save the target variable
data.head()

Why are there negative values? It may help to see the DESCR from diabetes. 

How many entries are there?

In [None]:
entries = None #replace None with the correct code
print("There are %i entries." %entries) 

## Training and test sets

When modelling, we need to split our data into sections. 

* A <b>training</b> set, for which the model is trained on and can make predictions using.
* A <b>validation</b> set, for which the model can be optimised against. Once this is done, the validation set can be reincorporated into the training set.
* A <b>test</b> set, completely excluded from training and optimisation, the performance of the model on this set gives indication of real-world performance.

SKlearn has a nice function available to allow us to [split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) our data into train and test sets.



In [None]:
from sklearn.model_selection import train_test_split

Using help() or the wiki page linked, split the 'data' dataset into 2/3 training and 1/3 test.

This split is performed randomly, so unless you set a seed, it will be different each time.

These are saved to two new variables, we'll call training, and test.

In [None]:
training, test = train_test_split(None) #replace None with the appropriate arguments.

In [None]:
print("Training size = %i" %len(training))
print("Test size = %i" %len(test))

## Modelling

Let's have a go at creating a model for diabetes using just the 'bmi' feature.

To start simply, we'll go with an ordinary least squares (OLS) [model](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression).

Perform the following:

    from sklearn.linear_model import LinearRegression
    model = LinearRegression()

In [None]:
#work here
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [None]:
x = None #replace None with the correct code
y = None #replace None with the correct code

Note, if we wanted to use multiple features, we can just supply them as 'x':
    
    x = training[["age","sex","bmi"]]

This would be <b>multiple linear regression</b>.

We can train a model using:

    model.fit(x,y)

But this will error. As we are using only a single feature, it is reading it in as a single sample with lots of features (give it a go and look at the error, then look at 'x'). If we were performing multiple linear regression, this isn't necessary. 

In [None]:
model.fit(x,y)

We therefore need to reshape x, using .reshape(-1,1)

In [None]:
model.fit(x.reshape(-1,1),y)

As linear regression is the process of fitting a straight line: $$ Y = mX + C $$

We can return the 'm' (intercept) and 'c' (bias) from the trained model.

    model.coef_
    model.intercept_

In [None]:
print("Y = %0.3fX + %0.3f" %(model.intercept_,model.coef_))

Let's have a look at how well our model has been trained.

We can compare its performance when looking at data it's already seen to see the 'training error'.

As it is a regression problem, let's look at [mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).

This is the average error between the predicted and actual value across all datapoints, and is a common performance indicator for regression models.

    from sklearn.metrics import mean_squared_error

To make a prediction, we can just use:

    model.predict(x)
    
Don't forget to save the predictions to a new variable!

In [None]:
training_predictions = None #replace None with the correct code

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y, training_predictions)

This looks pretty large, let's have a look at this in a graphic, using Matplotlib.pyplot.

    import matplotlib.pyplot as plt
    
    plt.scatter(x,y) #training data
    plt.plot(x,predictions, color="green") 
  
plt.plot(x,y) points joined by a line. What would be the X and Y arguments here if we want to look at the predictions?

You can add labels, change axes, add headers and change colors. 

    plt.scatter(x,y, color="red")
    plt.xlabel("")
    plt.ylabel("")
    plt.title("")
    plt.ylim(0,500)
    plt.xlim(0,500)
    
The plot can also be saved:
    
    plt.savefig("path/filename.png", dpi=300)

At the end of our plot code, we give the command:

    plt.show()
    
    
If this is given before savefig, a blank plot will be saved.

Feel free to add anything you wish to the plot.

In [None]:
#work here
import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.plot(x,None, linestyle="dashed", color="green") #replace None with the correct code
plt.show()

This should seem familiar to when you have tried in lessons to plot lines of best fit! 

## Cross validation

You will have noticed we haven't excluded a validation set yet. This can be done, but is generally more useful when we have a larger dataset. As we only have 294 datapoints to play with for training, we'll perform cross validation.

* Hold out validation - removal of a single dataset for validation
* Cross validation - In each of N folds, a subset of data is held out for validation and performance is measured. After all N folds, the validation performance is the average of each of the folds.
    * N can be any integer value.
    
For example, in 10-fold cross validation, the algorithm is as follows:
1. Take 90% of the data as training, exclude 10% as a validation set. 
2. Train model on the 90% training, test model performance on 10% validation set, record findings.
3. Replace validation data into pool, take next 90% and 10%, ensuring a datapoint is only ever in a training/validation set once. Repeat 2 and 3 until all data has been used. 

Once again, sklearn has a nice [function](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) for us to use rather than coding this ourselves!

For regression problems in particular, it is common to use mean squared error (MSE), and the $R^2$.

$$ MSE = \frac{1}{n}\sum{(Y_{true(i)} - Y_{pred(i)})^2} $$

$$R^2 = 1 - \frac{\sum(Y_{true(i)}-Y_{pred(i)})^2}{\sum(Y_{true(i)} - Y_{mean})} $$

Where $Y_{true(i)}$ and $Y_{pred(i)}$ are the actual and predicted Y values of the ith sample, respectively.

Loosely, MSE is a reflection of the error of the model, so the smaller the score, the lower the error.
$R^2$ is the coefficient of determination, providing a measure of how well future samples are likely to be predicted by the model. The best score is 1 - all are predicted with perfect accuracy.

These can be called using functions in sklearn:

    from sklearn.metrics import r2_score, mean_squared_error
    
Or, using the 'scoring' argument in cross_validate, in which case the key word is needed: 

    "r2","neg_mean_squared_error"
    
If you are interested, you can look at other metrics available for cross validation [here](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter), but we won't be looking at other metrics for the project.

In [None]:
from sklearn.model_selection import cross_validate

#we'll reinitialize the model to make sure it isn't trained yet!

model = LinearRegression()

When using cross validation, we do not need to declare the 'fit' method, as it occurs behind the scenes.

However, when we fully train the model, we will need to call it.

In [None]:
N = 3 #3 fold cross validation
results = cross_validate(model, x.reshape(-1,1), y, cv=N, scoring="neg_mean_squared_error", return_train_score=True)

As cross validate is often using by other larger functions (for example grid search), the mean squared error is returned as a negative in order to allow ranking of parameters correctly. 

The actual MSE is just the absolute value.

In [None]:
#see what is returned
results

Now, we need to work out the average of the test scores to give a robust estimate of our model training performance.

We can use numpy's functions to find the absolute values and the mean.

    import numpy as np
    np.abs()
    np.mean()

In [None]:
import numpy as np
validation_score = np.mean(np.abs(results["test_score"]))
print(validation_score)

How does this averaged score compare to the training score you found above? 

Which is better? Which would you believe more and why?

You could go back and see how the training score changes each time you run train_test_split without setting a seed.
Does your CV result change much in comparison?

## Testing a model

Finally, now we have a predicted performance from validation, we can use the test set.

Recall above that for testing, the model should be trained on as much data as possible, so we make sure to use the entire training dataset without cross validation for this.

In [None]:
model = LinearRegression()

model.fit(training["bmi"].values.reshape(-1,1), training["Y"].values)

In [None]:
test_predicted = model.predict(test["bmi"].values.reshape(-1,1))

Now plot the results as you had above, but for the test data, and find out the mean squared error for the predictions! 

How does it compare to your training set?
What is the $R^2$ for the test set?
Is this a good model?

Remember, to get MSE: 

    from sklearn.metrics import mean_squared_error
    mean_squared_error(y_true, y_pred)
    
(It is similar for r2_score).

In [None]:
# work here 
