## Machine learning with linear regression

These are the general steps we need to take to apply a machine learning algorithm. We'll look at how this applies specifically for linear regression. 

![ML_steps](https://www.houseofbots.com/images/news/11493/cover.png)

In [None]:
#Import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Get data

Let's practice our linear regression algorithm with a practice dataset that I found [here](https://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/slr/frames/frame.html). We are comparing chirps/sec of the ground striped cricket with the surrounding temperature (deg F)

In [None]:
#Read in excel file 'cricketchirps_temp.xlsx'


We are going to save each column as a NumPy array so it's easier to plot and run functions on. Name the variables as shown below.

In [None]:
#Just need to add a .values to the end of your columns
#For example: SST_array = df['SST'].values

# data_chirp = ?

# data_temp = ?


Check that the data type is a numpy array

Make a scatter plot of your data and make sure you can see a linear relationship between chirps/sec and temperature. There's no point putting in all this effort when your data can't be described with a linear model.

### Clean data

#### Conversions

There isn't much to do with this dataset but let's convert the temperature to degree Celsius for practice.

Note: (&deg;F − 32) × 5/9 = &deg;C

#### Reshape data

The function for linear regression only accepts data that is two dimensional. We can use the numpy.reshape() function to edit our data. 

* numpy.reshape(**array**,**new shape**)

It can also be used as a method.

* **numpy_array**.reshape(**new shape**)

Remember: the shape of an array is saved as a tuple. So for a 2D array, shape = **(**no of rows, no of columns**)**

In [None]:
#This is 1D
print(data_chirp.shape)

#Reshaping chirp data
no_rows = len(data_chirp)
data_chirp_reshaped = data_chirp.reshape((no_samples, 1))
print(data_chirp_reshaped.shape)

#Try this for temperature data


#### Split data

Now we need to separate out our data into:
* training data: data you'll use to create with your linear regression model
* testing data: data you'll use to test your model

You can do this manually by splitting your data into half. Try it!

OR

We can randomize the sampling using the train_test_split function from the sklearn.model_selection library.

Function description:

**data_train**, **data_test** = train_test_split(**data**,**test_size**,**train_size**,**random_state = 0**)

Inputs:
* **data**: data arrays or columns to split
* **test_size**: proportion of data or absolute number of values to sample for test data
* **train_size**: proportion of data or absolute number of values to sample for train data

Outputs:
* **data_train**: training data as array
* **data_test**: testing data as array

Let's again split the data by half but this time we're telling Python to choose our test and train data at random.
Since we want 50% of sample to be for training, our train_size = 0.5 and accordingly our test_size = 0.5

I have an additional input random_state = 0 which allows you to get the same results each time you run the randomizer. This is good for testing but we don't really need to do this when we analyze our coral data.

In [None]:
from sklearn.model_selection import train_test_split 

train_chirp, test_chirp, train_temp, test_temp = train_test_split(data_chirp_reshaped,data_temp_reshaped, train_size = 0.5, random_state = 0)

I've plotted both datasets so you can see how the data was split. The training data is in black and the test data is in blue.

In [1]:
plt.figure()
plt.scatter(train_chirp,train_temp,color = 'k')
plt.scatter(test_chirp,test_temp,color = 'b')
plt.xlabel('chirps/sec')
plt.ylabel('temperature')
plt.show()

NameError: name 'plt' is not defined

### Create your practice data

Now try splitting the data with some other proportion of test and training data. Give these different variable names.

### Train model

Now we need to train our model. Again, we import functions from the sklearn library.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
#We need to tell Python that we are running a linear regression model
model = LinearRegression()

If you were doing this calculation by hand, the next step would be to find the equation for the line that best fits the data. 

Recall the equation for a line is `` y = mx + C``

Linear regression gives you the coefficients m and C

In [None]:
#Find the model that best *fits* our training data
model.fit(train_chirp, train_temp)

#Find coefficients
m = model.coef_
C = model.intercept_

print(m, C)

What do m and C mean for our experiment?

Now that you have a relationship between chirps and temperature, you can predict the temperature for any value of chirps/sec. 

Find the temperature corresponding to 18 chirps/sec.

In [None]:
#Calculate the predicted temperature values for the training chirp data. Name the variable as shown below.

#pred_train_temp = ?


In [None]:
#Plot your measured training and your predicted training temperatures against the chirp/sec data


### Test model

Now we need to see how this model does with our test data. Let's find the predicted temperature values for the test chirp data using our linear regression model

In [None]:
#Calculate the predicted temperature values for the test chirp data. Name the variable as shown below.

#pred_test_temp = ?


In [None]:
#Plot this out as before


### Practice!

Test and train your own linear regression model using the data you split. 

### How did our model do?

We've talked about using the sum of the squared residuals to determine how good a fit was. 

**Refresher**

Residuals = predicted y - measured y

![residuals](https://internal.ncl.ac.uk/ask/numeracy-maths-statistics/images/Residuals.png)

Taking the average of that is called the mean squared error.

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
#Calculate the mean square error. This is like the score for your model
mean_sq_err = mean_squared_error(test_temp,pred_test_temp)
mean_sq_err 

What does this actually indicate? An easier way to interpret this result is to take the square root of the mean squared error. This is the root mean square error. 

In [None]:
root_mean_sq_err = np.sqrt(mean_sq_err)
root_mean_sq_err

This means the average difference between the measured and predicted temperatures is about 5.47 &deg;F

### Practice!

Compare this "score" to your own calculations.

### Challenge

What test and training size would give you the lowest root mean square error?