<a href="https://colab.research.google.com/github/big-data-analytics-physics/handsonml_ch1/blob/master/ch1_scikit_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple fitting using scikit-learn
What is scikit-learn?  To wikipedia: Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

The best way to see its usefulness it to... use it!   So let's follow the book and use scikit to perform a simple linear regression.   We will try to use 'GDP' from our dataset to predict 'Life satisfaction'.   Even though we have already read this data and plotted it, we will do it all over again and cover the steps one by one.

First pull in all of the packages we need.   I prefer to do all of my imports at the top of my python modules, but that is not necessary.

In [0]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
import plotly.offline as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go




---

Define the method for enabling ploty as we did before

In [0]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

Load the data

In [0]:
# Load the data
url = "https://raw.githubusercontent.com/big-data-analytics-physics/data/master/ch1/gdp_oecd_data_byCountry.csv"
gdp_data=pd.read_csv(url)
#
# Add the info column again
gdp_data['Info'] = "Country:"+gdp_data['Country']+"<br>GDP:"+gdp_data['GDP'].astype(str)+"<br>Employment rate:"+gdp_data['Employment rate'].astype(str)+"<br> Homicide rate:"+gdp_data['Homicide rate'].astype(str)

#
# Some printing
for i in range(len(gdp_data)):
    print("Country: ",gdp_data.iloc[i]['Country'],"; GDP: ",round(gdp_data.iloc[i]['GDP'],2), "; Life satisfaction: ",gdp_data.iloc[i]['Life satisfaction'])

## Sorting a dataframe
Note that there is no obvious order to the data above.  For some of the results below, it will be helpful to have our dataframe *sorted* by GDP.   How do we find out how to do this?   Google!   Type "pandas dataframe sort by column" into google.   The first result:
[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) 

In [0]:
gdp_data.sort_values(by=['GDP'],inplace=True)   # inplace means we don't have to put the results into a new dataframe

Loop over each element and print out again.   Is it sorted?

In [0]:
for i in range(len(gdp_data)):
    print("Country: ",gdp_data.iloc[i]['Country'],"; GDP: ",round(gdp_data.iloc[i]['GDP'],2), "; Life satisfaction: ",gdp_data.iloc[i]['Life satisfaction'])

Now let's do a simple plot.

In [0]:
enable_plotly_in_cell()

data1 = go.Scatter(
    x=gdp_data['GDP'],
    y=gdp_data['Life satisfaction'],
    mode='markers',
    text=gdp_data['Info']
)
layout = go.Layout(
    yaxis=dict(
        range=[0, 10]
    )
)

iplot(dict(data=[data1],layout=layout))


## Removing data from the training sample
If we compare this plot to the one in the book, we see that there are some points in our plot, which are not in the book's plot: Czech Republic, Chile, Mexic, and Brazil.   Lets remove these:

In [0]:
enable_plotly_in_cell()
gdp_data_subset = gdp_data[~gdp_data['Country'].isin(['Czech Republic','Chile','Mexico', 'Brazil'])]
gdp_data_notfit = gdp_data[gdp_data['Country'].isin(['Czech Republic','Chile','Mexico', 'Brazil'])]

data1 = go.Scatter(
    x=gdp_data_subset['GDP'],
    y=gdp_data_subset['Life satisfaction'],
    mode='markers',
    text=gdp_data_subset['Info']
)
layout = go.Layout(
    yaxis=dict(
        range=[0, 10]
    )
)

iplot(dict(data=[data1],layout=layout))


We can also display the data we removed on the same plot, by adding another "data" object to the plot:

In [0]:
enable_plotly_in_cell()
gdp_data_subset = gdp_data[~gdp_data['Country'].isin(['Czech Republic','Chile','Mexico', 'Brazil'])]
gdp_data_notfit = gdp_data[gdp_data['Country'].isin(['Czech Republic','Chile','Mexico', 'Brazil'])]

data1 = go.Scatter(
    x=gdp_data_subset['GDP'],
    y=gdp_data_subset['Life satisfaction'],
    mode='markers',
    name="Data to fit",
    text=gdp_data_subset['Info']
)

data2 = go.Scatter(
    x=gdp_data_notfit['GDP'],
    y=gdp_data_notfit['Life satisfaction'],
    mode='markers',
    name="Data not fit",
    text=gdp_data_notfit['Info']
)

layout = go.Layout(
    yaxis=dict(
        range=[0, 10]
    )
)

iplot(dict(data=[data1,data2],layout=layout))


## Choosing the model to fit our data
Now let's fit the data to a straight line.   The appropriate model for this is called LinearRegression.   To see how this works, check out: [Scikit-learn Linear Regression] (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

First we need to get the model:

In [0]:
model = sklearn.linear_model.LinearRegression()

New we need to call the fit method.   The fit method takes the input variable (in this case GDP) and the expected output variable (in this case 'Life satisfaction').   We will train it with our gdp_data_subset data.   We can then test this by looking at how well it predicts *unseen* data - this is simply data for which we know the result, but the model was not trained on.

The arguments of fit look like this:  model.fit(X,y), where X and y are earch *numpy* arrays.   To convert a column of a pandas dataframe to a number array is pretty easy, just add ".values" to a single pandas column, like this:

In [0]:
X = gdp_data_subset['GDP'].values
print("The GDP column as pandas:\n",gdp_data_subset['GDP'])
print("The GDP column as a numpy array:\n",type(X),X.shape,X)
y = gdp_data_subset['Life satisfaction'].values
print("y",type(y),y.shape,y)

NOTE: The next line will fail!

In [0]:
model.fit(X,y)

The above failed because the sklearn fit method was expecting that both X amd y be shape(32,1) but instead they are of shape (32,).   To fix this, we simply reshape:

In [0]:
X = X.reshape(len(X), 1)
y = y.reshape(len(y), 1)

In [0]:
model.fit(X,y)

As an aside, we could get around the **reshape** problem we had above with a little trick.  Instead of passing the numpy **single dimension** arrays X and y, we pass it a true data frame (which is basically a 2D array) by using double brackets [[...] ] like this:

model.fit(gdp_data_subset[['GDP']],gdp_data_subset[['Life satisfaction']])

Try this yourself.

## Get the results of the fit, and using the fit to predict new results
The *attributes* of the LinearRegression model are coef_ and intercept_, which are just the slope and intercept of the fitted line:

In [0]:
print("Fit results: slope=",model.coef_," and intercept=",model.intercept_)

We would like to be able to draw a fit curve on top of our data.   To do this, we use the predict method:

In [0]:
ypred = model.predict(X)   # This puts an array of predictions for each X in ypred
print("Type returned from predict:",type(ypred),"; shape: ",ypred.shape)
ypred = ypred.reshape(32)
print("Predictions",ypred,gdp_data_subset.GDP.values)
#
#  Look up how zip works if it is not clear to you!
for xval,yobs,yp in zip(X,y,ypred):
    print("X",xval,"; y observed=",yobs,"; y predicted",yp)

Let's plot: 

*   Our data we used to train the model (gdp_data_subset)
*   The additional data points we did not include in the fit (gdp_data_nofit)
*   The predicted results.   We will connect all of these so they make a straight line.



In [0]:
enable_plotly_in_cell()
print(ypred)
dataSubset = go.Scatter(
    x=gdp_data_subset['GDP'],
    y=gdp_data_subset['Life satisfaction'],
    mode='markers',
    name="fitted data",
    text=gdp_data_subset['Info']
)

dataNotfit = go.Scatter(
    x=gdp_data_notfit['GDP'],
    y=gdp_data_notfit['Life satisfaction'],
    mode='markers',
    name="not fit data",
    text=gdp_data_notfit['Info']
)

dataPred = go.Scatter(
    x=gdp_data_subset['GDP'],
    y=ypred,
    mode='lines',
    name="prediction",
    text=gdp_data['Info']
)

layout = go.Layout(
    yaxis=dict(
        range=[0, 10]
    )
)

iplot(dict(data=[dataSubset,dataNotfit,dataPred],layout=layout))