In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
from math import sqrt
%matplotlib inline
np.set_printoptions(precision=3)
fig_width = 6.9
golden_mean = (sqrt(5)-1.0)/2.0    # Aesthetic ratio
fig_height = fig_width*golden_mean # height in inches

params = {
   'axes.labelsize': 8,
   'text.latex.preamble': ['\\usepackage{gensymb}'],
   'font.size': 10,
    'axes.labelsize': 10, # fontsize for x and y labels (was 10)
    'axes.titlesize': 12,
   'legend.fontsize': 8,
   'xtick.labelsize': 10,
   'ytick.labelsize': 10,
   'text.usetex': True,
   'figure.figsize': [fig_width,fig_height],
    'font.family': 'serif'
   }
rcParams.update(params)


# Introducing Scikit-Learn



## Data Representation in Scikit-Learn

The best way to think about data within Scikit-Learn is in terms of tables of data. Consider the Cities data set below.

In [None]:
data = pd.read_csv('Data/Cities.csv')
data.head()

Here each row of the data refers to a single observed city, and the number of rows is the total number of city in the dataset. In general, we will refer to the rows of the matrix as samples, and the number of rows as $n_{samples}$.

Likewise, each column of the data refers to a particular quantitative piece of information that describes each sample. In general, we will refer to the columns of the matrix as features, and the number of columns as $n_{features}$.

### Features matrix

A two-dimensional numerical array or matrix with shape [$n_{samples}$, $n_{features}$]. By convention, this features matrix is often stored in a variable named $X$.

The samples (i.e., rows) always refer to the individual objects described by the dataset

The features (i.e., columns) always refer to the distinct observations that describe each sample.

### Target array

One dimensional, with length $n_{samples}$ usually the quantity we want to predict from the data

## Scikit-Learn's Estimator API

Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.

The steps in using the Scikit-Learn estimator API are as follows:

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
   * Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments the estimator classes.
3. Arrange data into a features matrix and target vector following the discussion above.
4. Fit the model to your data by calling the **fit()** method of the model instance.
5. Apply the Model to new data:
  * For supervised learning, often we predict labels for unknown data using the **predict()** method.
  * For unsupervised learning, we often transform or infer properties of the data using the **transform()** or **predict()** method.
  
Let us apply this step in the following example  


### Supervised learning: Simple linear regression

In regression, we are interested in predicting a scalar-valued target, such as the price of a stock. By linear, we mean that the target must be predicted as a linear function of the inputs.

We will use the Cities dataset and our target is to predict city temperature given its latitude.

### 1. Let Choose a class of model

Import  the linear regression class:

In [None]:
from sklearn.linear_model import LinearRegression

**Note** more general linear regression models exist as well follow this [link](http://scikit-learn.org/stable/modules/linear_model.html)

### 2. Choose model hyperparameters

This step involve defining the model with its associated parameters

In [None]:
model = LinearRegression()

### 3. Arrange data into a features matrix and target vector

Here our target variable y is already tempearture and the feature matrix is latitude.

In [None]:
y = data.temperature

In [None]:
# check the shape of target
y.shape

Our target variable y is already in the correct form: a 1-dimensional array

In [None]:
# feature matrix
X = data.latitude

In [None]:
# check shape of X
X.shape

We therefore need to massage the data X to make it a matrix of size [n_samples, n_features]

In [None]:
#  convert the 1-dimensional X  array into an X array with 2 axes
X = X[:, np.newaxis]
X.shape

### 4. Fit the model to your data

Now it is time to apply our model to data. This can be done with the fit() method of the model:

In [None]:
# fit the model
model.fit(X,y)

This fit() command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore.

In [None]:
print('Weight coefficients: ', model.coef_)
print('y-axis intercept: ', model.intercept_)

### 5 Predict labels for unknown data
Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data.

For the sake of this example, our "new data" will be a grid of X values, and we will ask what y values the model predicts:

Let select the last ten city and use our model to predict their temperature.

In [None]:
xval = data.tail(10).latitude
xval.shape

As before, we need to coerce these X values into a [n_samples, n_features] features matrix.

In [None]:
xval = xval[:, np.newaxis]
xval

In [None]:
y_pred = model.predict(xval)

Let visulaize the actual values vs predicted values

In [None]:
y_actual = data.tail(10).temperature

In [None]:
plot_name = "ML_regression"
plt.scatter(xval, y_actual, label="data")
plt.scatter(xval, y_pred, label="prediction")
plt.plot(xval,y_pred, c='red',label='fit')
plt.legend(loc='best')
plt.title('Model Visualization')
plt.xlabel('City Latitude')
plt.ylabel('City Temperature')
plt.savefig('image/%s.pdf' %(plot_name), format='pdf')

## Supervised learning example: Player classification.

Our question will be this: given a model trained on a portion of the player data, Predict player position from one or more of minutes, shots, passes, tackles, saves.

For this task, we will use an extremely simple logistic regression.

We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. 

This could be done by hand, but it is more convenient to use the **train_test_split utility** function:

### Prepare Data

In [None]:
data = pd.read_csv('Data/Players.csv')
data.head()

In [None]:
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']

In [None]:
X = data[features]
y = data.position

Separate the dataset into to separate sets, using 75% of the instances for training our classifier, and the remaining 25% for evaluating it

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

With the data arranged, we can follow our recipe to predict the labels:

In [None]:
from sklearn.linear_model import LogisticRegression   # 1. choose model class
model = LogisticRegression()               # 2. instantiate model
model.fit(X_train, y_train)                           # 3. fit model to data
y_pred = model.predict(X_test)                        # 4. predict on new data

Finally, we can use the accuracy_score utility to see the fraction of predicted labels that match their true value:

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Thus our model achieve a test accuracy of $61\%$.

# <font color="red">Exercise: World Cup Data</font>

1. From the players data, compute and plot a linear regression for minutes played (x-axis) versus passes made (y-axis).
2. Use linear regression for interactive number-of-passes predictor Training data: predicted the passes  players from  Greece, USA, and Portugal