<h1>ITNPDB2 Representing and Manipulating Data</h1>
<h3>University of Stirling<br>Dr. Saemundur Haraldsson</h3>

<h1>SciKit-Learn: Overview</h1>
<h2>Machine Learning <a href="https://scikit-learn.org/stable/">(documentation)</a></h2>
<h4>        
    <ul>
        <li>Preprocessing</li>
        <li>Model selection</li>
        <li>Classification</li>
        <li>Regression</li>
        <li>Clustering</li>
        <li>Dimensionality reduction</li>
        </ul>
</h4>

In [None]:
# First we need to import some base modules we'll need in this lecture
import numpy
# We'll need to generate some data to play with
from scipy import stats
# We'll also need predefined data from SciKit-Learn
from sklearn import datasets
# Just a stuff we need to make pretty data and things
from operator import itemgetter

import matplotlib.pyplot as plt
numpy.random.seed(seed=1234) # So that the lecture becomes deterministic

<h3>Preprocessing</h3>
<h4>Most real world data cannot be use as is.<br>
    We'll need to "clean" the data by
<ul>
    <li>Standardise or normalise</li>
    <li>Deal with outliers</li>
    <li>Encode categorical features</li>
</ul>
</h4>
<h3>SciKit-Learn provides utility functions and classes to do this for us</h3>

In [None]:
from sklearn import preprocessing
# First make some data
X_train = numpy.array([stats.alpha.rvs(7,size=10),
                       stats.anglit.rvs(size=10),
                       stats.geom.rvs(.8,size=10)
                      ]).transpose()
X_test = numpy.array([stats.alpha.rvs(7,size=5),
                       stats.anglit.rvs(size=5),
                       stats.geom.rvs(.8,size=5)
                      ]).transpose()

print(X_train)

#Now we can scale it to a mean 0 and a unit variance
#X_scaled = preprocessing.scale(X_train)

X_scaled = preprocessing.scale(X_train)
print(X_scaled)

<h3> See the difference</h3>

In [None]:
print('mean changes from ',X_train.mean(axis=0), ' --> ', X_scaled.mean(axis=0))
print('Variance changes from ',X_train.std(axis=0), ' --> ',X_scaled.std(axis=0))

<h3> We should keep track of our scalers so that we can treat the training set equally</h3>
<h4> For that we can use the class StandardScaler<br>
There are others as well:
    <ul>
        <li>Normalizer -- Scaling to unit norm</li>
        <li>MinMaxScaler -- Squeese the data between min-max values</li>
        <li>MaxAbsScaler -- Squeese the data below a certain max value (defaults to 1)</li>
        <li>RobustScaler -- For when you have many outliers</li>
        <li>QuantileTransformer -- Transforming to a uniform distribution</li>
        <li>PowerTransformer -- Map from "any" distribution to Gaussian</li>
    </ul>

</h4>

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train)
# The scaler keeps track of the original mean and variance
print('mean: ', scaler.mean_,' and scale: ',scaler.scale_)
# and we can use the scaler instance to scale the test data in the same way
print(X_test)
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled)
print('mean changes from ',X_test.mean(axis=0), ' --> ', X_test_scaled.mean(axis=0))
print('Variance changes from ',X_test.std(axis=0), ' --> ',X_test_scaled.std(axis=0))

<h3>Now let's try squeesing the data between a given min-max value</h3>
<h4>We know that the mean changes but what do you think happens to the variance?</h4>

In [None]:
# First just the standard
mm_scaler = preprocessing.MinMaxScaler()
X_scaled = mm_scaler.fit_transform(X_train)
print(X_scaled)
print('The variance was: ',X_train.std(axis=0))
print('      but is now: ',X_scaled.std(axis=0))

# Now we force it between 1 and 3
mm_scaler = preprocessing.MinMaxScaler(feature_range=(1,3))
X_scaled = mm_scaler.fit_transform(X_train)
print(X_scaled)
print('The variance was: ',X_train.std(axis=0))
print('      but is now: ',X_scaled.std(axis=0))

<h3>What if we have categorical data</h3>
<h4>i.e. post codes, education, gender identidy, etc.</h4>
<ul>
    <li><h4>OrdinalEncoder --- <i>Changes the categories to integers</i></h4></li>
    <li><h4>OneHotEncoder --- <i>Creates a binary (dummy) feature for each category</i></h4></li>
</ul>
<p>Note that Pandas provide a more convenient version of this functionality in Dataframes</p>

In [None]:
# The data
X_train = [['male', 'from US', 'uses Safari'], 
               ['female', 'from Europe', 'uses Firefox'],
               ['female', 'from Africa', 'uses Opera'],
               ['male', 'from Europe', 'uses Opera']
              ]
X_test = [['female', 'from Africa', 'uses Safari'], 
          ['male', 'from US', 'uses Opera']
         ]

## OrdinalEncoder

In [None]:
enc_ord = preprocessing.OrdinalEncoder()
X_ordinal = enc_ord.fit(X_train).transform(X_train)
print(X_ordinal)
print(enc_ord.transform(X_test))

## Create more features

In [None]:
enc_dum = preprocessing.OneHotEncoder()
X_dummies = enc_dum.fit(X_train).transform(X_train).toarray()
print(X_dummies)
print(enc_dum.transform(X_test).toarray())

## What if there are more categories in the test set?

In [None]:
X_test_alt = [['female', 'from Asia', 'uses Safari'], 
              ['male', 'from US', 'uses Chrome']
             ]

# This will fail, toggle comment on second try :) 
#enc_ord.transform(X_test_alt)
#enc_dum.transform(X_test_alt)

enc_dum = preprocessing.OneHotEncoder(handle_unknown='ignore')
enc_dum.fit(X_train)
print(enc_dum.transform(X_test_alt).toarray())

<h3>Classification and regression</h3>
<h4>Different sides of the same coin<br>
What type is the variable you are trying to predict? (i.e. the "Y" value)
</h4>
<ul>
    <li><h4>Categorical --- <i>Classification</i></h4></li>
    <li><h4>Numerical --- </h4></li>
    <ul>
        <li>Discreet and finite -- <i>Classification</i> (only a few finite values possible)</li>
        <li>Discreet and infinite -- <i>Regression</i> (infinite or very large set of values possible)</li>
        <li>Continuous -- <i>Regression</i> (infinite or very large set of values possible)</li>
    </ul>
</ul>
<h4>Most implemented models in SciKit learn can be used for both Classification and Regression </h4>
<p>fun fact: The link for classification and the link for regression on scikit-learn's website return the same page</p>

<h3>Clustering</h3>
<h4>Can be considered a more general way of classification<br>
you are automatically trying to learn the categories or labels.
</h4>
<h4>Unsupervised learning</h4>

<h3>Linear models</h3>
<h4>Mostly for regression and when the target is expected to be a linear combination of the features</h4>

In [None]:
from sklearn import linear_model
# Let's make some data
X_train = stats.norm.rvs(size=50).reshape(-1,1)
Y_train = stats.norminvgauss.rvs(1, 0.5,size=50)
X_test = stats.norm.rvs(size=10).reshape(-1,1)

# Fit to a model -- try different linear models
#regr = linear_model.LinearRegression()
regr = linear_model.Ridge(alpha=.5)
regr.fit(X_train, Y_train)


# Plot the model
Y_pred = regr.predict(X_test)

fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(111)
ax.scatter(X_train,Y_train)
ax.plot(X_test,Y_pred,'r',label=u'Linear model')
plt.legend()
plt.show()

<h3>Support Vector Machines</h3>
<ul>
    <li><h4>For regression and classification (also outliers detection)</h4></li>
    <li><h4>Effective on high dimensional data</h4></li>
    <li><h4>No direct probability estimates</h4></li>
</ul>

In [None]:
from sklearn import svm
# For classification we need a categorical target
X_train = numpy.array([stats.norm.rvs(size=50),
                       stats.norminvgauss.rvs(1, 0.5,size=50)
                      ]).transpose()
X_train = numpy.array(sorted(X_train,key=itemgetter(0,1)))
Y_train = numpy.array([1]*int(len(X_train)/2)+[0]*int(len(X_train)/2))
print(Y_train)
clf = svm.SVC(gamma='scale')
clf.fit(X_train,Y_train)
fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(111)
ax.plot(X_train[Y_train==1,0],X_train[Y_train==1,1],'r*',markersize=12)
ax.plot(X_train[Y_train==0,0],X_train[Y_train==0,1],'go',markersize=12)
plt.show()

<ul>
    <li><h4>When we've fitted the model we can use it to predict</h4></li>
</ul>

In [None]:
X_test =  numpy.array([stats.norm.rvs(size=10),
                       stats.norminvgauss.rvs(1, 0.5,size=10)
                      ]).transpose()
Y_pred = clf.predict(X_test)
fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(111)
ax.plot(X_train[Y_train==1,0],X_train[Y_train==1,1],'ro',markersize=6,alpha=.3)
ax.plot(X_train[Y_train==0,0],X_train[Y_train==0,1],'go',markersize=6,alpha=.3)
ax.plot(X_test[Y_pred==1,0],X_test[Y_pred==1,1],'r*',markersize=13)
ax.plot(X_test[Y_pred==0,0],X_test[Y_pred==0,1],'g*',markersize=13)
plt.show()

<h3>Model selection</h3>
<h4>Not all models are equal and neither is data<br>
We compare models, validate them, and tune them to the data (choose parameter values)<br>
    For it we use:
</h4>
<ul>
    <li><h4>Evaluate the performance of models</h4> So that we can compare models</li>
    <li><h4>Parameter tuning</h4> So that the model/method/algorithm is the best version we can get</li>
    <li><h4>Cross-validation</h4> To avoid overfitting</li>
</ul>

### Evaluating the performance of a model is easy with Scikit-learn
- model classes provide the __score__ function for this
- We can also use all kinds of metrics provided
 - Mean Squared Error
 - F-measure
 - Area under ROC curve
 - etc.
- There's even a function that prints out a pretty table with a number of metrics
- All you need is a test sample of the data
 - For obvious reasons you don't want to test on the data you trained on
 - Does anyone know these reasons?
- Let's test this with the digits dataset and use the train_test_split utility function

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn import datasets

# Loading the Digits dataset and flattening the data to be able to use it
digits = datasets.load_digits()
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

- One simple function to split it -- let's go for 60-40 trainint-testin split

In [None]:
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

- Fit the model to the training sample

In [None]:
# Of course we need to fit a model
clf = RandomForestClassifier(n_estimators=20)
clf.fit(X_train, y_train)

- and then evaluate on the testing sample

In [None]:
print(clf.score(X_test, y_test))
y_pred = clf.predict(X_test)
print(f1_score(y_test,y_pred,average='macro'))
print(classification_report(y_test, y_pred))

### Parameter tuning
- Now that we know how to evaluate the performance of our models we might want to adjust them
 - find the ''optimal'' parameters 
- We do that by repeatedly fitting our models with different sets of parameters until we are satisfied with the performance
 - We could to that with an __if__ or a __while__ loop 
 - or we could use Scikit-learn's utility which we have to tell
   - what parameters we want to adjust
   - what the boundaries or possible values the parameters have

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

param_dist = {"max_depth": [3, None],
              "max_features": stats.randint(1, 11),
              "min_samples_split": stats.randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5, iid=False)
random_search.fit(X, y)

print(random_search.best_params_)
y_pred = random_search.predict(X_test)
print(classification_report(y_test, y_pred))

### Cross-validation
- What we actually did in the last cell was Cross-validation
- We split the data into training and testing
 - So that we can test the fitted model on an unseen sample of the data
 - We could do it with functions from random sampling but Scikit-learn has utility functions to do this for us
- It's a preferable practice to split the data 3-ways
 - Anyone know why?


<h3>Dimensionality reduction</h3>
<h4>Find the features that are least helpful in explaining the variance of the target variable and remove them<br>
    Why do we need to remove features or dimensions?</h4>
<ul>
    <li><h4>The curse of dimensionality <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">(Wiki)</a></h4></li>
    <li><h4>Increase efficiency and speed</h4></li>
</ul>
<h3>We won't go into details about how it works</h3>

In [None]:
from sklearn.decomposition import PCA

diabetes = datasets.load_diabetes()
X = diabetes.data
print(X.shape)
pca = PCA(n_components=6) # Change this to see what happens to the dimensions
X_pca = pca.fit_transform(X)
print(pca.get_covariance())
print(X_pca.shape)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=41af8bd7-a5ed-4334-a2fe-992dcc7ea742' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>