# scikit-learn
A notebook dedicated to learn scikit learn and test stuff.<br>
Following this tutorial:

In [None]:
#from IPython.display import HTML
#tutorialPage = HTML('https://www.datacamp.com/community/tutorials/machine-learning-python')
#tutorialPage

Running the tutorial in this link:
[Tutorial](https://www.datacamp.com/community/tutorials/machine-learning-python)<br>
Load a practice dataset predefined in scikit, even if it actually points to ics.uci.edu ...

In [None]:
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
# print(digits)

In [None]:
#import pandas as pd

# Load the training data as a Pandas DataFrame
#digits = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra", header=None)
# print(digits)
#digits_test = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes", header=None)


In [None]:
# Glance at the data organization
print('keys')
print(digits.keys())
print('/***/')
print('target')
print(digits.target)
print(digits.target_names)
print('description')
print(digits.DESCR)

#### Looking into the Data

In [None]:
digits_data = digits.data

print('Data Shape')
print(digits_data.shape)

print('Possible Targets')
digits_target = digits.target
print(digits_target)

print('Number of Unique Targets')
num_digits = len(np.unique(digits_target))
print(num_digits)

So we have ** *1797 Samples* ** of ** *64 features* ** and there are ** *10 classes* ** for classification<br>
**The Images**<br>
We also have 1797 8x8 images. We're gonna reshape them to make sure they match the trainig data.

In [None]:
digits.images.shape

In [None]:
reshaped_images = digits.images.reshape((1797,64))
print('Let''s compare the training data with the reshaped images and check for full consistency:')
print('Fully consistent' if np.all(reshaped_images == digits_data) else 'Unconsistency detected between images and data')

### Drawing the images for some intuition

In [None]:
import matplotlib.pyplot as plt
# Image of 7x7 inches
fig = plt.figure(figsize=(7,7))

# Define subplots
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.075, wspace=0.075)

# Draw each digit into a separated sub_plot
for i in range(64):
    # Init the Subplots: Add a subplot grid of 8x8 at the i+1 position
    ax = fig.add_subplot(8,8,i+1, xticks=[], yticks=[])
    # Display an image at the i_th position
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    # Label the image with the target value (the digit that should be drawed)
    ax.text(0, 7, str(digits_target[i]))
    
plt.show()

... and for curiosity, other way to print them by merging data

In [None]:
# Joing images and labels into a single list
images_and_labeles = list(zip(digits.images, digits_target))

# Print the first row above
for ix, (image, lbl) in enumerate(images_and_labeles[:8]):
    # Build a 4x2 subplots table. Subplot index is 1 based.
    plt.subplot(2, 4, ix + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    plt.title('Training' + str(lbl))
    
plt.show()

### Principal Component Analysis (PCA)
When there are too many features thus too many dimensions, like ** 64 ** in our example, visualizing the data is impossible.<br>
Sometimes too many features separate away the samples in a maner that is hard to find something in common between samples that share the same label.<br>
To visualize data represented by a high number of dimensions we can use of the ** * Dimensionality Reduction * ** techniques, we'll use here *PCA*: **Principal Component Analysis**.<br>
The idea in PCA is to find a linear combination of the two variables that contains most of the information. This new variable or *principal component* can replace the two original variables.
#### The Method
A linear transformation method that yeilds the directions (principal compoents) that maximazie the variance of the data.<br>
**Variance:** How far a set f data points lie apart.<br>
For more details you can read [Introduction to PCA](http://www.lauradhamilton.com/introduction-to-principal-component-analysis-pca).

In [None]:
from sklearn.decomposition import PCA
# Create a randomized PCA model that takes 2 components
randomized_pca = PCA(svd_solver='randomized',n_components=2)
# Fit and transform the data to the PCA model
reduced_data_rpca = randomized_pca.fit_transform(digits_data)

For comparison, let's also generate a *full PCA* (Runs exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing) rather than the random

In [None]:
# Create a full PCA model that takes 2 components
pca = PCA(svd_solver='full', n_components=2)
# now fit and transform the data with the full PCA
reduced_data_pca = pca.fit_transform(digits_data)

# let's see
print('Randomized PCA data shape:' + str(reduced_data_rpca.shape))
print('Full PCA data shape:' + str(reduced_data_pca.shape))

Having 2D data, we can plot it!<br>
Let's see first if the random PCA reveals the distribution of the different labels and if we can clearly separate the instances from each other

In [None]:
def plot_pca(pca_data, titleStr):
    colors = ['black', 'blue', 'green', 'yellow', 'red', 'purple', 'lime', 'cyan', 'orange', 'gray']
    markers = ['$0$', '$1$', '$2$', '$3$', '$4$','$5$', '$6$', '$7$', '$8$', '$9$']

    fig = plt.figure(figsize=(10,10))
    for dtix in range(len(colors)):
        x = pca_data[:, 0][digits_target == dtix]
        y = pca_data[:, 1][digits_target == dtix]
        plt.scatter(x,y, c=colors[dtix], marker=markers[dtix])

    plt.legend(digits.target_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.title(titleStr)
    plt.show()

In [None]:
plot_pca(reduced_data_rpca, 'Random PCA Scatter Plot')

Let's compare it with the Full PCA

In [None]:
plot_pca(reduced_data_pca, 'Full PCA Scatter Plot')

Full vs Random ... <br>
... exactly the same??<br>
This means there is no need to scan the entire feature set. In such case, better use the Random as it's quicker

## Chosing the correct algorithm
Once we understand our data, we need to use the best algorithm for our problem.<br>
Use this map to think about what would be the best.

In [None]:
from IPython import display
display.Image(url='http://scikit-learn.org/stable/_static/ml_map.png')

## Preprocessing Data

Before modeling the data we need to prepre it first.<br>
### Data Normalization
Standardize the data.<br>
In the digits case we could make use of the `scale()` method<br>
By scaling the data, you shift the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

In [None]:
from sklearn.preprocessing import scale

data = scale(digits.data)

### Splitting data into Trainig and Testing Sets
The division of the data set into a test and a training sets is disjoint: the most common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into Train and Test sets
X_train, X_test, Y_train, Y_test, images_train, images_test = train_test_split(data, digits_target, digits.images, \
                                                                               test_size=0.25, random_state=42)

The argument `random_state` has the value *42* assigned to it. With this argument, you can guarantee that your split will always be **the same**. That is particularly handy if you want reproducible results.<br>
Inspect the number of samples in the sets...

In [None]:
def separator_line():
    print(20*'____')

In [None]:
n_samples, n_features = X_train.shape
print('Number of training samples')
print(str(n_samples))
print('Number of training features')
print(str(n_features))
separator_line()
n_digits = len(np.unique(Y_train))
print('Number of Training labels (unique)')
print(str(n_digits))
print('Total number of Training labels')
print(str(len(Y_train)))

### Clustering the data
Find those clusters of the training set. Use `KMeans()` from the `cluster` module to set up your model. You’ll see that there are three arguments that are passed to this method: `init`, `n_clusters` and the `random_state`.

In [None]:
from sklearn import cluster

# Creating the KMeans model
clf = cluster.KMeans(init='k-means++', n_clusters=n_digits, random_state=42) #, n_init=9) #same random_state as when splitting

# Fit the trainig data X_train to the model
clf.fit(X_train)

The thing is, how do we set the centers of the clusters. The number of centroid is set by `n_clusters`.<br>
The initial set of cluster centers can have a big effect on the clusters that are eventually found.
To deal with this effect, we usually try several initial sets in multiple runs and select the set of clusters with the minimum sum of the squared errors (SSE). In other words, you want to minimize the distance of each point in the cluster to the mean or centroid of that cluster.<br>
By adding the `n-init` argument to KMeans(), you can determine how many different centroid configurations the algorithm will try. The thing is that more configuratins **doesn't mean chosing the best configuration** ...<br>

#### Visualizing the clusters

In [None]:
# fig = plt.figure(8,3)
fig = plt.figure(figsize=(8,3))
fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold')

for imix in range(10):
    # Initialize subplots grid of 2x5, ONE based
    ax = fig.add_subplot(2,5,imix + 1)
    
    ax.imshow(clf.cluster_centers_[imix].reshape((8,8)), cmap=plt.cm.binary)
    plt.axis('off')

plt.show()

### Prediction
Predict the labels of the test data

In [None]:
Y_pred=clf.predict(X_test)
num2print = 25
print(Y_pred[:num2print])
print(Y_test[:num2print])
separator_line()
print('Cluster Centers Shape:')
clf.cluster_centers_.shape

Visualizing predicted labels

In [None]:
from sklearn.manifold import Isomap

# Create an isomap and fit the digits data to the isomap
X_iso = Isomap(n_neighbors=n_digits).fit_transform(X_train)

# Compute cluster centers and predict clster index for each sample
clusters = clf.fit_predict(X_train)

# Create a plot with subplots in a grid of 1x2
fig, ax = plt.subplots(1,2, figsize=(10, 5))
fig.subplots_adjust(top=0.85)

# Add the scatter plots
ax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=clusters)
ax[0].set_title('Predicted Training Lables')
ax[1].scatter(X_iso[:,0], X_iso[:,1], c=Y_train)
ax[1].set_title('Actual Taining Labels')

plt.show()


Now try the same using the reduced data

In [None]:
X_pca = pca.fit_transform(X_train)

clusters_pca = clf.fit_predict(X_pca) 
clusters = clf.fit_predict(X_train) 
print('Clusters Shape: ' + str(clusters.shape))

fig, ax = plt.subplots(1,3, figsize=(15,5))

fig.suptitle('Predicted Versus Training PCA Labels', fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.85)

ax[0].scatter(X_pca[:, 0], X_pca[:, 1], c=clusters)
ax[0].set_title('Predicted Training Labels')
ax[1].scatter(X_pca[:, 0], X_pca[:, 1], c=clusters_pca)
ax[1].set_title('Predicted PCA Training Labels')
ax[2].scatter(X_pca[:, 0], X_pca[:, 1], c=Y_train)
ax[2].set_title('Actual Training Labels')

plt.show()

### Eveluating the Clustered Model
Let's evaluate the peformance (which look pretty lame) of our model.<br>
#### The Confusion Matrix

In [None]:
from sklearn import metrics

print(metrics.confusion_matrix(Y_test, Y_pred))

... crap. Only 5 was predicted correctly 41 times.<br>
Let's look into some *cluster cuality metics*.

In [None]:
from sklearn.metrics import homogeneity_score, completeness_score,\
v_measure_score,adjusted_rand_score, adjusted_mutual_info_score,silhouette_score

print('% 9s' % 'inertia\thomo\tcomple\tv-means\tARI\tAMI\tsilhouette')
print('%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f' %(clf.inertia_,
                                                homogeneity_score(Y_test, Y_pred),
                                                completeness_score(Y_test, Y_pred),
                                                v_measure_score(Y_test, Y_pred),
                                                adjusted_rand_score(Y_test, Y_pred),
                                                adjusted_mutual_info_score(Y_test, Y_pred),
                                                silhouette_score(X_test, Y_pred, metric='euclidean')))


* The **homogeneity** score tells you to what extent all of the clusters contain only data points which are members of a single class.
* The **completeness** score measures the extent to which all of the data points that are members of a given class are also elements of the same cluster.
* The **V-measure** score is the harmonic mean between homogeneity and completeness.
* The **Adjusted Rand Info (ARI)** score measures the similarity between two clusterings and considers all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
* The **Adjusted Mutual Info (AMI)** score is used to compare clusters. It measures the similarity between the data points that are in the clusterings, accounting for chance groupings and takes a maximum value of 1 when clusterings are equivalent.
* The **silhouette** score measures how similar an object is to its own cluster compared to other clusters. The silhouette scores ranges from -1 to 1, where a higher value indicates that the object is better matched to its own cluster and worse mached to neighboring clusters. If many points have a high value, the clusteirng configuration is good.<br>


... crap indeed. For example:
The *silhouette* score is close to 0, which indicates that the sample is on or very close to the decision boundary between two neighboring clusters. This could indicate that the samples could have been assigned to the wrong cluster.<br>
Also the *ARI* measure seems to indicate that not all data points in a given cluster are similar and the *completeness* score tells you that there are definitely data points that weren’t put in the right cluster.


So we need a better predictor the the neigboors ...
### Suport Vector Machines !!
For KMeans we didn't need the Labels. We just tried to gather samples together into clusters.<br>
Let's use the targets together with the training data samples

In [None]:
from sklearn import svm

svc_model_linear = svm.SVC(gamma=0.001, C=100, kernel='linear') #, verbose=True)

# Fit the data to the SVC model
fit_result_linear = svc_model_linear.fit(X_train, Y_train)

#### Model Estimation
A way for finding best values for the parameters `gamma`, `C` and `kernel`
It is possible to automatically find good values for the parameters by using tools such as grid search and cross validation.<br>
**Grid Search** example

In [None]:
from sklearn.model_selection import GridSearchCV

# define possibilities
parameter_candidates = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

# Create a classfier with the options above
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)

# Train 
clf.fit(X_train, Y_train)

# Let's see ...
print('Best score for training data:', clf.best_score_)
print('Best ''C'':', clf.best_estimator_.C)
print('Best kernel:', clf.best_estimator_.kernel)
print('Best gamma:', clf.best_estimator_.gamma)

We have a winner!<br>
kernel - rbf<br>
gamma - 0.001<br>
C - 1000<br>
We'd redefine the SVM:

In [None]:
# Apply the classifier to the test data and check the acuracy score
accu = clf.score(X_test, Y_test)
print('Accuracy of grid search result: ', str(accu))

# Train an score a new classfier with the grid search parameters
svc_model = svm.SVC(gamma=clf.best_estimator_.gamma, C=clf.best_estimator_.C, kernel=clf.best_estimator_.kernel)
fitResult = svc_model.fit(X_train, Y_train)
accu = fitResult.score(X_test, Y_test)
print('Accuracy of SVM prediction:', accu)
accu = fit_result_linear.score(X_test, Y_test)
print('Accuracy of Linear SVM prediction:', accu)

#### Prediction
Now let's check a few predictions to see how good we are doing.

In [None]:
Y_predict = svc_model_linear.predict(X_test)

print(Y_predict[:num2print])
print(Y_test[:num2print])

Let's visualize the predictions

In [None]:
images_and_predictions = list(zip(images_test, Y_predict))
numOfDrawings = 8

fig = plt.figure(figsize=(2*numOfDrawings,2))
for imix, (image, prediction) in enumerate(images_and_predictions[:numOfDrawings]):
    plt.subplot(1, numOfDrawings, imix+1)
    plt.axis('off')
    
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Predicted: ' + str(prediction))
    
plt.show()

And how did our model perform?

In [None]:
print('Classification report of y_test and predicted')
print(metrics.classification_report(Y_test, Y_predict))
separator_line()
print('Confusion Matrix')
print(metrics.confusion_matrix(Y_test, Y_predict))

And finally, the isomap

In [None]:
predicted = svc_model_linear.predict(X_train)
fig, ax = plt.subplots(1,2,figsize=(10,5))
fig.subplots_adjust(top=0.85)

ax[0].scatter(X_iso[:,0], X_iso[:,1],c=predicted)
ax[0].set_title('Predicted Labels')
ax[1].scatter(X_iso[:,0], X_iso[:,1],c=Y_train)
ax[1].set_title('Actual Labels')

fig.suptitle('Predicted vs Actual Labels', fontsize=14, fontweight='bold')

plt.show()