** scikit-learn User Guide**

http://scikit-learn.org/stable/user_guide.html

http://scikit-learn.org/stable/modules/classes.html

# Supervised Learning

In this session, we will go through an example of supervised learning using the Iris data set. This data sets consists of 3 different types of irises' (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 `numpy.ndarray`. 

The columns of the array represents the petal and sepal's attributes: Sepal Length, Sepal Width, Petal Length and Petal Width. 

The rows of the array represents the observations for each attribute. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn import datasets

In [None]:
iris = datasets.load_iris()

We would like to know what are iris data. We would first find out its type by function type().

In [None]:
type(iris)

Bunch is a dictionary that supports attribute-style access, which means bunch lets you use a Python dict like it's an Object. Now we can check what elements are in the data set.

In [None]:
iris.keys()

In [None]:
iris.target_names   #access iris data like an Object

In [None]:
print(iris.DESCR)

In [None]:
iris['target'] # another way to access iris data like a dictionary

In [None]:
iris['data']

In [None]:
iris.data.shape

In [None]:
type(iris.data)

**Exploratory data analysis**

In [None]:
X=iris.data
y=iris.target

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 

In [None]:
df = pd.DataFrame(X,columns=iris.feature_names)

In [None]:
print(iris.feature_names)

In [None]:
print(df.head())

**Visual exploratory data analysis**

In [None]:
_fig = pd.tools.plotting.scatter_matrix(df,c=y,figsize = [8,8],s=150,marker='D')

In [None]:
plt.show()

## k-NN (k-nearest neighbors) Classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
#the features and targets are numpy arrays
knn.fit(iris['data'], iris['target']) 

In [None]:
iris['data'].shape

In [None]:
iris['target'].shape

Now let us predict unlabled data

In [None]:
prediction = knn.predict(X_new) #X_new is not defined

**Exercise 1:**

Create X_new to finish the prediction. Make sure the scikit-learn API will accept X_new as long as they are of the right shape.

In [None]:
X_new =  ____
y_new = iris['target'][20:100]

prediction = knn.predict(X_new)
knn.score(X_new,y_new)

## Measuring model performance

In reality, we will not train our model with the entire data set. Rather, we wil split them into 
different portions for trainining and validation purposes. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=21, stratify=y)

In [None]:
knn = KNeighborsClassifier(n_neighbors=8)

In [None]:
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
print("Test set predictions:\n {}".format(y_pred))

In [None]:
print("Test set actual values: \n {}".format(y_test))

Scikit provides embedded testing function

In [None]:
knn.score(X_test, y_test)

**Exercise 2:**

Data set
The MNIST digits recognition dataset has 10 classes, the digits 0 through 9. A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.
Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. 

Instructions
*Import datasets from sklearn and matplotlib.pyplot as plt.
*Load the digits dataset using the .load_digits() method on datasets.
*Print the keys and DESCR of digits.
*Print the shape of images and data keys using the . notation.
*Display the 1010th image using plt.imshow(). This has been done for you.

In [None]:
# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits =

# Print the keys and DESCR of the dataset
print(____)
print(____)

# Print the shape of the images and data keys
print(____.____.shape)
print(____.____.shape)

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

In [None]:
print(digits.keys())

In [None]:
print(digits.images[3])

In [None]:
print(digits.data[3])

In [None]:
print(digits.images.shape)

In [None]:
print(digits.data.shape)

**Exercise 3:**

Train/Test Split + Fit/Predict/Accuracy

After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the .score() method.

Instructions
*Import KNeighborsClassifier from sklearn.neighbors and train_test_split from sklearn.model_selection.
*Create an array for the features using digits.data and an array for the target using digits.target.
*Create stratified training and test sets using 0.25 for the size of the test set. Use a random state of 30. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.
*Create a k-NN classifier with 7 neighbors and fit it to the training data.
*Compute and print the accuracy of the classifier's predictions using the .score() method.

In [None]:
# Import necessary modules
____
____

# Create feature and target arrays
X = ____
y = ____

# Split into training and test set
X_train, X_test, y_train, y_test = ____(____, ____, test_size = ____, random_state=____, stratify=____)

# Create a k-NN classifier with 7 neighbors: knn
knn = ____

# Fit the classifier to the training data
____

# Print the accuracy
print(knn.score(____, ____))


** Exercise 4: **

By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as X_train, X_test, y_train, y_test. In addition, KNeighborsClassifier has been imported from sklearn.neighbors.

Instructions:
*Inside the for loop:
Setup a k-NN classifier with the number of neighbors equal to k.
Fit the classifier with k neighbors to the training data.
Compute accuracy scores the training set and test set separately using the .score() method and assign the results to the train_accuracy and test_accuracy arrays respectively.



In [None]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 30)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = ____

    # Fit the classifier to the training data
    ____
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(____, ____)

    #Compute accuracy on the testing set
    test_accuracy[i] = ____(____, ____)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()