# Hello scikit-learn!

In the hallowed tradition started by K & R in their book *The C Programming Language*, what follows is a "Hello World" program for [scikit-learn](http://scikit-learn.org/stable/), the well-known machine learning package for Python.

One of the strengths of the scikit-learn package is the consistent interface that it provides to users. The steps in using a model in scikit-learn are as follows:

1. Import the class to use
2. Instantiate the class, optionally tuning hyperparameters
3. Fit the model using training dataset
4. Make predictions using the fitted model

In return for providing this consistent interface, scikit-learn expects the followings:

1. Separate objects for features and responses
2. Numeric objects for features and responses
3. Features and responses in [NumPy](http://www.numpy.org/) arrays
4. Specific shapes for features and responses

Time to see all this in action using the famous [iris flower dataset](http://www.numpy.org/), which is included in the scikit-learn package. Features in the dataset will be used to train a k-nearest neighbour classifier model from the scikit-learn library. Then, the model will be used to predict the species of previously unseen iris flowers.   

In [131]:
# import the method to load iris dataset
from sklearn.datasets import load_iris

# load the iris data into a variable named iris
iris = load_iris()

# check the type of iris. Turns out scikit-learn saves iris dataset in a container
# type called "Bunch"
type(iris)

sklearn.datasets.base.Bunch

Features and feature names are in the attributes *data* and *feature_names*. Similarly, target and target names are  are in the attributes *target* and *target_names*. Let's explore these and other attributes.

In [132]:
# print the feature names
print(iris.feature_names)


['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [133]:
# print the first 5 lines of the features
print(iris.data[:5])

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]


In [134]:
# print the five last lines of the features
print(iris.data[-5:])

[[ 6.7  3.   5.2  2.3]
 [ 6.3  2.5  5.   1.9]
 [ 6.5  3.   5.2  2. ]
 [ 6.2  3.4  5.4  2.3]
 [ 5.9  3.   5.1  1.8]]


In [135]:
# print the shape of the features, i.e. number of rows and columns
print(iris.data.shape)

(150, 4)


In [136]:
The above output shows that there are 150 samples of iris flowers comprising 4 features

SyntaxError: invalid syntax (<ipython-input-136-d4cc81668bc8>, line 1)

In [137]:
# finally, let's check the data type of the features. 
# scikit-learn expects features to be in NumPy arrays
type(iris.data)

numpy.ndarray

In [138]:
# let's do the same for the targets, starting with target name
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [139]:
# print the first 5 targets
print(iris.target[:5])

[0 0 0 0 0]


The targets, which are species of the iris flower, are encoded as integers. The first 5 samples belong to the same species *setosa*.

In [140]:
# print the last 5 targets
print (iris.target[-5:])

[2 2 2 2 2]


In [141]:
# check the shape of the target dataset
print(iris.target.shape)

(150,)


The above output shows that there are 150 records of target consisting of 1 column

In [142]:
# check the data type of target. scikit-learn expects NumPy arras
print(type(iris.target))

<class 'numpy.ndarray'>


The next step is to split the iris dataset into training and test datasets. The rule of thumb for this split is 80/20 - splitting 80% of the dataset into training and the remaining into test dataset.

It is easy enough to slice the iris dataset into training and test datasets, but scikit-learn provides a function, *train_test_split()*, for this purpose. Among other things, this function has an argument for the split size. As the datasets are randomly split into training and test datasets, the function provides an argument, *random state*, that can be set to reproduce the split later.

The following code performs 70/30 split of the iris dataset into training and test datasets.

In [143]:
# import train_test_split() 

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.3, random_state = 0)

In [144]:
print(len(X_train), len(X_test), len(y_train), len(y_test))

105 45 105 45


In [145]:
print(X.shape, y.shape)

(150, 4) (150,)


Now that the iris dataset has been split into training and test sets, it is time to perform the four steps involved in using scikit-learn models as mentioned above.

In [146]:
# import k-nearest neighbour classifer, instantiate it with the number of neighbors to 6 - this is
# an example of hyerparameter tuning - fit the model with the training dataset, and then use the model to predict
# the species of the iris flowers in the test dataset

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

The predicted values for the iris flowers in the test dataset are saved in the variable *y_pred*. How accurately did the knn classifier predict the species of the test dataset? To answer this question, scikit-learn provides some utility functions.

In [147]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

0.977777777778


The model accurately predicted the species of the 98% of the test dataset. Not bad for a simple model like knn.