## Exploring a well-known dataset for machine learning

scikit-learn includes several out-of-the-box sample datasets. In the following example, we will import the Iris dataset directly using scikit-learn for simplicity.

In [3]:
from sklearn.datasets import load_iris
iris = load_iris()

In [4]:
# The feature (column) names and the response
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [6]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [8]:
# The object types of the feature matrix and the response array
type(iris.data)

numpy.ndarray

In [9]:
type(iris.target)

numpy.ndarray

In [10]:
# The shapes of samples and features
iris.data.shape

(150, 4)

In [11]:
iris.target.shape

(150,)

## Training models and classification
After loading the dataset and verifying that it meets the requirements for working with
scikit-learn, it is time to use it to train a model and classify a new observation.

With LinearSVC (Support Vector Classifier), the dataset is categorized or divided by a hyperplane into classes.
This hyperplane is often referred to as a Support Vector Machine, and represents the maximized division between the groups or classes. The response is given by
the class where a new observation belongs.

With the K-nearest neighbors model, a prediction is made for a new observation
by searching through the entire training dataset for the “K” most similar
observations (neighbors) based on the distance between them and the new sample.
The response is then given by the class with the highest number of neighbor
occurrences.

In [12]:
# Import LinearSVC class
from sklearn.svm import LinearSVC
# Import KNeighborsClassifier class
from sklearn.neighbors import KNeighborsClassifier
# Assign to variables for more convenient handling
X = iris.data
y = iris.target

In [14]:
# Create an instance of the LinearSVC classifier
clf = LinearSVC()
# Train the model
clf.fit(X, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [15]:
# Get the accuracy score of the LinearSVC classifier
clf.score(X, y)

0.9666666666666667

In [16]:
# Predict the response given a new observation
clf.predict([[ 6.3, 3.3, 6.0, 2.5]])

array([2])

In [17]:
# Create an instance of KNeighborsClassifier
# The default number of K neighbors is 5.
# This can be changed by passing n_neighbors=k as argument
knnDefault = KNeighborsClassifier() # K = 5
# Train the model
knnDefault.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [18]:
# Get the accuracy score of KNeighborsClassifier with K = 5
knnDefault.score(X, y)

0.9666666666666667

In [19]:
# Predict the response given a new observation
knnDefault.predict([[ 6.3, 3.3, 6.0, 2.5]])

array([2])

In [20]:
# Let's try a different number of neighbors
knnBest = KNeighborsClassifier(n_neighbors=10) # K = 10
# Train the model
knnBest.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [22]:
# Get the accuracy score of KNeighborsClassifier with K = 10
knnBest.score(X, y)

0.98

In [23]:
# Predict the response given a new observation
knnBest.predict([[ 6.3, 3.3, 6.0, 2.5]])

array([2])

In [24]:
# Let's try a different number of neighbors
knnWorst = KNeighborsClassifier(n_neighbors=100) # K = 100
# Train the model
knnWorst.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=100, p=2,
           weights='uniform')

In [25]:
# Get the accuracy score of KNeighborsClassifier with K = 100
knnWorst.score(X, y)

0.66

In [26]:
# Predict the response given a new observation
knnWorst.predict([[ 6.3, 3.3, 6.0, 2.5]])

array([1])

In [28]:
import pandas as pd
# Read file and attribute list into variable
data = pd.DataFrame(data=iris['data'],columns=iris['feature_names'])
data.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5
