<div style="color:red;background-color:black">
Diamond Light Source

<h1 style="color:red;background-color:antiquewhite"> Python Fundamentals: SciKitLearn</h1>  

©2000-20 Chris Seddon 
</div>

## 1
Execute the following cell to activate styling for this tutorial

In [None]:
from IPython.display import HTML
HTML(f"<style>{open('my.css').read()}</style>")

## 2
The iris flower is an excellent starting point for studying machine learning.  There are 3 common species of iris: setosa, versicolor and virginica.  What is interesting about these species is that they are easy to classify if we measure their petal and sepal width and length.  Each species cluster around a region in 4 dimensional space (the dimensions being (petal width, petal length, sepal width, sepal length).

Let's start with a picture:

In [None]:
from PIL import Image

image = Image.open("images/iris.png")
image

## 3
Let's examine our dataset with Pandas:

In [None]:
import pandas as pd
pd.set_option('display.width', None)        # None means all data displayed


# load iris data set
iris_df = pd.read_csv("data/iris.csv")
print(iris_df)

## 4
It will be helpful to add a new column to our dataframe where we use integers for the species.  We'll call this column 'target':

In [None]:
import pandas as pd
pd.set_option('display.width', None)        # None means all data displayed


# load iris data set
iris_df = pd.read_csv("data/iris.csv")
target_names = {'setosa':0, 'versicolor':1, 'virginica':2}
iris_df["target"] = iris_df.apply(lambda row: target_names[row.species], axis=1, raw=True)
print(iris_df.sample(10))

## 5
With machine learning, we use the 4 dimensional space of (sepal_length, sepal_width, petal_length and petal_width) to classify the irises.  To get a feel for this, we would like to plot the 4 dimensional space with each iris, but that's impossible.  However, what we can do is create four 3 dimensional cross sections of the 4 dimensional space:

In [None]:
%matplotlib inline
from sklearn.datasets import load_iris
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import pandas as pd

colors = ["red", "green", "blue", "black", "black"]
markers = ["o", "o", "o", "D", "D"]
sizes = [10, 10, 10, 100, 100]
labels = ['sepal length', 'sepal width', 'petal length', 'petal width']
target_names = {'setosa':0, 'versicolor':1, 'virginica':2}

iris_df = pd.read_csv("data/iris.csv")
iris_df["target"] = iris_df.apply(lambda row: target_names[row.species], axis=1, raw=True)
iris_df.drop(["species"], axis = 1, inplace = True)

# plot
figure = plt.figure(figsize=(16, 12))
        
def scatter(subplot, i, j, k):
    def doit(row):
        n = int(row[4])          
        ax.scatter(row[i], row[j], row[k],   
                   c=colors[n], 
                   marker=markers[n],
                   s=sizes[n])

    ax = figure.add_subplot(subplot, projection='3d')
    ax.set_xlabel(labels[i])
    ax.set_ylabel(labels[j])
    ax.set_zlabel(labels[k])
    iris_df.apply(doit, axis=1, raw=True)

scatter(221, 0, 1, 2)
scatter(222, 0, 1, 3)
scatter(223, 0, 2, 3)
scatter(224, 1, 2, 3)
plt.show()

## 6
Now we take our dataframe and use it to train our model.  There are several different model we can use.  We will be using:<pre>KNeighborsClassifier
LogisticRegression</pre>

Training the model involves separation the 4 "parameters" (sepal_length, sepal_width, petal_length and petal_width) from the "target" column in our dataframe and passing these two dataframes to an estimator.  In our case, the estimator will have 150 rows to work with.

Next we introduce 3 new irises that haven't been classified yet.  Since each model has been trained, it can now predict which type of iris they think they are.

Note the use of "iloc" to extract columns from the dataframe.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

iris_df = pd.read_csv("data/iris.csv")
iris_df["target"] = iris_df.apply(lambda row: target_names[row.species], axis=1, raw=True)
iris_df.drop(["species"], axis = 1, inplace = True)

# create 2 new dataframes to pass to the estimator
parameters = iris_df.iloc[:,[0,1,2,3]]
target = iris_df.iloc[:,4]

# define three unclassified iris's
iris1 = [4.1, 3.1, 1.8, 0.5]
iris2 = [6.9, 3.5, 3.5, 2.5]
iris3 = [6.7, 2.0, 5.0, 1.6] 

def predict(estimator, message):
    estimator.fit(parameters, target)     # train
    print(message, estimator.predict([iris1, iris2, iris3])) # estimate

# predict with different estimators and parameters
predict(KNeighborsClassifier(n_neighbors=1), "KNeighbors(K=1):")
predict(KNeighborsClassifier(n_neighbors=3), "KNeighbors(K=3):")
predict(KNeighborsClassifier(n_neighbors=5), "KNeighbors(K=5):")
predict(LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=150), "LogisticRegression:")

## 7
As you can see, there is some disagreement over what species the third iris belongs to.  

Let's look at our 4 plots again, this time with the 3 new iris plotted.  Our 3 new irises are colored black, grey and silver:

In [None]:
%matplotlib inline
from sklearn.datasets import load_iris
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import pandas as pd

colors = ["red", "green", "blue", "black", "grey", "silver"]
markers = ["o", "o", "o", "D", "D", "D"]
sizes = [10, 10, 10, 100, 100, 100]
labels = ['sepal length', 'sepal width', 'petal length', 'petal width']
target_names = {'setosa':0, 'versicolor':1, 'virginica':2}

iris_df = pd.read_csv("data/iris.csv")
iris_df["target"] = iris_df.apply(lambda row: target_names[row.species], axis=1, raw=True)
iris_df.drop(["species"], axis = 1, inplace = True)
iris1 = [4.1, 3.1, 1.8, 0.5, 3]
iris2 = [6.9, 3.5, 2.5, 2.5, 4]  
iris3 = [6.7, 3.0, 5.2, 2.3, 5] 
df = pd.DataFrame([iris1, iris2, iris3], columns = iris_df.columns) 
iris_df = iris_df.append(df)

# plot
figure = plt.figure(figsize=(16, 12))
def scatter(subplot, i, j, k):
    def doit(row):
        n = int(row[4])          
        ax.scatter(row[i], row[j], row[k],   
                   c=colors[n], 
                   marker=markers[n],
                   s=sizes[n])

    ax = figure.add_subplot(subplot, projection='3d')
    ax.set_xlabel(labels[i])
    ax.set_ylabel(labels[j])
    ax.set_zlabel(labels[k])
    iris_df.apply(doit, axis=1, raw=True)

scatter(221, 0, 1, 2)
scatter(222, 0, 1, 3)
scatter(223, 0, 2, 3)
scatter(224, 1, 2, 3)
plt.show()