### Mounting the Drive (For Colab Users)

You can uncomment and use the following codes to mount the drive if you choose to use Google Colab

In [59]:
'''
from google.colab import drive
drive.mount('/content/drive')
'''

"\nfrom google.colab import drive\ndrive.mount('/content/drive')\n"

Check the content of the drive



In [60]:
import os 
os.listdir()

['iris.data',
 'ML_Workshop_1_First_Look_at_ML.ipynb',
 'Week 2 - ML Cycle and how it learns.pdf']

### Read in the dataset

- Usually, pandas allows us to read files in and create a dataframe from it. Let's try read_csv() and see its head()

- If you use Colab, you can find the file at '/content/drive/MyDrive/Colab_Notebooks/iris.csv'
- If you use VSCode, just use the directory that you save the file.

In [61]:
import pandas as pd
import numpy as np
data = pd.read_csv('iris.data',header=None)
data.columns = ['sepal length','sepal width','petal length','petal width','class']


In [62]:
data.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Understanding the dataset

Let's visualize and try to understand the dataset using seaborn.


In [63]:
import seaborn as sns

sns.pairplot(data,hue='class')

<seaborn.axisgrid.PairGrid at 0x23c56036250>

### Preprocessing

- Let's split the dataset into training set and test set, and play with the random_state.

In [64]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.4,random_state = 42)
train.head()


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
123,6.3,2.7,4.9,1.8,Iris-virginica
24,4.8,3.4,1.9,0.2,Iris-setosa
25,5.0,3.0,1.6,0.2,Iris-setosa
23,5.1,3.3,1.7,0.5,Iris-setosa
94,5.6,2.7,4.2,1.3,Iris-versicolor


### The Model
- It is time to build the model. For this example, we will use a decision tree

In [65]:
x_train = train.iloc[:,0:4]
y_train = train.iloc[:,4]

x_test = test.iloc[:,0:4]
y_test = test.iloc[:,4]

In [66]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)

### The Result
- Let's do some prediction using the trained model, and analyze the result via a confusion matrix

In [67]:
y_pred = clf.predict(x_test)

In [68]:
print(y_test)
print(y_pred)

73     Iris-versicolor
18         Iris-setosa
118     Iris-virginica
78     Iris-versicolor
76     Iris-versicolor
31         Iris-setosa
64     Iris-versicolor
141     Iris-virginica
68     Iris-versicolor
82     Iris-versicolor
110     Iris-virginica
12         Iris-setosa
36         Iris-setosa
9          Iris-setosa
19         Iris-setosa
56     Iris-versicolor
104     Iris-virginica
69     Iris-versicolor
55     Iris-versicolor
132     Iris-virginica
29         Iris-setosa
127     Iris-virginica
26         Iris-setosa
128     Iris-virginica
131     Iris-virginica
145     Iris-virginica
108     Iris-virginica
143     Iris-virginica
45         Iris-setosa
30         Iris-setosa
22         Iris-setosa
15         Iris-setosa
65     Iris-versicolor
11         Iris-setosa
42         Iris-setosa
146     Iris-virginica
51     Iris-versicolor
27         Iris-setosa
4          Iris-setosa
32         Iris-setosa
142     Iris-virginica
85     Iris-versicolor
86     Iris-versicolor
16         

In [70]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        23
Iris-versicolor       0.95      0.95      0.95        19
 Iris-virginica       0.94      0.94      0.94        18

       accuracy                           0.97        60
      macro avg       0.96      0.96      0.96        60
   weighted avg       0.97      0.97      0.97        60

[[23  0  0]
 [ 0 18  1]
 [ 0  1 17]]
