# Tabular datasets and classification

Now let's forget about images for a while and talk a bit about more simple datasets — tabular. As it can be seen from the name, these datasets are represented in form of tables. The rows of such a table are individuals — objects, persons, animals, plants, phenomena — anything you investigate. The columns are characteristics of these individuals you observe or measure.

For example, if you have a dataset of all students in your class, then every row will correspond to a particular student, and every column — to one of their characteristic. It can be height, weight, age, IQ and other measurable or observable characteristics.

In this part of the course we will learn:

1. How to create such datasets and load them from files.
2. How to analyze and visualize such datasets.
3. How to make a model which will assign the individuals into groups — classify them.
4. How to assess the results of such classification.

As usually, let's start from the basics.

## Data frames


In Python, tabular datasets are represented as *data frames*. Data frame is very similar to an Excel table. It consists of one or several columns, every column contains one specific type of information, most of the time it is either a number or a text. 

If you want to work with data frames you need to use a special library, [Pandas](https://pandas.pydata.org). Let's install it first. Run the code below and then add `#` in front of the code to avoid running it afterwards.

In [None]:
! pip install pandas

Now let's create a data frame with the characteristics of four people. We create values for each characteristic as a list (they will form columns of the future data frame).

In [None]:
# create columns with characteristics of the people as lists
Names = ["John", "Jane", "David", "Emily"]
Age = [18, 25, 23, 28]
Height = [1.60, 1.75, 1.65, 1.70]
Weight = [62, 78, 59, 73]

Now we can combine the lists together, give them a name and create a data frame:

In [None]:
# load Pandas library and give it a short name "pd"
import pandas as pd

# combine columns to dictionary, create a data frame and show
data = pd.DataFrame({'Name': Names, "Age": Age, "Height": Height, "Weight": Weight})
data

As you can see, we also used a specific data structure to combine the columns together — a *dictionary*, which looks as follows: `{"Name1": values, "Name2", values2}`. This is a way in Python to combine different elements together so every element has a name (a key, an ID).

Data frames can be also saved to files (or loaded from files). The best suitable format for data frames is Comma Separated Values (CSV) files. They are just plain text files where values in every row are separated with commas. Pandas can also work with Excel files if needed.

Let's save our data frame to such a file:

In [None]:
data.to_csv("people.csv")

If you run the code above, you will get a new file, `people.csv`, which you can see in the left panel of your VSCode. Click on it to open and you will see the following:

```
,Name,Age,Height,Weight
0,John,18,1.6,62
1,Jane,25,1.75,78
2,David,23,1.65,59
3,Emily,28,1.7,73
```

This is exactly how the data values are stored inside the file. As you can see, Pandas, adds additional column with row indices in front of the data columns. But the rest corresponds to the data values we created earlier. You can "tell" Pandas not to add the column with IDs if you specify additional parameter:


In [None]:
data.to_csv("people.csv", index=False)

If you look inside the file now, you will see the following:

```
Name,Age,Height,Weight
John,18,1.6,62
Jane,25,1.75,78
David,23,1.65,59
Emily,28,1.7,73
```

You can load data from the CSV file as follows:

In [None]:
new_data = pd.read_csv("people.csv")
new_data

### Exercise 1

Use internet to collect the following information about all EU countries, including: 
* Name of the country
* Population
* Area (in square meters)
* Year they became a part of EU
* Do they use Euro (just "yes" or "no")

Enter this information to Excel or other spreadsheet table software and save the table as a CSV file (comma separated). Then load data from the CSV file to Python and show it on the screen.  

In [None]:
## put your code here

### Iris dataset

The [Iris Flower](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a well-known dataset that is frequently used for introductory purposes in data science and machine learning fields due to its simplicity and versatility. 

This dataset consists of measurements done for 150 flowers of three species of Iris: *Setosa*, *Versicolor* and *Virginica*. The measurements include sepal length, sepal width, petal length, and petal width. These is how the flowers look like:

<img src="illustrations/Setosa-Versicolor-Virginica-Images.png" style="max-width:800px;"/>

The term "petal" refers to the inner, and "sepal" refers to the outer portion of the flower. Both the sepal and the petal lengths and widths are measured in centimetres. 

<img src="illustrations/Petal-Sepal-Length-Width.jpg" style="width:300px; height:300px;"/>

We will use this dataset for all examples in this part of the course. Let's load the dataset from file `Iris.csv` (already provided).

In [None]:
d = pd.read_csv("Iris.csv")
d

As you can see, the dataset contains 150 rows — one for each individual flower. The columns include the ID (a unique number of each flower staring from 1), the four measurements, and the name of the species. Which means that there is one column with integer (whole) numbers, four columns with floating point numbers and one column with text labels.

### Take subsets of  rows

You can create different subsets of the data frame rows. For example, to get only the first five or only the last five rows of the data frame, use methods `head` and `tail`:


In [None]:
d.head()

In [None]:
d.tail()

In [None]:
td = d.tail()
td.shape

You can also specify a specific row number and get values for this row:

In [None]:
# get values of row 5 into separate variable and show it
r5 = d.iloc[5]
r5

Here `.iloc` means "location specified as index". You can use it also to get subset with several rows similar to how we did it for NumPy arrays (remember that indices in Python always start with 0):

In [None]:
# get rows from 11 to 15
sd = d.iloc[10:15]
sd

There is also a possibility to take not consequent rows, but, for example, every second or every fifth row. In order to do this you need to add additional number to the indices range, which  defines a step between numbers of the sequence (we did this in the previous class to reverse order of rows and columns):

In [None]:
# get every fifth row starting from row number 5
# remember that in Python indices start from 0 therefore
# we have 4 here instead of 5 at the beginning
sd = d.iloc[4:150:5]
sd

The total number of rows (150) is not necessary to specify, you can just keep this place empty, like for NumPy arrays, and Pandas will understand that it should proceed to the last row:

In [None]:
sd = d.iloc[4::5]
sd

### Take subsets of columns

You can also take values from a particular column just by specifying its name:

In [None]:
d["Species"]

Alternatively you can also use `.iloc` but specify two positions, indices of rows and columns, like you did it with 2D NumPy arrays. If you do not want to subset rows and subset only columns just use `:` symbol for the rows. For example in the code chunk below we take only columns with measurements and then show first 5 rows of the subset:

In [None]:
# take all rows and columns from 2 to 5 (from 1 to 4 if we count from 0)
X = d.iloc[:, 1:5]
X.head()

### Creating subsets using logical expressions

Another way to make the row subsets is to use logical expressions. For example, here is how to get only rows which correspond to *Setosa* species:

In [None]:
d["Species"] == "setosa"

In [None]:
setosa = d[d["Species"] == "setosa"]
setosa

Pay attention, that when we compare a column value with `"setosa"` we use a double symbol `=`, `==`. This is the way to tell Python that you do the comparison and do not want to change or assign a value (which we use a single `=` for).

And here is an example where we create a subset with only Setosa flowers but whose *PetalWidth* is above 0.2:

In [None]:
(d["Species"] == "setosa") & (d["PetalWidth"] > 0.2)

In [None]:
setosa2 = d[(d["Species"] == "setosa") & (d["PetalWidth"] > 0.3)]
setosa2

As you can see, in this case we combine the results of two comparisons. 

One, where we compare the values of column `"Species"` with `"setosa"`, and the second one, where we compare the values from column `"PetalWidth"` with value `0.2` using greater operator. Each comparison is located inside parentheses and there is an ampersand, `&`, in between. 

This symbol is a synonym for "and". And hence this expression tells Pandas: *select rows where value for column "Species" is "setosa" **and** value for column "PetalWidth" is larger than 0.2*. And this is exactly what we got inside the new data frame, `setosa2`.


### Exercise 2

Load data about EU countries from the CSV file you created earlier. Create and show the following subsets:

* Countries which became part of EU after 2000.
* Countries which became part of EU after 2000 and do not use Euro.
* Countries which have area more than 200.000 sq. m.
* Countries which have a population density less than 50 people per sq. m.


In [None]:
## write your code here

### Visualisation of data values

>*Note to teacher:* start with explaining how to make simple plots using manually entered data values. First as a list (e.g. height and weight of people), show scatter and bar plot using this example. Then show how to make scatter, line and bar plot for manually entered values of a parabola points. Then show how to generate this values using NumPy function `linspace()` and discuss pros and cons of NumPy arrays vs simple Python lists.


You can visualize data values using different plots. We will reuse library `matplotlib` for that. For example, the code below shows how large the  *Sepal Length* values are for different flowers using bar plot, so every flower is represented by a bar, and height of this bar corresponds to the sepal length of this flower. 

We will colorize the bars that correspond to the individual flowers according to the species. To do this we make three separate plots — one for each species.

In [None]:
# save values from column Species into a separate variable
species = d["Species"]

# create subsets
se = d[species == "setosa"]
ve = d[species == "versicolor"]
vi = d[species == "virginica"]

# show size
(d.shape, se.shape)

In [None]:
# load plotting engine from matplotlib library and give it a short name "plt"
import matplotlib.pyplot as plt

# make a plot figure of size 14 x 5
plt.figure(figsize = (14, 5))

# show barplot for each species. The location of each bar is defined by ID column
# (the ID values are unique) and height of the bar is defined by value from column SepalLength
# every bar series has its color and its label
plt.bar(se["Id"], se["SepalLength"], color="red", label="setosa")
plt.bar(ve["Id"], ve["SepalLength"], color="green", label="versicolor")
plt.bar(vi["Id"], vi["SepalLength"], color="blue", label="virginica")

# show legend to match colors and labels
plt.legend()

# add labels
plt.ylabel("Sepal length")
plt.xlabel("Flowers")
plt.title("Iris dataset")

What if we want to make such a plot for each of the four measured variable? It will require a lot of copy-paste with many manual changes. Let's simplify this and make a dedicated function which shows a barplot for any column whose name is provided by a user.

In this function we will also simplify the code written above by using loops:

In [None]:
# get unique value from column species
species = d["Species"]
species.unique()

In [None]:
# loop over the unique values
for s in species.unique():
    print(s)

In [None]:
def iris_barplot(d, colname = "SepalLength"):
    """ shows bar plot for values from column specified by parameter 'colname' """

    # make a dictionary with pre-defined colors for each species
    colors = {"setosa": "red", "virginica": "blue", "versicolor": "green"}

    # get species values to separate variable
    species = d["Species"]

    # make a loop over unique set of the species values
    for s in species.unique():
        # create a subset
        ds = d[species == s]
        # show a plot for the subset
        plt.bar(ds["Id"], ds[colname], color=colors[s], label=s)

    # add legend, labels and title
    plt.legend()
    plt.xlabel("Flowers")
    plt.ylabel(colname)
    plt.title(colname)

And now we can reuse this function to make all four plots together.

In [None]:
plt.figure(figsize=(20, 15))

plt.subplot(2, 2, 1)
iris_barplot(d, "SepalLength")
plt.subplot(2, 2, 2)
iris_barplot(d, "SepalWidth")
plt.subplot(2, 2, 3)
iris_barplot(d, "PetalLength")
plt.subplot(2, 2, 4)
iris_barplot(d, "PetalWidth")

Or in even more efficient way

In [None]:
# get column names for specific columns
colnames = d.columns[1:5]
colnames

In [None]:
# make the barplots by using loop over the column names
plt.figure(figsize=(20, 15))
for i in range(4):
    plt.subplot(2, 2, i + 1)
    iris_barplot(d, colnames[i])

Looks very good, but, what is most important, it shows us that Petal measurements (the last two plots) can be used to separate flowers of different species. They clearly have different petal sizes. We will use this knowledge later to build a classification model. 

Meanwhile let's learn how to make another plot — scatter plot. Let's start with a simple example to give an idea:

In [None]:
plt.scatter(d["PetalLength"], d["PetalWidth"])

In [None]:
plt.scatter(d["PetalLength"], d["PetalWidth"], marker="s", edgecolor="blue", color="yellow")

And here is a long and verbose example, where we reuse the subsets we already created above:

In [None]:
# make a plot figure of size 6 x 6
plt.figure(figsize = (6, 6))

# show scatter plot which takes values from two columns, Petal Width and Petal Length
# and then every row is shown as a point. The x-coordinate of the point corresponds to its
# Petal Width value and the y-coordinate corresponds to the value of Petal Length.
plt.scatter(se["PetalWidth"], se["PetalLength"], color="red", label="setosa")
plt.scatter(ve["PetalWidth"], ve["PetalLength"], color="green", label="versicolor")
plt.scatter(vi["PetalWidth"], vi["PetalLength"], color="blue", label="virginica")

# add axis labels
plt.xlabel("Petal Width")
plt.ylabel("Petal Length")

# show legend to match colors and labels
plt.legend()

# add a grid
plt.grid()

As you can see, the code is very similar to the code we used to make a barplot. And this means that we can simplify it and make a function:

In [None]:
def iris_scatter(d, x = "PetalLength", y = "PetalWidth", marker = "x"):

    # make a dictionary with colors for each species
    colors = {"setosa": "red", "virginica": "blue", "versicolor": "green"}

    # get species values to separate list
    species = d["Species"]

    # make a loop over unique set of species values
    for s in species.unique():
        # create a subset
        ds = d[species == s]
        #show a plot for the subset
        plt.scatter(ds[x], ds[y], color=colors[s], label=s, marker=marker)

    # add legend, labels and title
    plt.legend()
    plt.xlabel(x)
    plt.ylabel(y)
    plt.title("Iris dataset")
    plt.grid(color = "lightgray", linestyle = ":")

And reuse it:

In [None]:
plt.figure(figsize=(12, 11))

plt.subplot(2, 2, 1)
iris_scatter(d, x = "SepalLength", y = "SepalWidth")
plt.subplot(2, 2, 2)
iris_scatter(d, x = "SepalLength", y = "PetalLength")
plt.subplot(2, 2, 3)
iris_scatter(d, x = "PetalLength", y = "PetalWidth")
plt.subplot(2, 2, 4)
iris_scatter(d, x = "PetalWidth", y = "SepalWidth")

Or like this:

In [None]:
colnames = d.columns[1:5]
colnames

In [None]:
n = 1
plt.figure(figsize=(25, 25))
for i in range(4):
    for j in range(4):
        plt.subplot(4, 4, n)
        iris_scatter(d, colnames[i], colnames[j])
        n = n + 1

The figure gives an overview over all combination of the characteristics and shows which can be best used for separating the flowers of different species. Next part of the course explains how to make such separation meanwhile let's do some exercises.


### Exercise 3

Write a code that shows a bar plot for area of each EU country. Show the countries, which became a part of EU before 2000, using blue color, and the countries, which became a part of EU in 2000 or later, using green color.

In [None]:
# write your code here

Make a scatter plot which shows area of the countries as x-axis and population as y-axis. Do you see any trend? Why?

In [None]:
# write your code here


## Classification


*Classification* is a process of arranging individuals into groups based on similarities and dissimilarities of their characteristics or their combinations. Usually the classes are pre-defined, so we know the number of the classes and their names/labels, like in case of the Iris dataset. 

In order to create a classification rules or classification model we need individuals whose class belongings is known, so we can learn the rules (and hence create the classification model based on these rules) from exploring the individuals. And then, when we have a new individual, whose class is unknown, the classification model compares its characteristics to what it knows about the classes and makes a decision — which class it belongs to if any. 

Classification is a part of more general discipline, *machine learning*. The idea of machine learning is to let computer (computer program, algorithm, model) learn what makes individuals from the same class similar and what makes individuals from different classes different, and then use this knowledge for classification of new individuals, whose class is unknown. 

For example, you can create a model to distinguish between plastic and glass bottles and then use this model in a sorting machine. Or you can create a model which will recognize faces of your family members and then use it to lock/unlock the entrance door in your house. 



### Training and test sets

The process of developing such a model is usually called *training* as we *train* (*teach*, *supervise*) the model (while it *learns* from our training). In order to implement the *training* process you need a set of individuals whose classes are known — a *training set*. So the algorithm can learn from the training set about the similarities and dissimilarities.

In order to check how well the trained model works, we can apply it to another set of individuals with know classes — a *test set*. It is important that although individuals comprising the training set and the test set are taken from the same population they are not identical. So we train the model using one set of individuals and test it using another, independent set.

Let's create the two sets for the Iris data. Let's take every fifth row out and use it for testing (so we will have 30 flowers in the test set, 10 for each species). And the rest — for training. Here is how to do it:


In [None]:
# generate vectors for training and test sets
train_ind = d["Id"] % 5 != 0
test_ind = d["Id"] % 5 == 0

(test_ind[0:10], train_ind[0:10])

As you can see, we got two columns filled with values `True` and `False` as the result. 

How it works? The operator `%` computes the reminder of division. So when you write `x % 5` it means "compute a reminder of division of value x to 5". For example, if `x = 7`, `x % 5` will be `2`. Check this in the following block of code:

In [None]:
# check how % works - try to change x and see what happens when you run the code
x = 7
x % 5

But when x is equal to 5, 10, 15, or 55, the reminder will be equal to 0. And this is exactly what we use to create our indices. Because column `Id` contains unique number of rows, which start from 1, we can compute the reminder for every value of this column and compare it with 0. All rows, where this condition is true will be taken to the test set. All rows where this condition is false will be taken to the training set. 

Let's create the sets:

In [None]:
# create subsets, because "train_ind" and "test_ind" consists of boolean values (not numbers)
# we use "loc" (location) instead of "iloc" (index location) like we did before.
d_train = d.loc[train_ind]
d_test = d.loc[test_ind]

# show size of each subset
(d_train.shape, d_test.shape)

As you can see, we have 120 rows in the training set (40 for each species) and 30 (10 for each species) in the test set. Let's look at the test set:

In [None]:
d_test

And it looks like we expected, with 10 individual flower measurements for each species.


### Binary classification

Now we are ready for training a classification model. But what is the model? Which algorithm we should use for training and classification? What should the algorithm gives as as a result?

The simplest classification model is a *binary classifier*, which gives a binary answer — either `True` (if the model recognizes that a sample belongs to a particular class — a *member*) or `False` (if the model rejects the sample as being from other classes — a *stranger*). The class of interest in this case is called a *target class*.

Let's create a binary classifier for class *virginica*, the flowers of this class are shown using blue color on the plots. However, instead of using fancy algorithms, and let computer learn the classification rule from the data, let's define this rule manually, so we skip the machine learning part or rather take it from the machine and do it ourselves.

But let's make our decision based on the training set only, like it would be in real training process. Let's look at the plots for the training set:

In [None]:
plt.figure(figsize=(20, 15))

plt.subplot(2, 2, 1)
iris_barplot(d_train, "SepalLength")
plt.subplot(2, 2, 2)
iris_barplot(d_train, "SepalWidth")
plt.subplot(2, 2, 3)
iris_barplot(d_train, "PetalLength")
plt.subplot(2, 2, 4)
iris_barplot(d_train, "PetalWidth")

It looks like the best way to *discriminate* the *Virginica* flowers from the other two species is to define a threshold for Petal Width — if the value is above 1.7, then the corresponding flower should be classified as a member of the class *virginica*.

Here is the bar plot with the decision boundary shown as horizontal dashed line:

In [None]:
plt.figure(figsize=(20, 7))

iris_barplot(d_train, "PetalWidth")
plt.plot(plt.xlim(), [1.7, 1.7], color="black", linestyle="--")


As you can see, not all flowers will be classified correctly in this case. But we will discuss this later, let's implement the decision rule for one flower as a small Python function:


In [None]:
def is_virginica(flower, threshold = 1.7):
    return flower["PetalWidth"] > threshold

In [None]:
# test how it works for different rows
r = d_train.iloc[101]
res = is_virginica(r)
res

In [None]:
# and here how we can turn the True/False values to labels
("virginica" if is_virginica(r) else "non-virginica")

Now let's write another function which applies `is_virginica()` to each row of a data frame and returns results of classification as a list of class labels. It will assign label `"virginica"` if the result of classification is `True` and `"non-virginica"` if it is the opposite.

In [None]:
def df_isvirginica(d, threshold = 1.7):

    predictions = []

    # we use iterrows() function for data frames which let us
    # take every row as a separate object (in this case "flower")
    for index, flower in d.iterrows():
        class_label = "virginica" if is_virginica(flower, threshold) else "non-virginica"
        predictions.append(class_label)

    return predictions

And apply it to the test set:

In [None]:
pred = df_isvirginica(d_test)
pred

It works! Let's compare the real and the predicted classes:

In [None]:
# get column with reference class labels as a separate variable
ref = d_test["Species"]

# combine the predicted and the reference labels into a new data frame
res = pd.DataFrame({"Reference": ref, "Prediction": pred})
res

As you can see, on the one hand, the model rejects all non-members in the test set — the 10 flowers of *Setosa* species and 10 flowers of *Versicolor* species were correctly rejected as non-members (got `False` as classification answer)

But, at the same time three flowers from the target class were rejected incorrectly. Is this a good result? Should we change the threshold a little? How to assess the quality of classification and improve it? Let's talk about this in the next part.

### Classification quality

In order to assess the classification quality we need to count how many outcomes are correct and how many are wrong. However, there are two different groups of samples that model can be wrong or correct about — the *strangers* (objects which indeed do not belong to the target class) and the *members* (the ones that belong to this class). 

So we need to count the outcomes separately. This leads us to four numbers:

**Correct answers**
* TP (*true positives*) — number of members correctly accepted by the mode (got `True`).
* TN (*true negatives*) — number of strangers correctly rejected by the mode (got `False`). 

**Wrong answers (classification errors)**
* FP (*false positives*) — number of strangers incorrectly accepted as members (got `True`).
* FN (*false negatives*) — number of members incorrectly rejected as strangers (got `False`). 

This is how to count them for our example:

In [None]:
# define the target class label
target_class = "virginica"

# get reference and predictions as separate variables to make the code shorter
ref = res["Reference"]
pred = res["Prediction"]

# correct decisions — the value for reference class label and predicted class label
# are in agreement
TP = sum((ref == target_class) & (pred == target_class))
TN = sum((ref != target_class) & (pred != target_class))

# wrong decisions — the value for reference class label and predicted class label
# contradict
FN = sum((ref == target_class) & (pred != target_class))
FP = sum((ref != target_class) & (pred == target_class))

print(TP, TN, FN, FP)


As you can see, the numbers matches our manual observations: all 20 strangers are correctly rejected (TN = 20, FP = 0), and only 7 members out of 10 were correctly accepted (TP = 7, FN = 3).

Now we can compute the two statistics which will represent the same numbers but as a percent, so we get measures which do not depend on the size of the dataset and hence are easy to understand. 

For example, percent of correctly recognized members is called a *sensitivity* and is computed as a ratio of correctly recognized members to the total number of members:

$Sensitivity = TP / (TP + FN)$

The second statistic is specificity, it shows a percent of strangers, that were correctly rejected by the model:

$Specificity = TN / (TN + FP)$

Let's compute them for our data:

In [None]:
sens = TP / (TP + FN)
spec = TN / (TN + FP)

(sens, spec)

Again it matches our observations with 70% of correctly recognized members (7 out of 10) and 100% correctly rejected strangers.

Finally we can also compute a third statistic, *accuracy*, which will tell how well the model works overall. It simply computes the percent of all correct answers:

$Accuracy = (TP + TN) / (TP + TN + FP + FN)$

And here it is:

In [None]:
acc = (TP + TN) / (TP + TN + FP + FN)
acc

If the number of strangers and members in test set are equal, then the accuracy will be just an average of sensitivity and specificity.

### Exercise

Imagine that we trained a model to distinguish red apples from the others. Below you see the result of applying the model to the test set.

<img src="illustrations/apples-classification.png" style="width: 700px;">

Compute number of TP, TN, FP, FN and use these numbers to compute sensitivity, specificity and accuracy of the classification results. Do it with manually without programming and report the results.

### Classification quality (continue)

Let's combine all code together into a Python function, which will compute all statistics based on data frame with classification results. It will assume that the data frame has two columns, the first column contains reference class label and the second column contains the predicted labels.

The function will work for any target class provided as an argument:

In [None]:
def class_stat(res, target_class):

    ref = res["Reference"]
    pred = res["Prediction"]

    TP = sum((ref == target_class) & (pred == target_class))
    TN = sum((ref != target_class) & (pred != target_class))
    FP = sum((ref != target_class) & (pred == target_class))
    FN = sum((ref == target_class) & (pred != target_class))

    sens = TP / (TP + FN)
    spec = TN / (TN + FP)
    acc = (TP + TN) / (TP + TN + FP + FN)

    # return all statistics in form of dictionary
    return {
        "target": target_class,
        "TP": TP,
        "TN": TN,
        "FP": FP,
        "FN": FN,
        "sens": sens,
        "spec": spec,
        "acc": acc,
    }

Let's see if it works for our example. Let's get statistics both for the training set and for the test set separately:

In [None]:
test_stat = class_stat(res, "virginica")
test_stat

Now let's see how using different threshold values influence the classification quality. The code below is similar to what we used before, but just written in a more compact way. Try different threshold values and see how well they perform for the test set:

In [None]:
threshold = 1.70

pred = df_isvirginica(d_test, threshold)
ref = d_test["Species"]
res = pd.DataFrame({"Reference": ref, "Prediction": pred})
test_stat = class_stat(res, "virginica")
test_stat

Now try the same for the training set:

In [None]:
threshold = 1.70

pred = df_isvirginica(d_train, threshold)
ref = d_train["Species"]
res = pd.DataFrame({"Reference": ref, "Prediction": pred})
train_stat = class_stat(res, "virginica")
train_stat

### Receiver operating characteristic (ROC)

Finally let's do the following. Let's try to change the threshold and see how it influences the sensitivity, specificity and accuracy. Let's start with 1.0 and then try all threshold values up to 2.0 with a step of 0.1 so we will have 11 results in total: 

In [None]:
# we need numpy to make a sequence of threshold values
import numpy as np

# generate a sequence of 11 numbers between 1.0 and 2.0
thresholds = np.linspace(1.0, 2.0, 11)
thresholds

In [None]:
# prepare empty lists to save main statistics for each thresholds
sens = []
spec = []
acc = []

# save reference values into separate variable
ref = d_test["Species"]

# apply different threshold values and save the results to the lists
for t in thresholds:
    pred = df_isvirginica(d_test, threshold = t)
    res = pd.DataFrame({"Reference": ref, "Prediction": pred})
    stat = class_stat(res, "virginica")

    spec.append(stat["spec"])
    sens.append(stat["sens"])
    acc.append(stat["acc"])

# show the result
(spec, sens, acc)

Let's visualize the results:

In [None]:
plt.plot(thresholds, sens, marker="o", label = "sens")
plt.plot(thresholds, spec, marker="x", label = "spec")
plt.plot(thresholds, acc, marker="+", label = "acc")
plt.xlabel("Thresholds")
plt.legend()
plt.grid()

Now you can make an educated decision. For example, a threshold of 1.4 gives all three statistics equal to 0.90. A smaller threshold, 1.3, gives perfect sensitivity but decreases the specificity down to 0.85. The larger threshold, at 1.5, gives perfect specificity but decreases the sensitivity down to 0.80.

There is also another way to show these results — make a line plot where sensitivity depends on (1 - specificity):

In [None]:
# we convert list to NumPy array in order to make
# the arithmetic operation (1 - spec) easier
spec = np.array(spec)

# show the plot
plt.plot(1 - spec, sens, marker = "o")

# make plot look nicer
plt.xlim((-0.1, 1.1))
plt.ylim((-0.1, 1.1))
plt.grid(color = "lightgray")
plt.xlabel("1 - specificity")
plt.ylabel("sensitivity")

This plot is called *Receiver operating characteristic* (ROC) plot. The closer the curve is to the top right corner the better the model. However to make this plot complete we need to use a wider range of the threshold values, so both statistics go from 0 to 1. Try to implement this.

### Multiclass classification

Now let's make a classification model which will provide one of the class labels as a response, so it will discriminate all flowers among the three species. This approach is known as *multiclass classification*.

First of all let's look at the two scatter plots below:

In [None]:
plt.figure(figsize=(20, 8))

plt.subplot(1, 2, 1)
iris_scatter(d_train, x = "SepalLength", y = "SepalWidth")
plt.subplot(1, 2, 2)
iris_scatter(d_train, x = "PetalLength", y = "PetalWidth")

Apparently, we can use the petal measurements for discrimination. Setosa can be clearly discriminated by using a threshold along Petal Length, for example, a threshold of 2.5. We already have a solution for *virginica* (we will use the same threshold of 1.7). And then, if none of the two conditions are satisfied, the flower will be recognized as versicolor.

Schematically it can be shown as a following flowchart:

<img src="illustrations/tree.png" style="width:600px">

Such models based on a set of nested thresholds are called [Decision Trees](https://en.wikipedia.org/wiki/Decision_tree). 

Let's implement it as the following function for a single row from the data frame:

In [None]:
def flower_classifier(flower):
    if flower["PetalLength"] < 2.5:
        return "setosa"
    elif flower["PetalWidth"] > 1.7:
        return "virginica"
    else:
        return "versicolor"

And now let's make a function which will apply this classifier to all rows.

In [None]:
def df_classifier(d):
    predictions = []
    for index, flower in d.iterrows():
        predictions.append(flower_classifier(flower))
    return predictions

And test it:

In [None]:
ref = d_test["Species"]
pred = df_classifier(d_test)
res = pd.DataFrame({"Reference": ref, "Prediction": pred})
res

And it works! Now we just need to calculate the classification quality statistics for each of the classes:

In [None]:
stat =[]
for class_label in ref.unique():
    stat.append(class_stat(res, class_label))
stat

This is it. In the next class we will talk about how to create a model, which will self define the classification rule by learning from the data.

### Exercise

Implement decision tree which will use three conditions instead of two. Make a drawing first and then implement it on Python and test. Discuss cons and pros of this solution.