# Small project

We have now seen a large part of Python and some tools that are available, and even if there is much much more to learn we have enough knowledge by now to do a small project together.

In this course you have been learning how to use an Arduino to collect sensor data and you have distributed this data with the help of Raspberry Pis. It is now time to look at and analyze the data. We will be using the same tools that we have looked at this week. There will be some minimal data science to just have some new results to look at. Do not let this alarm you, you do not need to understand it, you just need to pay attention to how we visualize it. 

<img src="img/sensors.jpg" style="width:500px">

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
from sklearn.decomposition import PCA

%matplotlib inline

You have recieved a data set from your project where you have taken readings from three different types of objects and summarized this in a data file located at "data/iris.data". Lets start by loading this file using Pandas. The data file in CSV (comma separated values) format which we can easily load through Pandas using `pd.read_csv` that returns a new `DataFrame` with the data.

In [None]:
data = pd.read_csv("data/iris.data")
data.shape

Using `Dataframe.head` we can inspect the first rows of the dataset to get an idea of what is in it.

In [None]:
data.head()

`Dataframe.describe` can give us some more detailed information about the columns:

In [None]:
data.describe()

This information can be nice to visualize. `DataFrame`s have the convenient `hist` method that will plot them for us.

In [None]:
data.hist()
plt.show()

We can also group the information in a `DataFrame`. In this case it would be convenient to group by class and see a summary of the columns per class. For this we can use the `DataFrame.groupby` method.

In [None]:
data.groupby("class").size()

In [None]:
data.groupby("class").describe(percentiles=[])

In [None]:
data.groupby("class").hist()
plt.show()

We might also want to visualize the whole dataset using a scatter plot for example.

There are four different dimensions (columns) which we would need to reduce the number of dimensions. Here we will use a technique called *PCA* to reduce from four dimensions to three dimensions by calculating the dimensions that have the highest variance, and then we plot the dataset over them.

In [None]:
data_reduced = PCA(n_components=3).fit_transform(data.iloc[:, :3])

classes = data.replace({"Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2}).iloc[:, 4]
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.set_title("3D plot of data samples in the three most important dimensions")

p = ax.scatter(data_reduced[:, 0], data_reduced[:, 1], data_reduced[:, 2], c=classes)

plt.show()

If I would make a guess for the two most significant dimensions I would choose petal width and petal length. Lets make a 2D plot of the samples over these two dimensions.

In [None]:
fig, ax = plt.subplots()
ax.scatter(data["petal-width"], data["petal-length"], c=classes)
ax.grid(True)
plt.show()

## Bonus: predict class of the samples

Lets do a small bonus step and try to create a model that can predict the class from values for the sepal and petal dimensions. For this part we will be using [scikit-learn](https://scikit-learn.org/stable/index.html), which is a great machine learning library.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

First step in creating a model is to separate your data in input and output. In our case input is sepal length, sepal height, petal length and petal width. Our output is the class.

In [None]:
X = data.iloc[:, :4] # INPUT: four first columns
y = classes          # OUTPUT: the converted version of name to an int

Then we need to split our data into something to fit the model to, and a set that we later use to test on. We will make the test set a third of our original data. The `sklearn.model_selection` module has a very handy function `train_test_split` for this.

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.3)

Then we create a model that we want to use to classify the values it recieves. There are many different algorithms for classification problems. Here we have chosen to use *Random Forest*.

In [None]:
rf = RandomForestClassifier(n_estimators=10) # create a new classifier

Then we fit our model to the training set.

In [None]:
rf.fit(Xtrain, ytrain)

Lets then double check how well it performs on the test set.

In [None]:
rf.score(Xtest, ytest) # gives us the accuracy of our model

Almost 98%, not bad!

We can also get the importance of each feature from the classifier.

In [None]:
rf.feature_importances_

And we can visualize it with the following. 

(There is some magic here and there, but try and see if you understand it)

In [None]:
# sort the feature importances.
# np.argsort gives the order of the indices for the values sorted in ascending order.
# the last [::-1] reverses the order so we get descending order.
indices = np.argsort(rf.feature_importances_)[::-1]

# calculate the standard deviation for each importance
std = np.std([tree.feature_importances_ for tree in rf.estimators_],
             axis=0)

# make bar plots of the feature importances and show the standard deviation as well
fig, ax = plt.subplots()
ax.set_title("Feature importances")
ax.bar(range(4), rf.feature_importances_[indices],
       color="r", yerr=std[indices], align="center")
ax.set_xticks(range(4))
ax.set_xticklabels(indices)
ax.set_xlim((-1, 4))

plt.show()

As we can see the petal length and the petal width are by far the most important features to tell which class a sample belongs to.