# Solving the lizard problem

By: Jacobo Fernández-Vargas and Luca Citi

We are now going to solve from beginning to end the problem that was left as exercise in the second lab. This will allow us to go through all the steps involved in creating and testing a machine learning model. It is important to know that this dataset was not intended to use for classifications purposes, so we may not get good results or it may be a trivial problem. Regardless, thi example will serve its purpose of seeing an example of building a pipeline from start to end. We will also point to some common errors.

## The data

The data comes from a paper [1] that aimed at findinig differences in the behaviour of lizards depending on the colour of their abdomen (yellow, white or orange). For example the authors wanted to know if lizards of a specific colour have bigger territories. Each sample is one observation of a specific lizard.
The authors of the paper could not find any differences.
We are going to use the dataset in a different way (i.e. predicting the colour of a lizard observed at a specific location at a given time).

[1] https://onlinelibrary.wiley.com/doi/10.1002/ece3.6659

## Loading the data

Since we know that the data is not ready to use in numpy, we use pandas to load the data and preprocess it.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('lizards.csv')
data

Unnamed: 0,ID_e,E,N,Sex,Date,Hour,Cell,Label
0,e1f1,5.583066,1.175189,f,05/6/2018,12:27,e1,o
1,e1f1,5.325020,0.962680,f,21/6/2018,10:20,e1,o
2,e1f1,3.412439,0.947501,f,21/6/2018,14:25,e1,o
3,e1f1,3.761561,2.010046,f,21/6/2018,12:30,e1,o
4,e1f1,,1.144831,f,15/6/2018,12:06,e1,o
...,...,...,...,...,...,...,...,...
7184,e9m9,1.103858,,m,25/5/2018,15:40,e9,y
7185,e9m9,3.699470,,m,25/5/2018,17:21,e9,y
7186,e9m9,1.766447,,m,24/5/2018,16:26,e9,y
7187,e9m9,0.497900,1.169597,m,23/5/2018,12:30,e9,y


The first thing that we are going to do is to merge the Date and Hour columns. For this, we will concatenate the strings that contain the date and time and then transform them into a datetime object. We will then transform the result into a number (the number of days since a reference time point). Finally, we will remove the extra column.

In [3]:
data['Date'] = pd.to_datetime(data['Date'] + ' ' + data['Hour'])
data = data.drop('Hour', axis=1)
time0 = pd.Timestamp('2018-01-01 00:00:00')
data['Date'] = (data['Date'] - time0) / pd.Timedelta(1, 'day')
data.head()

Unnamed: 0,ID_e,E,N,Sex,Date,Cell,Label
0,e1f1,5.583066,1.175189,f,125.51875,e1,o
1,e1f1,5.32502,0.96268,f,171.430556,e1,o
2,e1f1,3.412439,0.947501,f,171.600694,e1,o
3,e1f1,3.761561,2.010046,f,171.520833,e1,o
4,e1f1,,1.144831,f,165.504167,e1,o


We will now transform all categorical values into numerical ones. For the sex we will use just 0 and 1. For the column 'Cell', we could either use the number in the cell or convert the field using a one-hot-encoder. It's hard  to decide which option is best without having additional details about the data collection. For simplicity, we will just use the cell number.

In [4]:
cleanup = {"Sex": {"f":0, "m":1},
          "Cell": {"e1":1, "e2":2, "e3":3, "e4":4, "e5":5, "e6":6, "e7":7, "e8":8, "e9":9, "e10":10}
          }
data.replace(cleanup, inplace=True)
data.head()

Unnamed: 0,ID_e,E,N,Sex,Date,Cell,Label
0,e1f1,5.583066,1.175189,0,125.51875,1,o
1,e1f1,5.32502,0.96268,0,171.430556,1,o
2,e1f1,3.412439,0.947501,0,171.600694,1,o
3,e1f1,3.761561,2.010046,0,171.520833,1,o
4,e1f1,,1.144831,0,165.504167,1,o


If we look at the data, we can see that the ID is just a concatenation of the cell, the sex and the lizard number. Because of this, there is a perfect mapping between the ID and the label. However, instead of removing it right away, we are going to use it to fill the missing values in E and N. We will first transform the ID into a numeric value using the method `factorize`.

In [5]:
data['ID_e'] = pd.factorize(data.ID_e)[0]
data

Unnamed: 0,ID_e,E,N,Sex,Date,Cell,Label
0,0,5.583066,1.175189,0,125.518750,1,o
1,0,5.325020,0.962680,0,171.430556,1,o
2,0,3.412439,0.947501,0,171.600694,1,o
3,0,3.761561,2.010046,0,171.520833,1,o
4,0,,1.144831,0,165.504167,1,o
...,...,...,...,...,...,...,...
7184,179,1.103858,,1,144.652778,9,y
7185,179,3.699470,,1,144.722917,9,y
7186,179,1.766447,,1,143.684722,9,y
7187,179,0.497900,1.169597,1,142.520833,9,y


Finally, we need to change the label. We could use a one-hot encoder for this, as the label seems to be a categorical value with no ordinal relationship. However, most classifiers need a single output and treat multiclass problems as one vs. all approach, so using a one-hot encoder is not strictly required. A notable exception to this are artificial neural networks (so if we want to use ANN we would need to change the following lines to use a one-hot encoder).

In [6]:
data['Label'] = pd.factorize(data.Label)[0]

Now we are ready to transfer the data to numpy arrays.

In [7]:
import numpy as np
x = data.loc[:, data.columns != 'Label'].to_numpy()
y = data.loc[:, 'Label'].to_numpy()

We now check for missing values.

In [8]:
np.sum(np.isnan(x),0)

array([  0, 387, 390,   0,   0,   0])