In [7]:
# Cell to hide - just an option and import other functions

import pandas as pd
pd.set_option('display.max_columns', 27)

%run 1-functions.ipynb

## Predict the width of a ship <a class="anchor" id="width"></a>

You customer would like a model that is able to predict the width of a ship, knowing its length. For this first task, the model will take only the length attribute as an input, and predict the width. For this task, you can use the static dataset.

In [4]:
import pandas as pd

static_data = pd.read_csv('./static_data.csv')

We can put both attribute names in two variables: ``x`` (containing a list of the predictive variables, here only length) and ``y`` (containing a list of the predicted variable).

In [5]:
# Prediction of Width from Length

x = ['Length']
y = ['Width']

As the prediction of the width of a ship is a regression problem, we use the function ``knn_regression()`` __TODO: add link to functions page when it is made__ to make the prediction. Then we can calculate the MAE with the method [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) from the ``sklearn`` library.

In [8]:
from sklearn.metrics import mean_absolute_error

pred1, ytest1 = knn_regression(static_data, x, y)

print('MAE with all data: ' + str(mean_absolute_error(pred1, ytest1)))

MAE with all data: 2.73451052631579


In the previous part, we already identified that the length and width attributes contain some missing values. We can try to make a prediction on a selected part of the dataset, that doesn't contain missing values. For that, we use the method [dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) on the dataframe. Then, we make a new prediction on the selected dataset and print the MAE.

In [9]:
static_selected = static_data.dropna()
pred2, ytest2 = knn_regression(static_selected, x, y)
print('MAE without NaN: ' + str(mean_absolute_error(pred2, ytest2)))

MAE without NaN: 2.0717238095238097


The error dropped (this means that the performance of the model increased when we removed the missing data). Before coming to any conclusion, we need to analyze in details the reason of this drop of performance, and think if it is really what we want.

__TODO: explanation: depends how missing values are treated in prediction > if length is here but width is missing, what does the MAE return? How to compare prediction with NaN? A length of 0 might lower the error.__

## Predict the mean speed from the type of vessel <a class="anchor" id="meansog"></a>

Your customer now would like to be able to predict the mean SOG of a ship, knowing the type of ship. This can be useful to predict when a particular ship would arrive at a lock, for example. The static dataset is again used in this case.

Solve these two widgets to determine the values of the ``x`` and ``y`` variables and the appropriate function to use (if the problem is regression or classification).

<iframe src="https://h5p.org/h5p/embed/753577" width="694" height="600" frameborder="0" allowfullscreen="allowfullscreen"></iframe><script src="https://h5p.org/sites/all/modules/h5p/library/js/h5p-resizer.js" charset="UTF-8"></script>

<iframe src="https://h5p.org/h5p/embed/753637" width="694" height="300" frameborder="0" allowfullscreen="allowfullscreen"></iframe><script src="https://h5p.org/sites/all/modules/h5p/library/js/h5p-resizer.js" charset="UTF-8"></script>

Start with filling the ``x`` and ``y`` variables with the appropriate attributes.

In [None]:
# Prediction of MeanSOG from VesselType
x = []
y = []

Make the predictions with the right prediction function (replace the ``xxx`` with the appropriate name: ``classification`` or ``regression``).

In [None]:
static_selected = static_data.dropna()

pred1, ytest1 = knn_xxx(static_data, x, y)
pred2, ytest2 = knn_xxx(static_selected, x, y)

print('MAE with all data: ' + str(mean_absolute_error(pred1, ytest1)))
print('MAE without NaN: ' + str(mean_absolute_error(pred2, ytest2)))

In [10]:
# For beginner version: hide cell

import numpy as np
import ipywidgets as widgets
from ipywidgets import interact
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score

functions = ['not chosen', 'classification', 'regression']
attributes = static_data.columns

def make_prediction(x, y, function):
    static_selected = static_data.dropna()
    
    if function == 'regression' and static_data[y].dtype.name in ['object', 'int64', 'float64']:
        pred1, ytest1 = knn_regression(static_data, [x], [y])
        pred2, ytest2 = knn_regression(static_selected, [x], [y])
        
        print('MAE with all data: ' + str(mean_absolute_error(pred1, ytest1)))
        print('MAE without NaN: ' + str(mean_absolute_error(pred2, ytest2)))
    
    elif function == 'classification' and static_data[y].dtype.name == ['category']:
        pred1, ytest1 = knn_classification(static_data, [x], [y])
        pred2, ytest2 = knn_classification(static_selected, [x], [y])
        
        print('Accuracy with all data: ' + str(accuracy_score(pred1, ytest1)))
        print('Accuracy without NaN: ' + str(accuracy_score(pred2, ytest2)))

interact(make_prediction,
         x = widgets.Dropdown(
            options = attributes,
            value = attributes[0],
            description = 'x = ',
            disabled = False,),
         y = widgets.Dropdown(
            options = attributes,
            value = attributes[0],
            description = 'y = ',
            disabled = False,),
         function = widgets.Dropdown(
            options = functions,
            value = functions[0],
            description = 'Type: ',
            disabled = False,))

interactive(children=(Dropdown(description='x = ', options=('TripID', 'MMSI', 'MeanSOG', 'VesselName', 'IMO', â€¦

<function __main__.make_prediction(x, y, function)>

This time, we see that the performance decreased without the missing data.

__TODO: explanation: MeanSOG has a lot of missing values that are replaced with 0. Check the missing values of VesselType.__

## How to deal with missing values

__TODO__
+ recover by guessing or predicting
+ set up a neutral / constant value
+ drop the columns or rows with missing values