### Import libraries

In [None]:
%matplotlib inline

# importing pandas and numpy
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import os

### Import log file

In [None]:
# load logfile into a Pandas dataframe
# dit stukje geven
df = pd.read_csv(os.path.join('..', 'data', 'datasets', 'log.csv'),
                   index_col='datetime', parse_dates=True).drop(['Unnamed: 0'], axis=1)

df.info()

* angle_of_attack: wind direction relative to the boat
> * A positive angle of attack means the wind is blowing onto the right (starboard) side of the boat
> * A negative angle of attack means the wind is blowing onto the left (port) side of the boat
* boat_angle: compass direction in which the boat is going (North==0/360, East==90, South==180, West==270)
* boat_heel: heeling angle in degrees (rotation around the longitudinal axis).
* boat_speed: speed in knots (5 knots is 9.26 km per hr)
* course_error: difference between boat_angle and target_angle
* rudder_angle: position of the rudder relative to centerline of the boat
* target_angle: compass direction in which you want to go
* wind_direction: direction from where the wind is coming
* wind_speed: wind speed in knots

### Look at a sample of the dataset

In [None]:
# Print the first 5 rows of the dataframe
# YOUR CODE HERE
df.head()

In [None]:
# Print the last 5 rows of the dataframe
# YOUR CODE HERE
df.tail()

In [None]:
# If you want you can look at more rows or try different slices
# YOUR CODE HERE

### Plotting the columns
First we will have a visual look at the data.

In [None]:
# Put the columns in a list named 'columns'
# YOUR CODE HERE
columns = list(df)

In [None]:
# Plot the data (this can take a few seconds)
# Dit stukje geven
for column in columns:
    _, ax = plt.subplots(figsize=(20, 10))
    ax.set_title(column)
    df[column].plot(ax=ax)
    plt.grid(True)

## Zoom in on column rudder_angle (and clip it's maximum value)
Our AI Captain would like to control the rudder angle of the boat. Let's zoom in on this column in the dataset

In [None]:
# dit stukje geven
_, ax = plt.subplots(figsize=(20, 10))
df['rudder_angle'].plot(ax=ax)
plt.grid(True)

In [None]:
# Select an interval of 1000 rows (these are 1000 datapoints) and put it in a new dataframe called 'selection'
# YOUR CODE HERE
selection_1k = df.iloc[0:1000]

In [None]:
# plot the rudder_angle in your selection
# YOUR CODE HERE
_, ax = plt.subplots(figsize=(20, 10))
selection_1k['rudder_angle'].plot(ax=ax)
plt.grid(True)

In [None]:
# Try an interval of 10,000 rows now and plot the rudder_angle on this interval
# YOUR CODE HERE
selection_10k = df.iloc[0:10**4]
_, ax = plt.subplots(figsize=(20, 10))
selection_10k['rudder_angle'].plot(ax=ax)
plt.grid(True)

In [None]:
# Find the maximum and minimum values for rudder_angle in the entire dataset
# YOUR CODE HERE
df.rudder_angle.max(), df.rudder_angle.min()

In [None]:
# Clip rudder_angle to [-20, 20]; i.e. set angles > 20 to 20 & angles < -20 to -20
# YOUR CODE HERE
df["rudder_angle"] = df["rudder_angle"].mask(df["rudder_angle"] > 20, 20)
df["rudder_angle"] = df["rudder_angle"].mask(df["rudder_angle"] < -20, -20)

In [None]:
# Look at the result of what you just did
_, ax = plt.subplots(figsize=(20, 10))
df[(df.index.hour == 9) & (df.index.minute == 21)]['rudder_angle'].plot(ax=ax)
plt.grid(True)

### Boat_speed (noisy signal)

In [None]:
# Plot boat_speed of your selection_10k
# YOUR CODE HERE
_, ax = plt.subplots(figsize=(20, 10))
selection_10k['boat_speed'].plot(ax=ax)
plt.grid(True)

In [None]:
# Plot boat_speed of your selection_10k again, this time with a rolling window calculation with rolling(20)
# YOUR CODE HERE
_, ax = plt.subplots(figsize=(20, 10))
selection_10k['boat_speed'].rolling(20).mean().plot(ax=ax)
plt.grid(True)

### Removing outliers from wind_speed

In [None]:
# Maybe you noticed that the graph of wind_speed looked quite messy
# Plot the wind_speed for you selection_10k
# YOUR CODE HERE
_, ax = plt.subplots(figsize=(20, 10))
selection_10k['wind_speed'].plot(ax=ax)
plt.grid(True)

In [None]:
# The big wind_speed values are not realistic. We will define wind_speed's above 35 knots as outliers
# Replace outliers in wind_speed with np.nan (use mask)
# df["wind_speed"] = YOUR CODE HERE
df["wind_speed"] = df["wind_speed"].mask(df["wind_speed"] > 35, np.nan)

In [None]:
_, ax = plt.subplots(figsize=(20, 10))
df.iloc[0:1000]['wind_speed'].plot(ax=ax)
plt.grid(True)

### Removing NA's

In [None]:
# Find the columns that contain NA's
# YOUR CODE HERE
df.isna().sum()

In [None]:
# Replace the NA's in each column with the last valid observation
# Can you think of a reason why this makes more sense here then replace with mean value?
df['boat_heel'].fillna(method='ffill', inplace=True)
df['target_angle'].fillna(method='ffill', inplace=True)
df['wind_speed'].fillna(method='ffill', inplace=True)

In [None]:
df.isna().sum().sum()
# This should show 0

### Creating a new feature
You can create your own features based on the existing columns.
In sailing VMG is used a lot.
VMG stands for Velocity Made Good and is defined as the the velocity component in the directing where you want to be going.
Maybe this feature will help our machine learning algorithm.

In [None]:
df['VMG'] = df.boat_speed*np.cos(np.deg2rad(df.course_error))

df = df[['wind_speed', 'wind_direction',
         'angle_of_attack', 'boat_heel',
         'boat_speed', 'VMG',
         'target_angle', 'boat_angle', 'course_error',
         'rudder_angle']]

In [None]:
_, ax = plt.subplots(figsize=(20, 10))
df.iloc[0:1000].boat_speed.plot(ax=ax)
df.iloc[0:1000].VMG.plot(ax=ax)
plt.legend()

### Plotting correlation matrix

In [None]:
# Plot the correlation matrix of the dataframe
# YOUR CODE HERE
corr = df.corr()
corr.style.background_gradient()

In [None]:
_, ax = plt.subplots(figsize=(20, 10))
df.iloc[0:1000].plot(ax=ax)
plt.legend()

Notice that the range of the values can differ quite a lot per column

### What did you learn?
In this notebook you have:
* Clipped the values of rudder_angle to [-20, 20]
* Used rolling mean to vizualize boat speed signal
* Replaced outliers in wind_speed
* Replaced NA's in boat_heel and target_angle
* Added a new feature
* Looked at the correlation matrix

In [None]:
# Save df to pickle
df.to_pickle('data_clean.pkl')