# TDT 4173 - Machine Learning Project - Long Notebook

### Kaggle team name: 121 - Team Sondre Skjerven


#### Participants
Name: Hedda Flemmen Holum:) \
Student ID: 544531


Name: Sondre Skjerven \
Student ID: 564376


Name: Tallak Ravn \
Student ID: 544531

| Criteria                              | How we covered it                                                                                                   | Lookup keyword |
|---------------------------------------|-----------------------------------------------------------------------------------------------------------------|------------------------|
| EDA: Domain Knowledge                 | See below.                                                                                                      | [Domain]               |
| EDA: Check intuitive data             | Explain time series and how data must be understood.                                                         | [Intuitive]            |
| EDA: Understand data generation       | See below.                                                                                                      | [Generation]           |
| EDA: Explore individual features      | We made feature distribution plots                                                                              | [Individual]           |
| EDA: Explore feature pairs & groups   | Made plots for correlation between pv_measurement and the rest of the features. Also plotted correlation matrix with all features. | [Pairs & Groups]       |
| EDA: Clean up features                | Removed missing values, consecutive non-zeros, consecutive zeros over length 24                                 | [Cleaning]             |
| Multiple types of predictors          | We have tried Pycaret, pure catboost and Autogluon as predictors. Both Pycaret and Autogluon are AutoML models which try several different base models. | [Multiple predictors]  |
| Feature engineering                   | Made cosine and sine transformations of the hour, week and month columns.                                      | [Feature engineering]  |
| Model interpretation                  | Plotted feature importance for the Pycaret model to see what features have the most explainability. Plotted predictions for all models.            | [Model interpretation] |


# Reproduction of results

The notebooks assume that the data folder is in the same folder as the notebook. 

Folder structure:

short_notebook1.ipynb \
short_notebook2.ipynb\
long_notebook_ML.ipynb\
data (folder)


In [None]:
# Uncomment and run the following to install the required packages
# !pip install 'nbformat>=4.2.0' (might need to restart kernel after)
# !pip install wheel
# !pip install -r requirements.txt

# Domain Knowledge / Data Generation / Data Intuition

### AIS Vessel Trajectory Prediction [Domain] Knowledge

Hedda, fix:

Solar power is generated through photo voltaic – PV  – cells absorbing the energy from photons emitted by the sun. As so, at the most basic level, the effect of solar panels (the amount of energy they produce) is a result of how much sun they are exposed to (DOE). Of course, panels can therefore not produce energy if the sun is down, i.e. at night. Still, there is more to it. Photons, which can be considered as solar rays, are dependent on striking panels at a sudden minimum angle if they are to be absorbed and not reflected (OSTI). Even though interaction with particles in the atmosphere distorts the direction of photons, their direction is predominantly given by the sun’s angle relative to the surface. Therefore, in the early morning and late afternoon, one might expect the amount of photons that strike the solar panels at a sufficiently steep angle to be less than around the middle of the day when the sun is at its highest. This results in a gradual increase and decrease in PV energy generated at each side of mid-day. 

Further on, there are other factors that affect the panels’ effect. These factors include whatever might block sunrays from striking the panels, even when the sun is up. This might include fog, clouds or general air humidity which will decrease the direct radiation, but also things that might cover the actual panels, such as snow. Additionally, the temperature of the panels affects their efficiency, as heat results in increased electrical resistance and therefore lower output. For this reason, surface temperatures might be an important factor, as warmer weather will increase the temperature of the panels. Wind should also be taken into consideration, as higher wind speeds should have a cooling effect. A non-external factor that also could be considered is that while aging, panels generally tend to perform worse, as performance typically decreases by 0.5% each year (NRL).

OSTI: https://www.osti.gov/servlets/purl/1617300, retrieved 09.11.2023

DOE: https://www.energy.gov/eere/solar/how-does-solar-work, retrieved 09.11.2023

NRL: https://www.nrel.gov/state-local-tribal/blog/posts/stat-faqs-part2-lifetime-of-pv-panels.html, retrieved 09.11.2023




### Data generation and structure

The main dataset in the project is the AIS data ("Automatic Identification system"). An international collection of vessels is equipped with a system allowing for real time tracking of a number of variables. The dataset provides real time info about the individual vessels, including their speed and course. In order to ensure a well-functioning and safe international maritime traffic, this form of data is critical. Examples of the specific use cases are navigation and collision avoidance. As emphasized in the documentation and the project handout, the data might include inaccuracies due to signal errors or humans. This will be addressed later.

It's also worth giving some introductory information about the data structure. A single row/record in the AIS dataset represents a single observation of a specific vessel at a specific time, including variables such as the observed speed and course. Filtering on a single vessel (by a uniquely identifying 'vesselId') and sorted by time yields a dataset of the overall observed trajectory for that specific vessel (as illustated by the handout 'Visualize_last_vessel_pos.ipynb'). A row is uniquely identified by its timestamp ('time') combined with the vessel ('vesselId').

The handout also includes a vessel dataset, with information regarding the shipping line, volume, type and dimensions among others. This dataset can be linked to AIS as dimensional data joined on the vesselIds. Furthermore, there is also a port dataset, with information regarding the ports with corresponding position, name and countries. This set can also be combined with the AIS data using the 'portId' column.


### Data intuition

It's worth dedicating some attention to understanding the data and creating a fundamental intuition regarding the relation between the training and the test set. To do this, we will use snippets from the set. The train set is represented through the 50 000 first rows in the set:

In [2]:
import pandas as pd

train_set = pd.read_csv('first_50000_rows.csv', sep='|')
train_set.head(5)

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId
0,2024-01-01 00:00:25,284.0,0.7,0,88,0,01-09 23:00,-34.7437,-57.8513,61e9f3a8b937134a3c4bfdf7,61d371c43aeaecc07011a37f
1,2024-01-01 00:00:36,109.6,0.0,-6,347,1,12-29 20:00,8.8944,-79.47939,61e9f3d4b937134a3c4bff1f,634c4de270937fc01c3a7689
2,2024-01-01 00:01:45,111.0,11.0,0,112,0,01-02 09:00,39.19065,-76.47567,61e9f436b937134a3c4c0131,61d3847bb7b7526e1adf3d19
3,2024-01-01 00:03:11,96.4,0.0,0,142,1,12-31 20:00,-34.41189,151.02067,61e9f3b4b937134a3c4bfe77,61d36f770a1807568ff9a126
4,2024-01-01 00:03:51,214.0,19.7,0,215,0,01-25 12:00,35.88379,-5.91636,61e9f41bb937134a3c4c0087,634c4de270937fc01c3a74f3


In [3]:
test_set = pd.read_csv('ais_test.csv')
test_set.head(5)

Unnamed: 0,ID,vesselId,time,scaling_factor
0,0,61e9f3aeb937134a3c4bfe3d,2024-05-08 00:03:16,0.3
1,1,61e9f473b937134a3c4c02df,2024-05-08 00:06:17,0.3
2,2,61e9f469b937134a3c4c029b,2024-05-08 00:10:02,0.3
3,3,61e9f45bb937134a3c4c0221,2024-05-08 00:10:34,0.3
4,4,61e9f38eb937134a3c4bfd8d,2024-05-08 00:12:27,0.3


The main thing to note here is that the training set includes a number of features that are not present in the test set. As an immediate conclusion, we can say that we will have to find a work-around for this. The core idea behind machine learning regression is to use a target variable's relationship with other variables to make predictions about the target variable when we don't have access to the target variable itself. A model trained on data with columns A, B, C, D and E might be able to say something about the probable values of E when only provided data with columns A, B, C and D if there are indeed underlying relationships between E and any of the remaining columns.

Quickly looking at the datasets, we see that the only variables in common between the train and the test set are the "vesselId" and the "time" columns. Without any modification, and using our previous illustration, a simple regression using these datasets directly would therefore require that the model should predict positions ***purely*** based on a vesselId and a time, without any other information. This surely appears to be a hard task. Adding to that the fact that all the test records appear at a later stage in time than all the train records, we can conclude that we really have to structure things in a different manner.

We do, however, indeed have more information relevant to the test sets. Even though we can't create columns such as 'rot', 'cog' and 'sog' from thin air, we can leverage the fact that we have access to historical data. A core idea to get comfortable with as we move along is that the data records should be viewed in relation ***to eachother*** as time series - not just by themselves.

Without diving into the technicals quite yet, we will reveal that looking at the ***earlier positions*** of the vessel will be a key way to address this data complexity issue. We can for example, at all points in time, know with certainty what was the ***last observed*** position of a vessel.

# Exploring the data / EDA

### Initial look at the columns

In order to wrap our heads around the data we are dealing with, we explore our dataset a bit. For now, we will be dealing exclusively with the training set, as the test set are only timestamps and vesselIds.

In [None]:
train_set.info()

# TODO: argumenter for at PortId kan fjernes pga alle de som mangler 
# TODO: etaRaw kan fjernes også?

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   time       50000 non-null  object 
 1   cog        50000 non-null  float64
 2   sog        50000 non-null  float64
 3   rot        50000 non-null  int64  
 4   heading    50000 non-null  int64  
 5   navstat    50000 non-null  int64  
 6   etaRaw     50000 non-null  object 
 7   latitude   50000 non-null  float64
 8   longitude  50000 non-null  float64
 9   vesselId   50000 non-null  object 
 10  portId     49797 non-null  object 
dtypes: float64(4), int64(3), object(4)
memory usage: 4.2+ MB


As we see, the set has very few missing data points. This is lucky for us, as missing values necessarily require handling, and we as data analysts are responsible for doing exactly that in an appropriate way. The data might however, still be bad. We will dive deeper into that aspect later. As of now, we can conclude that no columns are excluded purely because of too many missing values.

Moving on, we would like to see the different values and ranges appearing in the columns:

In [12]:
for column in train_set.columns:
    print(f"{column} - Unique values count: {train_set[column].nunique()}")

time - Unique values count: 31859
cog - Unique values count: 3597
sog - Unique values count: 243
rot - Unique values count: 73
heading - Unique values count: 361
navstat - Unique values count: 7
etaRaw - Unique values count: 621
latitude - Unique values count: 25780
longitude - Unique values count: 26269
vesselId - Unique values count: 435
portId - Unique values count: 380


As we see, most of the columns have a wide range of different values. This makes perfect sense, as well documented and explained in the 'Dataset definitions and explanations' file in the handout. In fact, most columns are actually continous as well. This means that the set of possible values for the column is infinitely large (at least in theory), even though there might be restrictions on the minimum and maximum values. An example would be longitudes and latitudes, which are recorded as float values (decimals). There are infinitely many decimal numbers between any two arbitrary decimal numbers, so there should be an unrestricted amount of lonitudes and latitudes in the range of possible values. To be fair, the devices recording the longitudes and latitudes (and other variables) are probably object to physical constraints, and only able to track position to a certain decimal. So, in practice there might in fact be a finite number of possible values for these. Nevertheless, we will move on viewing the sets of possibilities for these as infinitely big.

The only columns that are truly categorical are navstat, vesselId and portId. 

- ***Navstat***: Indicating one of 16 different 'navigational statuses' (documentation). These are discrete, mutually exclusive states implemented through integer values from 0 to 15. There are only 16 possible values (and possibly fewer in practice).

- ***vesselId***: Indicating which of the 711 possible vessels is observed in the record. These are also discrete, mutually exclusive IDs referring to vessels present in the 'vessels.csv' file.

- ***portId***: Similar to the vesselId: Indicating which of the possible 1329 is set by the captain either as destination or origin port. Can only be an existing port.

### Distribution of values
Hedda, fiks histogram som gir meining med kommentarer

# Data reading & preprocessing

## Preprocessing and cleaning

Before traning a regression model on the data set, it is essential that we convert the data to a format that the model can actually work with. This includes encoding some of the columns, and converting data.

## Preprocess pipelines

Example:

For the preprocessing, we have used the following functions:

1. Aggregate the data from 15-minute intervals to hourly
2. Join the features with the target variable
3. Remove consequtive non-zero values in pv_measurement
4. Remove consequtive zero-values above 24 in length from pv_measurement
5. Remove missing values in the pv_measurement column, remove some of the columns with a lot of missing values
6. Add new features (cosine and sine transformations of hour, months and week)

In [None]:
# Preprocess training data

# Exploratory Data Analysis

## Check if data has intuitive patterns

### Features with high positive, negative or low correlation with the target variable

In [None]:
# Mekke feature importance plot her

### Correlation heatmaps for all features [Pairs & Groups]

In [None]:
# mekke heatmaps, en med "features with high correlation", 
# og en "features with low correlation"

## Exploring differences between observed and estimated data

### Time series plot of estimated and observed data

Eksempel konklusjon: From the plots below, it looks like there is little difference between the observed and estimated data.

### Distribution plot of estimated and observed data

eksempel annen observasjon:

However, if we compare the distribution of the data from observed and estimated, we see that there are some differences in the values that the estimated take on (especially in t_1000 and sun_elevation). 

The difference in distribution for the sun_elevation feature can be explained by the fact that the estimated data is from october to may, meaning that the winter months are overrepresented. This skews the distribution of the sun_elevation in comparison to the observed data, which spans over multiple years. Since sun_elevation is lower during the winter months, the distribution for estimated sun_elevation conists of fewer values of high elevation (>20). The same logic can be applied to the t_1000hPa:K, which is temperature in Kelvin at 1000hPa pressure. 

We believe these differences are negligible.

# Model testing [Multiple predictors]

eksempel:

We have tested three types of models: Pycaret, Autogluon and Catboost (and variations/ensembles/stackings of these).

In our comparative analysis, each model showed both strengths and weaknesses, with variations in performance metrics across different datasets. Pycaret showcased versatility in automating the ML workflow, Autogluon excelled in time efficiency and ease of use, while Catboost demonstrated superior handling of categorical data and complex relationships. The incorporation of ensemble and stacking techniques further enhanced predictive capabilities, underscoring the potential of hybrid approaches in tackling diverse data challenges.

## Pycaret

eksempel:

The group only tried one Pycaret pipeline before changing to Autogluon + Catboost.

We tried various hyperparameters, including training time, number of folds, with/without eval set, eval size, none of which ended up outperforming Autogluon or Catboost.

## Autogluon

eksempel:

For Autogluon, we tried training one model for each location, as well as one model that predicted on all three locations. The first method yielded the best results, both methods are listed below. 

In addition, we tried a lot of different hyperparameters for Autogluon, experimenting and seeing which combinations gave the best metrics, both locally and on Kaggle. 

### Autogluon V1

eksempel: This model trains one model for each location.

### Autogluon V2

eksempel: 3 Autogluon with tuning data, no new features, remove 0 pv during daytime, random sample eval dataset from full dataset

### Autogluon V3

eksmepl: 3 Autogluon without tuning data, no new features, remove 0 pv during daytime

### Autogluon V4

eksempel: Experiment with longer training times Autogluon stacking and ensembling.

### Autogluon V5

eksempel: Our best autogluon model so far.

## Catboost

eksempel: Our first catboost simple was very simple. We later started experimenting with training multiple catboost models with randomized parameters and randomized train/val split, which proved to give decent results.

### Catboost V1

eksempel: Catboost with one model for all locations

### Catboost V2

eksempel: Catboost with one model for each location

### Catboost V3

eksempel: Catboost ensemble model, training 20 catboost models with random train val split (to increase variance before averaging). 

### Catboost V4

eksempel: Training 20 catboost models with random train val split and random parameters (within given ranges)

### Catboost V5

eksempel: Our best catboost model. Ensemble model.

## Average models

Eksempel: In addition to Catboost and Autogluon, we found that averaging them out with a weighted average produced good scores. 

Since our catboost models performed a bit better than the autogluon, we decided to weight the catboost models a bit higher.

The average models are the ones that performed the best on kaggle.

The submissions were produced by separately running the Autogluon and catboost models, then averaging the final test predictions.

### Model 1: Autogluon

#### Autogluon

#### Train one model for each location

#### Plotting

#### Submission

### Model 2: Catboost

#### Import libraries

### Ensemble: Weighted average of Model 1 and Model 2