## Forklaring på approach

Ettersom modellen skal prestere bra på **test-settet**, er det viktig at modellen trenes opp slik at den er god på nettopp det. Å gjette basert på ting som ser ut som **test-settet**. I test-settet er det maksimale tidsrommet siden en vessel har blitt observert (siste record med denne vesselen i train) 5 dager. Det vil si at det alltid vil være maksimalt 5 dager siden en vessel har blitt sett.

For å gjøre modellen god på å håndtere data som ser ut som test-settet, graver vi litt i denne egenskapen: at man kan si noe om hvor lenge det er siden en vessel har blitt sett. Rent praktisk deler vi opp train-settet i "chunks": for en bestemt vessel, så vil recordsene deles inn i "5-dagers-chunks", dvs. at alle recordsene som tilhører de fem første dagene av dataen på denne vesselen, vil tilhøre samme "chunk". De neste fem dagene vil tilhøre neste chunk, osv osv. Dette gjelder for alle vessels.

Praktisk gjøres dette gjennom at man lager en ny kolonne som kalles 'window-start': denne betegner når det gjeldende vinduet ble startet opp, 


Videre legges til kolonnen 'time since window start', som betegner hvor lang tid det har gått siden starten på dette vinduet. i.e. betyr dette at 'time since window-start' vil nulles ut når 'time since window-start' overskrider 5 dager/120 timer -> dette betyr at et nytt vindu starter. Denne kolonnen vil være = differansen mellom 'time' og 'window-start'.

Videre legger man til kolonnen 'last-known-time' som peker tilbake på tidsstampen på **siste verdi** i **forrige** window. Denne sier noe om tidsstampen på **forrige observerte posisjon**. 'Time-since-last-known' legges også til: Dette er altså differansen mellom recorden sin egen timestamp, og timestampen på avsluttende verdi i forrige vindu/chunk

Videre legger man til lag-kolonner 'last-known-lat' (og long osv osv.). Denne peker på **den siste** latitude-verdien i **forrige** window. Med andre ord peker alle records i samme chunk/window tilbake på samme verdi, nemlig **avsluttende verdi** i forrige vindu, akkurat som last known time.



In [15]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

In [16]:
ais_test = pd.read_csv('../ais_test.csv')
ais_train = pd.read_csv('../first_50000_rows.csv', sep='|')

In [17]:
ais_train

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId
0,2024-01-01 00:00:25,284.0,0.7,0,88,0,01-09 23:00,-34.74370,-57.85130,61e9f3a8b937134a3c4bfdf7,61d371c43aeaecc07011a37f
1,2024-01-01 00:00:36,109.6,0.0,-6,347,1,12-29 20:00,8.89440,-79.47939,61e9f3d4b937134a3c4bff1f,634c4de270937fc01c3a7689
2,2024-01-01 00:01:45,111.0,11.0,0,112,0,01-02 09:00,39.19065,-76.47567,61e9f436b937134a3c4c0131,61d3847bb7b7526e1adf3d19
3,2024-01-01 00:03:11,96.4,0.0,0,142,1,12-31 20:00,-34.41189,151.02067,61e9f3b4b937134a3c4bfe77,61d36f770a1807568ff9a126
4,2024-01-01 00:03:51,214.0,19.7,0,215,0,01-25 12:00,35.88379,-5.91636,61e9f41bb937134a3c4c0087,634c4de270937fc01c3a74f3
...,...,...,...,...,...,...,...,...,...,...,...
49995,2024-01-05 07:36:12,5.5,6.1,0,0,0,01-04 20:00,-24.03371,-46.34749,61e9f410b937134a3c4c0049,61d36fdf0a1807568ff9a1b0
49996,2024-01-05 07:36:13,94.9,0.0,0,31,5,01-02 18:00,39.64086,-0.22345,61e9f468b937134a3c4c0289,61d37fb629b60f6113c89e99
49997,2024-01-05 07:36:16,221.9,17.8,0,223,0,01-11 22:00,34.46860,138.21546,61e9f3aab937134a3c4bfe0f,61d37a221366c3998241d928
49998,2024-01-05 07:36:16,324.1,0.0,0,268,5,01-03 20:30,53.33446,7.16211,61e9f397b937134a3c4bfdaf,61d375e893c6feb83e5eb3e4


In [18]:
# Convert 'time' column to datetime format for time-based calculations
ais_train['time'] = pd.to_datetime(ais_train['time'])

# Sort by vesselID and time to ensure correct ordering
ais_train = ais_train.sort_values(by=['vesselId', 'time']).reset_index(drop=True)

# Set window duration to 5 days (120 hours)
window_duration = pd.Timedelta(hours=120)

# Initialize lists to store the new columns' values
window_starts = []
time_since_starts = []
last_known_times = []
time_since_last_known_time = []

# Initialize dictionaries to keep track of the current window start and last known time for each vessel
vessel_window_start = {}
vessel_previous_window_end = {}  # Track the end time of the previous window for each vessel

# Process each row to compute values for the new columns
for idx, row in ais_train.iterrows():
    vessel_id = row['vesselId']
    current_time = row['time']
    
    # Set or update the window start for the vessel
    if vessel_id in vessel_window_start:
        # Check if current time exceeds the 5-day window from the last window start
        if current_time >= vessel_window_start[vessel_id] + window_duration:
            # Update window start to the current record and set previous window end
            vessel_previous_window_end[vessel_id] = vessel_window_start[vessel_id] + window_duration
            vessel_window_start[vessel_id] = current_time
    else:
        # Initialize the first window start for this vessel
        vessel_window_start[vessel_id] = current_time
        vessel_previous_window_end[vessel_id] = pd.NaT  # No previous window initially

    # Calculate time since the window start
    time_since_start = current_time - vessel_window_start[vessel_id]
    
    # Determine the last known time as the end of the previous window
    last_known = vessel_previous_window_end[vessel_id]
    if pd.notnull(last_known):
        time_since_last = current_time - last_known
    else:
        time_since_last = pd.NaT  # First entry or first window

    # Append computed values to the lists
    window_starts.append(vessel_window_start[vessel_id])
    time_since_starts.append(time_since_start)
    last_known_times.append(last_known)
    time_since_last_known_time.append(time_since_last)

# Create new columns in the dataframe
ais_train['window-start'] = window_starts
ais_train['time-since-start'] = time_since_starts
ais_train['last-known-time'] = last_known_times
ais_train['time-since-last-known-time'] = time_since_last_known_time

In [19]:
ais_train

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId,window-start,time-since-start,last-known-time,time-since-last-known-time
0,2024-01-01 00:14:36,348.0,0.0,0,333,5,12-29 21:00,51.30883,3.23027,61e9f38eb937134a3c4bfd8d,61d36f9a0a1807568ff9a156,2024-01-01 00:14:36,0 days 00:00:00,NaT,NaT
1,2024-01-01 00:35:36,8.0,0.0,0,333,5,12-29 21:00,51.30882,3.23025,61e9f38eb937134a3c4bfd8d,61d36f9a0a1807568ff9a156,2024-01-01 00:14:36,0 days 00:21:00,NaT,NaT
2,2024-01-01 00:56:34,20.0,0.0,0,333,5,12-29 21:00,51.30882,3.23027,61e9f38eb937134a3c4bfd8d,61d36f9a0a1807568ff9a156,2024-01-01 00:14:36,0 days 00:41:58,NaT,NaT
3,2024-01-01 01:17:35,6.0,0.0,0,334,5,12-29 21:00,51.30880,3.23023,61e9f38eb937134a3c4bfd8d,61d36f9a0a1807568ff9a156,2024-01-01 00:14:36,0 days 01:02:59,NaT,NaT
4,2024-01-01 01:35:36,353.0,0.0,0,334,5,12-29 21:00,51.30882,3.23030,61e9f38eb937134a3c4bfd8d,61d36f9a0a1807568ff9a156,2024-01-01 00:14:36,0 days 01:21:00,NaT,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2024-01-05 05:56:19,0.0,0.0,-2,135,5,01-05 05:00,53.86176,8.72497,clh6aqawa0007gh0z9h6zi9bo,61d375e793c6feb83e5eb3e3,2024-01-01 22:15:13,3 days 07:41:06,NaT,NaT
49996,2024-01-05 06:17:19,0.0,0.0,3,135,5,01-05 05:00,53.86175,8.72496,clh6aqawa0007gh0z9h6zi9bo,61d375e793c6feb83e5eb3e3,2024-01-01 22:15:13,3 days 08:02:06,NaT,NaT
49997,2024-01-05 06:35:19,0.0,0.0,-2,135,5,01-05 05:00,53.86178,8.72498,clh6aqawa0007gh0z9h6zi9bo,61d375e793c6feb83e5eb3e3,2024-01-01 22:15:13,3 days 08:20:06,NaT,NaT
49998,2024-01-05 06:56:19,0.0,0.0,2,134,5,01-05 05:00,53.86178,8.72498,clh6aqawa0007gh0z9h6zi9bo,61d375e793c6feb83e5eb3e3,2024-01-01 22:15:13,3 days 08:41:06,NaT,NaT


In [20]:
ais_train.to_csv('kodd.csv')