## Imports & Notebook Setup

We will import the necessary packages utilized for EDA/preprocessing in this notebook:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

- `pandas` will help us load the CSV, handle missing values, etc.
- `numpy` will help us with numerical operations, arrays, and other data handling tools.
- `matplotlib` will help us with base plotting.
- `seaborn` provides nicer visualizations and better defaults for base plotting.
- `sklearn.model_selection`'s `train_test_split` will help us create train/test sets.
- `sklearn.preprocessing`'s `StandardScaler` and `OneHotEncoder` will help us scale numeric features and encode categorical variables, respectively.
- `sklearn.compose`'s `ColumnTransformer` will help us apply different transforms to numeric and categorical columns in one go.

## Initial EDA

We import the data from `ais_data.csv` to understand what we are working with in terms of the dataset's structure, quality, and the raw patterns that exist. This will help us determine what we need to keep and drop, which will assist us in creating a schematic for our preprocessing tasks.

### Loading & Inspecting Dataset Structure

Understand the shape, feature types, and identify ID-like columns.

In [None]:
ais_data = pd.read_csv("../data/ais_data.csv")

# Peek at structure
ais_data.head()

Unnamed: 0.1,Unnamed: 0,mmsi,navigationalstatus,sog,cog,heading,shiptype,width,length,draught
0,0,219019621,Unknown value,0.0,86.0,86.0,Fishing,4.0,9.0,
1,1,265628170,Unknown value,0.0,334.5,,Port tender,8.0,27.0,
2,2,219005719,Unknown value,0.0,208.7,,Fishing,4.0,11.0,
3,3,219028066,Unknown value,0.0,,,Pleasure,3.0,12.0,
4,4,212584000,Moored,0.0,153.0,106.0,Cargo,13.0,99.0,6.3


In [None]:
# Data types + missing counts
ais_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358351 entries, 0 to 358350
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Unnamed: 0          358351 non-null  int64  
 1   mmsi                358351 non-null  int64  
 2   navigationalstatus  358351 non-null  object 
 3   sog                 357893 non-null  float64
 4   cog                 355182 non-null  float64
 5   heading             337737 non-null  float64
 6   shiptype            358351 non-null  object 
 7   width               354640 non-null  float64
 8   length              354608 non-null  float64
 9   draught             332808 non-null  float64
dtypes: float64(6), int64(2), object(2)
memory usage: 27.3+ MB


In [None]:
# Numerical summary
ais_data.describe()

Unnamed: 0.1,Unnamed: 0,mmsi,sog,cog,heading,width,length,draught
count,358351.0,358351.0,357893.0,355182.0,337737.0,354640.0,354608.0,332808.0
mean,186757.775285,293967800.0,12.122554,189.064529,190.076829,19.947854,124.971549,6.571402
std,112181.60187,121386600.0,9.355851,107.588825,107.107604,10.808627,71.268183,2.934392
min,0.0,9112856.0,0.0,0.0,0.0,1.0,2.0,0.4
25%,89587.5,219578000.0,9.2,116.3,120.0,12.0,83.0,4.6
50%,179947.0,248659000.0,11.3,168.7,170.0,17.0,115.0,6.1
75%,283503.5,304665000.0,13.3,300.175,303.0,28.0,181.0,7.9
max,387581.0,992195000.0,214.0,359.9,507.0,78.0,690.0,25.5


In [19]:
# List of ALL column names
all_columns = list(ais_data.columns)
print("All columns:", all_columns)

# Target column identification
target_column = "navigationalstatus"
print("Target column:", target_column)

# ID fields (case insensitive/loose matching)
id_fields = []
possible_id_fields = ["mmsi", "imo", "callsign", "shipname"]
for col in ais_data.columns:
    if col.lower() in possible_id_fields:
        id_fields.append(col)
print("ID fields:", id_fields)

# Static features
static_features = []
for col in ais_data.columns:
    if col.lower() in ["width", "length", "draught", "shiptype"]:
        static_features.append(col)
print("Static features:", static_features)

# Dynamic features
dynamic_features = []
for col in ais_data.columns:
    if col.lower() in ["sog", "cog", "heading", "rot"]:
        dynamic_features.append(col)
print("Dynamic features:", dynamic_features)

All columns: ['Unnamed: 0', 'mmsi', 'navigationalstatus', 'sog', 'cog', 'heading', 'shiptype', 'width', 'length', 'draught']
Target column: navigationalstatus
ID fields: ['mmsi']
Static features: ['shiptype', 'width', 'length', 'draught']
Dynamic features: ['sog', 'cog', 'heading']


### Data Quality Checks 

Cleanliness assessment before deciding what to keep.

In [21]:
# Count missing values per column
ais_data.isna().sum()

Unnamed: 0                0
mmsi                      0
navigationalstatus        0
sog                     458
cog                    3169
heading               20614
shiptype                  0
width                  3711
length                 3743
draught               25543
dtype: int64

In [28]:
# Check for duplicate rows
dup_count = ais_data.duplicated().sum()
print(f"Duplicate rows found: {dup_count}")
if dup_count > 0:
    print("Duplicated rows:")
    print(ais_data[ais_data.duplicated()])

Duplicate rows found: 0


In [None]:
# Check for unrealistic Speeed-over-Ground (SOG) values: SOG < 0 or SOG > 60 knots.
unreal_sog = ais_data[(ais_data["sog"] < 0) | (ais_data["sog"] > 60)]
print("Unrealistic SOG rows:", len(unreal_sog))
unreal_sog[["mmsi", "navigationalstatus", "sog"]].head()

Unrealistic SOG rows: 2888


Unnamed: 0,mmsi,navigationalstatus,sog
1518,111219513,Under way using engine,69.2
1527,111219513,Under way using engine,102.2
1537,111219513,Under way using engine,102.2
1540,111219513,Under way using engine,102.2
1544,111219513,Under way using engine,102.2


In [33]:
# Check for unrealistic compass direciton/heading outside values: [0, 360] range in degrees.
unreal_heading = ais_data[ais_data["heading"] < 0] | ais_data[ais_data["heading"] > 360]
print("Unrealistic heading rows:", len(unreal_heading))
unreal_heading[["mmsi", "navigationalstatus", "heading"]].head()

Unrealistic heading rows: 4


Unnamed: 0,mmsi,navigationalstatus,heading
145692,False,False,False
157056,False,False,False
257390,False,False,False
354713,False,False,False


In [40]:
# Check for unrealistic draught (How deep the ship is submerged in the water in meters) values: Draught <= 0 m. 
print("Nonpositive draught rows:", len(unreal_draught))
unreal_draught[["mmsi", "shiptype", "length", "width", "draught"]].head()

Nonpositive draught rows: 0


Unnamed: 0,mmsi,shiptype,length,width,draught


In [42]:
# Check for unrealistic length/widths: Length <= 0 or Width <= 0 or Length > 400 or Width > 60.
unreal_size = ais_data[
    (ais_data["length"] <= 0) | (ais_data["width"] <= 0) |
    (ais_data["length"] > 400) | (ais_data["width"] > 60)
]
print("Unrealistic length/width rows:", len(unreal_size))
unreal_size[["mmsi", "shiptype", "length", "width"]].head()

Unrealistic length/width rows: 56


Unnamed: 0,mmsi,shiptype,length,width
76551,211422510,Tanker,690.0,28.0
124552,219013964,Reserved,295.0,78.0
124636,219013964,Reserved,295.0,78.0
124777,219013964,Reserved,295.0,78.0
124879,219013964,Reserved,295.0,78.0


In [None]:
# Speed-over-Ground (SOG) validity check
ais_data["sog"].describe()

count    357893.000000
mean         12.122554
std           9.355851
min           0.000000
25%           9.200000
50%          11.300000
75%          13.300000
max         214.000000
Name: sog, dtype: float64

In [None]:
# Compass direction/heading validity check
ais_data["heading"].describe()

count    337737.000000
mean        190.076829
std         107.107604
min           0.000000
25%         120.000000
50%         170.000000
75%         303.000000
max         507.000000
Name: heading, dtype: float64

In [None]:
# Draught validity check
ais_data["draught"].describe()

count    332808.000000
mean          6.571402
std           2.934392
min           0.400000
25%           4.600000
50%           6.100000
75%           7.900000
max          25.500000
Name: draught, dtype: float64

In [None]:
# Length validity check
ais_data["length"].describe()


count    354608.000000
mean        124.971549
std          71.268183
min           2.000000
25%          83.000000
50%         115.000000
75%         181.000000
max         690.000000
Name: length, dtype: float64

In [None]:
# Width validity check
ais_data["width"].describe()

count    354640.000000
mean         19.947854
std          10.808627
min           1.000000
25%          12.000000
50%          17.000000
75%          28.000000
max          78.000000
Name: width, dtype: float64

## Data Preprocessing 

## Post-preprocessing EDA