# SET UP

This notebook sets up the project environment and directory structure for the **SmartPort Delay Risk Scoring** project. 

It defines the project environment, paths, imports required libraries, and loads the initial dataset of container ship movements.

It also creates the training, validation, and working samples that will be used in subsequent notebooks for data quality checks, feature engineering, modeling and production code preparation. The goal is to establish a clear and organized foundation before beginning data quality, transformation and modeling steps.

## PROJECT ENVIROMENT

### Create and activate environment
conda create -n smartport python=3.10
conda activate smartport

### Install dependencies
pip install -r requirements.txt

### Run Streamlit app
streamlit run app.py

## IMPORT LIBRARIES

In [1]:
import os
import numpy as np
import pandas as pd

#Automcomplete
%config IPCompleter.greedy=True

## DIRECTORY

In [2]:
root = '/Users/rober/'

### Project name

In [3]:
dir_name = 'smartport-delay-risk-scoring'

### Create the directory and project structure

In [4]:
path = root + dir_name

In [5]:
try:
    os.mkdir(path)
    os.mkdir(path + '/01_Documents')
    os.mkdir(path + '/02_Data')
    os.mkdir(path + '/02_Data/01_Raw')
    os.mkdir(path + '/02_Data/02_Validation')
    os.mkdir(path + '/02_Data/03_Working')
    os.mkdir(path + '/02_Data/04_Caches')
    os.mkdir(path + '/03_Notebooks')
    os.mkdir(path + '/03_Notebooks/01_Functions')
    os.mkdir(path + '/03_Notebooks/02_Development')
    os.mkdir(path + '/03_Notebooks/03_System')
    os.mkdir(path + '/04_Models')
    os.mkdir(path + '/05_Outputs')
    os.mkdir(path + '/09_Other')
    
except OSError:
    print ("The directory %s has NOT been created" % path)
else:
    print ("The directory %s has been succesfully created" % path)

The directory /Users/rober/smartport-delay-risk-scoring has NOT been created


## IMPORT DATA

In [6]:
file_name_data = 'tracking_db.csv' 

In [7]:
path_data_raw = path + '/02_Data/01_Raw/' + file_name_data 

df = pd.read_csv(path_data_raw, low_memory=False)
df

Unnamed: 0,id,updated,ship,imo,lat,long,sog,cog,hdg,depPort,etdSchedule,etd,atd,arrPort,etaSchedule,eta,ata
0,4136,05/04/2018 19:18,Megastar,9773064,60.1469,24.9135,6.3,219,216,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:23
1,4137,05/04/2018 19:19,Megastar,9773064,60.1445,24.9100,11.6,217,217,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:29
2,4138,05/04/2018 19:20,Megastar,9773064,60.1412,24.9061,14.2,198,199,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:30
3,4139,05/04/2018 19:21,Star,9364722,59.4462,24.7726,3.7,17,159,EETLL,05/04/2018 19:30,07/04/2018 15:25,2018-04-05 19:21:17,FIHEL,05/04/2018 21:30,04/05/2018 21:26,04/05/2018 21:46
4,4140,05/04/2018 19:22,Megastar,9773064,60.1344,24.9056,15.9,179,179,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
823667,912179,15/03/2019 07:36,Finlandia,9214379,59.9242,24.8683,20.7,196,196,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,
823668,912180,15/03/2019 07:37,Finlandia,9214379,59.9198,24.8657,20.8,197,195,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,
823669,912181,15/03/2019 07:38,Finlandia,9214379,59.9148,24.8627,20.7,196,195,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,
823670,912182,15/03/2019 07:39,Finlandia,9214379,59.9082,24.8587,20.7,196,195,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,


The dataset contains AIS-like (Automatic Identification System) tracking records for container ships travelling between ports.

**Each row** represents a single **position update** for a ship, including timing, location, movement data and scheduled / actual departure and arrival information.



| **Field**       | **Description**                                                             |
| --------------- | --------------------------------------------------------------------------- |
| **id**          | Unique identifier for each data record                                      |
| **updated**     | Timestamp indicating when the record was last updated                       |
| **ship**        | Vessel name                                                                 |
| **imo**         | International Maritime Organization (IMO) number — unique vessel identifier |
| **lat**         | Latitude coordinate of vessel position                                      |
| **long**        | Longitude coordinate of vessel position                                     |
| **sog**         | Speed Over Ground (knots) — vessel speed relative to Earth                  |
| **cog**         | Course Over Ground (degrees) — actual direction of vessel movement          |
| **hdg**         | Heading (degrees) — direction the vessel’s bow is pointing                  |
| **depPort**     | Departure port code                                                         |
| **etdSchedule** | Scheduled Estimated Time of Departure                                       |
| **etd**         | Updated Estimated Time of Departure (may differ from schedule)              |
| **atd**         | Actual Time of Departure                                                    |
| **arrPort**     | Arrival port code                                                           |
| **etaSchedule** | Scheduled Estimated Time of Arrival                                         |
| **eta**         | Updated Estimated Time of Arrival (may differ from schedule)                |
| **ata**         | Actual Time of Arrival                                                      |




These fields will be used to build voyage-level features and to derive the target variable indicating whether a vessel experienced a relevant arrival delay.


### Rename columns

In [8]:
df = df.rename(columns={
    "id":"record_id",
    "updated":"updated_ts",
    "ship":"ship_name",
    "imo":"imo",
    "lat":"lat",
    "long":"lon",
    "sog":"sog",
    "cog":"cog",
    "hdg":"hdg",
    "depPort":"dep_port",
    "etdSchedule":"etd_schedule",
    "etd":"etd",
    "atd":"atd",
    "arrPort":"arr_port",
    "etaSchedule":"eta_schedule",
    "eta":"eta",
    "ata":"ata",
})

df

Unnamed: 0,record_id,updated_ts,ship_name,imo,lat,lon,sog,cog,hdg,dep_port,etd_schedule,etd,atd,arr_port,eta_schedule,eta,ata
0,4136,05/04/2018 19:18,Megastar,9773064,60.1469,24.9135,6.3,219,216,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:23
1,4137,05/04/2018 19:19,Megastar,9773064,60.1445,24.9100,11.6,217,217,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:29
2,4138,05/04/2018 19:20,Megastar,9773064,60.1412,24.9061,14.2,198,199,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:30
3,4139,05/04/2018 19:21,Star,9364722,59.4462,24.7726,3.7,17,159,EETLL,05/04/2018 19:30,07/04/2018 15:25,2018-04-05 19:21:17,FIHEL,05/04/2018 21:30,04/05/2018 21:26,04/05/2018 21:46
4,4140,05/04/2018 19:22,Megastar,9773064,60.1344,24.9056,15.9,179,179,FIHEL,05/04/2018 19:30,07/04/2018 15:29,2018-04-05 19:18:20,EETLL,05/04/2018 21:30,04/05/2018 21:25,04/05/2018 21:32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
823667,912179,15/03/2019 07:36,Finlandia,9214379,59.9242,24.8683,20.7,196,196,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,
823668,912180,15/03/2019 07:37,Finlandia,9214379,59.9198,24.8657,20.8,197,195,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,
823669,912181,15/03/2019 07:38,Finlandia,9214379,59.9148,24.8627,20.7,196,195,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,
823670,912182,15/03/2019 07:39,Finlandia,9214379,59.9082,24.8587,20.7,196,195,FIHEL,15/03/2019 07:00,15/03/2019 07:00,2019-03-15 06:51:54,EETLL,15/03/2019 09:15,,


### Create validation dataset

In [9]:
val = df.sample(frac=0.3, random_state=42)

validation_file = 'validation.csv'
validation_path = path + '/02_Data/02_Validation/' + validation_file

val.to_csv(validation_path, index=False)

### Create working dataset

In [10]:
work = df.loc[~df.index.isin(val.index)]

work_file = 'work.csv'
work_path = path + '/02_Data/03_Working/' + work_file

work.to_csv(work_path, index=False)

### Create a sample

In [11]:
sample = work.sample(n=20000, random_state=42)

sample_file = 'sample.csv'
sample_path = path + '/02_Data/03_Working/' + sample_file

sample.to_csv(sample_path, index=False)