Introduction
This notebook is the first step towards creating an algorithm that predicts the probability of train cancellation. Here I am going to analyze the open source dataset about train services that I found, performing data preprocessing and data cleaning based on constraints and requirements that I am going to describe below

Constraints
The features that would be used to train the model have to follow at least one of the following criteria:
a) The information stored in the feature is a part of the user's input
Example: The user's input is a city of destination, a city of departure and departure time for their future trip.
b) It is possible to retrieve or calculate the information stored in the feature based on the user's input. 
Example: The user (normally) doesn't know about maintenance works, but it is possible to search for any maintenance on the specified route, using the time and the stations that the user provided

(Functional) Requirements

The model has to give an estimate on total delay that the user might face, including arrival delay
The model has to give an estimate on the train's cancellation 
The estimation should be given in a form of a percentage
The model has to predict the delay probability for every 5-minute batch (example: 10% chance of 5 minute delay, 15% chance of 10 minute delay)
The model must be able to solve a multi-label classification problem, as there would be multiple features in y_test

At this point you should have all of the neccessary information to understand the logic behind my future steps. Let's start the data preprocessing!


In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
from datetime import datetime
from datetime import date
from pathlib import Path  

2023-11-23 14:44:20.866668: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-23 14:44:21.569127: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


2023-11-23 14:44:24.924667: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-23 14:44:25.332721: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-23 14:44:25.332759: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.


Brief dataset overview
The dataset that I found contains all of the neccessary information about train journeys in the Netherlands, including station code, station name, train id, train type and so on. However from the spot I can say that the amount of features is too big - I would certainly have to get rid of some of them, for example feature called Stop:RDT-id is just autoincremented and would have little to no impact on models prediction.
Moreover, though there's some NaN values that 'beg' to be cleaned, I certainly should put more effort into data cleaning than just typing df.dropna()

You can see the dataset overview below:

In [None]:
basic_info = pd.read_csv('services-2022.csv')
basic_info.head()
#disruptions = pd.read_csv('disruptions-2022.csv')

First steps:

a) Rename columns in the dataset - currently the column's names in the dataset are too long and too complicated, which makes the data preprocessing part much harder, especially when dealing with specific columns names. So I've renamed the columns, making them more clear and concise

b) Data cleaning. Dropping features
Elaboration on dropping features:
Train id, Stop id: this features just indicate the id of the entry, not pointing to any potential correlations
Maximum delay, Arrival cancelled, Completely cancelled, Partly cancelled: Though this features could be of use, they do not follow the requirements - the user can't possibly know if train arrival was cancelled, and though it is possible to make the model compute this features by putting them into y_test, this would only make everything more complicated, so it is much easier just to remove this 4 features from the dataset.

c) Data cleaning. Dealing with NaN values. 
After performing a short analysis, I've decided to deal with null values in this way:
For NaN values in y_test labels: Drop the entire entry
For NaN values in y_train where type is float or int: Change value from NaN to 0
For any NaN value which is of type "string" or "date": Drop the entire entry

In [None]:
column_list = basic_info.columns.tolist()

# Print the list of column names
print(column_list)

In [None]:

    def data_cleaning_main_dataset(dataset):
        dataset.drop("Partly cancelled", inplace = True, axis = 1)
        dataset.drop("Maximum delay", inplace = True, axis = 1)
        dataset.drop("Stop id", inplace = True, axis = 1)
        dataset.drop("Train id", inplace = True, axis = 1)
        dataset.drop("Arrival cancelled", inplace = True, axis = 1)
        dataset.drop("Completely cancelled", inplace = True, axis = 1)
        dataset = dataset.dropna(axis=0, subset=['Departure delay'])
        dataset = dataset.dropna(axis=0, subset=['Departure cancelled'])
        dataset = dataset.dropna(axis=0, subset=['Departure time'])
        dataset['Arrival delay'] = dataset['Arrival delay'].fillna(0)
        return dataset

    def rename_columns_main_dataset(dataset):
        new_columns = {
        'Service:RDT-ID': 'Train id',
        'Service:Date': 'Date',
        'Service:Type': 'Train type',
        'Service:Company': 'Railroad company',
        'Service:Train number': 'Train number',
        'Service:Completely cancelled': 'Completely cancelled',
        'Service:Partly cancelled': 'Partly cancelled',
        'Service:Maximum delay': 'Maximum delay',
        'Stop:RDT-ID': 'Stop id',
        'Stop:Station code': 'Station code',
        'Stop:Station name': 'Station name',
        'Stop:Arrival time': 'Arrival time',
        'Stop:Arrival delay': 'Arrival delay',
        'Stop:Arrival cancelled': 'Arrival cancelled',
        'Stop:Departure time': 'Departure time',
        'Stop:Departure delay': 'Departure delay',
        'Stop:Departure cancelled': 'Departure cancelled'
    }

        # Rename the columns
        dataset = dataset.rename(columns=new_columns)
        return dataset
        
        

Next step: Enriching the dataset with new features

I've decided to add two more features that would be obtained from "Date" column. This two features would be of boolean type and would be named "Is weekend" and "Is holiday".
The reason for adding this features is to extract as much information from the user's input as possible, as the information received from the user is qiute scarce - only 5 features. Moreover, it is important to add that dates contain more information than just plain numbers - for example according to open-source statistics, the total number of daily disruptions on weekends is almost 20% lower then on weekdays. 

In [None]:

def add_new_weekday_feature(dataset):
    new_weekday_feature = []
    for date_row in dataset['Date']:
        convert_to_date = datetime.strptime(date_row, '%Y-%m-%d')
        day_number = convert_to_date.weekday()
        if (day_number >= 5):
            new_weekday_feature.append(True)
        else:
            new_weekday_feature.append(False)
    dataset['Is_weekend'] = new_weekday_feature
    return dataset
    
def extract_holidays(dataset):
    list_with_holidays = [
    '2022-01-01',   # New Year's Day
    '2022-04-15',   # Good Friday
    '2022-04-17',   # Easter Sunday
    '2022-04-18',   # Easter Monday
    '2022-04-27',   # King's Day
    '2022-05-04',   # Remembrance Day
    '2022-05-05',   # Liberation Day
    '2022-05-26',   # Ascension Day
    '2022-06-05',   # Whit Sunday
    '2022-06-06',   # Whit Monday
    '2022-12-25',   # Christmas Day
    '2022-12-26',   # Second Christmas Day
    ]

    # Convert the 'Date' column to strings in the same format
    dataset['Date'] = pd.to_datetime(dataset['Date']).dt.strftime('%Y-%m-%d')
    # Create the 'Is_holiday' column based on the 'Date' column
    dataset['Is_holiday'] = dataset['Date'].isin(list_with_holidays)
    # Create the 'Is_holiday' column based on the 'Date' column
    return dataset
    
def data_preprocessing_main_dataset(dataset):
    dataset = add_new_weekday_feature(dataset)
    dataset = extract_holidays(dataset)
    return dataset


In [None]:
#disruptions.head()

In [None]:
#large_disruptions = disruptions[disruptions['duration_minutes'] > 3000]
#large_disruptions.head()

In [None]:
sorted_dataset = basic_info[basic_info['Stop:Arrival cancelled'] == True]
sorted_dataset.head(50)

In [None]:
print("Total rows before cleaning (main dataset): " + str(len(basic_info))) 
print("Total rows before cleaning (disruptions): " + str(len(basic_info))) 

In [None]:
basic_info = rename_columns_main_dataset(basic_info)
basic_info = data_cleaning_main_dataset(basic_info)
basic_info = data_preprocessing_main_dataset(basic_info)
basic_info.head()

Advanced data cleaning: Rare trains and small companies.
After observing the amount and distribution of unique values in "Train type" and "Railroad company" columns I've found out that there's a lot of small values in both features, that take around 0.01-0.05% of the dataset (see bar charts below for more information)
I've removed all entries from 'Train type' and 'Railroad company' columns where the distribution of the value within the feature is <1%. I believe that this would confuse the model much less, making the predictions more stable and precise.

In [None]:
def plot_bar_chart(label):
    grouped_data = basic_info.groupby(label).size()
    
    # Plot the bar chart
    grouped_data.plot(kind='bar', color='navajowhite', edgecolor='black')
    
    # Add labels and title
    plt.xlabel(label)
    plt.ylabel('Count')
    plt.title('Distribution of ' + label)
    
    # Add percentages on top of each bar
    total_count = len(basic_info)  # Total number of entries in the DataFrame
    for i, value in enumerate(grouped_data):
        percentage = (value / total_count) * 100
        plt.text(i, value + 0.1, f'{percentage:.2f}%', ha='center', va='bottom', fontsize=8)
    
    # Show the plot
    plt.show()

In [None]:
print("Total rows after cleaning: " + str(len(basic_info))) 

In [None]:
plot_bar_chart('Railroad company')

In [None]:
plot_bar_chart('Train type')

In [None]:
print(basic_info.groupby('Train type').size())

In [None]:
print(basic_info["Railroad company"].unique())

In [None]:
values_to_remove = ['NS Int', 'Eurobahn', 'Breng', 'DB', 'EB', 'NMBS', 'VIAS', 'ZLSM', 'keo', '.',
                   'ABRN', 'Keolis', 'NSI', 'Railexpert', 'TCS', 'connexxi', 'db', 'nmbs', 'noord']
basic_info['Railroad company'] = basic_info['Railroad company'].replace(values_to_remove, pd.NA)

# Drop rows with NaN values in 'Railroad company' column
basic_info = basic_info.dropna(subset=['Railroad company'])
basic_info['Railroad company'] = basic_info['Railroad company'].replace('ns', 'NS')
basic_info['Railroad company'] = basic_info['Railroad company'].replace('BN', 'Blauwnet')
basic_info['Railroad company'] = basic_info['Railroad company'].replace('Rnet', 'R-net')
plot_bar_chart('Railroad company')

In [None]:
values_to_remove = ['Bus', 'Eurostar', 'Extra trein', 'ICE International', 'Int. Trein', 'Nachttrein', 'Nightjet',
                    'Snelbus i.p.v. trein', 'Stoomtrein', 'Thalys', 'Speciale Trein', 'Stopbus i.p.v. trein', 
                    'Metro i.p.v. trein', 'Dinnner Train', 'Alpen Express']
basic_info['Train type'] = basic_info['Train type'].replace(values_to_remove, pd.NA)
# Drop rows with NaN values in 'Railroad company' column
basic_info = basic_info.dropna(subset=['Train type'])
plot_bar_chart('Train type')

Below: Plotting features that still have null values in them. Removing null values from those features

In [None]:
def plot_nas(df: pd.DataFrame):
    if df.isnull().sum().sum() != 0:
        na_df = (df.isnull().sum() / len(df)) * 100      
        na_df = na_df.drop(na_df[na_df == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame({'Missing Ratio %' :na_df})
        missing_data.plot(kind = "barh")
        plt.show()
    else:
        print('No NAs found')
plot_nas(basic_info)

In [None]:
basic_info = basic_info.dropna(axis=0, subset=['Station code'])

In [None]:
print("Total rows after cleaning: " + str(len(basic_info))) 

In [None]:
basic_info.head(50)

In [None]:
filepath = Path('preprocessed_data/main_preprocessed_dataset.csv')  
basic_info.to_csv(filepath)  