# 1. Setup & Bus Data Load

Import required modules and packages.

In [None]:
# import os so that environment variables can be accessed (for database password, etc.)
import os

# import mysql connector so that data can be pulled into the notebook directly from the database
import mysql.connector

# import pandas and numpy for data analysis
import pandas as pd
import numpy as np

# import transform_data function for transforming data into segment format
from transform_data import transform_data

# import convert_timestamp for various timestamp conversion functions
import convert_timestamp

Set the max number of columns & rows to display.

In [None]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 500)

Load data for a bus route from file.

In [None]:
df = pd.read_csv('./data/route_15A.csv', sep=";", na_values=['\\N'])

Perform a check to see how many rows and columns are in the file.

In [None]:
rows = df.shape[0]
cols = df.shape[1]
print()
print("Before any data cleaning, the CSV file contains", rows, "rows and", cols, "columns.")
print()

Print the first five lines of the dataframe.

In [None]:
df.head(5)

# 2. Initial Checks on the Bus Data

## 2.1 Check for Duplicate Rows & Columns

In [None]:
print()
print('Duplicate rows:', df.duplicated()[df.duplicated() == True].shape[0])
print('Duplicate columns:',df.columns.size - df.columns.unique().size)

There are no duplicate rows or columns in the bus data.

## 2.2 Check for Null/Empty Features

In [None]:
df.describe().T

Features with a count of zero can be dropped as they contain no useful information:

In [None]:
df = df.drop(columns=['tenderlot', 'suppressed_trip', 'justificationid_trip', 'passengers', 'passengersin', 'passengersout', \
                      'distance_leavetimes', 'note_leavetimes', 'note_vehicle'])

## 2.3 Assign Features as Continuous or Categorical

First check the data types of all rows after the file import.

In [None]:
df.dtypes

Assign categorical and continous features, and update the type of all categorical features to 'category'.

In [None]:
# Select columns containing categorical data
categorical_columns = df[['datasource', 'dayofservice', 'tripid', 'lineid', 'routeid', 'direction', 'basin', \
                         'lastupdate_trip', 'note_trip', 'progrnumber', 'stoppointid', \
                          'suppressed_leavetimes', 'lastupdate_leavetimes']].columns

# Convert data type to 'Category' for these columns
for column in categorical_columns:
    df[column] = df[column].astype('category')

In [None]:
# Select columns containing continuous data 
# This is done by selecting columns with a numeric type - float64 or int64
continuous_columns = df.select_dtypes(['float64', 'int64']).columns

## 2.4 Check for Constant Categorical Features

In [None]:
# Print details for the categorical columns
df[categorical_columns].describe().T

Features with a count of one are constant and can be dropped. <br> **lineid** is constant for this subset of data, but is not constant for the full data set so will not be dropped at this stage.

In [None]:
df = df.drop(columns=['datasource', 'basin'])

## 2.5 Check for Constant Continuous Features

In [None]:
# Print details for the continuous columns
df[continuous_columns].describe().T

There are no constant continuous features so nothing needs to be dropped.

# 3. Further Analysis of Features

There are a number of features that can be dropped because they fall into at least one of the following categories:
- Features that don't provide much information
- Features that we won't be able to provide information on to the model

These features can be dropped:

In [None]:
df = df.drop(columns=['lastupdate_trip', 'note_trip', 'suppressed_leavetimes', 'justificationid_leavetimes', \
                      'lastupdate_leavetimes','vehicleid', 'distance_vehicle', 'minutes_vehicle'])

# 4. Initial Checks for Missing Data

## 4.1 Categorical Features

Select the categorical features and print details:

In [None]:
# Select columns containing categorical data
categorical_columns = df[['dayofservice', 'tripid', 'lineid', 'routeid', 'direction', 'progrnumber', 'stoppointid']].columns

# Print details for the categorical columns
df[categorical_columns].describe().T

There is a full count for all categorical features.

## 4.2 Continuous Features

Select the continuous features and print details:

In [None]:
# Select columns containing continuous data 
# This is done by selecting columns with a numeric type - float64 or int64
continuous_columns = df.select_dtypes(['float64', 'int64']).columns

# Print details for the continuous columns
df[continuous_columns].describe().T

There are some rows missing data for **actualtime_arr_trip** and **actualtime_dep_trip**. This will be reviewed if these features are used in the future, currently they are not carried across when data is transformed.

# 5. Transform the Bus Data

Bus data must be transformed so that each row of data holds information on one journey segment. This is done by calling the *transform_data* function:

In [None]:
df_transformed = transform_data(df)

In [None]:
df_transformed

## 5.1 Check for Missing Data

First re-assign the transformed data as continuous or categorical:

In [None]:
# Select columns containing categorical data
categorical_columns = df_transformed[['dayofservice', 'tripid', 'lineid', 'routeid', 'direction',  \
                         'progrnumber_first', 'stoppointid_first', \
                          'progrnumber_next', 'stoppointid_next']].columns

# Convert data type to 'Category' for these columns
for column in categorical_columns:
    df_transformed[column] = df_transformed[column].astype('category')

In [None]:
# Select columns containing continuous data 
# This is done by selecting columns with a numeric type - float64 or int64
continuous_columns = df_transformed.select_dtypes(['float64', 'int64']).columns

Then check for missing data:

In [None]:
# Print details for the categorical columns
df_transformed[categorical_columns].describe().T

In [None]:
# Print details for the continuous columns
df_transformed[continuous_columns].describe().T

There are some rows with missing data, because the amount of missing rows is quite low, and because imputation would be difficult, these rows will be dropped.

## 5.2 Drop Rows with Missing Data

Drop rows where *stoppointid_first* or *stoppointid_next* is null:

In [None]:
df_transformed = df_transformed[pd.notnull(df_transformed['stoppointid_first'])]

In [None]:
df_transformed = df_transformed[pd.notnull(df_transformed['stoppointid_next'])]

# 6. Import Weather Data

Weather data is loaded from the database:

In [None]:
# open connection
connection = mysql.connector.connect(host=os.environ['DBHOST'], user=os.environ['DBUSER'], \
        password=os.environ['DBPASS'], db='db_raw_data')

# SQL query 
sql = "SELECT * FROM weather_data \
WHERE record_date BETWEEN CAST('2018-01-01' AS DATE) AND CAST('2019-01-01' AS DATE);"

# load into dataframe
df_weather = pd.read_sql(sql, connection)

## 6.1 Check for Duplicate Rows & Columns

In [None]:
print()
print('Duplicate rows:', df_weather.duplicated()[df_weather.duplicated() == True].shape[0])
print('Duplicate columns:',df_weather.columns.size - df_weather.columns.unique().size)

There are no duplicate rows or columns so nothing needs to be dropped here.

## 6.2 Assign Features as Continuous or Categorical

First check the data types of all rows after the file import:

In [None]:
df_weather.dtypes

Assign categorical and continuous features:

In [None]:
# Select columns containing categorical data
categorical_columns = df_weather[['record_date', 'irain', 'itemp', 'iwb']].columns

# Convert data type to 'Category' for these columns
for column in categorical_columns:
    df_weather[column] = df_weather[column].astype('category')

In [None]:
# Select columns containing continuous data 
# This is done by selecting columns with a numeric type - float64 or int64
continuous_columns = df_weather.select_dtypes(['float64', 'int64']).columns

## 6.3 Check for Missing Data, Constant Features, etc.

In [None]:
# Print details for the categorical columns
df_weather[categorical_columns].describe().T

**itemp** and **iwb** are constant columns so can be dropped.

In [None]:
# Print details for the categorical columns
df_weather[continuous_columns].describe().T

Investigate rows with missing data for rain:

In [None]:
# select all rows where irain is not 0
df_weather.loc[df_weather['irain'] != 0]

There are only two rows where irain is not zero, these rows correspond to missing values for rain.

In [None]:
# select other rows around the missing values
df_weather[6220:6240]

Given that there is no rain for the rest of the day, and given the high (for Ireland) temperature on the day, I think it's safe to replace the missing rain values with 0.

I will then drop the feature **irain** as it provides no useful information.

## 6.4 Replace Missing Weather Data

In [None]:
# replace rain with 0 where irain is not 0
df_weather['rain'].loc[df_weather['irain'] != 0] = 0

In [None]:
# check that values are updated
df_weather.loc[df_weather['irain'] != 0]

## 6.5 Drop Constant Weather Features

In [None]:
df_weather = df_weather.drop(columns=['irain', 'itemp', 'iwb'])

# 7. Combine Bus and Weather Data

## 7.1 Split Timestamp for Weather Data

To merge the data, timestamps must be split into month, day and hour.

New features are added using lambda functions:

In [None]:
df_weather['month'] = df_weather.apply (lambda row: convert_timestamp.timestamp_to_month_weather(row['record_date']), axis=1)

In [None]:
df_weather['day'] = df_weather.apply (lambda row: convert_timestamp.timestamp_to_day_weather(row['record_date']), axis=1)

In [None]:
df_weather['hour'] = df_weather.apply (lambda row: convert_timestamp.timestamp_to_hour_weather(row['record_date']), axis=1)

New features are updated to be categorical:

In [None]:
# Select columns containing categorical data
categorical_columns = df_weather[['record_date', 'month', 'day', 'hour']].columns

# Convert data type to 'Category' for these columns
for column in categorical_columns:
    df_weather[column] = df_weather[column].astype('category')

In [None]:
df_weather

## 7.2 Split Timestamp for Bus Data

New features are added using lambda functions:

In [None]:
df_transformed['month'] = df_transformed.apply (lambda row: convert_timestamp.timestamp_to_month_bus(row['dayofservice'], \
                                                                                   row['actualtime_arr_stop_first']), axis=1)

In [None]:
df_transformed['day'] = df_transformed.apply (lambda row: convert_timestamp.timestamp_to_day_bus(row['dayofservice'], \
                                                                               row['actualtime_arr_stop_first']), axis=1)

In [None]:
df_transformed['hour'] = df_transformed.apply (lambda row: convert_timestamp.timestamp_to_hour_bus(\
                                                                                row['actualtime_arr_stop_first']), axis=1)

In [None]:
df_transformed

## 7.3 Merge the Dataframes

In [None]:
df_merged = pd.merge(df_transformed, df_weather,  how='left', left_on=['month','day', 'hour'],\
                     right_on = ['month','day', 'hour'])

Check that there are no rows missing weather data:

In [None]:
df_merged[df_merged.rain.isnull()]

Update the data types for the new dataframe:

In [None]:
# Select columns containing categorical data
categorical_columns = df_merged[['dayofservice', 'tripid', 'lineid', 'routeid', 'direction',  \
                         'progrnumber_first', 'stoppointid_first', 'progrnumber_next', 'stoppointid_next',\
                          'month', 'day', 'hour', 'record_date']].columns

# Convert data type to 'Category' for these columns
for column in categorical_columns:
    df_merged[column] = df_merged[column].astype('category')

In [None]:
# Select columns containing continuous data 
# This is done by selecting columns with a numeric type - float64 or int64
continuous_columns = df_merged.select_dtypes(['float64', 'int64']).columns

In [None]:
df_merged

# 8. Create New Features

In [None]:
import importlib
importlib.reload(convert_timestamp)

## 8.1 Day of Week Feature

The new feature is added using a lambda function:

In [None]:
df_merged['day_of_week'] = df_merged.apply (lambda row: convert_timestamp.timestamp_to_day_of_week(row['dayofservice']), axis=1)

## 8.2 Weekend/Weekday Feature

The new feature is added using a lambda function:

In [None]:
df_merged['weekday'] = df_merged.apply (lambda row: convert_timestamp.timestamp_to_weekday_weekend(row['dayofservice']), axis=1)

## 8.3 Bank Holiday Feature

Make list of bank holidays for 2018 (based on https://www.officeholidays.com/countries/ireland/2018):

In [None]:
holidays = ['2018-01-01 00:00:00', '2018-03-19 00:00:00', '2018-04-02 00:00:00', '2018-05-07 00:00:00', '2018-06-04 00:00:00',\
           '2018-08-06 00:00:00', '2018-10-29 00:00:00', '2018-12-25 00:00:00', '2018-12-26 00:00:00']

The new feature is added using a lambda function:

In [None]:
df_merged['bank_holiday'] = df_merged.apply (lambda row: convert_timestamp.timestamp_to_bank_holiday(row['dayofservice'], \
                                                                                                      holidays), axis=1)

In [None]:
df_merged

# Appendix 1: Data Quality Plan - Bus Data (Before Transformation)

| Feature | Data Quality Issue | Handling Strategy |
|-------------------------|----------------------|------------------------------|
| tenderlot       | All rows are null | Drop feature |
| suppressed_trip | All rows are null | Drop feature |
| justificationid_trip | All rows are null | Drop feature |
| passengers | All rows are null | Drop feature | 
| passengersin | All rows are null | Drop feature |
| passengersout | All rows are null | Drop feature |
| distance_leavetimes | All rows are null | Drop feature |
| note_leavetimes | All rows are null | Drop feature |
| note_vehicle | All rows are null | Drop feature |
| datasource | Constant feature | Drop feature |
| lineid | Constant feature | This is constant because we just have data for one route loaded. At some point we may process more than one route together so will keep feature for now. May not be needed to train the model. |
| basin | Constant feature | Drop feature |
| lastupdate_trip | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| note_trip | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| suppressed_leavetimes | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| justifcationid_leavetimes | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| lastupdate_leavetimes | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| vehicleid | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| distance_vehicle | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| minutes_vehicle | Cannot be used to train model as we won't be able to provide this information | Drop feature |
| actualtime_arr_trip | Missing values < 1% | Ignore for now as this feature is not brought across when data is transformed. |
| actualtime_dep_trip | Missing values < 3% | Ignore for now as this feature is not brought across when data is transformed. |

# Appendix 2: Data Quality Plan - Bus Data (After Transformation)

| Feature | Data Quality Issue | Handling Strategy |
|-------------------------|----------------------|------------------------------|
| stoppointid_first | Missing values ~ 1% | Drop affected rows |
| actualtime_arr_stop_first | Missing values ~ 1%| Drop affected rows |
| stoppointid_next | Missing values ~ 1% | Drop affected rows |
| actualtime_arr_stop_next | Missing values ~ 1%| Drop affected rows |

# Appendix 3: Data Quality Plan - Weather Data

| Feature | Data Quality Issue | Handling Strategy |
|-------------------------|----------------------|------------------------------|
| itemp | Constant feature | Drop feature |
| iwb | Constant feature | Drop feature |
| rain | Missing data - 2 rows | Imputation - replace with 0 after looking at data for other timestamps on the same day |
| irain | Seems to be a missing data indicator | Drop feature as only two rows have missing data, and imputation is performed for these rows. |

# Appendix 4: Tests for *transform_data*

In [None]:
df_test1 = df.loc[5:100]
df_test1 = df_test1.reset_index(drop=True)
df_test1

In [None]:
df_transformed1 = transform.transform_data(df_test1)
df_transformed1

In [None]:
pieces = [df[:35], df[42:100]]
df_test2 = pd.concat(pieces)
df_test2 = df_test2.reset_index(drop=True)
df_test2

In [None]:
df_transformed2 = transform.transform_data(df_test2)
df_transformed2

In [None]:
pieces = [df[:5], df[10:100]]
df_test3 = pd.concat(pieces)
df_test3 = df_test3.reset_index(drop=True)
df_test3

In [None]:
df_transformed3 = transform.transform_data(df_test3)
df_transformed3

In [None]:
pieces = [df[:5], df[8:10], df[14:50]]
df_test4 = pd.concat(pieces)
df_test4 = df_test4.reset_index(drop=True)
df_test4

In [None]:
df_transformed4 = transform.transform_data(df_test4)
df_transformed4