In [31]:
import pandas as pd

# Loading the dataset for cleaning

In [32]:
data = pd.read_csv('/kaggle/input/original-dataset/ar41_for_ulb.csv', delimiter=';')

## Null Value Analysis
We observed that the columns `RS_E_InAirTemp_PC2`, `RS_E_OilPress_PC2`, `RS_E_RPM_PC2`, `RS_E_WatTemp_PC2`, and `RS_T_OilTemp_PC2` each have 12,726 missing values. Given that the total dataset comprises 17.6 million rows, the proportion of missing values is relatively small. Therefore, we decided to remove these rows for a cleaner and more consistent dataset.

In [33]:
null_values = data.isnull().sum()
null_values

Unnamed: 0                0
mapped_veh_id             0
timestamps_UTC            0
lat                       0
lon                       0
RS_E_InAirTemp_PC1        0
RS_E_InAirTemp_PC2    12726
RS_E_OilPress_PC1         0
RS_E_OilPress_PC2     12726
RS_E_RPM_PC1              0
RS_E_RPM_PC2          12726
RS_E_WatTemp_PC1          0
RS_E_WatTemp_PC2      12726
RS_T_OilTemp_PC1          0
RS_T_OilTemp_PC2      12726
dtype: int64

In [34]:
data.dropna(inplace=True)

# Checking Number of duplicate rows
The dataset doesn't contain any duplicate rows

In [52]:
duplicates = data.duplicated()
num_duplicates = duplicates.sum()
print(f"Number of duplicate rows: {num_duplicates}")

Number of duplicate rows: 0


## Cleaning the Time Interval

In our data cleaning process, we focus on ensuring that the data falls within the specified time range of the project. To achieve this, we implement the following steps:

1. **Timestamp Conversion**: The `timestamps_UTC` column is converted to a datetime format. 

2. **Defining the Time Range**: We define the time range of our project to be from January 1, 2023, to September 30, 2023. This period marks the temporal extent of our analysis.
   - `start_date`: January 1, 2023
   - `end_date`: September 30, 2023

3. **Filtering Data**: Using the defined time range, we filter the dataset to identify and count any rows with timestamps that fall outside this specified interval. This step ensures that our analysis is confined to the relevant time period.

In [43]:
data['timestamps_UTC'] = pd.to_datetime(data['timestamps_UTC'])

start_date = '2023-01-01'
end_date = '2023-09-30'

outside_interval = data[(data['timestamps_UTC'] < pd.to_datetime(start_date)) | 
                        (data['timestamps_UTC'] > pd.to_datetime(end_date))]

num_rows_outside_interval = outside_interval.shape[0]

print(f"Number of rows outside the given time interval: {num_rows_outside_interval}")

Number of rows outside the given time interval: 34


In [44]:
filtered_data = data[(data['timestamps_UTC'].dt.year == 2023) & (data['timestamps_UTC'].dt.month <= 9)]

## Geospatial Data Filtering

In this step of our data cleaning process, we aim to remove specific geospatial data points that fall outside the geographical boundaries of Belgium.

### Identified Coordinates for Removal
We have identified a list of latitude and longitude pairs that are outside the scope of our project by analyzing with PostGIS.

In [47]:
coordinates_to_remove = [
    (52.8570588, 4.4855411),
    (50.330024, 0.1750491),
    (50.3979063, 0.2067928),
    (49.3835907, 3.7002351),
    (49.7800844, 3.7347805)
]

for lat, lon in coordinates_to_remove:
    filtered_data = filtered_data[~((filtered_data['lat'] == lat) & (filtered_data['lon'] == lon))]

In [48]:
filtered_data.to_csv("/kaggle/working/cleaned_data.csv", index=False)