# Data Imputation
Based on the EDA and features selection, I found some features include the 'Not applicable' or 'Unknown' values, which may affect the model performance. In this notebook, I will impute the missing values based on the EDA and features selection results. The imputation methods include:
1. Kmeans clustering
2. Rule-based imputation
3. Reverse geocoding (Additional)

The features imputed: **ATMOSPH_COND**, **LIGHTING_CONDITION**, **SURFACE_COND**, **SPEED_ZONE**, **ROAD_NAME (Additional)**

In [4]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
import warnings

# Suppress all UserWarnings
warnings.filterwarnings("ignore", category=UserWarning)
# Load the all the dataset from Dataset folder
def load_dataset(file_name):
    """
    Load a dataset from the 'Dataset' folder.
    
    Parameters:
    file_name (str): The name of the file to load (e.g., 'data.csv').
    
    Returns:
    pd.DataFrame: The loaded dataset as a pandas DataFrame.
    """
    return pd.read_csv(f'Dataset/{file_name}', low_memory=False)

file_list = [
    'ACCIDENT.csv',
    'ACCIDENT_CHAINAGE.csv',
    'ACCIDENT_EVENT.csv',
    'ACCIDENT_LOCATION.csv',
    'ATMOSPHERIC_COND.csv',
    'NODE.csv',
    'NODE_ID_COMPLEX_INT_ID.csv',
    'PERSON.csv',
    'ROAD_SURFACE_COND.csv',
    'SUBDCA.csv',
    'VEHICLE.csv'
]
# Load all datasets into a dictionary
datasets = {file_name.split('.')[0]: load_dataset(file_name) for file_name in file_list}
# Save the datasets to a dictionary for further processing
datasets = {
    'ACCIDENT': load_dataset('ACCIDENT.csv'),
    'ACCIDENT_CHAINAGE': load_dataset('ACCIDENT_CHAINAGE.csv'),
    'ACCIDENT_EVENT': load_dataset('ACCIDENT_EVENT.csv'),
    'ACCIDENT_LOCATION': load_dataset('ACCIDENT_LOCATION.csv'),
    'ATMOSPHERIC_COND': load_dataset('ATMOSPHERIC_COND.csv'),
    'NODE': load_dataset('NODE.csv'),
    'NODE_ID_COMPLEX_INT_ID': load_dataset('NODE_ID_COMPLEX_INT_ID.csv'),
    'PERSON': load_dataset('PERSON.csv'),
    'ROAD_SURFACE_COND': load_dataset('ROAD_SURFACE_COND.csv'),
    'SUBDCA': load_dataset('SUBDCA.csv'),
    'VEHICLE': load_dataset('VEHICLE.csv')
}
# Generate a summary table for a categorical column with mapped descriptions and value counts.
def description_summary(df, column_name, mapping_dict):
    mapping_df = pd.DataFrame(list(mapping_dict.items()), columns=[column_name, "Description"])
    counts_df = df[column_name].value_counts().reset_index()
    counts_df.columns = [column_name, "Count"]
    # Merge with descriptions
    summary_df = pd.merge(mapping_df, counts_df, on=column_name, how="left").fillna(0)
    display(summary_df)
    return summary_df

# Atmospheric Condition Imputation
Some of the accidents have multiple weather conditions recorded, and some have unknown weather conditions. In order to maintain consistency and avoid confusion, I will impute the atmospheric condition based on the accident time and location. The steps are as follows:

Original atmospheric condition description: 1 Clear 2 Raining 3 Snowing 4 Fog 5 Smoke 6 Dust 7 Strong winds 9 Not known

Join table sequence : ACCIDENT -> NODE -> ATMOSPHERIC_COND


In [5]:
atmospheric_mapping_impute = {
    1: "Clear",
    2: "Raining",
    3: "Snowing",
    4: "Fog",
    5: "Smoke",
    6: "Dust",
    7: "Strong winds",
    9: "Not known",
    10: "Multiple Weather Conditions"
}
atmospheric_mapping = {
    1: "Clear",
    2: "Raining",
    3: "Snowing",
    4: "Fog",
    5: "Smoke",
    6: "Dust",
    7: "Strong winds",
    9: "Not known"
}
atmospheric_condition = datasets['ATMOSPHERIC_COND'].copy()
# remove the replicate accident no
atmospheric_condition = atmospheric_condition.drop_duplicates(subset=['ACCIDENT_NO'], keep='first')
# Check the distribution of atmospheric condition
print("Original Atmospheric Condition Distribution:")
atmospheric_condition_summary = description_summary(atmospheric_condition, "ATMOSPH_COND", atmospheric_mapping)

Original Atmospheric Condition Distribution:


Unnamed: 0,ATMOSPH_COND,Description,Count
0,1,Clear,164227
1,2,Raining,22018
2,3,Snowing,48
3,4,Fog,1535
4,5,Smoke,253
5,6,Dust,396
6,7,Strong winds,392
7,9,Not known,14839


### Step 1: Load the data, and assign "10" for multiple weather conditions

In [6]:
# Load the data
acc = datasets['ACCIDENT'].copy()
# Parse the accidenttime hour and month
acc['ACCIDENT_TIME_HOUR'] = pd.to_datetime(acc['ACCIDENTTIME']).dt.hour
acc['ACCIDENT_TIME_MONTH'] = pd.to_datetime(acc['ACCIDENTDATE']).dt.month
# find the atmosph_cond_seq > 1
multi_weather_ids = atmospheric_condition[atmospheric_condition['ATMOSPH_COND_SEQ'] > 1]['ACCIDENT_NO'].unique()
# Assign 10 for multiple weather conditions
atmospheric_condition.loc[atmospheric_condition["ACCIDENT_NO"].isin(multi_weather_ids), "ATMOSPH_COND"] = 10
atmospheric_condition.loc[atmospheric_condition["ACCIDENT_NO"].isin(multi_weather_ids), "Atmosph Cond Desc"] = 'Multiple Weather Conditions'

atmospheric_condition.shape

(203708, 4)

### Step 2: Join the atmospheric condition with ACCIDENT and NODE table using ACCIDENT_NO

In [7]:
# Join with accident features
node_by_acc = datasets["NODE"].drop_duplicates(subset=["ACCIDENT_NO"]).copy()
acc_node = acc.merge(
    node_by_acc[["ACCIDENT_NO","POSTCODE_NO"]],
    on="ACCIDENT_NO", how="left"
)
acc_node.shape

(203708, 31)

In [8]:
atmospheric_condition_full = atmospheric_condition.merge(
    acc_node[["ACCIDENT_NO","ACCIDENT_TIME_MONTH","ACCIDENT_TIME_HOUR","POSTCODE_NO"]],
    on="ACCIDENT_NO", how="left"
)
atmospheric_condition_full.shape

(203708, 7)

### Step 3: Impute missing atmospheric condition using KMeans clustering based on ACCIDENT_TIME_MONTH, ACCIDENT_TIME_HOUR and POSTCODE_NO

In [9]:
features_num = ["ACCIDENT_TIME_MONTH", "ACCIDENT_TIME_HOUR"]
features_cat = ["POSTCODE_NO"] 

for col in ["POSTCODE_NO"]:
    atmospheric_condition_full[col] = atmospheric_condition_full[col].astype("category")
# Mask the unknown and multiple weather conditions
mask_unknown = (atmospheric_condition_full["ATMOSPH_COND"] == 9)
mask_multi   = (atmospheric_condition_full["ATMOSPH_COND"] == 10)

known = atmospheric_condition_full[~mask_unknown & ~mask_multi].copy()
unknown = atmospheric_condition_full[mask_unknown].copy()
multi   = atmospheric_condition_full[mask_multi].copy()

# Build a preprocessing + KMeans pipeline
pre = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), features_num),
        ("cat", OneHotEncoder(handle_unknown="ignore"), features_cat),
    ]
)
# Create a KMeans, with 7 clusters indicate the 7 weather conditions
# 1 Clear 2 Raining 3 Snowing 4 Fog 5 Smoke 6 Dust 7 Strong winds 9 Not known
kmeans = KMeans(n_clusters=7, random_state=42, n_init=10)
pipe = Pipeline(steps=[("prep", pre), ("km", kmeans)])

# Fit on KNOWN only (1-7 weather conditions)
X_known = known[features_num + features_cat]
pipe.fit(X_known)

#  Get clusters for known and map each cluster to its majority weather code
known["cluster"] = pipe.named_steps["km"].labels_
cluster_weather_map = (
    known.groupby("cluster")["ATMOSPH_COND"]
    .agg(lambda s: s.mode().iloc[0])
    .to_dict()
)

# Predict clusters for 9(Not known) and impute from cluster majority
X_unknown = unknown[features_num + features_cat]
unknown["cluster"] = pipe.predict(X_unknown)
unknown["ATMOSPH_COND"] = unknown["cluster"].map(cluster_weather_map).astype(int)

# Recombine the datasets
atmospheric_condition_full_imputed = pd.concat([known, unknown, multi], ignore_index=True)
# check each class number in atospheric condition
print("Atmospheric Condition Imputation Summary:")
atmospheric_condition_summary = description_summary(atmospheric_condition_full_imputed, "ATMOSPH_COND", atmospheric_mapping_impute)

Atmospheric Condition Imputation Summary:


Unnamed: 0,ATMOSPH_COND,Description,Count
0,1,Clear,179066.0
1,2,Raining,21686.0
2,3,Snowing,38.0
3,4,Fog,1532.0
4,5,Smoke,237.0
5,6,Dust,396.0
6,7,Strong winds,392.0
7,9,Not known,0.0
8,10,Multiple Weather Conditions,361.0


In [11]:
# Check the original atmospheric condition
print("Original Atmospheric Condition Distribution:")
atmospheric_condition_summary = description_summary(atmospheric_condition, "ATMOSPH_COND", atmospheric_mapping)

Original Atmospheric Condition Distribution:


Unnamed: 0,ATMOSPH_COND,Description,Count
0,1,Clear,164227
1,2,Raining,21686
2,3,Snowing,38
3,4,Fog,1532
4,5,Smoke,237
5,6,Dust,396
6,7,Strong winds,392
7,9,Not known,14839


In [11]:
# ---- SAVE THE RESULT ----- if you wanna save, uncommented this cell.
# Only keep ACCIDNET_NO, ATMOSPH_COND, Atmosph Cond Desc
# atmospheric_condition_final = atmospheric_condition_full_imputed[["ACCIDENT_NO", "ATMOSPH_COND", "Atmosph Cond Desc"]].copy()
# SAve the imputed atmospheric condition to a csv file
# atmospheric_condition_final.to_csv("Clean-data/ATMOSPHERIC_COND_IMPUTED.csv", index=False)

## Lighting Condition Imputation
Some of the accidents have unknown lighting conditions. In order to maintain consistency and avoid confusion, I will impute the lighting condition based on the accident time.

Origianl light condition description: 1 Daylight 2 Dark - Street Lights On 3 Dark - Street Lights Off 4 Dark - No Street Lights 5 Dusk/Dawn 6 Dark - Unknown Lighting 7 Daylight - Unknown Lighting 8 Daylight - Street Lights On 9 Unknown


In [12]:
acc = datasets['ACCIDENT'].copy()
# Check the distribution of LIGHT_CONDITION
print("Original Light Condition Distribution:")
lighting_mapping = {
    1: "Daylight",
    2: "Dark - Street Lights On",
    3: "Dark - Street Lights Off",
    4: "Dark - No Street Lights",
    5: "Dusk/Dawn",
    6: "Dark - Unknown Lighting",
    7: "Daylight - Unknown Lighting",
    8: "Daylight - Street Lights On",
    9: "Unknown"
}
lighting_summary = description_summary(acc, "LIGHT_CONDITION", lighting_mapping)

Original Light Condition Distribution:


Unnamed: 0,LIGHT_CONDITION,Description,Count
0,1,Daylight,136360.0
1,2,Dark - Street Lights On,16485.0
2,3,Dark - Street Lights Off,33105.0
3,4,Dark - No Street Lights,516.0
4,5,Dusk/Dawn,11490.0
5,6,Dark - Unknown Lighting,2063.0
6,7,Daylight - Unknown Lighting,0.0
7,8,Daylight - Street Lights On,0.0
8,9,Unknown,3689.0


### Impute LIGHT_CONDITION based on ACCIDENT_TIME
If time hours are between 6 and 18, set LIGHT_CONDITION to 1 (Daylight), if time hours are between 18 and 6, set LIGHT_CONDITION to 6 (Dark - Unknown Lighting)

In [13]:
acc['ACCIDENT_TIME_HOUR'] = pd.to_datetime(acc['ACCIDENTTIME']).dt.hour
# Mask for Unknown light conditions (9)
mask_unknown = acc['LIGHT_CONDITION'] == 9
# if the accident time hour is between 6 and 18, set LIGHT_CONDITION to 1 (Daylight)
acc.loc[mask_unknown & (acc['ACCIDENT_TIME_HOUR'] >= 6) & (acc['ACCIDENT_TIME_HOUR'] < 18), 'LIGHT_CONDITION'] = 1
# if the accident time hour is between 18 and 6, set LIGHT_CONDITION to 6 (Dark - Unknown Lighting)
acc.loc[mask_unknown & ((acc['ACCIDENT_TIME_HOUR'] < 6) | (acc['ACCIDENT_TIME_HOUR'] >= 18)), 'LIGHT_CONDITION'] = 6
# Check result
print("Imputed Light Condition Distribution:")
lighting_summary_imputed = description_summary(acc, "LIGHT_CONDITION", lighting_mapping)

Imputed Light Condition Distribution:


Unnamed: 0,LIGHT_CONDITION,Description,Count
0,1,Daylight,138997.0
1,2,Dark - Street Lights On,16485.0
2,3,Dark - Street Lights Off,33105.0
3,4,Dark - No Street Lights,516.0
4,5,Dusk/Dawn,11490.0
5,6,Dark - Unknown Lighting,3115.0
6,7,Daylight - Unknown Lighting,0.0
7,8,Daylight - Street Lights On,0.0
8,9,Unknown,0.0


In [14]:
acc.head()

Unnamed: 0,ACCIDENT_NO,ACCIDENTDATE,ACCIDENTTIME,ACCIDENT_TYPE,Accident Type Desc,DAY_OF_WEEK,Day Week Description,DCA_CODE,DCA Description,DIRECTORY,...,NO_PERSONS_INJ_2,NO_PERSONS_INJ_3,NO_PERSONS_KILLED,NO_PERSONS_NOT_INJ,POLICE_ATTEND,ROAD_GEOMETRY,Road Geometry Desc,SEVERITY,SPEED_ZONE,ACCIDENT_TIME_HOUR
0,T20060000010,13/01/2006,12:42:00,1,Collision with vehicle,6,Friday,113,RIGHT NEAR (INTERSECTIONS ONLY),MEL,...,0,1,0,5,1,1,Cross intersection,3,60,12
1,T20060000018,13/01/2006,19:10:00,1,Collision with vehicle,6,Friday,113,RIGHT NEAR (INTERSECTIONS ONLY),MEL,...,0,1,0,3,1,2,T intersection,3,70,19
2,T20060000022,14/01/2006,12:10:00,7,Fall from or in moving vehicle,7,Saturday,190,FELL IN/FROM VEHICLE,MEL,...,1,0,0,1,1,5,Not at intersection,2,100,12
3,T20060000023,14/01/2006,11:49:00,1,Collision with vehicle,7,Saturday,130,REAR END(VEHICLES IN SAME LANE),MEL,...,1,0,0,1,1,2,T intersection,2,80,11
4,T20060000026,14/01/2006,10:45:00,1,Collision with vehicle,7,Saturday,121,RIGHT THROUGH,MEL,...,0,3,0,0,1,5,Not at intersection,3,50,10


In [20]:
# # ---- SAVE THE RESULT ----- if you wanna save, uncommented this cell.
# Only keep ACCIDENT_NO, LIGHT_CONDITION, LIGHT_COND_Desc
# acc_speed_zone = acc[['ACCIDENT_NO', 'LIGHT_CONDITION', 'Light Condition Desc']].copy()
# Save the imputed speed zone to a csv file
# acc_speed_zone.to_csv("Clean-data/LIGHT_CONDITION_IMPUTED.csv", index=False)

# Speed Zone Imputation

Speed zone description: 040 40 km/hr 050 50 km/hr 060 60 km/hr 075 75 km/hr 080 80 km/hr 090 90 km/hr 100 100 km/hr 110 110 km/hr 777 Other speed limit 888 Camping grounds, off road 999 Not known

In [15]:
acc = datasets['ACCIDENT'].copy()
# Check the distribution of SPEED_ZONE
print("Original Speed Zone Distribution:")
speed_zone_mapping = {
    40: "40 km/hr",
    50: "50 km/hr",
    60: "60 km/hr",
    75: "75 km/hr",
    80: "80 km/hr",
    90: "90 km/hr",
    100: "100 km/hr",
    110: "110 km/hr",
    777: "Other speed limit",
    888: "Camping grounds, off road",
    999: "Not known"
}
# Check the speed zone distribution
speed_zone_summary = description_summary(acc, "SPEED_ZONE", speed_zone_mapping)

Original Speed Zone Distribution:


Unnamed: 0,SPEED_ZONE,Description,Count
0,40,40 km/hr,8937
1,50,50 km/hr,36149
2,60,60 km/hr,69133
3,75,75 km/hr,62
4,80,80 km/hr,27794
5,90,90 km/hr,940
6,100,100 km/hr,31240
7,110,110 km/hr,2151
8,777,Other speed limit,249
9,888,"Camping grounds, off road",930


In [16]:
acc = datasets['ACCIDENT'].copy()
acc_location = datasets['ACCIDENT_LOCATION'].copy()
acc = acc.merge(acc_location[['ACCIDENT_NO', 'ROAD_TYPE']], on='ACCIDENT_NO', how='left')
# Treat 999 as missing
acc['SPEED_ZONE'] = acc['SPEED_ZONE'].replace(999, np.nan)

# Rule-based imputation:
# Fill NaNs with the most frequent SPEED_ZONE within each ROAD_TYPE.
# If ROAD_TYPE is missing for some rows, fall back to global mode.

# Find the most frequent SPEED_ZONE for each ROAD_TYPE
def group_mode(s):
    m = s.mode()
    return m.iloc[0] if not m.empty else np.nan
mode_by_group = acc.groupby('ROAD_TYPE')['SPEED_ZONE'].transform(group_mode)
acc['SPEED_ZONE'] = acc['SPEED_ZONE'].fillna(mode_by_group)

# global mode for ROAD_TYPE still missing
global_mode = acc['SPEED_ZONE'].mode()
acc['SPEED_ZONE'] = acc['SPEED_ZONE'].fillna(global_mode.iloc[0])
speed_zone_summary = description_summary(acc, "SPEED_ZONE", speed_zone_mapping)

Unnamed: 0,SPEED_ZONE,Description,Count
0,40,40 km/hr,8941.0
1,50,50 km/hr,36987.0
2,60,60 km/hr,77609.0
3,75,75 km/hr,62.0
4,80,80 km/hr,28464.0
5,90,90 km/hr,940.0
6,100,100 km/hr,31954.0
7,110,110 km/hr,2151.0
8,777,Other speed limit,249.0
9,888,"Camping grounds, off road",933.0


In [17]:
acc.head()

Unnamed: 0,ACCIDENT_NO,ACCIDENTDATE,ACCIDENTTIME,ACCIDENT_TYPE,Accident Type Desc,DAY_OF_WEEK,Day Week Description,DCA_CODE,DCA Description,DIRECTORY,...,NO_PERSONS_INJ_2,NO_PERSONS_INJ_3,NO_PERSONS_KILLED,NO_PERSONS_NOT_INJ,POLICE_ATTEND,ROAD_GEOMETRY,Road Geometry Desc,SEVERITY,SPEED_ZONE,ROAD_TYPE
0,T20060000010,13/01/2006,12:42:00,1,Collision with vehicle,6,Friday,113,RIGHT NEAR (INTERSECTIONS ONLY),MEL,...,0,1,0,5,1,1,Cross intersection,3,60.0,STREET
1,T20060000018,13/01/2006,19:10:00,1,Collision with vehicle,6,Friday,113,RIGHT NEAR (INTERSECTIONS ONLY),MEL,...,0,1,0,3,1,2,T intersection,3,70.0,ROAD
2,T20060000022,14/01/2006,12:10:00,7,Fall from or in moving vehicle,7,Saturday,190,FELL IN/FROM VEHICLE,MEL,...,1,0,0,1,1,5,Not at intersection,2,100.0,ROAD
3,T20060000023,14/01/2006,11:49:00,1,Collision with vehicle,7,Saturday,130,REAR END(VEHICLES IN SAME LANE),MEL,...,1,0,0,1,1,2,T intersection,2,80.0,ROAD
4,T20060000026,14/01/2006,10:45:00,1,Collision with vehicle,7,Saturday,121,RIGHT THROUGH,MEL,...,0,3,0,0,1,5,Not at intersection,3,50.0,AVENUE


In [30]:
# ---- SAVE THE RESULT ----- if you wanna save, uncommented this cell.
# Only keep ACCIDENT_NO, SPEED_ZONE
# acc_speed_zone = acc[['ACCIDENT_NO', 'SPEED_ZONE']].copy()
# Save the imputed speed zone to a csv file
# acc_speed_zone.to_csv("Clean-data/SPEED_ZONE_IMPUTED.csv", index=False)

# ROAD_SURFACE_COND Imputation

Road surface condition description:  1 Dry 2 Wet 3 Muddy 4 Snowy 5 Icy 9 Unknown

In [18]:
road_surface_cond_mapping = {
    1: "Dry",
    2: "Wet",
    3: "Muddy",
    4: "Snowy",
    5: "Icy",
    9: "Unknown"
}
road_surface_cond_mapping_impute = {
    1: "Dry",
    2: "Wet",
    3: "Muddy",
    4: "Snowy",
    5: "Icy",
    9: "Unknown",
    10: "Multiple Surface Conditions"
}
road_surface = datasets['ROAD_SURFACE_COND'].copy()
road_surface = road_surface.drop_duplicates(subset=['ACCIDENT_NO'], keep='first')
# Check the distribution of ROAD_SURFACE_COND
print("Original Road Surface Condition Distribution:")
road_surface_summary = description_summary(road_surface, "SURFACE_COND", road_surface_cond_mapping)

Original Road Surface Condition Distribution:


Unnamed: 0,SURFACE_COND,Description,Count
0,1,Dry,160269
1,2,Wet,32162
2,3,Muddy,439
3,4,Snowy,83
4,5,Icy,339
5,9,Unknown,10416


In [19]:
# find the SURFACE_COND_SEQ > 1
multi_surface_ids = road_surface[road_surface['SURFACE_COND_SEQ'] > 1]['ACCIDENT_NO'].unique()
# Assign 10 for multiple weather conditions
road_surface.loc[road_surface["ACCIDENT_NO"].isin(multi_surface_ids), "SURFACE_COND"] = 10
road_surface.loc[road_surface["ACCIDENT_NO"].isin(multi_surface_ids), "SURFACE_COND_Desc"] = 'Multiple Surface Conditions'
road_surface_summary = description_summary(road_surface, "SURFACE_COND", road_surface_cond_mapping_impute)

Unnamed: 0,SURFACE_COND,Description,Count
0,1,Dry,160213
1,2,Wet,32154
2,3,Muddy,439
3,4,Snowy,81
4,5,Icy,339
5,9,Unknown,10416
6,10,Multiple Surface Conditions,66


In [20]:
# Merge ACCIDENT, ACCIDENT_LOCATION, ROAD_SURFACE
acc = datasets['ACCIDENT'].copy()
acc_location = datasets['ACCIDENT_LOCATION'].copy()
acc = acc.merge(acc_location[['ACCIDENT_NO', 'ROAD_TYPE']], on='ACCIDENT_NO', how='left')
acc = acc.merge(road_surface[['ACCIDENT_NO', 'SURFACE_COND', 'Surface Cond Desc']], on='ACCIDENT_NO', how='left')

# Treat 9 = Unknown as NaN
acc['SURFACE_COND'] = acc['SURFACE_COND'].replace(9, np.nan)

# Rule-based imputation:
# Fill NaNs with the most frequent SURFACE_COND within each ROAD_TYPE.
# If ROAD_TYPE is missing for some rows, fall back to global mode.
def group_mode(s):
    m = s.mode()
    return m.iloc[0] if not m.empty else np.nan

mode_by_road = acc.groupby('ROAD_TYPE')['SURFACE_COND'].transform(group_mode)
acc['SURFACE_COND'] = acc['SURFACE_COND'].fillna(mode_by_road)

# Global mode
global_mode = acc['SURFACE_COND'].mode()
if not global_mode.empty:
    acc['SURFACE_COND'] = acc['SURFACE_COND'].fillna(global_mode.iloc[0])

In [21]:
acc.shape

(203708, 31)

In [22]:
road_surface_summary = description_summary(acc, "SURFACE_COND", road_surface_cond_mapping_impute)

Unnamed: 0,SURFACE_COND,Description,Count
0,1,Dry,170629.0
1,2,Wet,32154.0
2,3,Muddy,439.0
3,4,Snowy,81.0
4,5,Icy,339.0
5,9,Unknown,0.0
6,10,Multiple Surface Conditions,66.0


In [44]:
# # ---- SAVE THE RESULT ----- if you wanna save, uncommented this cell.
# Only keep ACCIDENT_NO, SURFACE_COND
# acc_road_surface = acc[['ACCIDENT_NO', 'SURFACE_COND', 'Surface Cond Desc']].copy()
# Save the imputed road surface condition to a csv file
# acc_road_surface.to_csv("Clean-data/ROAD_SURFACE_COND_IMPUTED.csv", index=False)

## (Additional, Run with Cautious)Impute ROAD_NAME in ACCIDENT_LOCATION
Use lat, lon in the ACCIDENT_LOCATION to reverse the ROAD_NAME
ACCIDENT -> NODE -> ACCIDENT_LOCATION


In [23]:
acc_location = datasets['ACCIDENT_LOCATION'].copy()
node_by_acc = datasets["NODE"].drop_duplicates(subset=["ACCIDENT_NO"]).copy()
# Join on ACCIDENT_NO
acc_loc_enriched = acc_location.merge(node_by_acc, on="ACCIDENT_NO", how="left")
# find the rows with missing ROAD_NAME
missing_road_name = acc_loc_enriched[acc_loc_enriched["ROAD_NAME"].isna()]
missing_road_name

Unnamed: 0,ACCIDENT_NO,NODE_ID_x,ROAD_ROUTE_1,ROAD_NAME,ROAD_TYPE,ROAD_NAME_INT,ROAD_TYPE_INT,DISTANCE_LOCATION,DIRECTION_LOCATION,NEAREST_KM_POST,...,NODE_TYPE,VICGRID94_X,VICGRID94_Y,LGA_NAME,LGA_NAME_ALL,REGION_NAME,DEG_URBAN_NAME,Lat,Long,POSTCODE_NO
41,T20060000265,-10,,,,,,,,,...,,,,,,,,,,
239,T20060001147,205508,-1.0,,,,,0.0,NK,,...,O,2601323.287,2388098.254,BAW BAW,BAW BAW,EASTERN REGION,RURAL_VICTORIA,-38.002811,146.153669,3825.0
294,T20060001419,-1,,,,,,,,,...,,,,,,,,,,
297,T20060001426,-3,,,,,,,,,...,,,,,,,,,,
300,T20060001431,-3,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197282,T20190026485,632362,-1.0,,,,,0.0,NK,,...,O,2495556.750,2413212.396,MELBOURNE,MELBOURNE,METROPOLITAN NORTH WEST REGION,MELB_URBAN,-37.782060,144.949558,3052.0
197549,T20200000493,661855,-1.0,,,,,0.0,NK,,...,O,2509060.792,2492594.510,MITCHELL,MITCHELL,NORTHERN REGION,RURAL_VICTORIA,-37.066696,145.101898,3659.0
200126,T20200006060,638277,-1.0,,,,,0.0,NK,,...,O,2289931.606,2481504.911,NORTHERN GRAMPIANS,NORTHERN GRAMPIANS,WESTERN REGION,RURAL_VICTORIA,-37.143162,142.634955,3381.0
201595,T20200009524,653055,-1.0,,,,,0.0,NK,,...,O,2432214.089,2533131.034,BENDIGO,BENDIGO,NORTHERN REGION,RURAL_VICTORIA,-36.698976,144.241328,3556.0


In [65]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import re

mask_missing = acc_loc_enriched["ROAD_NAME"].isna() & acc_loc_enriched["Lat"].notna() & acc_loc_enriched["Long"].notna()
# Reverse-geocode only remaining nulls, run with cautious, be friendly with APIs
geolocator = Nominatim(user_agent="vic_crash_geocoder")
geocode_rev = RateLimiter(geolocator.reverse, min_delay_seconds=1.1) 
# Reverse the ROAD_NAME by lat, lonh
def reverse_to_road(lat, lon):
    try:
        loc = geocode_rev((lat, lon), language="en")
        if loc and "road" in (loc.raw.get("address") or {}):
            print(loc.raw["address"]["road"])
            return loc.raw["address"]["road"]
    except Exception:
        return None
    return None

acc_loc_enriched.loc[mask_missing, "ROAD_NAME"] = acc_loc_enriched.loc[mask_missing, ["Lat","Long"]] \
    .apply(lambda s: reverse_to_road(s["Lat"], s["Long"]), axis=1)

# Clean: drop street type suffix & UPPERCASE
STREET_TYPES = r"(road|rd|street|st|avenue|ave|drive|dr|track|trl|court|ct|place|pl|close|cl|highway|hwy|boulevard|blvd|lane|ln|terrace|ter|crescent|cres|way)"
PATTERN_END = re.compile(rf"\b{STREET_TYPES}\b\.?$", flags=re.IGNORECASE)
def clean_road_base(name):
    if not isinstance(name, str):
        return name
    cleaned = PATTERN_END.sub("", name).strip()
    return cleaned.upper()

acc_loc_enriched["ROAD_NAME"] = acc_loc_enriched["ROAD_NAME"].apply(clean_road_base)
# Fallback, if the missing the lat and long, fill with UNKNOWN
acc_loc_enriched["ROAD_NAME"] = acc_loc_enriched["ROAD_NAME"].fillna("UNKNOWN")
acc_loc_enriched.to_csv("ACCIDENT_LOCATION_with_NODE_ROADNAME_CLEAN.csv", index=False)

Cusack Road
Burgess Track
Overland Drive
Fox Island Road
Swallowfield Road
Foreshore Road
Hardings Lane
Chandler Avenue
Sayers Track
Hume Freeway
Dartmouth Road
Wellington Road
Wet Gully Track
Mary Street
Koonwarra Pound Creek Road
Jackson Road 4WD
Channel Track
Thompson Road
Olsen Road
Woolshed Gully Drive
Bungaree - Wallace Road
Heathcote - Nagambie Road
Cherry Tree Spur Track
Muirs Lane
Former Minnidale Road South
Hill Plantation Link Track
Cambarville Road
Stawell Road
Harnett Street
Mansfield - Woods Point Road
Hume Freeway
Bogong High Plains Road
Howes Creek-Goughs Bay Road
Strzelecki Highway
Mayford Track
Pennyroyal East Link
Summit Road
Punt Track
Suckling Road
Hume Freeway On Ramp
Old Yarck Road
Boundary 14 Track
Murray Valley Highway Offramp
Barry Road
Maroondah Highway
Cowwarr Walhalla Road
Drummond Street South
Murray Valley Highway
Great Alpine Road
Deans Marsh - Lorne Road
Great Ocean Road
Dorchap Range Track
Murray River Road
Corn Hill Track
Ross Road
Peregrine Track
MT 

In [69]:
miss = acc_loc_enriched[acc_loc_enriched["ROAD_NAME"]== "UNKNOWN"].copy()
miss

Unnamed: 0,ACCIDENT_NO,NODE_ID_x,ROAD_ROUTE_1,ROAD_NAME,ROAD_TYPE,ROAD_NAME_INT,ROAD_TYPE_INT,DISTANCE_LOCATION,DIRECTION_LOCATION,NEAREST_KM_POST,...,NODE_TYPE,VICGRID94_X,VICGRID94_Y,LGA_NAME,LGA_NAME_ALL,REGION_NAME,DEG_URBAN_NAME,Lat,Long,POSTCODE_NO
41,T20060000265,-10,,UNKNOWN,,,,,,,...,,,,,,,,,,
239,T20060001147,205508,-1.0,UNKNOWN,,,,0.0,NK,,...,O,2601323.287,2388098.254,BAW BAW,BAW BAW,EASTERN REGION,RURAL_VICTORIA,-38.002811,146.153669,3825.0
294,T20060001419,-1,,UNKNOWN,,,,,,,...,,,,,,,,,,
297,T20060001426,-3,,UNKNOWN,,,,,,,...,,,,,,,,,,
300,T20060001431,-3,,UNKNOWN,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186611,T20190006335,-1,,UNKNOWN,,,,,,,...,,,,,,,,,,
188661,T20190010303,-1,,UNKNOWN,,,,,,,...,,,,,,,,,,
188820,T20190010646,-1,,UNKNOWN,,,,,,,...,,,,,,,,,,
192045,T20190016778,-1,,UNKNOWN,,,,,,,...,,,,,,,,,,
