This project is a classification case based on traffic accident data, which includes data cleaning, feature engineering, training and parameter tuning of multiple models (Baseline MLP, Tuned MLP, Bagging MLP, Voting Ensemble), and comparing and ranking the above four types of models according to metrics such as accuracy, precision, recall, F1-Score, and AUC.
- Prepare data (remove duplicates, clean the data, handle missing values, fix data types).
- Encode categorical variables: using Label Encoding for ordered categories and One-Hot Encoding for unordered categories.
- Split the set into training and testing size (80%/20% split) to prevent any future data leakage.
- Perform feature selection on the training set, assess the importance of different features using a feature importance plot, then apply the same selected features to the test set.
- Perform feature scaling on the numerical features: fit the scaler on the training set, then transform the test set using the same parameters to avoid data leakage.
- Train the model using the processed training data.
- Fine-tune model using RandomizedSearchCV.
- Explore two ensemble methods: Voting Classifier+Stacking Classifier
- Evaluation metrics such as accuracy, precision, recall, F1-score, and ROC-AUC
Final 24 features selected:
- Police_Force_label
- Casualties_Per_Vehicle
- Number_of_Vehicles
- Day_Cos
- Day_Sin
- Hour_Cos
- Hour_Sin
- Month_Sin
- Speed_Category_Low_on
- Day_of_Week_label
- Geo_Cluster
- Vehicle_Type_label
- Road_Type_Single carriageway_on
- Month_Cos
- Light_Conditions_Daylight_on
- Weather_Conditions_Fine no high winds_on
- Year
- Junction_Control_Give way or uncontrolled_on
- Junction_Control_Data missing or out of range_on
- Carriageway_Hazards_IsNone_on
- Road_Surface_Conditions_Wet or damp_on
- Junction_Detail_Not at junction or within 20 metres_on
- Junction_Detail_Roundabout_on
- Light_Conditions_Darkness - no lighting_on
The best model is the Baseline MLP, MLPClassifier.