## Preprocessing with PySpark on AWS EMR

The initial phase of this project involved preprocessing the US car accident dataset, which contains approximately 7.7 million records spanning from February 2016 to March 2023 across 49 states. Given the dataset’s size and complexity, we utilized an AWS EMR (Elastic MapReduce) cluster with PySpark to efficiently handle the data processing tasks. The preprocessing steps were designed to clean and transform the raw data into a suitable format for subsequent analysis and modeling, specifically to predict accident duration and its impact on traffic flow.

The preprocessing workflow consisted of several key steps. First, we loaded the dataset from a CSV file stored in an S3 bucket and standardized the column names to lowercase for consistency. Unnecessary columns—such as `id`, `end_lat`, `end_lng`, and `wind_chill(f)`—were removed to streamline the dataset. Next, we addressed missing values in critical columns like `precipitation(in)` and `wind_speed(mph)` by filling them with monthly averages calculated based on the accident start time. This ensured that gaps in weather-related data were handled contextually rather than left as nulls. Rows with remaining null values across any column were then dropped to maintain data integrity.

To prepare the dataset for time-based analysis, we processed the `start_time` and `end_time` columns by converting them to timestamps and calculating the accident duration in minutes. Additionally, we extracted temporal features—start hour, day, month, and year—from the `start_time` to enrich the dataset with time-related insights. These features are essential for understanding patterns in accident duration and traffic impact.

Outlier detection and removal were also critical steps to ensure the dataset’s reliability. We focused on key numerical columns—`distance(mi)`, `temperature(f)`, `pressure(in)`, `visibility(mi)`, `wind_speed(mph)`, `precipitation(in)`, and the newly computed `accident_duration(min)`—and computed whiskers (lower and upper bounds) using the interquartile range (IQR) method with approximate quantiles. Records falling outside these bounds were filtered out to eliminate extreme values that could skew the analysis.

Finally, the preprocessed dataset was saved as a Parquet file in an S3 bucket, coalesced into a single partition for simplicity. The process was logged throughout, with the log file uploaded alongside the output data for transparency and debugging purposes. This preprocessing effort reduced noise, enhanced data quality, and set the stage for accurate modeling of accident duration.


## Text Processing with Doc2Vec

Following the initial preprocessing with PySpark, the next step involved handling the textual data within the dataset, specifically the `description` column containing details about each accident. Given the large volume of data—approximately 7.7 million records—using advanced embedding models like OpenAI’s `text-embedding-ada-002` was deemed impractical due to computational and cost constraints. Instead, we adopted a traditional yet effective vectorization approach using `Doc2Vec`, a model from the Gensim library, to transform the text into numerical representations suitable for downstream modeling.

The text processing workflow began by loading the preprocessed dataset from a Parquet file generated by the PySpark job on AWS EMR. We first normalized the categorical features to ensure consistency across the dataset. The `visibility(mi)` column was dropped as it was deemed non-essential for this phase. Integer-based columns—such as `severity`, `start_hour`, `start_day`, `start_month`, and `start_year`—were converted to a categorical type, while object and boolean columns were similarly transformed. All categorical values were then standardized by converting them to lowercase and stripping whitespace.

The core of the text processing focused on the `description` column. We tokenized the text by converting it to lowercase, removing punctuation, and splitting it into individual words. Non-alphabetical tokens were filtered out, and the remaining words were processed by removing stopwords (common words like "the" or "and") and applying lemmatization to reduce words to their root form (e.g., "running" to "run"). This resulted in a clean set of tokens for each accident description, ready for vectorization.

To convert these tokens into numerical embeddings, we trained a `Doc2Vec` model with a vector size of 100, a window size of 5, and 20 epochs, using the distributed memory (DM) approach. The model was trained on tagged documents, where each tokenized description was paired with a unique identifier. Once trained, the model generated 100-dimensional vectors capturing the semantic meaning of each description. To reduce dimensionality and improve computational efficiency, we applied Principal Component Analysis (PCA) to these vectors, retaining the top three components. These PCA-derived features—labeled `description_pca1`, `description_pca2`, and `description_pca3`—were added to the dataset, while the original `description` and `tokens` columns were discarded.

The resulting dataset, now enriched with text embeddings, was saved as a Parquet file for use in subsequent modeling steps. This approach effectively transformed unstructured text into a structured format, enabling the inclusion of accident descriptions in the prediction of traffic flow impact. The process was logged throughout, with logs stored alongside the output for traceability.



## Feature Engineering

After completing the initial preprocessing and text processing phases, it became evident that an additional round of feature engineering was necessary to further refine the dataset for modeling. This step aimed to enhance the predictive power of the features by transforming, grouping, and reducing them, ensuring the dataset was optimized for forecasting accident duration and its impact on traffic flow.

The feature engineering process began by loading the dataset from the Parquet file generated in the text processing step. We first streamlined the dataset by dropping a set of columns deemed redundant or irrelevant for the analysis. These included metadata like `weather_timestamp`, `airport_code`, `country`, and `source`, as well as location-specific details such as `street`, `city`, `county`, and `zipcode`. Additionally, binary road condition flags—such as `amenity`, `bump`, `crossing`, and `traffic_signal`—and twilight-related columns were removed to reduce dimensionality. Outliers in the `accident_duration(min)` column were also filtered out to ensure the target variable remained robust.

To capture the periodic nature of temporal and spatial features, we applied cyclic encoding using sine and cosine transformations. The `start_lng` (longitude) was encoded with a period of 360 degrees, while `start_hour` (24-hour cycle), `start_day` (7-day week), and `start_month` (12-month year) were similarly transformed. The original columns were then dropped, leaving pairs of sine and cosine features that effectively represent these cyclical patterns without assuming linear relationships. For the `wind_direction` column, textual values (e.g., "north", "east", "calm") were mapped to corresponding angles (in degrees), cyclically encoded with a 360-degree period, and then replaced with their sine and cosine components.

Next, we grouped categorical features to simplify the dataset and enhance interpretability. The `state` column, representing the 49 states in the dataset, was categorized into `urban`, `rural`, or `unknown` based on a predefined list of urban and rural states. This reduced the granularity while preserving meaningful distinctions in traffic and accident patterns. Similarly, the `weather_condition` column, which contained a wide range of detailed weather descriptions, was mapped into four broader groups: `clear`, `cloudy`, `precipitation`, and `obscured`. Unmapped conditions were labeled as `unknown`. These groupings reduced noise and consolidated related conditions into more manageable categories.

The final dataset, now transformed and enriched, was saved as a Parquet file. This process resulted in a cleaner, more focused set of features, with reduced dimensionality and improved representation of temporal, spatial, and environmental factors. The feature engineering phase ensured the data was well-prepared for modeling, balancing complexity and predictive potential. Logs were maintained throughout the process and stored alongside the output for reference.


## Modeling

With the dataset fully preprocessed and enriched through feature engineering, the next phase focused on training predictive models to forecast accident duration and assess its impact on traffic flow. To achieve this, we developed training pipelines for five distinct regression models: Linear Regression, Random Forest, XGBoost, CatBoost, and LightGBM. These pipelines were designed not only to train the models but also to optimize their hyperparameters, evaluate performance, and save the results—including metrics and visualizations—for further analysis.

The modeling process began by loading the final preprocessed dataset from a Parquet file. To accommodate potential computational constraints, we allowed for sampling a fraction of the data (defaulting to 100%), ensuring flexibility in experimentation. The dataset was split into features (`X`) and the target variable (`accident_duration(min)`), with an 80-20 train-test split applied to reserve 20% of the data for testing. Numerical features were standardized using a `StandardScaler`, while categorical features—such as `state_group` and `weather_group`—were one-hot encoded (dropping the first category to avoid multicollinearity). This preprocessing was integrated into a `ColumnTransformer` within each pipeline to ensure consistent data preparation.

For each model, we employed `RandomizedSearchCV` to tune hyperparameters efficiently, performing a randomized search over predefined parameter distributions with 20 iterations and 5-fold cross-validation. The scoring metric was set to negative root mean squared error (RMSE) to prioritize minimizing prediction errors. The models and their tuned parameters were as follows:
- **Linear Regression**: No hyperparameters were tuned, serving as a baseline.
- **Random Forest**: Tuned `n_estimators` (100–500), `max_depth` (10–30 or None), and `min_samples_split` (2–10).
- **XGBoost**: Tuned `n_estimators` (100–300), `learning_rate` (0.01–0.1), and `max_depth` (4–10).
- **CatBoost**: Tuned `iterations` (100–500), `learning_rate` (0.01–0.1), and `depth` (4–10).
- **LightGBM**: Tuned `n_estimators` (100–300), `learning_rate` (0.01–0.1), and `max_depth` (6–12).

Each pipeline was executed individually, leveraging parallel processing to optimize computational efficiency. After training, the best-performing model (based on cross-validation) was evaluated on both the training and test sets. Key metrics included adjusted R² (accounting for the number of features) and RMSE, providing insights into model fit and prediction accuracy. These metrics were logged for each model, along with the fraction of data used, to facilitate comparison.

Beyond performance metrics, we analyzed feature importance and residuals to deepen our understanding of each model’s behavior. For Linear Regression, feature importance was derived from coefficient magnitudes and accompanying p-values (calculated via `statsmodels`), while for tree-based models (Random Forest, XGBoost, CatBoost, LightGBM), it was based on built-in feature importance scores. These results were saved as Excel files. Residual analysis included histograms, scatter plots against predicted values, and Q-Q plots to assess normality and autocorrelation, with the Durbin-Watson statistic computed to detect residual patterns. These visualizations were saved as PNG files for later review.

The trained pipelines, optimized with their best hyperparameters, were serialized and saved as `.pkl` files, named according to the model type and data fraction (e.g., `RandomForest-frac-1.0.pkl`). This ensured reproducibility and easy access for future predictions or analysis. Logs captured the entire process, including preparation, training, and evaluation steps, and were stored alongside the outputs.

This modeling phase provided a robust framework for comparing multiple regression approaches, balancing simplicity (Linear Regression) with complexity (ensemble methods), and setting the stage for a detailed performance analysis in the next section.
