# Univarite EDA: Basic Analysis of Each Variable

Examination of each of the varibales one at a time to get a basic understanding of their nature in terms of distribution and typical values, outliers among others. Data is from the cleaned data notebook


# 2.1 Setup and Data Loading

In [None]:
# Setup and cleaned data loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

cleaned_df = pd.read_pickle("../data/cleaned/crash_2018_cleaned.pkl")

print(f"Crash Data shape: {cleaned_df.shape}")
print(f"\nFirst few rows:")
cleaned_df.head  # first few rows of the crash data

In [None]:
cleaned_df.info()  # Basic info about variables

# 2.2 Identification and Grouping of the Variable Types

The variables that are of interest are grouped into continuous and categorical types for the analysis


## 2.2.1 Continuous Variables

In [None]:
# Continuous variables
continuous_vars = [
    "AADT",
    "Number of Lanes Num",
    "Number of Vehicles Num",
    "Speed Limit Num",
    "Impact Speed Num",
    "Driver Age",
    "Driver BAC",
]


# Distribution(Histogram) of the continuous variables
for col in continuous_vars:
    plt.subplots(figsize=(10, 4))
    sns.histplot(cleaned_df[col], bins=20, alpha=1.0)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.show()

**Observations** 

1. The Annual Average Daily Traffic(AADT) is concentrated on the left, right skewed indicating most of daily traffic on the roadways are within the range of 5,000 to 60,000 which is typical for roadway data

2. Most of crashes in Alabama in 2018 occured on roadways with 2 or 4 lanes

3. The impact speed is right skewed indicating most of the crashes occured with impact speed ranging from 5mph to 45mph

4. The distribution for driver age shows that young drivers (age around 20 years) are mostly involved in crashes.

5. For the suspected DUI cases, driver Blood Alcohol Concentration(BAC) tested after crash was most concentrated between 0.1% and 0.2%. Most divers were someway intoxicated. The typical legal limit is 0.08%  

In [None]:
# Summary statistics
print("Continuous Variables Summary Statistics:")
cleaned_df[continuous_vars].describe().round(2)

**Determining Quartiles and Outliers**

In [None]:
# Box plot of continuous variables
for col in continuous_vars:
    plt.figure(figsize=(10, 6))

    # Overlay jittered points
    sns.stripplot(
        y=cleaned_df[col],
        color="skyblue",
        alpha=0.5,
        size=5,
        jitter=True,
    )
    # Boxplot (no outlier dots)
    sns.boxplot(
        y=cleaned_df[col],
        width=0.3,
        showcaps=True,
        boxprops=dict(facecolor="none", edgecolor="black", linewidth=1.5),
        whiskerprops=dict(linewidth=1.5),
        capprops=dict(linewidth=1.5),
        medianprops=dict(color="red", linewidth=2),
    )

    plt.grid(False)
    plt.title(f"{col} — Distribution with Outliers", fontsize=14)
    plt.tight_layout()
    plt.show()

Those plots look messy because there are too many points and very skewed ranges (like AADT). Let’s fix this using sampling method (we shall sample 2000 data for easy visualization) and we shall use log scale for AADT

In [None]:
for col in continuous_vars:
    plt.figure(figsize=(8, 5))

    # drop NaNs first
    series = cleaned_df[col].dropna()
    n_sample = min(2000, len(series)) #Pick 2000 random non-nan rows (or all rows if the sample size is less than 2000)

    sample = series.sample(n=n_sample, random_state=0)

    sns.stripplot(
        x=sample,
        color="skyblue",
        alpha=0.3,
        size=3,
        jitter=0.2,
    )

    sns.boxplot(
        x=cleaned_df[col],
        width=0.3,
        showcaps=True,
        boxprops=dict(facecolor="none", edgecolor="black", linewidth=1.5),
        whiskerprops=dict(linewidth=1.5),
        capprops=dict(linewidth=1.5),
        medianprops=dict(color="red", linewidth=2),
    )

    if col == "AADT":
        plt.xscale("log")

    plt.title(f"{col} — Distribution with Outliers")
    plt.tight_layout()
    plt.show()


### **Interpretations**

## 2.2.2 Categorical Variables

In [None]:
# Categorical variables
categorical_vars = [
    "Crash Severity",
    "Weekend",
    "Time of Day",
    "County",
    "Area Type",
    "Functional Class Recode",
    "Vehicle Type Recode",
    "Driver Gender Recode",
    "Driver License Validity",
    "BAC Available",
    "Crash Manner Recode",
    "Visibility Obstruction Recode",
    "Lighting Conditions Recode",
]


# Categorical variables distribution
for col in categorical_vars:
    print(f"\n\n--> Distribution of {col}\n")
    plt.figure(
        figsize=(10, 4),
    )
    cleaned_df[col].value_counts().plot(kind="bar")
    plt.title(f"{col} Counts")
    plt.xlabel(col)
    plt.ylabel("Counts (log scale)")
    plt.yscale("log")
    plt.tight_layout()
    plt.show()

    #We will show count and % for better understanding and also missing values
    counts = cleaned_df[col].value_counts(dropna=False)
    perc = cleaned_df[col].value_counts(normalize=True, dropna=False) * 100
    summary = pd.DataFrame({"Count": counts, "Percent": perc.round(1)})
    
    print(f"\nCounts and percentages for {col}:")
    print(summary, "\n")

**Observations**

- Crash Severity: Most crashes are PDO (79%), with only 0.5% fatal and 3.3% serious.

- Weekend: Crashes happen more on weekdays (78%) than weekends (22%).

- Time of Day: About 73% of crashes occur in daytime and 27% at nighttime.

- County: Crashes are highly concentrated in a few counties, led by Jefferson (21.5%) and Mobile (10.1%).

- Area Type: Urban areas account for about 77% of crashes; rural areas about 23%.

- Functional Class Recode: Crashes are most common on Principal Arterials (27%) and Minor Arterials (18%); 15% have missing functional class.

- Vehicle Type Recode: Passenger cars dominate (49%), with pickups (17%) and “Other” (22%) also substantial; motorcycles are under 1%.

- Driver Gender Recode: About half of crashes involve male drivers (50%), 39% female, and 12% missing gender.

- Driver License Validity: Most involved drivers have valid licenses (78%), but about 13% do not and 9% are missing.

- BAC Available: BAC information is rarely available—only 1.2% of cases.

- Crash Manner Recode: Rear-end (37%) and sideswipe/angle (36%) crashes dominate; head-on is only 4.5%.

- Visibility Obstruction Recode: Visibility obstructions are coded in about 8% of crashes; most (88%) report no obstruction.

- Lighting Conditions Recode: Most crashes occur in daylight (72%); dark or night‑related conditions together account for around one‑quarter.