# Univarite EDA: Basic Analysis of Each Variable

Examination of each of the varibales one at a time to get a basic understanding of their nature in terms of distribution and typical values, outliers among others. Data is from the cleaned data notebook


# 2.1 Setup and Data Loading

In [None]:
# Setup and cleaned data loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

cleaned_df = pd.read_pickle("../data/cleaned/crash_2018_cleaned.pkl")

print(f"Crash Data shape: {cleaned_df.shape}")
print(f"\nFirst few rows:")
cleaned_df.head  # first few rows of the crash data

In [None]:
cleaned_df.info()  # Basic info about variables

# 2.2 Identification and Grouping of the Variable Types

The variables that are of interest are grouped into continuous and categorical types for the analysis


## 2.2.1 Continuous Variables

In [None]:
# Continuous variables
continuous_vars = [
    "AADT",
    "Number of Lanes Num",
    "Number of Vehicles Num",
    "Speed Limit Num",
    "Impact Speed Num",
    "Driver Age",
    "Driver BAC",
]


# Distribution(Histogram) of the continuous variables
for col in continuous_vars:
    plt.subplots(figsize=(10, 4))
    sns.histplot(cleaned_df[col], bins=20, alpha=1.0)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.show()

**Observations** 

1. The Annual Average Daily Traffic(AADT) is concentrated on the left, right skewed indicating most of daily traffic on the roadways are within the range of 5,000 to 60,000 which is typical for roadway data

2. Most of crashes in Alabama in 2018 occured on roadways with 2 or 4 lanes

3. The impact speed is right skewed indicating most of the crashes occured with impact speed ranging from 5mph to 45mph

4. The distribution for driver age shows that most crashes occured among young adults(20-50 years)

5. Driver Blood Alcohol Concentration(BAC) tested after crash was most concentrated between 0.1% and 0.2%. Most divers were someway intoxicated. The typical legal limit is 0.08%  

In [None]:
# Summary statistics
print("Continuous Variables Summary Statistics:")
cleaned_df[continuous_vars].describe().round(2)

**Determining Quartiles and Outliers**

In [None]:
# Box plot of continuous variables
for col in continuous_vars:
    plt.figure(figsize=(10, 6))

    # Overlay jittered points
    sns.stripplot(
        y=cleaned_df[col],
        color="skyblue",
        alpha=0.5,
        size=5,
        jitter=True,
    )
    # Boxplot (no outlier dots)
    sns.boxplot(
        y=cleaned_df[col],
        width=0.3,
        showcaps=True,
        boxprops=dict(facecolor="none", edgecolor="black", linewidth=1.5),
        whiskerprops=dict(linewidth=1.5),
        capprops=dict(linewidth=1.5),
        medianprops=dict(color="red", linewidth=2),
    )

    plt.grid(False)
    plt.title(f"{col} â€” Distribution with Outliers", fontsize=14)
    plt.tight_layout()
    plt.show()

**Interpretations**

## 2.2.2 Categorical Variables

In [None]:
# Categorical variables
categorical_vars = [
    "Crash Severity",
    "Weekend",
    "Time of Day",
    "County",
    "Area Type",
    "Functional Class Recode",
    "Vehicle Type Recode",
    "Driver Gender Recode",
    "Driver License Validity",
    "BAC Available",
    "Crash Manner Recode",
    "Visibility Obstruction Recode",
    "Lighting Conditions Recode",
]


# Categorical variables distribution
for col in categorical_vars:
    plt.figure(
        figsize=(10, 4),
    )
    cleaned_df[col].value_counts().plot(kind="bar")
    plt.title(f"{col} Counts")
    plt.xlabel(col)
    plt.ylabel("Counts (log scale)")
    plt.yscale("log")
    plt.tight_layout()
    plt.show()

    print(f"Counts for {col}:")
    print(cleaned_df[col].value_counts(), "\n")

**Observations**

1. Most crashes in 2018 resulted in Property Damage Only(PDO), 126,719 counts and 878 counts for fatal

2. Majority of crashes occured on weekdays

3. Rear-end and sideswipe/angle accounted for most crashes  