# Multivariate EDA: Exploration of Relationship Among Variables 

Exploring how several variables relate and interact with one another simultaneously. This involves computing and interpreting a correlation matrix to understand linear relationships among numerical variables, creating heatmaps to visualize patterns and produce pairwise relationship plots to observe trends, distributions, and interactions across all selected variables. 


# 4.1 Setup and Data Loading

In [None]:
# Setup and cleaned data loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

cleaned_df = pd.read_pickle("../data/cleaned/crash_2018_cleaned.pkl")

# 4.2 Correlation Matrix for all Numeric Variables

In [None]:
# Calculating the correaltion among the numeric variables
numeric_vars = [
    "AADT",
    "Number of Lanes Num",
    "Number of Vehicles Num",
    "Speed Limit Num",
    "Impact Speed Num",
    "Driver Age",
    "Driver BAC",
]
correlation_matrix = cleaned_df[numeric_vars].corr()

# Visualization of the correlation with heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5, square=True, cbar=True)
plt.title("Correlation Matrix of Numerical Variables")
plt.grid(False)
plt.show()

**Observations and Interpretation**

1. Strong Correlations: There is no strong correlation either positive or negative among all the varibales

2. Moderate Correlations:
   - A positive relationship between speed limit and impact speed (0.56). The speed limit of the roadway positively influenced the impact speed of a crash
   - Positive correlation between AADT and Number of Lanes ( 0.53). AADT increases on roadways with more lanes 
    
3. Weak Correlation
   - A negative correlation between number of vehivles and impact speed (-0.38). Impact speed decrease as the number of vehicles involved in a crash increase 
   - A very weak negative correlation between driver age and impact speed (-0.12)

4. Most of the correlations between the variables are negligible, indicating no linear relationship 

# Pairplots to Visualize the Linear Relationship 

In [None]:
# Pairplots of the numeric variables
sns.pairplot(cleaned_df[numeric_vars])
plt.suptitle("Pairplot of Numerical Variables", y=1.02)
plt.tight_layout()
plt.show()