# Car Crashes EDA Notebook

Author: Tammy Mims

Module 6 EDA Notebook Project

Date: 2026-02-11

Repository: https://github.com/tmims71-ctrl/datafun-06-eda

Purpose: Create a custom exploratory data analysis (EDA) project using GitHub, Jupyter, pandas, Seaborn, and other popular data analytics tools.


## Dataset Information
Dataset: Car Crashes (Seaborn dataset)
Description: U.S. state-level car crash statistics, including total crashes and factors such as speeding and alcohol.
Source: Seaborn built-in datasets
Access: Exported to data/car_crashes.csv

## Section 0. Intro to Jupyter Notebooks
- Run a cell with Ctrl+Enter (Cmd+Enter on Mac).
- Change cell type (Markdown or Code) in the notebook toolbar.
- Reorder cells by dragging them.
- Select the .venv kernel when working in VS Code.

## Section 0. Intro to EDA
EDA helps you understand structure, quality, and limits of a dataset before modeling or reporting.
Goals include checking scale, missing data, and relationships between variables.

## Section 1. Project Setup and Imports
All imports and configuration appear once, at the top of the notebook.

In [None]:
# Imports at the top of the file
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

print("Imports complete.")

## Section 2. Load the Data
Load the dataset and preview the first few rows.

In [None]:
car_df = pd.read_csv("../data/car_crashes.csv")
car_df.head()

## Section 3. Inspect Data Shape and Structure
Check rows/columns, data types, and column names.

In [None]:
shape = car_df.shape
print(f"The car_crashes dataset has {shape[0]} rows and {shape[1]} columns.")

In [None]:
car_df.info()

In [None]:
print("Column names:")
print(list(car_df.columns))

## Section 4. Data Quality Checks
Check missing values and duplicate rows.

In [None]:
print("Missing values per column:")
print(car_df.isnull().sum())

In [None]:
num_duplicates = car_df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

## Section 5. Create Clean View
Create a clean view by dropping rows with missing values (if any).

In [None]:
car_clean = car_df.dropna()
print(f"Original dataset: {len(car_df)} rows")
print(f"Clean dataset: {len(car_clean)} rows")
print(f"Rows removed: {len(car_df) - len(car_clean)}")

## Section 6. Descriptive Statistics
Summary statistics for numeric columns.

In [None]:
car_clean.describe()

In [None]:
mean_total = np.mean(car_clean["total"])
std_total = np.std(car_clean["total"])
min_total = np.min(car_clean["total"])
max_total = np.max(car_clean["total"])
range_total = np.ptp(car_clean["total"])

print("Total Crash Statistics (using numpy):")
print(f"  Mean: {mean_total:.2f}")
print(f"  Std Dev: {std_total:.2f}")
print(f"  Min: {min_total:.2f}")
print(f"  Max: {max_total:.2f}")
print(f"  Range: {range_total:.2f}")

## Section 7. Correlation Matrix
Explore relationships between numeric variables.

In [None]:
numeric_cols = car_clean.select_dtypes(include="number")
correlation_matrix = numeric_cols.corr()
correlation_matrix

In [None]:
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix Heatmap")
plt.show()

## Section 8. Make Plots
Visualize patterns with scatter plots and histograms.

In [None]:
sns.scatterplot(data=car_clean, x="speeding", y="total")
plt.title("Speeding vs Total Crashes")
plt.xlabel("Speeding")
plt.ylabel("Total")
plt.show()

In [None]:
sns.histplot(data=car_clean, x="alcohol", bins=15, kde=True)
plt.title("Alcohol-Related Crash Distribution")
plt.xlabel("Alcohol")
plt.ylabel("Count")
plt.show()

## Section 8b. Additional Charts
Pairplot, bar chart, and boxplot for quick distribution comparisons.

In [None]:
# Pairplot of numeric columns
numeric_only = car_clean.select_dtypes(include="number")
sns.pairplot(numeric_only, corner=True)
plt.show()

In [None]:
# Bar chart of total crashes by state abbreviation (top 15)
top_total = car_clean.sort_values("total", ascending=False).head(15)
sns.barplot(data=top_total, x="abbrev", y="total")
plt.title("Top 15 States by Total Crashes")
plt.xlabel("State")
plt.ylabel("Total")
plt.show()

In [None]:
# Boxplot of numeric columns
melted = numeric_only.melt(var_name="metric", value_name="value")
plt.figure(figsize=(10, 5))
sns.boxplot(data=melted, x="metric", y="value")
plt.title("Distribution of Numeric Metrics")
plt.xticks(rotation=45)
plt.show()

## Section 9. Reminder
Before saving and pushing, run all cells to capture outputs.