## Exploratory Data Analysis
#### First step in any data science or machine learning project

make sure you have a folder named "data" in your project folder and the data folder is in the same folder as this     
make sure you are in your virtural environment    

python3 -m venv venv  
source venv/bin/activate   # macOS/Linux  
venv\Scripts\activate    # Windows  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Read in data
print("Loading US Accidents dataset...")
file_path = 'data/US_Accidents_March23.csv'
us_accidents = pd.read_csv(file_path)
print(f"Sample size: {len(us_accidents)}")

In [None]:
# Check dataset size
print(us_accidents.shape)
# Get column types 
print(us_accidents.info())
# View first few rows
print(us_accidents.head())
# Check for missing values
print(us_accidents.isna().sum())

In [None]:
# Group Pratices
# TODO: 1. Find the last 5 rows

# TODO: 2. Find the mean value of "Distance(mi)"

# TODO: 3. Find the number of unique values in "Weather_Condition"

# TODO: 4. Find total number of accidents in MI

# TODO: 5. Choose one column that interests you, find some patterns/interesting facts, and share

## Basic Data Sampling
#### Sampling allows us to work with a smaller, representative subset while preserving key distributions

Look into this link for more data sampling techniques  
https://www.qualtrics.com/experience-management/research/sampling-methods/

In [None]:
# Sampling Goal: Extract a representative sample of 50,000 rows while maintaining the distribution of Severity.
# Use simple random sampling with pandas.sample()
us_accidents_sample = us_accidents.sample(n=50000, random_state=42) # random_state=42 ensures sample remains the same every time.
print(us_accidents_sample["Severity"].value_counts(normalize=True))


In [None]:
# TODO: Find the distribution of Severity of us_accidents
# Is the distribution of Severity of us_accidents_sample similar to that of us_accidents?

In [None]:
# Random sampling may not preserve class proportions.
# TODO: Try stratified sampling to preserve class proportions with sklearn 
# and compare the distribution with the original dataset

In [None]:
# More practices: try other sampling methods on distribution of other features

## Data Cleaning & Preprocessing
#### Fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset
Ways to handle missing values: https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/#h-list-of-methods-to-handle-missing-values-in-a-dataset  
Handling missing values and outliers: https://medium.com/gen-ai-adventures/handling-missing-values-and-outliers-in-data-analysis-c1ffc2dd5051  
Normalization methods: https://medium.com/@mkc940/different-normalization-methods-a1be71fe9f1  


In [None]:
# Number of null values for each column in descending order
print(us_accidents_sample.isna().sum().sort_values(ascending=False))

In [None]:
# Fill missing End_Lat and End_Lng with Start_Lat and Start_Lng (if missing)
us_accidents_sample["End_Lat"].fillna(us_accidents_sample["Start_Lat"], inplace=True)
us_accidents_sample["End_Lng"].fillna(us_accidents_sample["Start_Lng"], inplace=True)
# Drop irrelevant columns (e.g., ID if not useful for prediction).
us_accidents_sample.drop(columns=["ID"], inplace=True)

In [None]:
# TODO: Fill missing values for "Wind_Speed(mph)" and "Visibility(mi)" with the mean of their respective columns

In [None]:

# TODO: Process all columns with missing values with appropriate methods (e.g., mean, median, mode, or drop etc.)

In [None]:
# TODO Challenge: Apply KNN imputation for missing values of "Start_Lat" and "Start_Lng"
from sklearn.impute import KNNImputer

## Data Preprocessing
#### Preparing and cleaning the dataset to make it more suitable for machine learning algorithms


In [None]:
# Perform one-hot encoding on categorical column - "Sunrise_Sunset"
print(us_accidents_sample["Sunrise_Sunset"].unique())

In [None]:
us_accidents_sample["Sunrise_Sunset"] = us_accidents_sample["Sunrise_Sunset"].map({"Day": 0, "Night": 1})
print(us_accidents_sample["Sunrise_Sunset"].unique())

In [None]:
# TODO: Choose another categorical column and apply encoding

In [None]:
# Convert 'Start_Time' to datetime and extract the year
us_accidents_sample['Start_Time'] = us_accidents_sample['Start_Time'].str.split('.').str[0]
us_accidents_sample['Start_Time'] = pd.to_datetime(us_accidents['Start_Time'].str.split('.').str[0], errors='coerce')
us_accidents_sample['Year'] = us_accidents_sample['Start_Time'].dt.year
us_accidents_sample.info()

In [None]:
# Simple plot of accidents by year
plt.figure(figsize=(10, 6))
us_accidents_sample['Year'].value_counts().sort_index().plot(kind='bar')
plt.title('Accidents by Year')
plt.xlabel('Year')
plt.ylabel('Number of Accidents')
plt.show()

In [None]:
# Does your plot show any interesting patterns or trends?
# Does your sampled data show similar patterns/trends as the original dataset?

In [None]:
# TODO: 1. Extract hour of the day from "Start_Time"
# TODO: 2. Create new feature: duration = (End_Time - Start_Time)

In [33]:
# TODO: 3. Try Min-Max Normalization on "Distance(mi)" 
from sklearn.preprocessing import MinMaxScaler

# TODO: 4. Try Standardization on "Precipitation(in)"
from sklearn.preprocessing import StandardScaler

DO NOT normalize/standardize categorical features!!!