### Importing Libraries and Establishing Database Connection

In this section, we import the necessary libraries for the analysis and set up the database connection by appending the path to the project directory and importing the custom function establecer_conexion from the src.db_conexion module.

In [5]:
import pandas as pd
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..', 'src')))
from db_conexion import establish_connection, close_connection


Successful database connection


We proceed to define some variables, which will make it possible for us to extract the data from the database that we have, in this case, the variable conn, what it does is extract the database. With the cursor variable, what allows us to go through the data in our database, and finally, we create a query that allows us to create an SQL function to be able to extract the entire table, and then save it in a DataFrame and be able to do everything the possible analysis and visualization.

In this section, we establish a connection to the database using the establecer_conexion function and execute an SQL query to fetch data from the us_accidents table. The retrieved data is then loaded into a pandas DataFrame for further analysis.

In [6]:
# Establish a connection and create a cursor
conn, cursor = establish_connection()  # Function to establish the database connection

# SQL query to select all data from the 'us_accidents' table
query = "SELECT * FROM us_accidents"

# Read the data into a pandas DataFrame
df = pd.read_sql_query(query, conn)


Successful database connection


Here what we simply do is verify the database, to see if the connection with the database was made correctly.

In [8]:
import pandas as pd

# Configure pandas to display more rows and columns
pd.set_option('display.max_rows', 100)  # Show up to 100 rows
pd.set_option('display.max_columns', None)  # Show all columns without truncation
pd.set_option('display.width', None)  # Automatically adjust display width to fit the content

# Display the first 5 rows in a tabular format
df.head(5)


Unnamed: 0,id,source,severity,start_time,end_time,start_lat,start_lng,end_lat,end_lng,distance_mi,description,street,city,county,state,zipcode,country,timezone,airport_code,weather_timestamp,temperature_f,wind_chill_f,humidity_percent,pressure_in,visibility_mi,wind_direction,wind_speed_mph,precipitation_in,weather_condition,amenity,bump,crossing,give_way,junction,no_exit,railway,roundabout,station,stop,traffic_calming,traffic_signal,turning_loop,sunrise_sunset,civil_twilight,nautical_twilight,astronomical_twilight
0,A-2047758,Source2,2.0,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,,,0.0,Accident on LA-19 Baker-Zachary Hwy at Lower Z...,Highway 19,Zachary,East Baton Rouge,LA,70791-4610,US,US/Central,KBTR,2019-06-12 09:53:00,77.0,77.0,62.0,29.92,10.0,NW,5.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day,Day,Day,Day
1,A-4694324,Source1,2.0,2022-12-03 23:37:14,2022-12-04 01:56:53,38.990562,-77.39907,38.990037,-77.398282,0.056,Incident on FOREST RIDGE DR near PEPPERIDGE PL...,Forest Ridge Dr,Sterling,Loudoun,VA,20164-2813,US,US/Eastern,KIAD,2022-12-03 23:52:00,45.0,43.0,48.0,29.91,10.0,W,5.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Night
2,A-5006183,Source1,2.0,2022-08-20 13:13:00,2022-08-20 15:22:45,34.661189,-120.492822,34.661189,-120.492442,0.022,Accident on W Central Ave from Floradale Ave t...,Floradale Ave,Lompoc,Santa Barbara,CA,93436,US,US/Pacific,KLPC,2022-08-20 12:56:00,68.0,68.0,73.0,29.79,10.0,W,13.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day,Day,Day,Day
3,A-4237356,Source1,2.0,2022-02-21 17:43:04,2022-02-21 19:43:23,43.680592,-92.993317,43.680574,-92.972223,1.054,Incident on I-90 EB near REST AREA Drive with ...,14th St NW,Austin,Mower,MN,55912,US,US/Central,KAUM,2022-02-21 17:35:00,27.0,15.0,86.0,28.49,10.0,ENE,15.0,0.0,Wintry Mix,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day
4,A-6690583,Source1,2.0,2020-12-04 01:46:00,2020-12-04 04:13:09,35.395484,-118.985176,35.395476,-118.985995,0.046,RP ADV THEY LOCATED SUSP VEH OF 20002 - 726 CR...,River Blvd,Bakersfield,Kern,CA,93305-2649,US,US/Pacific,KBFL,2020-12-04 01:54:00,42.0,42.0,34.0,29.77,10.0,CALM,0.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Night


Then we proceed with all the cleaning.

1. First of all, we chose the columns that we saw pertinent to eliminate, which we eliminated because they did not give us relevant information, or had most of their data null, or simply there were more columns with the same information with which we could guide ourselves.

2. For meteorological variables, such as temperature or wind speed, what we did was a numerical estimate taking into account the existing values ​​in this column, then we could use the mode, the mean, the most frequent value or the closest value

3. Then we proceed to eliminate rows with null values, which makes it even easier for us to read the existing data in the column, and then proceed to count both the number of null values ​​and the number of empty values, in order to do so. finally being able to have a cleaner and more readable database.

In [10]:
# Columns to drop
columns_to_drop = ['id', 'source', 'country', 'description', 'end_lat', 'end_lng', 
                   'civil_twilight', 'nautical_twilight', 'astronomical_twilight']

# Drop the specified columns
df_cleaned = df.drop(columns=columns_to_drop)

# Impute missing values in numerical columns with the mean
df_cleaned['temperature_f'].fillna(df_cleaned['temperature_f'].mean(), inplace=True)

# Impute missing values in categorical columns with the mode (most frequent value)
df_cleaned['weather_condition'].fillna(df_cleaned['weather_condition'].mode()[0], inplace=True)

# Impute missing values in multiple numerical columns with the mean
num_cols = ['wind_chill_f', 'humidity_percent', 'pressure_in', 'visibility_mi', 'wind_speed_mph', 'precipitation_in']
df_cleaned[num_cols] = df_cleaned[num_cols].apply(lambda col: col.fillna(col.mean()))

# Impute the 'wind_direction' column with the most frequent value (mode)
df_cleaned['wind_direction'] = df_cleaned['wind_direction'].fillna(df_cleaned['wind_direction'].mode()[0])

# Impute 'weather_timestamp' with the previous value (forward fill) for missing timestamps
df_cleaned['weather_timestamp'] = df_cleaned['weather_timestamp'].fillna(method='ffill')

# Remove rows containing any remaining missing values
df_cleaned.dropna(inplace=True)

# Count missing (NaN) values in each column
nan_counts = df_cleaned.isna().sum()

# Count empty strings ('') in each column
empty_counts = (df_cleaned == '').sum()

# Combine the counts into a single DataFrame for better visualization
null_summary = pd.DataFrame({
    'NaN Count': nan_counts,
    'Empty String Count': empty_counts,
    'Total Missing': nan_counts + empty_counts
})

# Display the summary of missing values
print(null_summary)

# Configure pandas to display more rows and columns if necessary
pd.set_option('display.max_rows', 10000)  # Show up to 10,000 rows
pd.set_option('display.max_columns', None)  # Display all columns

# Show the first 5 rows of the cleaned DataFrame in a tabular format
df_cleaned.head(5)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['temperature_f'].fillna(df_cleaned['temperature_f'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['weather_condition'].fillna(df_cleaned['weather_condition'].mode()[0], inplace=True)
  df_cleaned['weather_timestamp'] = df_cleaned['wea

                   NaN Count  Empty String Count  Total Missing
severity                   0                   0              0
start_time                 0                   0              0
end_time                   0                   0              0
start_lat                  0                   0              0
start_lng                  0                   0              0
distance_mi                0                   0              0
street                     0                   0              0
city                       0                   0              0
county                     0                   0              0
state                      0                   0              0
zipcode                    0                   0              0
timezone                   0                   0              0
airport_code               0                   0              0
weather_timestamp          0                   0              0
temperature_f              0            

Unnamed: 0,severity,start_time,end_time,start_lat,start_lng,distance_mi,street,city,county,state,zipcode,timezone,airport_code,weather_timestamp,temperature_f,wind_chill_f,humidity_percent,pressure_in,visibility_mi,wind_direction,wind_speed_mph,precipitation_in,weather_condition,amenity,bump,crossing,give_way,junction,no_exit,railway,roundabout,station,stop,traffic_calming,traffic_signal,turning_loop,sunrise_sunset
0,2.0,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,0.0,Highway 19,Zachary,East Baton Rouge,LA,70791-4610,US/Central,KBTR,2019-06-12 09:53:00,77.0,77.0,62.0,29.92,10.0,NW,5.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day
1,2.0,2022-12-03 23:37:14,2022-12-04 01:56:53,38.990562,-77.39907,0.056,Forest Ridge Dr,Sterling,Loudoun,VA,20164-2813,US/Eastern,KIAD,2022-12-03 23:52:00,45.0,43.0,48.0,29.91,10.0,W,5.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
2,2.0,2022-08-20 13:13:00,2022-08-20 15:22:45,34.661189,-120.492822,0.022,Floradale Ave,Lompoc,Santa Barbara,CA,93436,US/Pacific,KLPC,2022-08-20 12:56:00,68.0,68.0,73.0,29.79,10.0,W,13.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day
3,2.0,2022-02-21 17:43:04,2022-02-21 19:43:23,43.680592,-92.993317,1.054,14th St NW,Austin,Mower,MN,55912,US/Central,KAUM,2022-02-21 17:35:00,27.0,15.0,86.0,28.49,10.0,ENE,15.0,0.0,Wintry Mix,False,False,False,False,False,False,False,False,False,False,False,False,False,Day
4,2.0,2020-12-04 01:46:00,2020-12-04 04:13:09,35.395484,-118.985176,0.046,River Blvd,Bakersfield,Kern,CA,93305-2649,US/Pacific,KBFL,2020-12-04 01:54:00,42.0,42.0,34.0,29.77,10.0,CALM,0.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
