### Importing Libraries and Establishing Database Connection

In this section, we import the necessary libraries for the analysis and set up the database connection by appending the path to the project directory and importing the custom function establecer_conexion from the src.db_conexion module.

In [None]:
import pandas as pd
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..', 'src')))
from src.db_conexion import establecer_conexion

Conexion exitosa a la base de datos


We proceed to define some variables, which will make it possible for us to extract the data from the database that we have, in this case, the variable conn, what it does is extract the database. With the cursor variable, what allows us to go through the data in our database, and finally, we create a query that allows us to create an SQL function to be able to extract the entire table, and then save it in a DataFrame and be able to do everything the possible analysis and visualization.

In this section, we establish a connection to the database using the establecer_conexion function and execute an SQL query to fetch data from the us_accidents table. The retrieved data is then loaded into a pandas DataFrame for further analysis.

In [3]:
# Establish a connection and create a cursor
conn, cursor = establecer_conexion()  # Function to establish the database connection

# SQL query to select all data from the 'us_accidents' table
query = "SELECT * FROM us_accidents"

# Read the data into a pandas DataFrame
df = pd.read_sql_query(query, conn)


Conexion exitosa a la base de datos


  df = pd.read_sql_query(query, conn)
Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x7cb33367fd90>>
Traceback (most recent call last):
  File "/home/willyb/Documentos/Accidents_usa/venv/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 


Here what we simply do is verify the database, to see if the connection with the database was made correctly.

In [2]:
import pandas as pd

# Configure pandas to display more rows and columns
pd.set_option('display.max_rows', 100)  # Show up to 100 rows
pd.set_option('display.max_columns', None)  # Show all columns without truncation
pd.set_option('display.width', None)  # Automatically adjust display width to fit the content

# Display the first 20 rows in a tabular format
df.head(20)


NameError: name 'df' is not defined

Then we proceed with all the cleaning.

1. First of all, we chose the columns that we saw pertinent to eliminate, which we eliminated because they did not give us relevant information, or had most of their data null, or simply there were more columns with the same information with which we could guide ourselves.

2. For meteorological variables, such as temperature or wind speed, what we did was a numerical estimate taking into account the existing values ​​in this column, then we could use the mode, the mean, the most frequent value or the closest value

3. Then we proceed to eliminate rows with null values, which makes it even easier for us to read the existing data in the column, and then proceed to count both the number of null values ​​and the number of empty values, in order to do so. finally being able to have a cleaner and more readable database.

In [None]:
# Columns to drop
columns_to_drop = ['id', 'source', 'country', 'description', 'end_lat', 'end_lng', 
                   'civil_twilight', 'nautical_twilight', 'astronomical_twilight']

# Drop the specified columns
df_cleaned = df.drop(columns=columns_to_drop)

# Impute missing values in numerical columns with the mean
df_cleaned['temperature_f'].fillna(df_cleaned['temperature_f'].mean(), inplace=True)

# Impute missing values in categorical columns with the mode (most frequent value)
df_cleaned['weather_condition'].fillna(df_cleaned['weather_condition'].mode()[0], inplace=True)

# Impute missing values in multiple numerical columns with the mean
num_cols = ['wind_chill_f', 'humidity_percent', 'pressure_in', 'visibility_mi', 'wind_speed_mph', 'precipitation_in']
df_cleaned[num_cols] = df_cleaned[num_cols].apply(lambda col: col.fillna(col.mean()))

# Impute the 'wind_direction' column with the most frequent value (mode)
df_cleaned['wind_direction'] = df_cleaned['wind_direction'].fillna(df_cleaned['wind_direction'].mode()[0])

# Impute 'weather_timestamp' with the previous value (forward fill) for missing timestamps
df_cleaned['weather_timestamp'] = df_cleaned['weather_timestamp'].fillna(method='ffill')

# Remove rows containing any remaining missing values
df_cleaned.dropna(inplace=True)

# Count missing (NaN) values in each column
nan_counts = df_cleaned.isna().sum()

# Count empty strings ('') in each column
empty_counts = (df_cleaned == '').sum()

# Combine the counts into a single DataFrame for better visualization
null_summary = pd.DataFrame({
    'NaN Count': nan_counts,
    'Empty String Count': empty_counts,
    'Total Missing': nan_counts + empty_counts
})

# Display the summary of missing values
print(null_summary)

# Configure pandas to display more rows and columns if necessary
pd.set_option('display.max_rows', 10000)  # Show up to 10,000 rows
pd.set_option('display.max_columns', None)  # Display all columns

# Show the first 100 rows of the cleaned DataFrame in a tabular format
df_cleaned.head(100)
