# Uploading Ohio Crash Statistics to DuckDB Database

## Background

DuckDB is a minimalist database which may suit those who do not want to deal with the complexity of a cloud database such as Snowflake, nor need to share this database with a wide range of audience or users.  DuckDB should only be used if your use cases involve analytical queries or "OLAP" use cases.  If your use cases involve "OLTP" transactions (inserts, updates, deletes, etc), then DuckDB would not be the optimal database to use.

**NOTE:** zip files were saved in the `data` folder which is NOT checked into GitLab.

#### Library Imports

In [1]:
import duckdb
import pandas as pd
import zipfile
pd.options.display.max_columns=250
pd.options.display.max_rows=1000

#### Unzip the zip file for the specified month located in the `data` directory

In [4]:
MONTH = '05'

In [5]:
with zipfile.ZipFile(f"data/OH_2022-{MONTH}.zip", "r") as zip_ref:
    # extract all the contents of the zip file to the specified directory
    zip_ref.extractall(f"data/OH_2022-{MONTH}")

#### Data month to upload

In [None]:
MONTH = '07'

#### Upload the cleaned data as a DuckDB table

In [None]:
with duckdb.connect(database='data/veh_crash_stats.duckdb', read_only=False) as con:
    df = pd.read_csv(f'data/OH_2022-{MONTH}/CrashStatistics.csv')
    
    # Format dates properly
    df['CrashDateTime'] = pd.to_datetime(df['CrashDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['CrashReportedDateTime'] = pd.to_datetime(df['CrashReportedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['DispatchedDateTime'] = pd.to_datetime(df['DispatchedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['ArrivedDateTime'] = pd.to_datetime(df['ArrivedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['SceneClearedDateTime'] = pd.to_datetime(df['SceneClearedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')

    # Remove unnecessary tab characters
    df['LocalReportNumber'] = df['LocalReportNumber'].str.replace("\t","")
    df['ReportingAgencyNCIC'] = df['ReportingAgencyNCIC'].str.replace("\t","")
    
    con.execute('CREATE TABLE crash_statistics AS SELECT * FROM df')

#### Append new data to an existing table

In [6]:
MONTH = '05'

In [7]:
with duckdb.connect(database='data/veh_crash_stats.duckdb', read_only=False) as con:
    df = pd.read_csv(f'data/OH_2022-{MONTH}/CrashStatistics.csv')
    
    # Format dates properly
    df['CrashDateTime'] = pd.to_datetime(df['CrashDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['CrashReportedDateTime'] = pd.to_datetime(df['CrashReportedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['DispatchedDateTime'] = pd.to_datetime(df['DispatchedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['ArrivedDateTime'] = pd.to_datetime(df['ArrivedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
    df['SceneClearedDateTime'] = pd.to_datetime(df['SceneClearedDateTime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')

    # Remove unnecessary tab characters
    df['LocalReportNumber'] = df['LocalReportNumber'].str.replace("\t","")
    df['ReportingAgencyNCIC'] = df['ReportingAgencyNCIC'].str.replace("\t","")
    
    con.append("crash_statistics", df)