Skip to content

This is an academic project for the course Database Foundations. A python notebook created to generate INSERT SQL quries for each tables in the DB Schema. the python notebook does many of the data processing functions like duplicate handling and splitting the columns and generating datasets(.csv) for individual tables.

License

Notifications You must be signed in to change notification settings

rajadevineni/MySQL_Mongo_DB_for_COVID_Data

Repository files navigation

SQL-Mongo Project – Populations at Risk for COVID19

This project was undertaken as academic study to understand and apply the concepts of BUAN 6320: DataBase Foundations for Business Analytics. The data set chosen by the course instructor and the analysis done as part of this project is not representative of the overall statistics.

About:

The 3,142 counties of the United States span a diverse range of social, economic, health, and weather conditions. Because of the COVID19 pandemic, over 2,400 of these counties have already experienced some COVID19 cases.

Combining county-level data on health, socioeconomics, and weather can help us address identify which populations are at risk for COVID19 and help prepare high-risk communities.

Temperature and humidity may affect the transmissibility of COVID19, but in the United States, warmer regions also tend to have markedly different socioeconomic and health demographics. As such, it's important to be able to control for factors like obesity, diabetes, access to healthcare, and poverty rates, since these factors themselves likely play a role in COVID19 transmission and fatality rates.

The dataset has 3 CSV files and total of 415 columns:

  1. US_counties_COVID19_health_weather_data.csv
  2. us_county_sociohealth_data.csv
  3. us_county_geometry.csv

The dataset provides all of this information, formatted, cleaned, and ready for analysis. Most columns have little or no missing data. A small number have larger amounts of missing data; see the kernel that generated this dataset for more details:(https://www.kaggle.com/johnjdavisiv/us-counties-weather-health-covid19-data)

Thanks to John Davis for the dataset!

Note:

The project is in progress and is subject to change technically. All the documentation will be updated as the project progress.

DataBase Creation:

Schema Design:

415 columns over the 3 source files are divided into 6 tables mentioned below.

Tables:

  1. county_details
  2. State_lockdown_details
  3. fips_daily_cases
  4. daily_weather_details
  5. county_socio_health
  6. station_details

Entity relations:

Schema Design

INSERT_SQL queries Generation:

sql_insert_generator.ipynb python notebook takes input of 3 CSV files and generates INSERT SQL quries for each tables designed in DB Schema. the python notebook does many of the data processing functions like duplicate handling and splitting the columns and generating datasets for individual tables.

Data Loading:

Due to the foreign key dependencies between the tables, run the SQL files in the below order only. You can find the SQL queries here

  1. state_lockdown_details.sql
  2. county_details.sql
  3. station_details.sql
  4. fips_daily_cases.sql
  5. daily_weather_details.sql
  6. county_socio_health.sql

Validation:

  • After Creating the DB and loading the data, "Validation_Quries.sql" can be used to check if you have loaded the complete data as expected.
  • Each query has the count commented below to check the data count after loading into the table.

About

This is an academic project for the course Database Foundations. A python notebook created to generate INSERT SQL quries for each tables in the DB Schema. the python notebook does many of the data processing functions like duplicate handling and splitting the columns and generating datasets(.csv) for individual tables.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published