# Cases Exploration and Error Finding 
The case data is sources from [CDC](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36)
and this notebook is built to explore the data and get a general sense of things present in the dataset. 

__It has 2 main goals__
- Find out if there are any missing dates in the entire dataset for each of the states.
- Confirm that the beginning and the end dates of each of the states is the same. 



<br/>
<br/>
<br/>
<br/>

_This notebook is a slice of life type thing and will probably save me a lot of time in the future_



In [73]:
import pandas as pd
import json
from collections import Counter

In [74]:
case_df = pd.read_csv('../dataset/case_1121_cdc.csv',parse_dates=['submission_date'])

## Making a report of the date ranges and missing date for covid cases 

In [75]:
case_df.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,2021-02-12,UT,359641,359641.0,0.0,1060,0.0,1785,1729.0,56.0,11,2.0,02/13/2021 02:50:08 PM,Agree,Agree
1,2021-03-01,CO,438745,411869.0,26876.0,677,60.0,5952,5218.0,734.0,1,0.0,03/01/2021 12:00:00 AM,Agree,Agree
2,2020-08-22,AR,56199,,,547,0.0,674,,,11,0.0,08/23/2020 02:15:28 PM,Not agree,Not agree
3,2020-08-12,AS,0,,,0,0.0,0,,,0,0.0,08/13/2020 02:12:28 PM,,
4,2020-06-05,HI,661,,,8,0.0,17,,,0,0.0,06/06/2020 10:31:37 AM,Not agree,Not agree


In [76]:
# unique states present in the dataset 
states = case_df['state'].unique()

In [77]:
print(f'Number of States in the CDC Case Dataset = {len(states)}')

Number of States in the CDC Case Dataset = 60


In [78]:
# This code block checks which states have any missing dates and then creates a report of it as a JSON file to be further inspected if needed

STATE_REPORT = {}
STATE_REPORT['state'] = []

MIN_DATES = [] 
MAX_DATES = []

for target_state in states: 
    # Set one state as the target for report generation 
    target_df = case_df[case_df['state']==target_state]
    target_df = target_df.set_index('submission_date')
    min_date = str(case_df_co.index.min()).split(' ')[0]
    max_date = str(case_df_co.index.max()).split(' ')[0]
    
    missing_dates = pd.date_range(start=min_date, end=max_date).difference(target_df.index)
    missing_dates_count = len(missing_dates)
    
    MIN_DATES.append(min_date)
    MAX_DATES.append(max_date)
    
    if missing_dates_count > 0 :
        print(f'Missing Dates in the state of {target_state}')
    
    STATE_REPORT['state'].append({
    'State Name' : f'{target_state}',
    'Start Date': f'{min_date}',
    'End Date': f'{max_date}',
    'Missing Date Count': f'{missing_dates_count}',
    'Missing Dates': f'{missing_dates.date}'
    })

# Saving the report in the output folder 
with open('../outputs/STATE CASE CDC MISSING DATA REPORT.txt', 'w') as outfile:
    json.dump(STATE_REPORT, outfile)
    
    
    

In [79]:
# This code block checks if all the states have the same start and end date 
if len(Counter(MAX_DATES)) == 1:
    print('All states have the same end date ')
    print('SUCCESS')

if len(Counter(MIN_DATES)) == 1:
    print('All states have the same start date ')
    print('SUCCESS')
    

All states have the same end date 
SUCCESS
All states have the same start date 
SUCCESS
