# Missing US data

As mentioned in the thread https://www.kaggle.com/c/covid19-global-forecasting-week-1/discussion/137564 US data is missing before 2020-03-10. As @Thomas Brandon pointed out this was due to the reporting based on counties and not states before that date. Yet the data is still there and it is fairly simple to translate it into the state based data as all the counties have the two-letter state abbreviation appended.

Below I show how this can be done e.g. for the state of New York data. I strongly urge Kaggle to incorporate these data into the training data for the competition!

In [None]:
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
train = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-1/train.csv')
org = pd.read_csv('/kaggle/input/gthubdata-new/time_series_19-covid-Confirmed.csv')

See below missing data for New York:

In [None]:
train[train['Province/State']=='New York'].iloc[39:55]

The https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series data is organized slightly differently:

In [None]:
org.columns

It incorporates both states and counties:

In [None]:
us = org[org['Country/Region']=='US']
us['Province/State'].unique()

Here we add the missing data for New York:

In [None]:
days = us.columns[4:]
ny_state = us[us['Province/State']=='New York']
ny_counties = us[us['Province/State'].str.find('NY')>0]
(ny_counties[days].sum() + ny_state[days])[days[39:55]]

Compare this with the data contained in the training set (as of 22nd March)

In [None]:
train[train['Province/State']=='New York'].iloc[39:55]

Just in case, Below is a dictionary between states and abbreviations from http://code.activestate.com/recipes/577305-python-dictionary-of-us-states-and-territories/

In [None]:
states = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}