# Loading NJ COVID-19 Data Into JSON and CSV

This python notebook contains the code used to scrape articles from [NJ.com](https://www.nj.com/) to get data on number of COVID-19 cases per municipalities in New Jersey. 

Some counties have not been reporting the number of cases (as of 4/7): Mercer County, Atlantic County

I use `urllib` for loading the webpage, `bs4` & `re` for scraping, `datetime` for getting the current date, and `json` & `csv` for loading the data into the correct outputting files.

In [1]:
from urllib import request, error, parse
from bs4 import BeautifulSoup
from datetime import date
import json
import csv
import re

Getting the current date and loading it into the correct format for accessing the [NJ.com](https://www.nj.com/) url where the data lies.

In [2]:
rn = date.today()
month = rn.strftime("%B").lower()
day = rn.day
year = rn.year

In [3]:
current_date = "{}-{}-{}".format(month,day,year)
month_number = str(rn.month) if rn.month > 9 else '0' + str(rn.month)

In [4]:
nj_dot_com_link = 'https://www.nj.com/coronavirus/2020/{0}/where-is-the-coronavirus-in-\
nj-latest-map-update-on-county-by-county-cases-{1}.html'.format(month_number, current_date)

Loading the HTML file locally for scraping and parsing.

In [5]:
response = request.urlopen(nj_dot_com_link)
covid19_html = response.read()
soup = BeautifulSoup(covid19_html,'lxml')

Getting the main part of the article where the data lies within the HTML file.

In [6]:
main = soup.find_all('main') # main part of article
article = soup.find_all('article') # article component

Loading and cleaning the data (in string format) for matching. Prints an error if the row is not cleaned properly.

In [7]:
data = []
for row in article[0].find_all('p'):
    delims = '.+\:\s*\d+'
    statistic = row.getText().lower()
    
    if ':' in statistic and '•' in statistic:
        cleaned_row = re.findall(delims,statistic)
        if len(cleaned_row) != 1:
            print(statistic)
            continue
        data.append(cleaned_row[0][1:].strip())

Loading all of the NJ Municipalities names from `nj_municipals.txt`. Loading dictionaries for handling errors such as misspellings and shorthands for munipalities.

In [8]:
nj_municipals = set(line.strip() for line in open('nj_municipals.txt'))

In [9]:
substring_errors = {
    'pemberton boro' : 'pemberton',
    ' borough' : '',
    'parsippany' : 'parsippany-troy hills',
    ' city' : '',
    'south orange' : 'south orange village',
    'oldsman' : 'oldmans',
    'bryram' : 'byram',
    'wantgage' : 'wantage'
}

fullstring_errors = {
    'orange' : 'city of orange',
    'clinton town' : 'clinton township',
    'pine hil' : ' pine hill',
    'peapack-gladstone' : 'peapack and gladstone',
    'pepack-gladstone' : 'peapack and gladstone',
    'hadonfield' : 'haddonfield',
    'tewsbury' : 'tewksbury',
    'boonton town' : 'boonton township',
    'peuannock' : 'pequannock',
    'hardick' : 'hardwick',
    'gutenberg' : 'guttenberg',
    'rivervale' : 'river vale',
    'highstown' : 'hightstown'
}

Loading and storing the data into a dictionary `nj_covid_19_data` after cleaning the input to match the correct municipality names retreived from `nj_municipals`. Prints an error message if a municipality is not found.

In [10]:
nj_covid_19_data = {}

for row in data:
    infected_township = re.split(':',row)
    town = infected_township[0].strip()
    num_infected = int(infected_township[1].strip())
    
    for error in substring_errors.keys():
        town = town.replace(error,substring_errors[error])
        
    for error in fullstring_errors.keys():
        if town.strip() == error:
            town = fullstring_errors[error]
        
    town = town.strip()
    township = town + ' township'
    city = town + ' city'
    
    if town in nj_municipals:
        nj_covid_19_data[town] = num_infected
        
    elif township in nj_municipals:
        nj_covid_19_data[township] = num_infected
        
    elif city in nj_municipals:
        nj_covid_19_data[city] = num_infected
    else: 
        print('ERROR TOWN NOT FOUND IN NJ MUNICIPALS: {}'.format(town))

Loading the data into a json file `nj_covid19_today.json` containing the number for cases per municipalities in New Jersey for the current date.

In [11]:
json_data = {
    'last fetched' : current_date,
    'data' : nj_covid_19_data
}

In [12]:
with open('../nj_covid19_today.json', 'w') as datafile:
    json.dump(json_data,datafile)

Below are helper functions used to access data from previous dates.

In [13]:
def get_covid_data_for_specific_date(month,day,month_num='04'):
    date = "{}-{}-{}".format(month,day,2020)
    monthnumber = month_num

    nj_dot_com_link = 'https://www.nj.com/coronavirus/2020/{0}/where-is-the-coronavirus-in-\
    nj-latest-map-update-on-county-by-county-cases-{1}.html'.format(monthnumber, date)
    return nj_dot_com_link

In [14]:
def put_covid_data_in_csv():
    with open('../total_nj_covid19.csv', 'a') as fulldf:
        for mun in nj_covid_19_data:
            writer = csv.writer(fulldf)
            writer.writerow([current_date,mun,nj_covid_19_data[mun]])
    return

In [15]:
# get_covid_data_for_specific_date('april',7)
put_covid_data_in_csv()