Cleaning smoke coverage data
* merging with census keys to get state and county names
* creating smoke_score column: sum of light, medium, heavy (scale 1-3)
* downcasting dtypes to save memory
* resolution: county level --> dropping census tracts and block groups

### Data description and source info
**Data title:** Time Series of Potential US Wildland Fire Smoke Exposures

**Date Range of observations:** June 2010 - December 2016, daily records
**Geo Range of observations:** All U.S. down to census block group (much smaller than county).  

**Data summary:** "this is a data set of U.S. population and wildland fire smoke spatial and temporal coincidence" from June 1, 2010 to Dec 17, 2019.  Combines NOAA's HMS-Smoke satellite system with census pop to estimate "potential exposure to light, medium, and heavy categories of wildfire smoke"  
  
From description, potential research Qs (regarding health): "The data represents a modest advancement of NOAA's HMS-Smoke product, with the aims of spurring additional work on the impacts of wildfire smoke on the health of US Populations. Namely, these should include tracking potential wildfire smoke exposures to identify areas and times most heavily impacted by smoke, adding potential smoke exposures to population characteristics describing the social determinants of health in order to better distribute resources and contextualize public health messages and interventions, and combining information specific to wildfire smoke with other air pollution data to better isolate and understand the contribution of wildfires to poor health. (2020-02-24)"  
  
---
**source:** [cleaned/combined data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CTWGWE) , within this URL are links to original dfs sourced from NOAA and Census Centers, and Vargo's R script on github where he cleaned them

---
**citation:**  
Vargo, Jason, 2020, "Time Series of Potential US Wildland Fire Smoke Exposures", https://doi.org/10.7910/DVN/CTWGWE, Harvard Dataverse, V1

In [1]:
import numpy as np
import pandas as pd

In [2]:
smoke = pd.read_csv('../scratch_data/HMS_USpopBG_2010_2019.csv')  #update w your filepath
census = pd.read_csv('../scratch_data/census_merge_w_HMS.csv')  #update w your filepath

In [3]:
smoke_all_counties = pd.merge(census, smoke, how = 'right')

##### set States of interest and date range here:

In [4]:
west_states = ['CA', 'OR', 'WA', 'AZ', 'NV', 'UT', 'ID']
smoke_west_counties = smoke_all_counties.loc[smoke_all_counties['STATE'].isin(west_states)]
smoke_west_counties = smoke_west_counties.loc[(smoke_west_counties['date']>20091231) & (smoke_west_counties['date']<=20161231)]

In [5]:
#looks like all NANs should be zeros.  fillna() then sum the classes to get smoke_score
smoke_west_counties.fillna(0, inplace=True)
smoke_west_counties['smoke_score'] = smoke_west_counties['light'] + smoke_west_counties['medium'] + smoke_west_counties['heavy']
smoke_west_counties.drop(columns=['light','medium','heavy'], inplace=True)

In [6]:
#downcast numerics to reduce filesize
downcast = ['STATEFP', 'COUNTYFP', 'BLKGRPCE','TRACTCE','POPULATION','date', 'smoke_score']
for i in downcast:
    smoke_west_counties[i] = pd.to_numeric(smoke_west_counties[i], downcast='unsigned')

In [7]:
#still too big.  drop tractce and blkgrp
smoke_west_counties.drop(['TRACTCE', 'BLKGRPCE'], axis=1, inplace=True)

In [8]:
# tractce/blkgrop were more granular than county
# with those dropped, there will be duplicates at county level.  drop duplicates
smoke_west_counties.drop_duplicates(inplace=True)

In [9]:
smoke_west_counties.reset_index(drop=True, inplace=True)

In [10]:
# # github size limit 100MB unless LFS
# # update w your filepath
# smoke_west_counties.to_csv('../scratch_data/smoke_west_counties_2010_2016.csv', index=False)