# COGS 108 - Data Checkpoint

# Names

- Tyler Le
- Aditya Tomar
- William Lynch
- Michael Mao
- Natalie Quach

<a id='research_question'></a>
# Research Question

Is there a positive correlation between the cost of living and the impact of natural disasters in terms of injuries, casualties, and property damage per capita? In which state does the impact of natural disasters affect cost of living the most?

# Dataset(s)

*Fill in your dataset information here*

**Template**
- Dataset Name:
- Link to the dataset: 
- Number of observations: 

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.



<!-- **1. Fatalities and Injuries.**
- Dataset Name: 2020 Summary of Hazardous Weather Fatalities, Injuries, and Damage Costs by State
- Link to the dataset: https://www.weather.gov/hazstat/
- Number of observations: 67 (per year). Each year is a separate dataset.

This is a dataset describing the number of fatalities, injuries, property damage costs, and total damage costs due to Hazardous weather for the year of 2015.

This dataset is for a single year, we plan to gather information for years from XXXX-XXXX and will merge all of them into one. -->


**1. Frequency of Disasters By State**
- Dataset Name: Billion-Dollar Disasters By Year (CPI-Adjusted)
- Link to the dataset: https://www.ncdc.noaa.gov/billions/state-freq-data.csv
- Number of observations: 2228 
- Number of features: 9

This dataset contains the count of natural disasters that cost more than 1 billion dollars for each state from 1980 - 2021.

**2. Types of Disaster By State**
- Dataset Name: Disaster Declarations Summaries
- Link to the dataset: https://www.fema.gov/api/open/v1/DisasterDeclarationsSummaries.csv
- Number of observations: 62772
- Number of features: 5

This dataset contains all federally declared natural disasters from 1953-2022 (by year,state, and type) along with declared recovery programs.


**3. National Risk Index (NRI)**
- Dataset Name:
- Link to the dataset: https://hazards.fema.gov/nri/data-resources#csvDownload
- Number of observations: 3142
- Number of features: 9

Dataset from FEMA that identifies counties and states most at risk to 18 natural hazards. Includes data about expected annual losses from natural hazards, social vulnerability and community resilience. 

**4. States with Coastline**
- Dataset Name: States with Coastline
- Link to the dataset: https://worldpopulationreview.com/state-rankings/coastal-states
- Number of observations: 50
- Number of features: 2. It contains a column for each state and another column for the coast associated with the state.

This dataset contains whether or not each state in the United States has a coast. Each observation contains a state and its associated coast.

# Setup

In [88]:
import pandas as pd

# Data Cleaning

Describe your data cleaning steps here.

## Clean Table #1 (Frequency of Disaster By State)

In [89]:
freq_df = pd.read_csv('datasets/natural_disaster_frequencies.csv')
assert(freq_df.isna().sum().sum() == 0)

freq_df

Unnamed: 0,year,state,drought,flooding,freeze,severe storm,tropical cyclone,wildfire,winter storm
0,1980,AK,0,0,0,0,0,0,0
1,1980,AL,1,0,0,0,0,0,0
2,1980,AR,1,1,0,0,0,0,0
3,1980,AZ,0,0,0,0,0,0,0
4,1980,CA,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
2221,2021,VT,0,0,0,0,0,0,0
2222,2021,WA,1,0,0,0,0,1,1
2223,2021,WI,0,0,0,4,0,0,0
2224,2021,WV,0,0,0,0,1,0,0


## Clean Table #2 (Types of Disaster By State)

In [105]:
disaster_type_df = pd.read_csv('datasets/DisasterDeclarationsSummaries.csv')

# select a subset of the columns
wanted_columns = ['state', 'declarationDate','incidentType','declarationTitle','designatedArea']

# rename the columns
disaster_type_df = disaster_type_df[wanted_columns].rename(columns={"declarationDate":"year", "designatedArea": "county", "incidentType":"disaster_type", "declarationTitle":"disaster_declaration"})

# check for no NaNs
assert(disaster_type_df.isna().sum().sum() == 0)

# filter dataset to only include Counties
disaster_type_df = disaster_type_df[disaster_type_df['county'].str.contains("(County)")]

# remove "(County)" from all rows.
disaster_type_df['county'] = disaster_type_df['county'].str.split('\(County\)').str[0]

# remove extra noise from year column
disaster_type_df['year'] = disaster_type_df['year'].str.split('T').str[0]

# strip year column to only include year
disaster_type_df["year"] = pd.to_datetime(disaster_type_df["year"]).dt.year

# sort by year
disaster_type_df = disaster_type_df.sort_values('year').reset_index(drop=True)

disaster_type_df

  disaster_type_df = disaster_type_df[disaster_type_df['county'].str.contains("(County)")]


Unnamed: 0,state,year,disaster_type,disaster_declaration,county
0,IN,1959,Flood,FLOOD,Clay
1,WA,1964,Flood,HEAVY RAINS & FLOODING,Wahkiakum
2,WA,1964,Flood,HEAVY RAINS & FLOODING,Skamania
3,WA,1964,Flood,HEAVY RAINS & FLOODING,Pierce
4,WA,1964,Flood,HEAVY RAINS & FLOODING,Pacific
...,...,...,...,...,...
55258,MO,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin
55259,TN,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer
55260,WA,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Skagit
55261,TN,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson


## Clean Table #3 (NRI)

In [91]:
# Look at the "NRI Data Dictionary in the datasets/NRI_Table_Counties" to see what the cols mean
# EAL = "Expected Annual Lost", quantifies the anticipated economic damage resulting from natural hazards each year. 
# 1-100 scale

df_nri = pd.read_csv('datasets/NRI_Table_Counties/NRI_Table_Counties.csv')

# select a subset of the columns
wanted_cols = ['STATE','STATEABBRV','COUNTY','POPULATION','AREA','RISK_SCORE','RISK_RATNG','EAL_SCORE','EAL_RATNG']
df_nri = df_nri[wanted_cols]

# make sure there are no NaNs
assert(df_nri.isna().sum().sum() == 0)


df_nri

Unnamed: 0,STATE,STATEABBRV,COUNTY,POPULATION,AREA,RISK_SCORE,RISK_RATNG,EAL_SCORE,EAL_RATNG
0,Kentucky,KY,Johnson,23356,261.958144,9.281419,Relatively Low,11.921944,Relatively Low
1,Kentucky,KY,Kenton,159720,160.213975,10.449057,Relatively Low,16.837131,Relatively Moderate
2,Kentucky,KY,Knott,16346,351.517978,10.068395,Relatively Low,10.945913,Relatively Low
3,Kentucky,KY,Knox,31883,386.298435,11.858245,Relatively Low,11.983719,Relatively Low
4,Kentucky,KY,Larue,14193,261.539564,4.610900,Very Low,7.028611,Very Low
...,...,...,...,...,...,...,...,...,...
3137,Wyoming,WY,Sweetwater,43806,10426.919825,2.572885,Very Low,7.174819,Very Low
3138,Wyoming,WY,Teton,21294,3996.855337,6.513988,Very Low,15.004361,Relatively Low
3139,Wyoming,WY,Uinta,21118,2081.651840,4.458528,Very Low,8.437977,Very Low
3140,Wyoming,WY,Washakie,8533,2238.665800,5.182977,Very Low,5.592578,Very Low


## Clean Table #4 (States with Coastline)

In [92]:
coastline_df = pd.read_csv('datasets/states_with_coastline.csv')
coastline_df

Unnamed: 0,State,coast
0,Alabama,Gulf Coast
1,Alaska,Pacific Ocean
2,Arizona,
3,Arkansas,
4,California,Pacific Ocean
5,Colorado,
6,Connecticut,Atlantic Ocean
7,Delaware,Atlantic Ocean
8,Florida,Atlantic Ocean/Gulf Coast
9,Georgia,Atlantic Ocean


In [93]:
coastline_df.dtypes

State    object
coast    object
dtype: object

In [94]:
coastline_df['coast'].value_counts()

None                                20
Atlantic Ocean                      12
Great Lakes Coast                    7
Pacific Ocean                        5
Gulf Coast                           4
Atlantic Ocean/Gulf Coast            1
Atlantic Ocean/Great Lakes Coast     1
Name: coast, dtype: int64

In [95]:
# check for missing values
coastline_df.isna().sum()

State    0
coast    0
dtype: int64

In [96]:
# Binarize "coast" column. 0 = no coastline, 1 = yes coastline
def clean_coast(coast_val):
    coast_val = coast_val.lower()

    if "none" in coast_val:
        coast_val = coast_val.replace("none", "0")
        output = int(coast_val)
    else:
        output = 1
    
    return output

In [97]:
# test function from above
assert clean_coast('None') == 0
assert clean_coast('Atlantic Ocean') == 1

In [100]:
coastline_df['coast'] = coastline_df['coast'].apply(clean_coast)

In [103]:
coastline_df

Unnamed: 0,State,coast
0,Alabama,1
1,Alaska,1
2,Arizona,0
3,Arkansas,0
4,California,1
5,Colorado,0
6,Connecticut,1
7,Delaware,1
8,Florida,1
9,Georgia,1
