# COGS 108 - Data Checkpoint

# Names

- Tyler Le
- Aditya Tomar
- William Lynch
- Michael Mao
- Natalie Quach

<a id='research_question'></a>
# Research Question

*Fill in your research question here*

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

This dataset contains all federally declared natural disasters from 1953-2022 (by year, state, county, and type) along with declared recovery programs. Each observation contains the type of natural disaster, when it occurred, and the state/county it occured in.


**3. National Risk Index (NRI)**
- Dataset Name: National Risk Index per County (NRI_Table_Counties.csv)
- Link to the dataset: https://hazards.fema.gov/nri/data-resources#csvDownload
- Number of observations: 3142
- Number of features: 365. Contains a combination of numerical and string data types.

Dataset from FEMA that identifies counties and states most at risk to 18 natural hazards. Includes data about expected annual losses from natural hazards, social vulnerability and community resilience. 

**4. States with Coastline**
- Dataset Name: States with Coastline (states_with_coastline.csv)
- Link to the dataset: https://worldpopulationreview.com/state-rankings/coastal-states
- Number of observations: 50
- Number of features: 2. Contains string data types.

This dataset contains whether or not each state in the United States has a coast. Each observation contains a state and its associated coast.


**5. Cost of Living**
- Dataset Name: Cost of Living (cost_of_living.csv)
- Link to the dataset: https://advisorsmith.com/wp-content/uploads/2021/02/advisorsmith_cost_of_living_index.csv
- Number of observations: 510
- Number of features: 3. Contains numerical and string data types.

Each observation in this dataset contains a state, the city associated with the state, and the Cost of Living Index. The Cost of Living Index measures the costs such as food and energy.

**6. Climate**
- Average Climate by County (Average_Climate_By_County.csv)
- https://www.ncdc.noaa.gov/cag/county/mapping
- Number of observations: 3137
- Number of features: 3. Contains a combination of numerical and string types.

This dataset contains the mean climate over a 5-year span from 2017 to 2022 for all counties in the USA except those in Hawaii. The climate is measured in Fahrenheit. This dataset is for comparing the correlation between natural disasters vs cost of living with climate vs cost of living because climate is a potential confounding variable that affects cost of living.



# Setup

In [3]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim

ModuleNotFoundError: No module named 'geopy'

# Data Cleaning

**Table #1 (Frequency of Disaster By State):** This dataset was fairly clean in that there were no missing values and each observation contains whether or not a state had a natural disaster event in a certain year. We decided to remove spaces from the column names and to replace them with underscores. Since each column describing a natural disaster contains the count of how many natural disasters of that type, there was little data cleaning needed for this dataset.

**Table #2 (Types of Disasters By State/County):** This dataset was fairly clean. We first focused on extracting the relevant columns, which were "state", "declarationType", "incidentType", "declarationTitle", and "declarationArea". These were the relevant variables since we want each state, the type of natural disaster, and whether it occured on a county level or not. We filtered the dataset to only contain natural disasters that occurred at the county level and standardized the column. We decided to keep the year it happened rather than the exact month and day since in our EDA in the future we would like to explore the natural disaster frequencies by decade. To make future analyses more convienient, we renamed some of the column names. Also, we checked for missing values and found that there were none. 

**Table #3 (National Risk Index):** This dataset was fairly clean. We focused on extracting the relevant columns, such as county, population size, National Risk Index score, and expected annual loss. These variables are important because we would like to compare counties per capita. We also decided to lowercase all the columns and replace spaces with underscores for consistency across all dataframes.

**Table #4 (States with Coastline):** This dataset was fairly clean. Originally, each observation in this dataset contained a state and its associated coast. If the state did not have a coast associated, it had a value "None". To aid with future analyses, such as fitting multiple linear regression models later on, we changed the "coast" column to be binary where 0 means that a state does not have a coast associated with it and 1 means that a state does have a coast associated with it.

**Table #6 (Average Climate by County)**
This dataset was very clean. All that we needed to change was remove ID numbers after the state abbreviations and change the column names from "Location ID" to "State", "Location" to "County", and "Value" to "Temperature (F)". 



## Clean Table #1 (Frequency of Disaster By State)

In [None]:
freq_df = pd.read_csv('datasets/natural_disaster_frequencies.csv')

# replace space with underscores in column names
freq_df.columns = freq_df.columns.str.replace(' ', '_')

# check for NaNs
assert(freq_df.isna().sum().sum() == 0)

freq_df

## Clean Table #2 (Types of Disaster By State/County)

In [None]:
def standardize_county(str_in):
    try:
        if '(County)' in str_in:
            output = str_in.replace('(County)','')
        else:
            output = None
    except: 
        output = None

    return output


def standardize_year(str_in):
    try:
        output = str_in.split('T')[0]
        output = pd.to_datetime(str_in).year
    except:
        output = None
        
    return output

In [73]:
df['County'] = new_counties
county = df.pop('County')
df.insert(1,'County',county)
df.County.str.count("County").sum()

444

In [71]:
df

Unnamed: 0,City,County,State,Cost of Living Index
0,Abilene,Taylor County,TEXAS,89.1
1,Adrian,Lenawee County,MICHIGAN,90.5
2,Akron,Summit County,OHIO,89.4
3,Alamogordo,Otero County,NEW MEXICO,85.8
4,Albany,Dougherty County,GEORGIA,87.3
...,...,...,...,...
505,Wheeling,1528,WEST VIRGINIA,84.1
506,New London,New London County,CONNECTICUT,105.9
507,Daphne,Baldwin County,ALABAMA,96.6
508,Victoria,Texas,TEXAS,89.5


In [74]:
#Dataset for the mean climate for each county in the USA except those in Hawaii. The climate is measured over a 5-year span from 2017-2022. 
climate_df = pd.read_csv('datasets/Average_Climate_By_County.csv')
climate_df['Location ID'] = climate_df['Location ID'].apply([lambda x: x[0:2]])
climate_df = climate_df.rename(columns={'Location ID': 'State', 'Location': 'County', 'Value': 'Temperature (F)'})
climate_df

Location(Victoria County, Texas, United States, (28.8026443, -96.9766308, 0.0))