# COGS 108 - Final Project (change this to your project's title)

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [  ] NO - keep private

# Overview

*Fill in your overview here*

# Names

- Ant Man
- Hulk
- Iron Man
- Thor
- Wasp

<a id='research_question'></a>
# Research Question

*Fill in your research question here*

<a id='background'></a>

## Background & Prior Work

*Fill in your background and prior work here* 

References (include links):
- 1)
- 2)

# Hypothesis


*Fill in your hypotheses here*

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: 
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [49]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
import os
from pathlib import Path

# converting city to county
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from tqdm import tqdm # progress bar for .apply()

# for choropleth
import plotly.express as px
import plotly.graph_objects as go
# used for choropleth
from urllib.request import urlopen
import json

# filter extra noise from warnings
import warnings
warnings.filterwarnings('ignore')

# Statmodels & patsy
import patsy
import statsmodels.api as sm
from scipy.stats import pearsonr
from scipy.stats import boxcox

# Make plots just slightly bigger for displaying well in notebook
plt.rcParams['figure.figsize'] = (10, 5)

# Displaying figures as image
from IPython.display import Image

# used to convert state/county to fips
import addfips

%config InlineBackend.figure_format ='retina'

# Data Cleaning

Describe your data cleaning steps here.

In [59]:
def standardize_county(str_in):
    try:
        if '(County)' in str_in:
            output = str_in.replace('(County)','County')
        elif '(Borough)' in str_in:
            output = str_in.replace('(Borough)','Borough')
        elif '(Census Subarea)'in str_in:
            output = str_in.replace('(Census Subarea)','')
        elif 'Statewide' in str_in:
            output = None
        elif '(Parish)'in str_in:
            output = str_in.replace('(Parish)','Parish')
        elif 'County' in str_in:
            output = str_in
        elif 'Borough' in str_in:
            output = str_in
        elif 'Census Area'in str_in:
            output = str_in.replace('Census Area','')
        elif 'Parish'in str_in:
            output = str_in
        else:
            output = None
    except: 
        output = None

    return output


def standardize_year(str_in):
    try:
        output = str_in.split('T')[0]
        output = pd.to_datetime(str_in).year
    except:
        output = None
        
    return output

In [60]:
hp_df = pd.read_csv('NewHousingPrices2017-2021.csv').drop(columns=['Unnamed: 0'])

hp_df[['county', 'state']] = hp_df['County & State'].str.split(',', expand=True)
hp_df['county'] = hp_df['county'].apply(standardize_county)
old_cols = hp_df.columns.values
hp_df['Average_HP'] = hp_df.iloc[:,3:].mean(axis=1)
new_cols = ['county','state','FIPS','Average_HP']
hp_df = hp_df.reindex(columns=new_cols)
hp_df

Unnamed: 0,county,state,FIPS,Average_HP
0,Autauga County,Alabama,1001,156341.75
1,Baldwin County,Alabama,1003,222907.75
2,Barbour County,Alabama,1005,96513.75
3,Bibb County,Alabama,1007,103153.25
4,Blount County,Alabama,1009,133840.75
...,...,...,...,...
3112,Teton County,Wyoming,56039,867572.50
3113,Uinta County,Wyoming,56041,185513.50
3114,Washakie County,Wyoming,56043,173410.00
3115,Weston County,Wyoming,56045,179362.50


In [61]:
disaster_type_df = pd.read_csv('datasets/DisasterDeclarationsSummaries.csv')

# select a subset of the columns
wanted_columns = ['state', 'declarationDate','incidentType','declarationTitle','designatedArea']

# rename the columns
disaster_type_df = disaster_type_df[wanted_columns].rename(columns={"declarationDate":"year", "designatedArea": "county", "incidentType":"disaster_type", "declarationTitle":"disaster_declaration"})

# Set "Statewide" to None and strip "(County)" from all counties
disaster_type_df['county'] = disaster_type_df['county'].apply(standardize_county)

# filter dataset to only include non-null 
disaster_type_df = disaster_type_df[~disaster_type_df['county'].isnull()]

# strip year column to only include year
disaster_type_df['year'] = disaster_type_df['year'].apply(standardize_year)

# sort by year
disaster_type_df = disaster_type_df.sort_values('year').reset_index(drop = True)

# check for no NaNs
assert(disaster_type_df.isna().sum().sum() == 0)

disaster_type_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county
0,IN,1959,Flood,FLOOD,Clay County
1,WA,1964,Flood,HEAVY RAINS & FLOODING,Wahkiakum County
2,WA,1964,Flood,HEAVY RAINS & FLOODING,Skamania County
3,WA,1964,Flood,HEAVY RAINS & FLOODING,Pierce County
4,WA,1964,Flood,HEAVY RAINS & FLOODING,Pacific County
...,...,...,...,...,...
57867,MO,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Bollinger County
57868,MO,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin County
57869,WA,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom County
57870,TN,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson County


In [62]:
state_abv = {'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AZ': 'Arizona','CA': 'California','CO': 'Colorado',
             'CT': 'Connecticut','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','HI': 'Hawaii','IA': 'Iowa',
             'ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana',
             'MA': 'Massachusetts','MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota',
             'MO': 'Missouri','MS': 'Mississippi','MT': 'Montana','NC': 'North Carolina','ND': 'North Dakota',
             'NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey','NM': 'New Mexico','NV': 'Nevada',
             'NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania',
            'RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee',
             'TX': 'Texas','UT': 'Utah','VA': 'Virginia','VT': 'Vermont','WA': 'Washington',
             'WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'
                    }
disaster_type_df['state'] = disaster_type_df['state'].map(state_abv)
values = ['Biological', 'Human Cause', 'Fishing Losses', 'Terrorist', 'Other','Dam/Levee Break', 'Toxic Substances']
disaster_type_df = disaster_type_df[~(disaster_type_df['disaster_type'].isin(values))]
disaster_type_df['disaster_type'].unique()

dfrq_df = disaster_type_df[disaster_type_df['year'] >= 2017].reset_index(drop=True)

dfrq_df = dfrq_df.dropna()
dfrq_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county
0,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee County
1,Georgia,2017,Hurricane,HURRICANE IRMA,Pickens County
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe County
3,Georgia,2017,Hurricane,HURRICANE IRMA,Peach County
4,Georgia,2017,Hurricane,HURRICANE IRMA,Miller County
...,...,...,...,...,...
7603,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Bollinger County
7604,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin County
7605,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom County
7606,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson County


In [63]:
af = addfips.AddFIPS()
def find_fips(x):
    county = x['county'].strip()
    state = x['state'].strip()
    fips = af.get_county_fips(county,state=state)
    return str(fips)

tqdm.pandas()
dfrq_df['fips'] = dfrq_df.progress_apply(find_fips,axis=1)
dfrq_df

100%|██████████████████████████████████████| 7584/7584 [00:00<00:00, 56189.49it/s]


Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips
0,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee County,13219
1,Georgia,2017,Hurricane,HURRICANE IRMA,Pickens County,13227
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe County,13221
3,Georgia,2017,Hurricane,HURRICANE IRMA,Peach County,13225
4,Georgia,2017,Hurricane,HURRICANE IRMA,Miller County,13201
...,...,...,...,...,...,...
7603,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Bollinger County,29017
7604,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin County,29069
7605,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom County,53073
7606,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson County,47077


In [None]:
dfrq_df['county_state'] = dfrq_df['county']+", " +dfrq_df['state']
geolocator = Nominatim(user_agent='find_long_lat')
def standardize_long(location):
    try:
        output = location.longitude
    except:
        output = None
    return output
def standardize_lat(location):
    try:
        output = location.latitude
    except:
        output = None
    return output
geocode = RateLimiter(geolocator.geocode,min_delay_seconds=1)
tqdm.pandas()
try:
    dfrq_df['geocode'] = dfrq_df['county_state'].progress_apply(geocode)
except:
    pass

 12%|████▋                                   | 890/7584 [14:51<1:51:37,  1.00s/it]

In [None]:
dfrq_df['long'] = dfrq_df['geocode'].apply(standardize_long)
dfrq_df['lat'] = dfrq_df['geocode'].apply(standardize_lat)

In [None]:
null_df = dfrq_df[dfrq_df.isnull().any(axis=1)]
def standardize_null_county(str_in):
    if 'MSA' in str_in:
        output = str_in.split('(')[0]
        return output
null_df['county'] = null_df['county'].apply(standardize_null_county)
null_df['county_state'] = null_df['county']+", " +null_df['state']
null_df['fips'] = null_df.progress_apply(find_fips,axis=1)
try:
    null_df['geocode'] = null_df['county_state'].progress_apply(geocode)
except:
    pass
#make a dataframe from this 

In [None]:
null_df['long'] = null_df['geocode'].apply(standardize_long)
null_df['lat'] = null_df['geocode'].apply(standardize_lat)
null_df

In [None]:
frames = [dfrq_df[~dfrq_df.isnull().any(axis=1)],null_df]
dfrq_df = pd.concat(frames)
dfrq_df

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [None]:
bm = pd.read_csv('datasets/disaster_freq_bubblemap.csv')
fig = px.scatter_geo(bm,lat='lat',lon='long',hover_name='county_state',animation_frame='year',
                     projection='albers usa',hover_data=['disaster_type'],color='disaster_type')
fig.show()

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*

# Team Contributions

*Specify who in your group worked on which parts of the project.*