# COGS 108 - Final Project (change this to your project's title)

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [  ] NO - keep private

# Overview

*Fill in your overview here*

# Names

- Ant Man
- Hulk
- Iron Man
- Thor
- Wasp

<a id='research_question'></a>
# Research Question

*Fill in your research question here*

<a id='background'></a>

## Background & Prior Work

*Fill in your background and prior work here* 

References (include links):
- 1)
- 2)

# Hypothesis


*Fill in your hypotheses here*

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: 
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
import os
from pathlib import Path

# converting city to county
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from tqdm import tqdm # progress bar for .apply()

# for choropleth
import plotly.express as px
import plotly.graph_objects as go
# used for choropleth
from urllib.request import urlopen
import json

# filter extra noise from warnings
import warnings
warnings.filterwarnings('ignore')

# Statmodels & patsy
import patsy
import statsmodels.api as sm
from scipy.stats import pearsonr
from scipy.stats import boxcox

# Make plots just slightly bigger for displaying well in notebook
plt.rcParams['figure.figsize'] = (10, 5)

# Displaying figures as image
from IPython.display import Image

# used to convert state/county to fips
import addfips

%config InlineBackend.figure_format ='retina'

# Data Cleaning

Describe your data cleaning steps here.

In [79]:
hp_df = pd.read_csv('NewHousingPrices2017-2021.csv').drop(columns=['Unnamed: 0'])

hp_df[['county', 'state']] = hp_df['County & State'].str.split(',', expand=True)
hp_df['county'] = hp_df['county'].apply(standardize_county)
old_cols = hp_df.columns.values
hp_df['Average_HP'] = hp_df.iloc[:,3:].mean(axis=1)
new_cols = ['county','state','FIPS','Average_HP']
hp_df = hp_df.reindex(columns=new_cols)
hp_df

Unnamed: 0,county,state,FIPS,Average_HP
0,Autauga County,Alabama,1001,156341.75
1,Baldwin County,Alabama,1003,222907.75
2,Barbour County,Alabama,1005,96513.75
3,Bibb County,Alabama,1007,103153.25
4,Blount County,Alabama,1009,133840.75
...,...,...,...,...
3112,Teton County,Wyoming,56039,867572.50
3113,Uinta County,Wyoming,56041,185513.50
3114,Washakie County,Wyoming,56043,173410.00
3115,Weston County,Wyoming,56045,179362.50


In [78]:
def standardize_county(str_in):
    try:
        if '(County)' in str_in:
            output = str_in.replace('(County)','County')
        elif '(Borough)' in str_in:
            output = str_in.replace('(Borough)','Borough')
        elif '(Census Subarea)'in str_in:
            output = str_in.replace('(Census Subarea)','')
        elif '(Parish)'in str_in:
            output = str_in.replace('(Parish)','Parish')
        elif 'County' in str_in:
            output = str_in
        elif 'Borough' in str_in:
            output = str_in
        elif 'Census Area'in str_in:
            output = str_in.replace('Census Area','')
        elif 'Parish'in str_in:
            output = str_in
        else:
            output = str_in
    except: 
        output = None

    return output


def standardize_year(str_in):
    try:
        output = str_in.split('T')[0]
        output = pd.to_datetime(str_in).year
    except:
        output = None
        
    return output

In [4]:
disaster_type_df = pd.read_csv('datasets/DisasterDeclarationsSummaries.csv')

# select a subset of the columns
wanted_columns = ['state', 'declarationDate','incidentType','declarationTitle','designatedArea']

# rename the columns
disaster_type_df = disaster_type_df[wanted_columns].rename(columns={"declarationDate":"year", "designatedArea": "county", "incidentType":"disaster_type", "declarationTitle":"disaster_declaration"})

# Set "Statewide" to None and strip "(County)" from all counties
disaster_type_df['county'] = disaster_type_df['county'].apply(standardize_county)

# filter dataset to only include non-null 
disaster_type_df = disaster_type_df[~disaster_type_df['county'].isnull()]

# strip year column to only include year
disaster_type_df['year'] = disaster_type_df['year'].apply(standardize_year)

# sort by year
disaster_type_df = disaster_type_df.sort_values('year').reset_index(drop = True)

# check for no NaNs
assert(disaster_type_df.isna().sum().sum() == 0)

disaster_type_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county
0,IN,1959,Flood,FLOOD,Clay County
1,WA,1964,Flood,HEAVY RAINS & FLOODING,Wahkiakum County
2,WA,1964,Flood,HEAVY RAINS & FLOODING,Skamania County
3,WA,1964,Flood,HEAVY RAINS & FLOODING,Pierce County
4,WA,1964,Flood,HEAVY RAINS & FLOODING,Pacific County
...,...,...,...,...,...
57602,MO,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin County
57603,TN,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer County
57604,WA,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom County
57605,TN,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson County


In [5]:
state_abv = {'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AZ': 'Arizona','CA': 'California','CO': 'Colorado',
             'CT': 'Connecticut','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','HI': 'Hawaii','IA': 'Iowa',
             'ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana',
             'MA': 'Massachusetts','MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota',
             'MO': 'Missouri','MS': 'Mississippi','MT': 'Montana','NC': 'North Carolina','ND': 'North Dakota',
             'NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey','NM': 'New Mexico','NV': 'Nevada',
             'NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania',
            'RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee',
             'TX': 'Texas','UT': 'Utah','VA': 'Virginia','VT': 'Vermont','WA': 'Washington',
             'WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'
                    }
disaster_type_df['state'] = disaster_type_df['state'].map(state_abv)
values = ['Biological', 'Human Cause', 'Fishing Losses', 'Terrorist', 'Other','Dam/Levee Break', 'Toxic Substances']
disaster_type_df = disaster_type_df[~(disaster_type_df['disaster_type'].isin(values))]
disaster_type_df['disaster_type'].unique()

dfrq_df = disaster_type_df[disaster_type_df['year'] >= 2017].reset_index(drop=True)

dfrq_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county
0,Georgia,2017,Hurricane,HURRICANE IRMA,Laurens County
1,Georgia,2017,Hurricane,HURRICANE IRMA,Long County
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee County
3,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe County
4,Georgia,2017,Hurricane,HURRICANE IRMA,Peach County
...,...,...,...,...,...
7579,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin County
7580,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer County
7581,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom County
7582,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson County


In [6]:
df = pd.merge(hp_df,dfrq_df, how='left',on=['state','county'])
old_cols = df.columns.values
df['disaster_type'] = df['disaster_type'].fillna(disaster_type_df['disaster_type'])
new_cols = ['county','state','FIPS','Average_HP','disaster_type']
df = df.reindex(columns=new_cols)
df

Unnamed: 0,county,state,FIPS,Average_HP,disaster_type
0,Autauga,Alabama,1001,156341.75,Flood
1,Baldwin,Alabama,1003,222907.75,Flood
2,Barbour,Alabama,1005,96513.75,Flood
3,Bibb,Alabama,1007,103153.25,Flood
4,Blount,Alabama,1009,133840.75,Flood
...,...,...,...,...,...
3112,Teton,Wyoming,56039,867572.50,Severe Storm(s)
3113,Uinta,Wyoming,56041,185513.50,Severe Storm(s)
3114,Washakie,Wyoming,56043,173410.00,Flood
3115,Weston,Wyoming,56045,179362.50,Severe Storm(s)


In [9]:
fips_df = pd.read_csv('datasets//county_fips_lat_long.csv', encoding = 'latin1')
fips_df = fips_df.rename(str.lower,axis='columns')
fips_df['state'] = fips_df['state'].map(state_abv)
fips_df = fips_df[1:]
fips_df = fips_df[['state','fips','county [2]','latitude','longitude']]
fips_df = fips_df.rename(columns={'county [2]':'county'})
fips_df

Unnamed: 0,state,fips,county,latitude,longitude
1,Alabama,01001,Autauga,+32.536382°,86.644490°
2,Alabama,01003,Baldwin,+30.659218°,87.746067°
3,Alabama,01005,Barbour,+31.870670°,85.405456°
4,Alabama,01007,Bibb,+33.015893°,87.127148°
5,Alabama,01009,Blount,+33.977448°,86.567246°
...,...,...,...,...,...
3139,Wyoming,56037,Sweetwater,+41.660339°,108.875676°
3140,Wyoming,56039,Teton,+44.049321°,110.588102°
3141,Wyoming,56041,Uinta,+41.284726°,110.558947°
3142,Wyoming,56043,Washakie,+43.878831°,107.669052°


In [10]:
merged_df = pd.merge(dfrq_df,fips_df, on=['state','county'],how='left')
merged_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips,latitude,longitude
0,Georgia,2017,Hurricane,HURRICANE IRMA,Laurens County,,,
1,Georgia,2017,Hurricane,HURRICANE IRMA,Long County,,,
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee County,,,
3,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe County,,,
4,Georgia,2017,Hurricane,HURRICANE IRMA,Peach County,,,
...,...,...,...,...,...,...,...,...
7579,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin County,,,
7580,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer County,,,
7581,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom County,,,
7582,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson County,,,


In [11]:
af = addfips.AddFIPS()
def find_fips(x):
    county = x['county'].strip()
    state = x['state'].strip()
    fips = af.get_county_fips(county,state=state)
    return str(fips)

tqdm.pandas()
dfrq_df['fips'] = dfrq_df.progress_apply(find_fips,axis=1)

dfrq_df


100%|██████████████████████████████████| 7584/7584 [00:00<00:00, 51686.45it/s]


Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips
0,Georgia,2017,Hurricane,HURRICANE IRMA,Laurens County,13175
1,Georgia,2017,Hurricane,HURRICANE IRMA,Long County,13183
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee County,13219
3,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe County,13221
4,Georgia,2017,Hurricane,HURRICANE IRMA,Peach County,13225
...,...,...,...,...,...,...
7579,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin County,29069
7580,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer County,47045
7581,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom County,53073
7582,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson County,47077


In [12]:
dfrq_df['county_state'] = dfrq_df['county']+", " +dfrq_df['state']
geolocator = Nominatim(user_agent='find_long_lat')
def standardize_long(location):
    try:
        output = location.longitude
    except:
        output = None
    return output
def standardize_lat(location):
    try:
        output = location.latitude
    except:
        output = None
    return output
geocode = RateLimiter(geolocator.geocode,min_delay_seconds=1)
tqdm.pandas()
try:
    dfrq_df['geocode'] = dfrq_df['county_state'].progress_apply(geocode)
except:
    pass

 62%|█████████████████████▊             | 4732/7584 [1:42:35<47:45,  1.00s/it]RateLimiter caught an error, retrying (0/2 tries). Called with (*('Salt Lake County, Utah',), **{}).
Traceback (most recent call last):
  File "/Users/atomar/opt/anaconda3/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/Users/atomar/opt/anaconda3/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/Users/atomar/opt/anaconda3/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/atomar/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/Users/atomar/opt/anaconda3/lib/python3.9/site-packages/ur

100%|███████████████████████████████████| 7584/7584 [2:30:45<00:00,  1.19s/it]


In [81]:
dfrq_df.to_csv('datasets/disaster_freq_bubblemap.csv')

In [14]:
dfrq_df['long'] = dfrq_df['geocode'].apply(standardize_long)
dfrq_df['lat'] = dfrq_df['geocode'].apply(standardize_lat)

In [28]:
null_df = dfrq_df[dfrq_df.isnull().any(axis=1)]
def standardize_null_county(str_in):
    if 'MSA' in str_in:
        output = str_in.split('(')[0]
        return output
null_df['county'] = null_df['county'].apply(standardize_null_county)
null_df['county_state'] = null_df['county']+", " +null_df['state']
null_df['fips'] = null_df.progress_apply(find_fips,axis=1)
try:
    null_df['geocode'] = null_df['county_state'].progress_apply(geocode)
except:
    pass
#make a dataframe from this 

100%|██████████████████████████████████████| 41/41 [00:00<00:00, 10513.33it/s]
100%|█████████████████████████████████████████| 41/41 [00:40<00:00,  1.02it/s]


Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips,county_state,geocode,long,lat
1516,Connecticut,2018,Tornado,"SEVERE STORMS, TORNADOES, AND STRAIGHT-LINE WINDS",New Haven County,9009,"New Haven County, Connecticut","(New Haven County, Connecticut, United States,...",,
1544,Connecticut,2018,Tornado,"SEVERE STORMS, TORNADOES, AND STRAIGHT-LINE WINDS",Fairfield County,9001,"Fairfield County, Connecticut","(Fairfield County, Connecticut, United States,...",,
1710,Connecticut,2018,Severe Storm(s),SEVERE STORMS AND FLOODING,New London County,9011,"New London County, Connecticut","(New London County, Connecticut, United States...",,
1930,Massachusetts,2018,Severe Storm(s),SEVERE WINTER STORM AND FLOODING,Essex County,25009,"Essex County, Massachusetts","(Essex County, Massachusetts, United States, (...",,
1938,Maine,2018,Coastal Storm,SEVERE STORM AND FLOODING,York County,23031,"York County, Maine","(York County, Maine, United States, (43.422930...",,
1964,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Essex County,25009,"Essex County, Massachusetts","(Essex County, Massachusetts, United States, (...",,
1979,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Norfolk County,25021,"Norfolk County, Massachusetts","(Norfolk County, Massachusetts, United States,...",,
1980,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Worcester County,25027,"Worcester County , Massachusetts","(Worcester County, Massachusetts, United State...",,
1990,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Middlesex County,25017,"Middlesex County, Massachusetts","(Middlesex County, Massachusetts, United State...",,
1997,New Hampshire,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Rockingham County,33015,"Rockingham County, New Hampshire","(Rockingham County, New Hampshire, United Stat...",,


In [29]:
null_df['long'] = null_df['geocode'].apply(standardize_long)
null_df['lat'] = null_df['geocode'].apply(standardize_lat)
null_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips,county_state,geocode,long,lat
1516,Connecticut,2018,Tornado,"SEVERE STORMS, TORNADOES, AND STRAIGHT-LINE WINDS",New Haven County,9009,"New Haven County, Connecticut","(New Haven County, Connecticut, United States,...",-72.933028,41.4082
1544,Connecticut,2018,Tornado,"SEVERE STORMS, TORNADOES, AND STRAIGHT-LINE WINDS",Fairfield County,9001,"Fairfield County, Connecticut","(Fairfield County, Connecticut, United States,...",-73.37486,41.294307
1710,Connecticut,2018,Severe Storm(s),SEVERE STORMS AND FLOODING,New London County,9011,"New London County, Connecticut","(New London County, Connecticut, United States...",-72.123767,41.491501
1930,Massachusetts,2018,Severe Storm(s),SEVERE WINTER STORM AND FLOODING,Essex County,25009,"Essex County, Massachusetts","(Essex County, Massachusetts, United States, (...",-70.948678,42.676297
1938,Maine,2018,Coastal Storm,SEVERE STORM AND FLOODING,York County,23031,"York County, Maine","(York County, Maine, United States, (43.422930...",-70.654664,43.42293
1964,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Essex County,25009,"Essex County, Massachusetts","(Essex County, Massachusetts, United States, (...",-70.948678,42.676297
1979,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Norfolk County,25021,"Norfolk County, Massachusetts","(Norfolk County, Massachusetts, United States,...",-71.182801,42.153861
1980,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Worcester County,25027,"Worcester County , Massachusetts","(Worcester County, Massachusetts, United State...",-71.867724,42.370578
1990,Massachusetts,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Middlesex County,25017,"Middlesex County, Massachusetts","(Middlesex County, Massachusetts, United State...",-71.396826,42.485452
1997,New Hampshire,2018,Snow,SEVERE WINTER STORM AND SNOWSTORM,Rockingham County,33015,"Rockingham County, New Hampshire","(Rockingham County, New Hampshire, United Stat...",-71.100314,42.996668


In [32]:
frames = [dfrq_df[~dfrq_df.isnull().any(axis=1)],null_df]
dfrq_df = pd.concat(frames)
dfrq_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips,county_state,geocode,long,lat
0,Georgia,2017,Hurricane,HURRICANE IRMA,Laurens County,13175,"Laurens County, Georgia","(Laurens County, Georgia, United States, (32.4...",-82.938894,32.423997
1,Georgia,2017,Hurricane,HURRICANE IRMA,Long County,13183,"Long County, Georgia","(Long County, Georgia, 31316, United States, (...",-81.753725,31.770490
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee County,13219,"Oconee County, Georgia","(Oconee County, Georgia, United States, (33.82...",-83.427592,33.829579
3,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe County,13221,"Oglethorpe County, Georgia","(Oglethorpe County, Georgia, United States, (3...",-83.104938,33.870268
4,Georgia,2017,Hurricane,HURRICANE IRMA,Peach County,13225,"Peach County, Georgia","(Peach County, Georgia, United States, (32.554...",-83.835073,32.554346
...,...,...,...,...,...,...,...,...,...,...
6624,Connecticut,2021,Hurricane,TROPICAL STORM ISAIAS,New London County,09011,"New London County, Connecticut","(New London County, Connecticut, United States...",-72.123767,41.491501
6637,Connecticut,2021,Hurricane,TROPICAL STORM ISAIAS,Litchfield County,09005,"Litchfield County, Connecticut","(Litchfield County, Connecticut, United States...",-73.254305,41.767249
6646,Connecticut,2021,Hurricane,TROPICAL STORM ISAIAS,New Haven County,09009,"New Haven County, Connecticut","(New Haven County, Connecticut, United States,...",-72.933028,41.408200
7360,Connecticut,2021,Hurricane,TROPICAL STORM ISAIAS,Hartford County,09003,"Hartford County, Connecticut","(Hartford County, Connecticut, United States, ...",-72.722324,41.792348


# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [42]:
bm = pd.read_csv
fig = px.scatter_geo(dfrq_df,lat='lat',lon='long',hover_name='county_state',animation_frame='year',
                     projection='albers usa',hover_data=['disaster_type'],color='disaster_type')
fig.show()

In [80]:
print(hp_df[['county','state']].to_string())

                            county                   state
0                   Autauga County                 Alabama
1                   Baldwin County                 Alabama
2                   Barbour County                 Alabama
3                      Bibb County                 Alabama
4                    Blount County                 Alabama
5                   Bullock County                 Alabama
6                    Butler County                 Alabama
7                   Calhoun County                 Alabama
8                  Chambers County                 Alabama
9                  Cherokee County                 Alabama
10                  Chilton County                 Alabama
11                  Choctaw County                 Alabama
12                   Clarke County                 Alabama
13                     Clay County                 Alabama
14                 Cleburne County                 Alabama
15                   Coffee County                 Alaba

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*

# Team Contributions

*Specify who in your group worked on which parts of the project.*