# COGS 108 - Final Project (change this to your project's title)

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [  ] NO - keep private

# Overview

*Fill in your overview here*

# Names

- Ant Man
- Hulk
- Iron Man
- Thor
- Wasp

<a id='research_question'></a>
# Research Question

*Fill in your research question here*

<a id='background'></a>

## Background & Prior Work

*Fill in your background and prior work here* 

References (include links):
- 1)
- 2)

# Hypothesis


*Fill in your hypotheses here*

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: 
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
import os
from pathlib import Path

# converting city to county
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from tqdm import tqdm # progress bar for .apply()

# for choropleth
import plotly.express as px

# used for choropleth
from urllib.request import urlopen
import json

# filter extra noise from warnings
import warnings
warnings.filterwarnings('ignore')

# Statmodels & patsy
import patsy
import statsmodels.api as sm
from scipy.stats import pearsonr
from scipy.stats import boxcox

# Make plots just slightly bigger for displaying well in notebook
plt.rcParams['figure.figsize'] = (10, 5)

# Displaying figures as image
from IPython.display import Image

# used to convert state/county to fips
import addfips

%config InlineBackend.figure_format ='retina'

# Data Cleaning

Describe your data cleaning steps here.

In [3]:
hp_df = pd.read_csv('NewHousingPrices2017-2021.csv').drop(columns=['Unnamed: 0'])
hp_df[['county', 'state']] = hp_df['County & State'].str.split('County,', expand=True)
old_cols = hp_df.columns.values
hp_df['Average_HP'] = hp_df.iloc[:,3:].mean(axis=1)
new_cols = ['county','state','FIPS','Average_HP']
hp_df = hp_df.reindex(columns=new_cols)
hp_df

Unnamed: 0,county,state,FIPS,Average_HP
0,Autauga,Alabama,1001,156341.75
1,Baldwin,Alabama,1003,222907.75
2,Barbour,Alabama,1005,96513.75
3,Bibb,Alabama,1007,103153.25
4,Blount,Alabama,1009,133840.75
...,...,...,...,...
3112,Teton,Wyoming,56039,867572.50
3113,Uinta,Wyoming,56041,185513.50
3114,Washakie,Wyoming,56043,173410.00
3115,Weston,Wyoming,56045,179362.50


In [4]:
def standardize_county(str_in):
    try:
        if '(County)' in str_in:
            output = str_in.replace('(County)','')
        elif '(Borough)' in str_in:
            output = str_in.replace('(Borough)','')
        elif '(Census Subarea)'in str_in:
            output = str_in.replace('(Census Subarea)','')
        elif '(Parish)'in str_in:
            output = str_in.replace('(Parish)','')
        else:
            output = None
    except: 
        output = None

    return output


def standardize_year(str_in):
    try:
        output = str_in.split('T')[0]
        output = pd.to_datetime(str_in).year
    except:
        output = None
        
    return output

In [5]:
disaster_type_df = pd.read_csv('datasets/DisasterDeclarationsSummaries.csv')

# select a subset of the columns
wanted_columns = ['state', 'declarationDate','incidentType','declarationTitle','designatedArea']

# rename the columns
disaster_type_df = disaster_type_df[wanted_columns].rename(columns={"declarationDate":"year", "designatedArea": "county", "incidentType":"disaster_type", "declarationTitle":"disaster_declaration"})

# Set "Statewide" to None and strip "(County)" from all counties
disaster_type_df['county'] = disaster_type_df['county'].apply(standardize_county)

# filter dataset to only include non-null 
disaster_type_df = disaster_type_df[~disaster_type_df['county'].isnull()]

# strip year column to only include year
disaster_type_df['year'] = disaster_type_df['year'].apply(standardize_year)

# sort by year
disaster_type_df = disaster_type_df.sort_values('year').reset_index(drop = True)

# check for no NaNs
assert(disaster_type_df.isna().sum().sum() == 0)

disaster_type_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county
0,IN,1959,Flood,FLOOD,Clay
1,WA,1964,Flood,HEAVY RAINS & FLOODING,Wahkiakum
2,WA,1964,Flood,HEAVY RAINS & FLOODING,Skamania
3,WA,1964,Flood,HEAVY RAINS & FLOODING,Pierce
4,WA,1964,Flood,HEAVY RAINS & FLOODING,Pacific
...,...,...,...,...,...
57602,MO,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin
57603,TN,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer
57604,WA,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom
57605,TN,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson


In [6]:
state_abv = {'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AZ': 'Arizona','CA': 'California','CO': 'Colorado',
             'CT': 'Connecticut','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','HI': 'Hawaii','IA': 'Iowa',
             'ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana',
             'MA': 'Massachusetts','MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota',
             'MO': 'Missouri','MS': 'Mississippi','MT': 'Montana','NC': 'North Carolina','ND': 'North Dakota',
             'NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey','NM': 'New Mexico','NV': 'Nevada',
             'NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania',
            'RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee',
             'TX': 'Texas','UT': 'Utah','VA': 'Virginia','VT': 'Vermont','WA': 'Washington',
             'WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'
                    }
disaster_type_df['state'] = disaster_type_df['state'].map(state_abv)
dfrq_df = disaster_type_df[disaster_type_df['year'] >= 2017].reset_index(drop=True)
dfrq_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county
0,Georgia,2017,Hurricane,HURRICANE IRMA,Laurens
1,Georgia,2017,Hurricane,HURRICANE IRMA,Long
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee
3,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe
4,Georgia,2017,Hurricane,HURRICANE IRMA,Peach
...,...,...,...,...,...
13769,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin
13770,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer
13771,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom
13772,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson


In [7]:
df = pd.merge(hp_df,dfrq_df, how='left',on=['state','county'])
old_cols = df.columns.values
df['disaster_type'] = df['disaster_type'].fillna(disaster_type_df['disaster_type'])
new_cols = ['county','state','FIPS','Average_HP','disaster_type']
df = df.reindex(columns=new_cols)
df

Unnamed: 0,county,state,FIPS,Average_HP,disaster_type
0,Autauga,Alabama,1001,156341.75,Flood
1,Baldwin,Alabama,1003,222907.75,Flood
2,Barbour,Alabama,1005,96513.75,Flood
3,Bibb,Alabama,1007,103153.25,Flood
4,Blount,Alabama,1009,133840.75,Flood
...,...,...,...,...,...
3112,Teton,Wyoming,56039,867572.50,Severe Storm(s)
3113,Uinta,Wyoming,56041,185513.50,Severe Storm(s)
3114,Washakie,Wyoming,56043,173410.00,Flood
3115,Weston,Wyoming,56045,179362.50,Severe Storm(s)


In [92]:
fips_df = pd.read_csv('datasets//county_fips_lat_long.csv', encoding = 'latin1')
fips_df = fips_df.rename(str.lower,axis='columns')
fips_df['state'] = fips_df['state'].map(state_abv)
fips_df = fips_df[1:]
fips_df = fips_df[['state','fips','county [2]','latitude','longitude']]
fips_df = fips_df.rename(columns={'county [2]':'county'})
fips_df

Unnamed: 0,state,fips,county,latitude,longitude
1,Alabama,01001,Autauga,+32.536382°,86.644490°
2,Alabama,01003,Baldwin,+30.659218°,87.746067°
3,Alabama,01005,Barbour,+31.870670°,85.405456°
4,Alabama,01007,Bibb,+33.015893°,87.127148°
5,Alabama,01009,Blount,+33.977448°,86.567246°
...,...,...,...,...,...
3139,Wyoming,56037,Sweetwater,+41.660339°,108.875676°
3140,Wyoming,56039,Teton,+44.049321°,110.588102°
3141,Wyoming,56041,Uinta,+41.284726°,110.558947°
3142,Wyoming,56043,Washakie,+43.878831°,107.669052°


In [91]:
merged_df = pd.merge(dfrq_df,fips, on=['state','county'],how='left')
merged_df

Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips
0,Georgia,2017,Hurricane,HURRICANE IRMA,Laurens,
1,Georgia,2017,Hurricane,HURRICANE IRMA,Long,
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee,
3,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe,
4,Georgia,2017,Hurricane,HURRICANE IRMA,Peach,
...,...,...,...,...,...,...
13769,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin,
13770,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer,
13771,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom,
13772,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson,


In [133]:
af = addfips.AddFIPS()
def find_fips(x):
    county = x['county'].strip()
    state = x['state'].strip()
    fips = af.get_county_fips(county,state=state)
    return str(fips)
# def findLong(x):
#     county = x['county'].strip()
#     state = x['state'].strip()
#     county_state = county + ' County, ' + state
#     geolocator = Nominatim(user_agent='find_long')
#     loc = geolocator.geocode(county_state)
#     return loc.longitude
# def findLat(x):
#     county = x['county'].strip()
#     state = x['state'].strip()
#     county_state = county + ' County, ' + state
#     geolocator = Nominatim(user_agent='find_long')
#     loc = geolocator.geocode(county_state)
#     return loc.latitude
tqdm.pandas()
dfrq_df['fips'] = dfrq_df.progress_apply(find_fips,axis=1)
# dfrq_df['long'] = dfrq_df.progress_apply(findLong,axis=1)
# dfrq_df['lat'] = dfrq_df.progress_apply(findLat,axis=1)
# long = []
# lat = []
# for i in range(len(dfrq_df)):
#     if findGeocode(dfrq_df.loc[i,'county'],dfrq_df.loc[i,'state']) != None:
#         loc = findGeocode(dfrq_df.loc[i,'county'],dfrq_df.loc[i,'state'])
#         lat.append(loc.latitude)
#         long.append(loc.longitude)
#         print(i)
#     else:
#         lat.append(np.nan)
#         long.append(np.nan)
# dfrq_df['long'] = long
# dfrq_df['lat'] = lat
dfrq_df


100%|██████████████████████████████████| 13774/13774 [00:00<00:00, 51961.86it/s]


Unnamed: 0,state,year,disaster_type,disaster_declaration,county,fips
0,Georgia,2017,Hurricane,HURRICANE IRMA,Laurens,13175
1,Georgia,2017,Hurricane,HURRICANE IRMA,Long,13183
2,Georgia,2017,Hurricane,HURRICANE IRMA,Oconee,13219
3,Georgia,2017,Hurricane,HURRICANE IRMA,Oglethorpe,13221
4,Georgia,2017,Hurricane,HURRICANE IRMA,Peach,13225
...,...,...,...,...,...,...
13769,Missouri,2022,Severe Storm(s),"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dunklin,29069
13770,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Dyer,47045
13771,Washington,2022,Flood,"SEVERE STORMS, STRAIGHT-LINE WINDS, FLOODING, ...",Whatcom,53073
13772,Tennessee,2022,Tornado,"SEVERE STORMS, STRAIGHT-LINE WINDS, AND TORNADOES",Henderson,47077


In [150]:
dfrq_df['county_state'] = dfrq_df['county']+", " +dfrq_df['state']
geolocator = Nominatim(user_agent='find_long_lat')
def standardize_long(location):
    try:
        output = location.longitude
    except:
        output = None
    return output
def standardize_lat(location):
    try:
        output = location.latitude
    except:
        output = None
    return output
geocode = RateLimiter(geolocator.geocode,min_delay_seconds=1)
tqdm.pandas()
try:
    dfrq_df['geocode'] = dfrq_df['county_state'].progress_apply(geocode)
except:
    pass

100%|███████████████████████████████████| 13774/13774 [4:30:36<00:00,  1.18s/it]


In [185]:
geolocator.geocode('Acadia Parish, Louisiana')

Location(Acadia Parish, Louisiana, United States, (30.2740735, -92.3957036, 0.0))

In [186]:
disaster_type_df['county'][0]

'Clay '

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [9]:
import plotly.graph_objects as go

import pandas as pd


{'Laurens ': 'Georgia',
 'Long ': 'Georgia',
 'Oconee ': 'Georgia',
 'Oglethorpe ': 'Georgia',
 'Peach ': 'Georgia',
 'Miller ': 'Missouri',
 'Pickens ': 'Georgia',
 'Schley ': 'Georgia',
 'Tift ': 'Georgia',
 'Tattnall ': 'Georgia',
 'Toombs ': 'Georgia',
 'Marion ': 'Kentucky',
 'Gwinnett ': 'Georgia',
 'Hall ': 'Texas',
 'Montgomery ': 'Kentucky',
 'Macon ': 'Missouri',
 'Seminole ': 'Oklahoma',
 'Jeff Davis ': 'Texas',
 'Heard ': 'Georgia',
 'Dade ': 'Missouri',
 'Telfair ': 'Georgia',
 'Screven ': 'Georgia',
 'Spalding ': 'Georgia',
 'Butts ': 'Georgia',
 'Crawford ': 'Iowa',
 'Hart ': 'Kentucky',
 'Jasper ': 'Texas',
 'Pulaski ': 'Kentucky',
 'Stephens ': 'Oklahoma',
 'Fulton ': 'Pennsylvania',
 'Hancock ': 'Mississippi',
 'Johnson ': 'Kentucky',
 'Mitchell ': 'Texas',
 'Taliaferro ': 'Georgia',
 'Walker ': 'Texas',
 'Burke ': 'North Carolina',
 'Clarke ': 'Mississippi',
 'Echols ': 'Georgia',
 'Towns ': 'Georgia',
 'Columbia ': 'Washington',
 'Emanuel ': 'Georgia',
 'Pike ': 'Ke

In [41]:
rows = [{'county': 'Georgetown - Quitman ', 'state': 'Georgia'},{'county': 'Laurens', 'state': 'Georgia'}]
for row in rows:
    print(af.add_county_fips(row,'county', 'state'))

{'county': 'Georgetown - Quitman ', 'state': 'Georgia', 'fips': None}
{'county': 'Laurens', 'state': 'Georgia', 'fips': '13175'}


In [36]:
rows = []
for i in range(len(dfrq_df)):
    rows.append({'county':dfrq_df.iloc[i]['county'],'state':dfrq_df.iloc[i]['state']})

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*

# Team Contributions

*Specify who in your group worked on which parts of the project.*