In [570]:
import pandas as pd
import json
import altair as alt
import numpy as np
from vega_datasets import data as datasets
from scipy.optimize import curve_fit

# Billion Dollar Weather and Climate Disasters

#### Motivation

It comes as no surprise that climate-related disasters have been on the rise throughout the world for the past few decades. In the last few years, in particular, record-shattering disasters seem to have become troublingly commonplace. In 2020, the US witnessed a surprising diversity of such disasters. The climate disasters have been historic not just for their intensity, but also their diversity. In the US, 2020 was marked by California forests engulfed in flames in one of the worst wildfires on record, all the while enduring 30 named storms brewing in the Atlantic. 

#### Billion-Dollar Disasters
In 2020, the US witnessed 22 distinct climate disaster events costing upwards of 1 billion dollars each; causing a cumulative economic damage of around 95 billion dollars. While 2020 has been a particularly devasting year, such disasters have been steadly on the rise over the past several years. The National Oceanic and Atmospheric Administration (NOAA) has been recording such 'billion-dollar disasters' for at least the past four decades. Over the years, these billion-dollar disasters have begun accounting for larger and larger proportions of the total damages from climate related events in the US. In fact, these events account for more than 80% of the damage from all recorded US weather and climate events.

#### The Dataset
In this report, we dig into NOAA's [billion-dollar disaster dataset](https://www.ncdc.noaa.gov/billions/overview) which records the total number of annual billion-dollar disasters from 1980-2020, broken down into seven categories: _droughts, flooding, freeze, severe storms, tropical cyclones, wildfires, and winter storms_. Besines raw counts, the dataset also provides estimates of the total number of deaths and economic cost (in billions of dollars) attributed to each disaster. NOAA also provides this information aggregated over the entire US, as well across various climate regions.

All the data used in this report was obtained from the [Billion-Dollar Weather and Climate Disasters: Time Series](https://www.ncdc.noaa.gov/billions/time-series), followed by some manual data cleaning. Besides that, we rely on Altair for visualization, scipy for some analysis, and this [github gist](https://gist.github.com/rogerallen/1583593) for a useful dictionary to convert between US state names and acronyms. We'll delve into futher details of the dataset as we encounter them

#### The Questions
We will attempt to answer the following questions, primarily through visualizations:
1. Have billion-dollar climate disasters been occuring more frequently?
2. Have certain types of disasters been occuring more frequently than others?
3. Do certain types of disasters cause more damage (economic and loss of life) than others?
4. Are there any geographic patterns in the changes in frequency and intensity of these disasters? Have certain parts of the US been hit harder than others?



# Load Data

In [571]:
# Load USA level data
usa_data = pd.read_csv('data/usa_data.csv')

# Load state level data
data = pd.read_csv('data/full_data.csv')

# Load regions dictionary
with open('data/regions.json', 'r') as j:
    regions_dict = json.loads(j.read())
    
# Load state abbreviations dictionary
with open('data/us_state_abbrv.json', 'r') as j:
    state_abbrv_dict = json.loads(j.read())

## Billion-Dollar Disasters in the US Over the Years

Let's start by looking at the trends in the US in general. Let's plot the total number of billion dollar disasters each year across the entire united states. We'll also fit a curve to the data, and extrapolate it to the next few years to get a sense of what we can expect

In [572]:
data = usa_data
disaster = 'all-disasters'
metric = 'count'

# Select out rows from dataset for the disaster
all_disaster_data = data.loc[data['disaster'] == disaster, :].reset_index(drop=True)

# Get metric data as our targets
regression_data = data.loc[data['disaster'] == disaster,metric].reset_index(drop=True)

# Define a fit function for a given exponent n
def fit_func(n): 
    return lambda x,a,b: a + b*x**n

# Set the list of exponents to try
exponents = {'x^2': 2,'x^3': 3,'x^4': 4,'x^5': 5}
colors = ['red','blue','green','orange']
num_future_yrs = 15

preds = np.zeros((len(exponents),len(regression_data) + num_future_yrs-1))
rmse_errs = np.zeros(len(exponents))

for idx,exp_name in enumerate(exponents.keys()):
    fn = fit_func(exponents[exp_name])
    params, cov = curve_fit(fn, np.arange(1,len(regression_data)+1), regression_data)
    preds[idx,:] = fn(np.arange(1,len(regression_data)+num_future_yrs), params[0], params[1])
    rmse_errs[idx] = np.sqrt(np.sum((preds[idx, 0:len(regression_data)] - regression_data)**2))

# Collect the predictions    
pred_df = pd.DataFrame(preds).T
pred_df.columns = exponents.keys()
pred_df['year'] = np.arange(1980,2020+num_future_yrs)
pred_df = pred_df.reset_index(drop=True).melt('year') # Turn columns to rows


# Plot the raw data for disasters as points
chart = alt.Chart(all_disaster_data).mark_point().encode(
    alt.X('year:N', axis=alt.Axis(title='Years', values=np.arange(1980,2020+num_future_yrs,5), labelAngle=0)),
    alt.Y('{}'.format(metric), axis=alt.Axis(title='Number of Disasters'))
).properties(
    width=850,
    height=450
)

# Plot the fitted curves
fit_charts = alt.Chart(pred_df).mark_line().encode(
    alt.X('year:N'),
    alt.Y('value:Q'),
    color='variable:N',
    opacity = alt.condition(alt.datum.variable == 'x^3', alt.value(1.0), alt.value(0.3))
)

# Add label texts for fit curves
annotations = pd.DataFrame({'x': [2034]*4, 'y': [73,55,40,30], 'text': ['y=x^5', 
                                                                        'y=x^4',
                                                                        'y=x^3',
                                                                        'y=x^2']})
fit_label_chart = alt.Chart(annotations).mark_text(size=10).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q'),
    text = 'text',
)

# Shade the Observed and Predicted Regions
cutoffs = pd.DataFrame({
    'start': [1979,2021],
    'stop': [2021,2021+num_future_yrs]
})
shade_chart = alt.Chart(
    cutoffs.reset_index()
).mark_rect(
    opacity=0.2
).encode(
    x='start:N',
    x2='stop:N',
    y=alt.value(0),
    y2=alt.value(450),
    color=alt.Color('index:N', legend=None)
)

# Add text for 'observed' and 'predicted'
annotations = pd.DataFrame({'x': [1984,2025], 'y': [70,70], 'text': ['Observed', 'Predicted']})
text_chart = alt.Chart(annotations).mark_text(size=20).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q'),
    text = 'text',
)

# Extra comment annotations
annotations = pd.DataFrame({'x': [2020], 'y': [25], 'text': ['Record number of 22 disasters in 2020!']})
comments_chart = alt.Chart(annotations).mark_text(size=12).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q'),
    text = 'text',
)


chart + fit_charts + shade_chart + text_chart + fit_label_chart + comments_chart

We see that there is clearly an increase in the number of annual billion-dollar disasters over the years. In 2020, we saw a record number of 22 billion dollar disasters.

We've also drawn up some projections based on different polynomial fits to the data. The cubic polynomial fit the data the best among all the other options tried. If this trend were true, we'd need to brace ourselves for around 35 annual disasters.

Let us now see how these disasters break down by various categories

In [573]:
chart = alt.Chart(usa_data).mark_bar().transform_filter(
    alt.datum.disaster != 'all-disasters'
).encode(
    x = alt.X('year:N', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    y = alt.Y('count:Q'),
    color = alt.Color('disaster:N'),
    order= alt.Order(
      'count',
      sort='ascending'
    )
)


# Shade the Observed and Predicted Regions
cutoffs = pd.DataFrame({
    'start': [1980,2007],
    'stop': [2007, 2021]
})
shade_chart = alt.Chart(
    cutoffs.reset_index()
).mark_rect(
    opacity=0.2
).encode(
    x='start:N',
    x2='stop:N',
    y=alt.value(0),
    y2=alt.value(50),
    color=alt.Color('index:N', legend=None)
)

# Add comment annotations
annotations = pd.DataFrame({'x': [2010, 2007], 'y': [20, 21], 'text': ['Severe storms begin driving the increase in annual disasters',
                                                                      '2007']})
comments_chart = alt.Chart(annotations).mark_text(size=12).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q'),
    text = 'text',
)

chart 
# + comments_chart + shade_chart

In the above plot, we have sorted each of the stacks by the number of times that disaster occured in the given year i.e. the most commonly occuring disaster is at the top of the stack. 

In the last two decades, the bulk of the billion-dollar disasters have been severe storms. The annual number of severe-storms have increased substatially in the last two decades, starting sometime around 2007. In fact, it appears that the increase in billion-dollar disasters is being driven mostly by severe storms.

The above plot clearly shows how severe-storms have dominated the billion-dollar disasters in the past decade. However, it is a bit difficult to get a sense of how the frequency of different disasters has been changing over time. This is a bit more visceral in the following plot

In [574]:
top_chart = alt.Chart(usa_data).mark_circle().transform_filter(
    alt.datum.disaster != 'all-disasters'
).encode(
    x = alt.X('year:N', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    y = alt.Y('disaster:N',  axis=alt.Axis(title=None)),
    size = 'count'
).properties(
    width = 800,
    height=400
)

bottom_chart = alt.Chart(usa_data).mark_circle().transform_filter(
    alt.datum.disaster == 'all-disasters'
).encode(
    x = alt.X('year:N', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    y = alt.Y('disaster:N', axis=alt.Axis(title=None)),
    size = 'count'
).properties(
    width = 800
)

alt.vconcat(top_chart, bottom_chart)

We now see some interesting patterns emerging, other than how severe-storms have been increasing in frequency. 

First, we see that wildfires have been becoming a more regular occurence in the past two decades. Prior to 2000, it was common to have many years without any wildfires. We did not have any from 1980 to 1990. Starting 2000, there are fewer spells of years without any wildfires at all. 

Freezes and winter-storms seem to have become less frequent in the last two decades. However, both floods and droughts, disasters that are at two extremes, seem to both be occuring more frequently in the last decade.


This figure suggests that the last two decades have seen some marked changes. Let's plot the same frequency data, but using the average from 1980-1999 as a baseline. In particular, we're interested in seeing how things have fared in comparison to the two earlier decades 1980-1999

In [575]:
cutoff_year = 1999

# Get baseline averages
baseline_data = usa_data.copy().loc[usa_data['year'] < cutoff_year, ['count','disaster']].reset_index(drop=True)
baseline_data = baseline_data.groupby(['disaster']).mean().reset_index()
baseline_data = baseline_data.rename(columns={'count':'avg_count'})

# Add average info and diff from avg to our data
usa_data_avg = usa_data.merge(baseline_data, left_on='disaster',right_on='disaster')
usa_data_avg['avg_diff'] = usa_data_avg['count'] - usa_data_avg['avg_count']

alt.Chart(usa_data_avg).mark_bar().encode(
    x = alt.X('year:N', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    y = alt.Y('avg_diff:Q', axis=alt.Axis(title='Difference from 1980-1999 Average')),
    color=alt.condition(
        alt.datum.avg_diff > 0,
        alt.value("red"),  # The positive color
        alt.value("green")  # The negative color
)).transform_filter(
    alt.datum.year > cutoff_year
).facet(
    facet = 'disaster:N',
    columns=2
)

We see that most disasters have become more frequent in comparison to the 1980-1999 average. Two notable exceptions are ``winter-storms`` and ``freezes``

#### Deaths due to Disasters

Let us now look at the human deaths due to these disasters

In [576]:
width = 800

# Death Charts
chart_death = alt.Chart(usa_data).mark_bar().encode(
    x = alt.X('year:N', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    y = alt.Y('deaths'),
    color='disaster',
    order =  alt.Order(
      'deaths:N',
      sort='ascending'
    )
).transform_filter(
    alt.datum.disaster != 'all-disasters'
).properties(
    width = width
)

# Add comment annotations
# Extra comment annotations
annotations = pd.DataFrame({'x': [1990, 2011], 
                            'y': [1800, 1000], 
                            'text': ['1980-2000: Droughts caused the most fatalities',
                                     '2000-2020: Droughts not as life-threatening']})
comments_chart_death = alt.Chart(annotations).mark_text(size=12).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q'),
    text = 'text',
)

# Money Charts
chart_money = alt.Chart(usa_data).mark_bar().encode(
    x = alt.X('year:N', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    y = alt.Y('cost'),
    color='disaster',
    order =  alt.Order(
      'cost:N',
      sort='ascending'
    )
).transform_filter(
    alt.datum.disaster != 'all-disasters'
).properties(
    width = width
)

# Extra comment annotations
annotations = pd.DataFrame({'x': [2005, 2012, 2017], 
                            'y': [250, 150, 340], 
                            'text': ['Hurricane Katrina?', 'Hurricane Sandy?', 'Hurricane Harvey?']})
comments_chart_money = alt.Chart(annotations).mark_text(size=12).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q'),
    text = 'text',
)


# Timeline Charts
timeline_df = pd.DataFrame()
timeline_df['year'] = usa_data['year'].unique()
events_dict = {1980: 'Historic Heatwaves',
              2005: 'Hurricane Katrina',
              2012: 'Hurricane Sandy',
              2017: 'Hurricane Harvey'}
timeline_df['events'] = timeline_df['year'].map(events_dict)
timeline_df['events_flag'] = 1 - timeline_df['events'].isna()


# Create timeline chart
timeline_chart = alt.Chart(timeline_df).mark_point().encode(
    x = alt.X('year:O', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    size = alt.Size('events_flag', legend=None)
).properties(
    width = width,
    height = 50
)
# Extra comment annotations
annotations = pd.DataFrame({'x': list(events_dict.keys()), 
                            'y': [0.1]*len(events_dict.keys()), 
                            'text': [events_dict[x] for x in events_dict.keys()]})
comments_chart_timeline = alt.Chart(annotations).mark_text(size=12).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q', axis=alt.Axis(labels=False, title=None)),
    text = 'text',
)

In [577]:
alt.vconcat(timeline_chart + comments_chart_timeline,
            alt.vconcat(chart_money,
           chart_death + comments_chart_death))

We notice two main things here. 
1) Fortunately, there does not seem to be sustained increase in the number of deaths over the years i.e. the billion-dollar disasters have not been becoming more and more deadly over the years. 
2) There were a few devastating tragedies that have caused human deaths. From 1980-2000, the most deadly of such events tended to be droughts. In the last two decades, we do not see as much of loss of life due to droughts. In general, there seem to have been fewer deaths overall. The main killer in the past 20 decades has instead been tropical cyclones.

Comparing the two graphs above, we see the both economic costs and deaths have spikes at some notable events. Let's see if these variables are actually correlated

In [578]:
full_plot = alt.Chart(usa_data).mark_point(size=100).encode(
    alt.X('deaths:Q'),
    alt.Y('cost:Q'),
    alt.Color('disaster:N')
).transform_filter(
    alt.datum.disaster != 'all-disasters'
).properties(
    width = 200,
    height=500
)

In [579]:
red_plot1 = alt.Chart(usa_data).mark_point(size=100).encode(
    alt.X('deaths:Q'),
    alt.Y('cost:Q'),
    alt.Color('disaster:N')
).transform_filter(
    alt.datum.disaster != 'all-disasters'
).transform_filter(
    alt.datum.cost < 200
).properties(
    width = 200,
    height=500
)

red_plot2 = alt.Chart(usa_data).mark_point(size=100).encode(
    alt.X('deaths:Q'),
    alt.Y('cost:Q'),
    alt.Color('disaster:N')
).transform_filter(
    alt.datum.disaster != 'all-disasters'
).transform_filter(
    alt.datum.cost < 30
).transform_filter(
    alt.datum.deaths < 400
).properties(
    width = 200,
    height=500
)

In [580]:
alt.hconcat(full_plot, red_plot1, red_plot2)

In the first plot, we see that some of the most devastating disasters in the US that have caused high economic and human loss have been tropical cyclones

In the second plot, we see that outside of those two outliers from earlier, most other tropical cyclones have lead to relatively low loss of human life, but still they are the most economically damaging disasters. Also, we see that over the years droughts have caused the most loss of human life. They can also have quite high economic cost, but can also have low economic cost (while still leading to a large loss of life)

In the third plot, we don't really see a trend between death and cost. We do see that most of these billion dollar disasters seem to have a bigger impact in terms of economic costs than deaths. For example, there are many disasters that have zero deaths, but still lead to damages in the order of 12-14 billion.

### Splitting Up By Regions

Thus far, we've only looked at aggregate values over the entire US. Now, let's drill down at some region level metrics.

In [581]:
# Load state level data
data = pd.read_csv('data/full_data.csv')

# Load regions dictionary
with open('data/regions.json', 'r') as j:
    regions_dict = json.loads(j.read())
    
# Load state abbreviations dictionary
with open('data/us_state_abbrv.json', 'r') as j:
    state_abbrv_dict = json.loads(j.read())

Let's start off by taking a peek at our data

In [582]:
data.head()

Unnamed: 0,cost,costRank,count,countRank,deaths,deathsRank,disaster,region,year
0,5.0,3.0,1.0,1.0,1260.0,1.0,drought,CCR,1980
1,0.0,42.0,0.0,42.0,0.0,42.0,drought,CCR,1981
2,0.0,42.0,0.0,42.0,0.0,42.0,drought,CCR,1982
3,1.6,8.0,1.0,1.0,0.0,42.0,drought,CCR,1983
4,0.0,42.0,0.0,42.0,0.0,42.0,drought,CCR,1984


The dataset is similarly structured as the dataset for the full US. But now we have the cost and deaths metrics are split up by regions. Let's see what regions are present in the ``region`` columm

We also have a dictionary that gives us the full form of all of these regional acronyms. The dict also gives us the names of all the states included in a given region. Let's add this information in our dataset

In [583]:
# Convert regions_dict into a dataframe
regions = pd.DataFrame.from_dict(regions_dict).transpose().reset_index()
regions.rename(columns={'index':'region'}, inplace=True)
regions = regions.explode('states')
regions.reset_index(drop=True, inplace=True)
regions['abbrv'] = regions['states'].map(state_abbrv_dict)
regions.head()

Unnamed: 0,region,name,states,abbrv
0,CCR,Central Climate Region,Illinois,IL
1,CCR,Central Climate Region,Indiana,IN
2,CCR,Central Climate Region,Kentucky,KY
3,CCR,Central Climate Region,Missouri,MO
4,CCR,Central Climate Region,Ohio,H


Ok. Let's see what regions we have

In [584]:
# Get a list of all region names for later
all_region_names = regions['name'].unique()

regions.drop_duplicates('region')[['region','name']].reset_index(drop=True)

Unnamed: 0,region,name
0,CCR,Central Climate Region
1,ENCCR,East North Central Climate Region
2,NECR,Northeast Climate Region
3,NWCR,Northwest Climate Region
4,SCR,South Climate Region
5,SECR,Southeast Climate Region
6,SWCR,Southwest Climate Region
7,WCR,West Climate Region
8,WNCCR,West North Climate Region
9,GCS,Gulf Coast States


Let's start off by checking what states are contained within what regions by plotting the various regions on map. To do this, we first need to get ``states`` dataset of all states from ``vega-datasets``, which has an ID associated with each state. Our dataset has names of states, so we'll need to map these to IDs. Once our dataset has the same state IDs as the ``states`` dataset, we can use lookups to make our cartographic plots

In [585]:
# Pull geographic info on all states from vega datasets
states = alt.topo_feature(datasets.us_10m.url, 'states')
# Pull state IDs from another vega dataset so we can use to do lookups
state_ids = datasets.population_engineers_hurricanes()
# Match ids to state names in our dataset
state_ids = state_ids.loc[:,['state','id']]
state_ids = dict(zip(state_ids.state, state_ids.id))
regions['id'] = regions['states'].map(state_ids)
regions.head()

Unnamed: 0,region,name,states,abbrv,id
0,CCR,Central Climate Region,Illinois,IL,17
1,CCR,Central Climate Region,Indiana,IN,18
2,CCR,Central Climate Region,Kentucky,KY,21
3,CCR,Central Climate Region,Missouri,MO,29
4,CCR,Central Climate Region,Ohio,H,39


In [586]:
def chloropleth_map(regions_df, projection='albers'):
    return alt.Chart(states).mark_geoshape(stroke='black').encode(
        alt.Color('name:N', scale=alt.Scale(scheme='tableau20'))
    ).transform_lookup(
        lookup='id',
        from_=alt.LookupData(regions_df, 'id', list(regions.columns))
    ).properties(
        width=800,
        height=350
    ).project(
        type=projection
    )

chloropleth_map(regions)

Two things to note here:
First, our dataset only seems to cover the contiguous US as Hawaii, Alaska, and Puerto Rico show up as ``null``.

Second, we saw earlier that there were 13 regions in our dataset, but we only see 11 unique labels here. One possible reason could be that some of the regions overlap with one another. For example, we don't see ``East North Central Climate Region`` or ``West North Climate Region``. Let's check if we have regions overlap with one another

In [587]:
regions.head()

Unnamed: 0,region,name,states,abbrv,id
0,CCR,Central Climate Region,Illinois,IL,17
1,CCR,Central Climate Region,Indiana,IN,18
2,CCR,Central Climate Region,Kentucky,KY,21
3,CCR,Central Climate Region,Missouri,MO,29
4,CCR,Central Climate Region,Ohio,H,39


In [588]:
region_membership = regions.groupby('states')['region'].count().reset_index()
region_membership = region_membership.rename(columns={'region':'region_membership'})
regions_membership = regions.merge(region_membership, on ='states')

heatmap = alt.Chart(regions_membership).mark_rect().encode(
    alt.Y('states:O'),
    alt.X('region:N'),
    alt.Color('name:N', scale=alt.Scale(scheme='tableau20'))
).properties(
    width=400,
    height=700
)

membership_plot = alt.Chart(regions_membership).mark_text().encode(
    y = alt.Y('states:O'),
    text = alt.Text('region_membership:Q'),
    color = alt.Text('region_membership:Q', legend=None),
).properties(
    height=700
)


heatmap | membership_plot 

Aha! Just as we suspected, many states are included in multiple regions. For example, Alabama is part of the Gulf Coast States (GCS) and the South East Climate Region (SECR). Texas is included in 4 different regions!

After starting at the region names for a bit, we notice that most of these names end with "Climate Region", but there are a few that don't (e.g. "Great Lake States", "Tornado Alley"). Let's see what happens if we remove all of these non climate-region states

In [589]:
region_names_to_drop = ['Great Lakes States', 'Gulf Coast States', 'Southern Plains', 'Tornado Alley']
reduced_region_names = [x for x in all_region_names if x not in region_names_to_drop]

In [590]:
region_membership = regions.loc[regions['name'].isin(reduced_region_names),:]
region_membership = region_membership.groupby('states')['region'].count().reset_index()
region_membership = region_membership.rename(columns={'region':'region_membership'})
regions_membership = regions.merge(region_membership, on ='states')


heatmap = alt.Chart(regions_membership).mark_rect().encode(
    alt.Y('states:O'),
    alt.X('region:N'),
    alt.Color('name:N', scale=alt.Scale(scheme='tableau20'))
).properties(
    width=400,
    height=700
)

membership_plot = alt.Chart(regions_membership).mark_text().encode(
    y = alt.Y('states:O'),
    text = alt.Text('region_membership:Q'),
    color = alt.Text('region_membership:Q', legend=None),
).properties(
    height=700
)


heatmap | membership_plot 

Great! Now we see that each state is included only in 1 region. There is still one more question: did we end up losing any states when we dropped those extra regions? Let's make a chloropleth plot and see if we miss any states

In [614]:
regions_chloro = chloropleth_map(regions.loc[regions['name'].isin(reduced_region_names)], projection='albersUsa')
regions_chloro

That seems to have done the trick. We now have all 11 of the climate-regions represented in our map. Also, every state is colored on this cholopleth. This tells us that our 11 region names have full coverage -- we didn't miss any states by removing some of those regions. 

Now, let's just remove those extra region names from our main regions dataframe

In [592]:
regions = regions.loc[regions['name'].isin(reduced_region_names),:]

In [604]:
regions.loc[:,['region','name']].head()

Unnamed: 0,region,name
0,CCR,Central Climate Region
1,CCR,Central Climate Region
2,CCR,Central Climate Region
3,CCR,Central Climate Region
4,CCR,Central Climate Region


In [608]:
regions_data = data.merge(regions.loc[:,['region','name']].drop_duplicates(), on='region', how='left')
regions_data = regions_data.loc[regions_data['name'].isin(reduced_region_names),:]

Let's start off by visualizing the total disasters across all the categories over the various regions in our dataset

In [657]:
bottom_chart = alt.Chart(regions_data).mark_circle().encode(
    alt.Y('disaster:N'),
    alt.X('region:N'),
    alt.Size('sum(count)')
).properties(
    width=450,
    height=300
).transform_filter(
    alt.datum.disaster != 'all-disasters'
)

top_chart = alt.Chart(regions_data).mark_circle().encode(
    alt.Y('disaster:N'),
    alt.X('region:N'),
    alt.Size('sum(count):Q')
).properties(
    width=450,
    height=50
).transform_filter(
    alt.datum.disaster == 'all-disasters'
)


alt.vconcat(top_chart, bottom_chart)

In the top plot, we get a sense of which regions have gotten hit by these billion dollar disasters the most. Namely, CCR, SCR and SECR are at the top of the list

In [703]:
df = regions.copy()
df['name'][~df['region'].isin(['CCR','SCR','SECR'])] = 'Other'
df['name'][df['region'].isin(['CCR','SCR','SECR'])] = 'High Frequency Disaster Regions'


alt.Chart(states).mark_geoshape(stroke='black').encode(
        alt.Color('name:N', scale=alt.Scale(scheme='tableau20'))
    ).transform_lookup(
        lookup='id',
        from_=alt.LookupData(df, 'id', list(regions.columns))
    ).properties(
        width=800,
        height=350
    ).project(
        type='albersUsa'
    )

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


So, we see that most of the billion dollar disasters in the US have been concentrated in the south-west corner of the country

Let us now take a look at disasters by the years in the different regions

In [704]:
# Plotting the total number of disasters across the regions
alt.Chart(regions_data).mark_circle().transform_filter(
        alt.datum.disaster != 'all-disasters'
    ).encode(
    x = alt.X('year:O',axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    size = 'sum(count):Q',
    y ='name:N',
    color='region:N'
).properties(
    width=700,
    height=500
)

We notice here that up until 2000, the western regions of the US did not see billion dollar disasters as frequently. But they've started becoming more frequent in the past two decades. As we saw before, the south-western regions have been seeing the largest number of disasters, and they're happening more frequently too

In [705]:
width = 800

chart_death = alt.Chart(regions_data).mark_bar().transform_filter(
        alt.datum.disaster == 'drought'
    ).encode(
    alt.X('year:O',axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    alt.Y('region'),
    alt.Color('sum(deaths)')
).properties(
    width=width,
    height=450
)

chart_money = alt.Chart(regions_data).mark_bar().transform_filter(
        alt.datum.disaster == 'drought'
    ).encode(
    alt.X('year:O',axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    alt.Y('region'),
    alt.Color('sum(cost)')
).properties(
    width=width,
    height=450
)

# Timeline Charts
timeline_df = pd.DataFrame()
timeline_df['year'] = usa_data['year'].unique()
events_dict = {1980: 'Historic Heatwaves',
              2005: 'Hurricane Katrina',
              2012: 'Hurricane Sandy',
              2017: 'Hurricane Harvey'}
timeline_df['events'] = timeline_df['year'].map(events_dict)
timeline_df['events_flag'] = 1 - timeline_df['events'].isna()


# Create timeline chart
timeline_chart = alt.Chart(timeline_df).mark_point().encode(
    x = alt.X('year:O', axis=alt.Axis(title='Years', values=np.arange(1980,2030,5), labelAngle=0)),
    size = alt.Size('events_flag', legend=None)
).properties(
    width = width,
    height = 50
)
# Extra comment annotations
annotations = pd.DataFrame({'x': list(events_dict.keys()), 
                            'y': [0.1]*len(events_dict.keys()), 
                            'text': [events_dict[x] for x in events_dict.keys()]})
comments_chart_timeline = alt.Chart(annotations).mark_text(size=12).encode(
    x = alt.X('x:N'),
    y = alt.Y('y:Q', axis=alt.Axis(labels=False, title=None)),
    text = 'text',
).properties(
    width = width
)

In [706]:
alt.vconcat(timeline_chart, chart_money, chart_death).resolve_scale(
    color='independent'
)