# Check Your Tires ... And Your Shopping Basket?

*“It’s difficult to imagine the power that you’re going to have when so many different sorts of data are available.”*  
Tim Berners-Lee

Datasets used in this kernel,

- [US Traffic Fatality Records](https://www.kaggle.com/usdot/nhtsa-traffic-fatalities)
- [Chronic Disease Indicators](https://www.kaggle.com/cdc/chronic-disease)
- [2016 US Election](https://www.kaggle.com/benhamner/2016-us-election)
- [Vegetarian & Vegan Restaurants](https://www.kaggle.com/datafiniti/vegetarian-vegan-restaurants)

### Contents

- [Introduction](#Introduction)
- [BigQuery](#BigQuery)
- [Data overview](#Dataoverview)
- [Multiple Datasets](#MultipleDatasets)
- [Chronic Disease Indicators](#ChronicDiseaseIndicators)
- [2016 US Election](#2016USElection)
- [Vegetarian & Vegan Restaurants](#Vegetarian&VeganRestaurants)
- [State Populations](#StatePopulations)
- [Closing Thoughts](#ClosingThoughts)

<a id='Introduction'></a>

### Introduction

This kernel explores the **'US Traffic Fatality Records - Fatal car crashes for 2015-2016' dataset** from the US Department of Transport.

This dataset is so huge and varied (one of the first of Kaggle's BigQuery datasets), that it's hard to know where to begin. The data isn't just the where and when of traffic accidents. It also contains data on the passengers, drivers, visibility at the time, damage done to the vehicle, etc.

In this kernel, some exploration and analysis of this core dataset is done first. The kernel then uses one of Kaggle's newer features, **the ability to merge different datasets**. This aims to begin an exploration of how an already rich and detailed dataset can be further embellished, in order to try and see these fatality statistics from some alternative perspectives.

Before loading any data, I've included below the descriptions of the individual files, to make the explorations easier,


<a id='Dataoverview'></a>

### Data overview

- **accidents_2015** - This data file contains information about crash characteristics and environmental conditions at the time of the crash. There is one record per crash
- **cevent** - This data file contains information for all of the qualifying events (i.e., both harmful and non-harmful involving in-transport motor vehicles) which occurred in the crash. It details the chronological sequence of events resulting from an unstabilized situation that constitutes a motor vehicle traffic crash. There is one record per event. Included in each record is a description of the event or object contacted (e.g., ran off road-right, crossed center line, guardrail, parked motor vehicle), the vehicles involved, and the vehicles’ area of impact
- **damage** - This data file contains information about all of the areas on this vehicle that were damaged in the crash. There is one record per damaged area.
- **distract** - This data file contains information about driver distractions. There is at least one record per in-transport motor vehicle. Each distraction is a separate record.
- **drimpair** - This data file contains information about physical impairments of drivers of motor vehicles. There is one record per impairment and there is at least one record for each driver of an in-transport motor vehicle
- **factor** - This data file contains information about vehicle circumstances which may have contributed to the crash. There is at least one record per in-transport motor vehicle. Each factor is a separate record.
- **maneuver** - This data file contains information about actions taken by the driver to avoid something or someone in the road. There is at least one record per in-transport motor vehicle. Each maneuver is a separate record.
- **nmcrash** - This data file contains information about any contributing circumstances or improper actions of people who are not occupants of motor vehicles (e.g., pedestrians and bicyclists) noted on the PAR. There is one record per action and there is at least one record for each person who is not an occupant of a motor vehicle.
- **nmimpair** - This data file contains information about physical impairments of people who are not occupants of motor vehicles. There is one record per impairment and there is at least one record for each person who is not an occupant of a motor vehicle.
- **nmprior** - This data file contains information about the actions of people who are not occupants of motor vehicles (e.g., pedestrians and bicyclists) at the time of their involvement in the crash. There is one record per action and there is at least one record for each person who is not an occupant of a motor vehicle.
- **parkwork** - This data file contains information about parked and working vehicles that were involved in FARS crashes. A parked vehicle is a motor vehicle which is stopped off the roadway. A working vehicle is used to indicate that this is a motor vehicle that was in the act of performing highway construction, maintenance or utility work related to the trafficway when it became an involved in the crash. Data users are strongly advised to consult the annual FARS/NASS GES Coding and Validation Manuals for a detailed description. There is one record per parked/working vehicle.
- **pdtype** - This data file contains information about crashes between motor vehicles and pedestrians, people on personal conveyances and bicyclists. Data from the crash are enter into the Pedestrian and Bicycle Crash Analysis Tool (PBCAT). The output fields from PBCAT, including the pre-crash actions of the parties involved (crash type), are included in this data set. There is one record for each pedestrian, bicyclist or person on a personal conveyance.
- **persons** - This data file contains information describing all persons involved in the crash including motorists (i.e., drivers and passengers of in-transport motor vehicles) and non-motorists (e.g., pedestrians and pedalcyclists). It provides information such as age, sex, vehicle occupant restraint use, and injury severity. There is one record per person.
- **safetyeq** - This data file contains information about safety equipment used by people who are not occupants of motor vehicles. There is one record per equipment item, and there is at least one record for each person who is not an occupant of a motor vehicle.
- **vehicle** - This data file contains information describing the in-transport motor vehicles and the drivers of in-transport motor vehicle who are involved in the crash. There is one record per in-transport motor vehicle. Parked and working vehicle information is in the Parkwork data file
- **vevent** - This data file contains the sequence of events for each intransport motor vehicle involved in the crash. This data file has the same data elements as the Cevent data file. In addition, this data file has a data element that records the sequential event number for each vehicle (VEVENTNUM). There is one record for each event for each in-transport motor vehicle.
- **vindecode** - This data file contains vehicle descriptors for all vehicles, mainly passenger vehicles, trucks and motorcycles, based on the vehicle’s VIN which is decoded using the VINtelligence program. There is one record per vehicle.
- **violatn** - This data file contains information about violations which were charged to drivers. There is at least one record per in-transport motor vehicle. Each violation is a separate record.
- **vision** - This data file contains information about circumstances which may have obscured the driver’s vision. There is at least one record per in-transport motor vehicle. Each obstruction is a separate record.
- **vsoe** - This data file contains the sequence of events for each intransport motor vehicle involved in the crash. This data file has a subset of the data elements contained in the Vevent data file (It is a simplified Vevent data file). There is one record for each event for each in-transport motor vehicle.


<a id='BigQuery'></a>

### BigQuery

This mass of data doesn't lend itself particularly well to the usual format of CSV files. Instead, this kernel uses a very recent addition to Kaggle, **Google's BigQuery**. Google describe this technology as *"A fast, highly scalable, cost-effective and fully-managed enterprise data warehouse for analytics at any scale"*.

In short, it lets you store and query big datasets, without the headache of establishing in-house big-data hardware. It runs on SQL standard queries, providing a quick start for anyone with SQL knowledge.

Let's get started! ...

In [None]:
import numpy as np
import pandas as pd
from google.cloud import bigquery #For BigQuery
from bq_helper import BigQueryHelper #For BigQuery

Load the traffic fatalities data,

In [None]:
us_traffic = BigQueryHelper("bigquery-public-data", "nhtsa_traffic_fatalities")

Take a peak at the data,

In [None]:
us_traffic.head("accident_2015")

So, we have all sorts here, including state, vehicle, driver, location and time information. Note that the first column is **'state_number'**. That will come in handy later.

Next, let's try a query. Below, I'm grabbing the location data, the data about the number of fatalities, and the timestamp. When I ran this initially, it brought back a vast amount of data. I also noticed some crazy location data, which are perhaps data-entry errors. For these two reasons, I'm limiting the longitude and latitudes to those corresponding to mainland US, and selecting data with at least one drunk driver, and only from December. The data is only from 2016. Note that these choices aren't for any specific analysis reasons, but they allow me to take a took at an interesting subset and to get started,

In [None]:
accidents_query = """SELECT longitude, latitude, number_of_fatalities, timestamp_of_crash
                     FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
                     WHERE number_of_drunk_drivers > 0
                     AND longitude < 0
                     AND longitude > -140
                     AND month_of_crash = 12 """ 

Let's convert to a **pandas dataframe** to make life easier,

In [None]:
accidents_latlong = us_traffic.query_to_pandas(accidents_query)

Next, let's take a look at this data with a **Plotly** interactive plot. This shows the locations of the accidents, with the size of each point scaled to the number of fatalities and the hover-over text the date and time of the accident. Note that I'm altering the format of the date and time first, in order to drop the unwanted seconds and milliseconds components,

In [None]:
import datetime
accidents_latlong['timestamp_of_crash'] = accidents_latlong['timestamp_of_crash'].apply(lambda x: x.strftime("%Y-%m-%d %H:%M"))

In [None]:
#Ref: https://plot.ly/python/scatter-plots-on-maps/
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode()

data = [ dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = accidents_latlong['longitude'],
        lat = accidents_latlong['latitude'],
        text = accidents_latlong['timestamp_of_crash'],
        mode = "markers",
        marker = dict(
            size = accidents_latlong['number_of_fatalities']*10,
            opacity = 0.8,
        ))]

layout = dict(
        title = 'US Fatalities by Location (December 2016, Drunk Drivers)',
        colorbar = True,
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        ),
    )

fig = dict(data=data, layout=layout)
iplot(fig, validate=False, filename='fatalties')

Next, let's have a look at which months saw what numbers of fatality-related accidents,

In [None]:
us_traffic_crashes_by_month = us_traffic.query_to_pandas_safe("""
    SELECT month_of_crash, count(month_of_crash) AS months_totals
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
    GROUP BY month_of_crash
    ORDER BY months_totals DESC
""")
us_traffic_crashes_by_month

It looks like **October** is the worst for accident-related fatalities. 

In such accidents, I wonder what the distribution of fatalities looks like...

In [None]:
us_traffic_fatalities = us_traffic.query_to_pandas_safe("""
    SELECT number_of_fatalities
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
""")

In [None]:
x = us_traffic_fatalities['number_of_fatalities']
data = [go.Histogram(x=x)]

layout = go.Layout(
    title='Number of Fatalities Per Incident',
    yaxis=dict(
        title='Count'
    ),
    xaxis=dict(
        title='Number of Fatalities'
    ),
    bargap=0.2,
    bargroupgap=0.1
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, validate=False, filename='fatalties')

Mostly single fatalities. What about the same query, but looking at the number of fatalities? (as opposed to number of accidents involving fatalities), I expect these numbers to be similar to the previous table, given the histogram above,

In [None]:
us_traffic_fatality_by_month = us_traffic.query_to_pandas_safe("""
    SELECT month_of_crash, sum(number_of_fatalities) AS months_fatalities_totals
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
    GROUP BY month_of_crash
    ORDER BY months_fatalities_totals DESC
""")
us_traffic_fatality_by_month

Again, October is the highest. The number of fatality-related accidents was 3249, and the number of fatalities was 3526.

What about fatalities by state?

In [None]:
us_traffic_fatality_by_state = us_traffic.query_to_pandas_safe("""
    SELECT state_name, 
    sum(number_of_fatalities) AS fatality_total
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
    GROUP BY state_name
    ORDER BY fatality_total DESC
""")
us_traffic_fatality_by_state

As you can see, **Texas** is the highest.

What about some of the other tables? The **cevent** table tells us about the **events surrounding the accident**. Let's have a quick look at the data, and then see how the different events stack up in terms of frequency,

In [None]:
us_traffic.head("cevent_2016")

In [None]:
us_traffic_events = us_traffic.query_to_pandas_safe("""
    SELECT sequence_of_events_name, 
    count(sequence_of_events_name) AS events_counts
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.cevent_2016`
    GROUP BY sequence_of_events_name
    ORDER BY events_counts DESC
    LIMIT 10
""")
us_traffic_events

Not overly surprising that **'Motor Vehicle in Transport'** is high. It looks like **running off the road to the right** is next, with a terrifying number of **overturns**.

What about **'contributing circumstances'**? Note I'm using a slightly different technique here of converting directly to a pandas format,

In [None]:
us_traffic_factors = us_traffic.query_to_pandas_safe("""
    SELECT contributing_circumstances_motor_vehicle_name
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.factor_2016`
""")
us_traffic_factors_counts = us_traffic_factors.groupby(['contributing_circumstances_motor_vehicle_name']).agg('contributing_circumstances_motor_vehicle_name').count().sort_values(ascending = False)

In [None]:
us_traffic_factors_counts

**'Tires'** a major culprit.

And what about **obstructions to vision**?

In [None]:
us_traffic_vision = us_traffic.query_to_pandas_safe("""
    SELECT drivers_vision_obscured_by_name
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.vision_2016`
""")
us_traffic_vision.groupby(['drivers_vision_obscured_by_name']).agg('drivers_vision_obscured_by_name').count().sort_values(ascending = False)

**'Rain, Snow, Fog, Smoke, Sand, Dust'** is top of the list.

Looking at single tables is straight-forward. What about data from **multiple tables**? This requires us to merge the tables via the SQL query. The code below merges the vision and accident tables by their state_number columns. I'm again limiting the query in a few ways in order to prevent the query taking too long,

In [None]:
us_traffic_vision_by_month = us_traffic.query_to_pandas_safe("""
    SELECT a.drivers_vision_obscured_by_name, a.state_number AS vision_state, b.state_number, b.state_name, b.month_of_crash
    FROM `bigquery-public-data.nhtsa_traffic_fatalities.vision_2016` a
    JOIN `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016` b
    ON a.state_number = b.state_number
    WHERE b.month_of_crash = 12
    AND number_of_drunk_drivers > 0
    AND a.drivers_vision_obscured_by_name != 'No Obstruction Noted'
    AND a.drivers_vision_obscured_by_name != 'Unknown'
""")

In [None]:
us_traffic_vision_by_month.groupby(['state_name', 'drivers_vision_obscured_by_name']).agg('state_name').count().sort_values(ascending = False).head(15)

The slightly cryptic** 'Other Visual Obstruction'** is clearly causing problems in **Florida**.

The possibilities for further analysis of this data is massive, but hopefully that gives a flavour of what's possible. Next, let's see how this data looks alongside other datasets.

<a id='MultipleDatasets'></a>

### Multiple Datasets

Last year, [Kaggle began to allow kernels to access and query multiple datasets](https://www.kaggle.com/product-feedback/32423). This fantastic new feature allows us to fully unlock the potential of open datasets, by mixing and merging data to uncover novel insights.

I've chosen 3 different datasets for this, which offer some serious as-well-as more left-field opportunities to explore this superset of US-related data.

<a id='ChronicDiseaseIndicators'></a>

### Chronic Disease Indicators

This data by the [Centers for Disease Control and Prevention](https://www.cdc.gov/) contains state information on 124 chronic disease indicators. Let's take a look,

In [None]:
state_chronic = pd.read_csv('../input/chronic-disease/U.S._Chronic_Disease_Indicators.csv')

In [None]:
state_chronic.head(5)

Let's see what different health topics are available,

In [None]:
state_chronic_topics = state_chronic.groupby(['Topic']).agg('Topic').count().sort_values(ascending = False)

In [None]:
state_chronic_topics.head(20)

Of relevance to the fatalities data above may be the chronic illness data related to **alcohol**. Let's limit the data to that topic,

In [None]:
state_chronic_alcohol = state_chronic[state_chronic['Topic'] == 'Alcohol'].groupby(['Question']).agg('Question').count().sort_values(ascending = False)

In [None]:
state_chronic_alcohol.head(20)

Lot's of different aspects here. Let's take a look at **'Alcohol use among youth'**,

In [None]:
state_chronic_2015 = state_chronic[(state_chronic['YearStart'] == 2015) & (state_chronic['Question'] == 'Alcohol use among youth')]

In [None]:
state_chronic_2015.head(5)

In [None]:
state_chronic_2015_value = state_chronic_2015.groupby(['LocationDesc','DataValue']).mean()

In [None]:
state_chronic_2015_value.sortlevel('DataValue', ascending = False)

So, **Arizona** has the highest percentage of alcohol use among youth, and the **District of Columbia** has the least. Now for the really interesting bit. Let's merge this data with the accident-fatality data,

In [None]:
merged = pd.merge(us_traffic_fatality_by_state, state_chronic_2015, left_on='state_name', right_on='LocationDesc')

In [None]:
merged.head()

Let's tidy that up by grouping by location and youth alcohol amount,

In [None]:
fatalities_and_alcohol = merged.groupby(['LocationDesc','DataValue']).mean()

In [None]:
fatalities_and_alcohol = fatalities_and_alcohol.reset_index() #Tidy up the column headers

In [None]:
fatalities_and_alcohol.head()

In [None]:
from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(fatalities_and_alcohol['DataValue'],fatalities_and_alcohol['fatality_total'])
line = slope*fatalities_and_alcohol['DataValue']+intercept

trace1 = go.Scatter(
    x = fatalities_and_alcohol['DataValue'],
    y = fatalities_and_alcohol['fatality_total'],
    text = fatalities_and_alcohol['LocationDesc'],
    mode = "markers")

trace2 = go.Scatter(
    x=fatalities_and_alcohol['DataValue'],
    y=line,
    mode='lines',
    hoverinfo='none',
    marker=go.Marker(color='red'),
    name='Fit'
    )

layout = go.Layout(
    title = 'Alcohol Use Among Youth vs Number of Fatalities by State',
    xaxis=go.XAxis(title = 'Alcohol Use Among Youth (%)'),
    yaxis=go.XAxis(title = 'Number of Fatalities'),
    showlegend=False
)

data = [trace1, trace2]
fig = go.Figure(data=data, layout=layout)
iplot(fig, validate=False, filename='alcohol')

Interesting, but perhaps unsurprising. It looks like there is some relationship between youth alcohol intake and fatalities by state.

<a id='2016USElection'></a>

### 2016 US Election

This dataset details how American's voted in the 2016 presidential election by state. Let's take a look,

In [None]:
election = pd.read_csv('../input/2016-us-election/primary_results.csv')

In [None]:
election.head()

How about seeing the breakdown by state and candidate,

In [None]:
results_by_state_candidate = election.groupby(['state', 'candidate']).agg('votes').sum()

In [None]:
results_by_state_candidate = results_by_state_candidate.to_frame()
results_by_state_candidate = results_by_state_candidate.reset_index()

In [None]:
results_by_state_candidate.head(10)

A quick check of the numbers ([see here](https://www.nytimes.com/elections/2016/results/primaries/alabama)) looks correct.

And the highest in each state?

In [None]:
results_by_state_candidate_top = results_by_state_candidate.groupby(['state']).agg('votes').idxmax()

In [None]:
results_by_state_candidate_top = results_by_state_candidate_top.to_frame()
votes = results_by_state_candidate_top.reset_index()

In [None]:
results_by_state_candidate_top = results_by_state_candidate.iloc[results_by_state_candidate_top['votes'].tolist(),:]

In [None]:
results_by_state_candidate_top.head()

Another quick check can be done [here](https://www.nytimes.com/elections/2016/results/primaries/alaska).

Now, let's merge with the fatalities data,

In [None]:
fatalities_election = pd.merge(us_traffic_fatality_by_state, results_by_state_candidate_top, left_on='state_name', right_on='state')

In [None]:
fatalities_election.head(10)

In [None]:
y0 = fatalities_election['fatality_total'][fatalities_election['candidate'] == 'Hillary Clinton']
y1 = fatalities_election['fatality_total'][fatalities_election['candidate'] == 'Donald Trump']
y2 = fatalities_election['fatality_total'][fatalities_election['candidate'] == 'Bernie Sanders']
y3 = fatalities_election['fatality_total'][fatalities_election['candidate'] == 'Ted Cruz']
y4 = fatalities_election['fatality_total'][fatalities_election['candidate'] == 'John Kasich']

trace0 = go.Box(
    name = 'Hillary Clinton',
    y=y0
)
trace1 = go.Box(
    name = 'Donald Trump',
    y=y1
)
trace2 = go.Box(
    name = 'Bernie Sanders',
    y=y2
)
trace3 = go.Box(
    name = 'Ted Cruz',
    y=y3
)
trace4 = go.Box(
    name = 'John Kasich',
    y=y4
)

layout = go.Layout(
    title = 'Votes vs Fatalities',
    xaxis=go.XAxis(title = 'Votes'),
    yaxis=go.XAxis(title = 'Number of Fatalities'),
    showlegend=False
)

data = [trace0, trace1, trace2, trace3, trace4]
fig = go.Figure(data=data, layout=layout)
iplot(fig, validate=False, filename='votes')

So, it looks like overall, there are slightly more fatalities in **Hillary Clinton**-leaning states.

<a id='Vegetarian&VeganRestaurants'></a>

### Vegetarian & Vegan Restaurants

OK, so this one is a bit nonsensical, but it does show just what you can do in the way of merging datasets. The data contains information on vegetarian and vegan food outlets in the US. Let's have a quick look,

In [None]:
vege = pd.read_csv('../input/vegetarian-vegan-restaurants/vegetarian_restaurants_US_datafiniti.csv')

Let's remove any rows with missing cuisine data and stick to the vegan outlets,

In [None]:
vege = vege[vege.cuisines.notnull()]

In [None]:
vegan = vege[vege['cuisines'].str.contains("Vegan")]

In [None]:
vegan.head()

Now, let's get vegan food-outlet numbers by state (or 'province' in this data),

In [None]:
vegan_count_by_state = vegan.groupby(['province']).agg('address').count()
vegan_count_by_state = vegan_count_by_state.to_frame()
vegan_count_by_state = vegan_count_by_state.reset_index()
vegan_count_by_state.columns = ['province', 'count']

In [None]:
vegan_count_by_state.head()

OK, so that's working. Next I need to get state names,

In [None]:
#Ref: https://gist.github.com/rogerallen/1583593

us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

Let's merge the state data with the vegan data, and then merge the vegan data with the fatalities data,

In [None]:
us_state_abbrev_df = pd.DataFrame(list(us_state_abbrev.items()), columns=['state', 'abb'])

In [None]:
vegan_with_states = pd.merge(vegan_count_by_state, us_state_abbrev_df, left_on='province', right_on='abb')

In [None]:
fatalities_vegan = pd.merge(us_traffic_fatality_by_state, vegan_with_states, left_on='state_name', right_on='state')

In [None]:
fatalities_vegan.head()

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(fatalities_vegan['count'],fatalities_vegan['fatality_total'])
line = slope*fatalities_vegan['count']+intercept

trace1 = go.Scatter(
    x = fatalities_vegan['count'],
    y = fatalities_vegan['fatality_total'],
    text = fatalities_vegan['state_name'],
    mode = "markers")

trace2 = go.Scatter(
    x=fatalities_vegan['count'],
    y=line,
    mode='lines',
    hoverinfo='none',
    marker=go.Marker(color='red'),
    name='Fit'
    )

layout = go.Layout(
    title = 'Number of Vegan Food Outlets vs Number of Fatalities by State',
    xaxis=go.XAxis(title = 'Number of Vegan Food Outlets'),
    yaxis=go.XAxis(title = 'Number of Fatalities'),
    showlegend=False
)

data = [trace1, trace2]
fig = go.Figure(data=data, layout=layout)
iplot(fig, validate=False, filename='vegan')

I can see the headlines now ... ***"Does a lack of animal protein hinder your driving ability?"***, and ***"Should insurance premiums be higher for those that drink almond milk?"***. Of course, what's almost certainly happening here is that the number of vegan food outlets is simply a proxy for state population (the same may be true of the voting data previously).

Please note that I'm not trying to make light of the serious nature of the fatalities data. But in this era of **fake news**, the above plot is probably not beyond the realms of some media outlets.

<a id='StatePopulations'></a>

### State Populations

On the subject of state population, the 2016 election results dataset actually includes state population data (albeit the most recent being from 2014). We can use this to normalise any of the above data, if we are interested in per capita questions. For example, going back to the core fatalities data, we had a table of the number of deaths per state. Let's adapt that for per capita and per 100,000,

In [None]:
pop = pd.read_csv('../input/2016-us-election/county_facts.csv')

Let's sum by state, and then do a quick overall sum to make sure the total makes sense (population of the US was around 318 million in 2014 according to a quick Google search),

In [None]:
pop_state = pop.groupby(['state_abbreviation']).agg('PST045214').sum()
pop_state = pop_state.to_frame()
pop_state = pop_state.reset_index()
pop_state['PST045214'].sum()

OK! Now let's merge and create the new columns,

In [None]:
pop_state_name = pd.merge(us_state_abbrev_df, pop_state, left_on='abb', right_on='state_abbreviation')

In [None]:
fatalities_with_pop = pd.merge(us_traffic_fatality_by_state, pop_state_name, left_on='state_name', right_on='state')

In [None]:
fatalities_with_pop = fatalities_with_pop.drop(['abb', 'state', 'state_abbreviation'], axis=1)

In [None]:
fatalities_with_pop['Fatalities Per Capita'] = fatalities_with_pop['fatality_total'] / fatalities_with_pop['PST045214']

In [None]:
fatalities_with_pop['Fatalities Per 100k'] = fatalities_with_pop['Fatalities Per Capita'] * 100000

Finally, let's order by fatalities per 100,000,

In [None]:
fatalities_with_pop.sort_values('Fatalities Per 100k', ascending=False)

It looks like **Mississippi** was the worst (keeping in mind this is 2016 fatality data and 2014 population data). According to [this article](https://www.huffingtonpost.com/entry/most-dangerous-states-to-drive_us_5736185de4b060aa781a3eec), Mississippi was the 2nd highest in 2016, with Wyoming taking the top spot.

<a id='ClosingThoughts'></a>

### Closing Thoughts

We've seen how to both query Kaggle's BigQuery datasets, and then how to merge that data with other Kaggle datasets. This initial foray into the data has revealed some real factors (rain, snow, poor tires, alcohol) and some not-so-real factors (vegan democrats) when it comes to traffic fatalities.

Keep an eye out in the future for more US-related Kaggle datasets to take this sort of analysis further.