# DCFemTech Hack for Good: Gun Violence in the United States

#### What is DCFemTech?
DCFemTech is a coalition of women leaders aimed at amplifying the efforts of women in tech organizations, sharing resources, and bringing leaders together to close the gender gap.

#### What is Hack for Good?
Hack for Good is a 2-day event bringing folks together to work on projects that support our community. Progress on projects will be presented in a science fair format at the conclusion of the event. This is a different kind of hackathon. No competition, no award money; just bringing people together to learn, make connections, and build something awesome!

### About the Challenge and the Dataset

##### BACKGROUND 
Gun violence in the United States has been steadily rising in the last decade. A recent study published in the American Journal of Medicine found that Americans are 25 times more likely to die from gun violence than people in other developed countries. 

##### DATASET 
The dataset chosen for this challenge is a comprehensive record of over 260,000 U.S. gun violence incidents between January 2013 and March 2018. 

* https://www.kaggle.com/jameslko/gun-violence-data

* https://github.com/spacecadetjo/GunViolenceInAmerica/tree/master/Data

##### THE PROBLEM 
With the most complete record of gun violence in America at our fingertips, what information can we glean about this epidemic? Using a combination of geospatial and time series analysis, can we draw meaningful conclusions that can inform public health and government policies? Working together with facilitators from Booz Allen Hamilton's Data Science team, this notebook presents the findings of new data scientists of various skill levels on the gun violence epidemic.

## Getting Started

First we need to import the packages we'll be using during the exercise. This has already been done for you. Just hit run on the cell to execute the package installation. Remember, if you're using anaconda, you'll need to create a new kernel (follow the instructions in the participant's guide).

In [None]:
from IPython.display import display

import pandas as pd
import numpy as np
import datetime as dt

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()
from plotly.offline import *

import matplotlib.pyplot as plt
import seaborn as sns

import folium
from folium.plugins import MarkerCluster

Now that you have packages, you'll need to read in the data from the CSVs. If you're using anaconda, you'll need to create a folder called Data and point pandas at that folder. If you get stuck, ask a facilitator for help.

Once you've read in the data, use the `.head()` method to inspect the beginning of the data. Use `.columns()` to get a list of all the column names. Use `.describe()`, `.unique()`, and `.info()` to do more basic exploration before we begin cleaning the data.

##### Remember: you can add new cells by going to insert >> insert cell below. Cell types can be changed under the "cell" menu.

In [None]:
#replace filename with either a relative or absolute path to the gun violence csv
guns = pd.read_csv(filename1)
#replace filename with either a relative or absolute path to the victims csv
victims = pd.read_csv(filename2)

## Cleaning the data
It's exceedingly rare that a clean dataset will be available. There are advanced techniques for cleaning and normalizing messy data sets, but the primary goal should be to tidy the data. The two principals of tidy data are as follows:
Each column represents a variable.
Each row represents an observation.
Similar data grouped together is a dataset.

The gun-violence.csv file is not a tidy dataset in that each incident has the victims grouped together on the line. This is fine if you want to study the incidents at a macro level, but what if we want to know more about the victims?

The task of parsing out each individual victim has been done for you ahead of time. The file victims.csv has the participant data already parsed out. You can merge it back to the original dataset by doing a .merge() on the case_id.

### Outliers and Nulls
Start by checking on the numerical data such as the victim's age. Are there any missing data? Any outliers? Shown here we see that the oldest person is 311 years old. Does that make sense? What other areas can you check to see where data is missing or not statistically representative? Normalize the data by removing outlying observations with spurious or erronous data. Another technique is to dig deeper into that data point and check it against the source. For example, open the incident_url for the 311 year old victim -- is this a typo or does the incident from the Gun Violence archive support this? If it's from the Gun Violence archive, are there any other errors? Should the whole observation of the incident be removed or just that one individual participant?

Work your way through cleaning data using techniques covered in the slides. If in doubt, ask a facilitator for help.

In [None]:
#who is the oldest victim?
victims['participant_age'].max()

In [None]:
# Show number of nulls per column
null_sum = guns.isnull().sum()
print(null_sum)

# Plot number of nulls per column 
null_sum.plot(kind='bar')
plt.title('Number of null values per column')
plt.show()

Use the cheat sheets to help remove the outlying information. Remember, discussion is key here. When does information become an outlier? Also, consider creating new dataframes containing only information you want to explore further. We'll do more of that in the afternoon of day one.

## Exploratory Data Analysis (EDA)

By now you've already done some EDA work to prep the data and clean it, but let's dive deeper into the data to see what we can find. As you've looked at the data, perhaps some questions have come to mind while you worked. If not, here are some simple questions you can work to find the answer for.

* What is the mean, median age of a victim?
* Compare the suspects/victim ratio by gender.
* Plot shooting incidents over the last year.
* What state had the most shootings?
* Is there a correlation between guns used and number of victims?

### Bringing in New Data

As you may have noticed, just using gun violence data alone doesn't give a complete picture. Consider what types of data you might need to answer more advanced data analysis questions. Some examples include:

* Census data
* Night club, Hospital, Gun Store, etc location information
* Gun laws by state
* Calendar data
* Initimate Partnetr Violence datasets

As you explore the data, you may come up with more questions you wish to answer. We'll cover more advanced techniques later, so keep those questions in mind.



In [None]:
# Show number of incidents per state
guns.groupby(['state']).count()

There are two major ways we can explore this dataset; time series analysis and geospatial analysis. Both can be combined to really drill down and isolate problematic areas. We'll begin by doing time series analysis and then move on to plotting geospatially. Finally, you'll have time to draw your own conclusions through exploration and visualization.

## Time Series Analysis

As you continue to explore the data, consider questions about the past, present, and future. 

* When do most shootings take place? Monthly trends? Yearly trends?
* What are the trends in shootings?
* What day of the week do most shootings occur?

Consider what kinds of actions can be done to mitigate deaths based on your time series predictions. What conclusions can we begin to draw? For example, do shootings fall on significant days of the week? Are there any clear outliers in the number of shootings? What happened on those dates? Does the number of incidents tell us something different from what the how deadly a single incident is? What happens if you plot deaths over time versus incidents over time?

All of these questions can be answered using time series analysis. As trends are discovered, you can begin to make predictions. For example, notice that July 4th is a very high incident day in America. Why? Is it because it's Independance Day? Are there any other reasons? What kind of datasets would you need to do that type of analysis and can you find them?


In [None]:
#one of the first things we need to do is convert the string objects to datetime objects. Thankfully, Pandas and Datetime
#work well together

# Convert date to pandas datetime object 
guns['date'] = pd.to_datetime(guns['date'])

In [None]:
#What day of the week do most indicents happen?
guns['weekday'] = guns['date'].dt.weekday

#to do this we'll need to make a new dataframe, so we'll make two series and concat them together
days = pd.Series(['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'], name = "day_of_week")
incidents = guns['weekday'].value_counts(sort = False).rename("incidents")
inc = pd.concat([days, incidents], axis = 1)
#print the dataframe to make sure it works
print(inc)

#now we can plot the data
ax = sns.barplot(x="day_of_week", y="incidents", data=inc)

Now it's your turn. Use the cheat sheets and your facilitators, as well as help from the other teams and try to answer questions about the time when incidents occur. Consider merging the victim data in. When are shooting's deadliest? When are they least deadly?

## Geospatial Analysis
Just as we can find trends based on time, we can look at the locations where incidents take place. We have location information in the form of states and counties or cities, but also in the form of Latitude and Longitude. We'll walk through using Plotly to visualize state information, and then Folium to drill down on latitude and longitude to discover how trends in shootings map out to regional areas in finer detail. As you work, consider what questions you'd like to answer.

* Where do most shootings take place?
* What is in the vicinity of the places where shootings take place? What other data can you bring in to get answers?
* What's the deadliest county? What's the least deadliest?
* What are the characteristics of victims by state?
* What are the characteristics of shooters by state?

Consider how areas differ geospatially. If you were trying to dispel myths about gun violence, what kind of geospatial data would you need? How can you communicate and inform policy makers and the public about regions where gun violence occurs?


In [None]:
# Show highest number of incidents by city or county by state 
guns.groupby('state')["city_or_county"].describe()

### Using Plotly to Graph State Data

This is a fairly simple way to plot geospatial data, and if Folium proves to be difficult to use, Plotly can be used to do simple geospatial analysis.

In [None]:
# State Wise Number of Gun Violence Incidents

states_df = guns['state'].value_counts()

#create a new dataframe and begin adding data to it.
statesdf = pd.DataFrame()
statesdf['state'] = states_df.index
statesdf['counts'] = states_df.values

#plot parameters
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

state_to_code = {'District of Columbia' : 'dc','Mississippi': 'MS', 'Oklahoma': 'OK', 'Delaware': 'DE', 
                 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR', 'New Mexico': 'NM', 'Indiana': 'IN', 
                 'Maryland': 'MD', 'Louisiana': 'LA', 'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona':'AZ',
                 'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT', 'Virginia': 'VA', 'Oregon': 'OR', 
                 'Connecticut': 'CT', 'Montana': 'MT', 'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV', 
                 'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI', 'Vermont': 'VT', 'Georgia': 'GA', 
                 'North Dakota': 'ND', 'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY', 'Hawaii': 'HI', 
                 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH', 'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD', 
                 'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA', 'North Carolina': 'NC', 
                 'New York': 'NY', 'Texas': 'TX', 'Nevada': 'NV', 'Maine': 'ME'}
statesdf['state_code'] = statesdf['state'].apply(lambda x : state_to_code[x])

#data to be passed in by plotly
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = statesdf['state_code'],
        z = statesdf['counts'],
        locationmode = 'USA-states',
        text = statesdf['state'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Gun Violence Incidents")
        ) ]

layout = dict(
        title = 'State wise number of Gun Violence Incidents',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),)
             )
#and plot time!    
fig = dict( data=data, layout=layout )
iplot( fig, filename='d3-cloropleth-map.html' )

### Using Folium to See Individual Incidents

If you've installed Folium, you can use it to create pointers on a map for each incident, as well as cluster data.

In [None]:
# Parse data set for plotting
df_gun_map = guns[['date','state','latitude', 'longitude']].copy()
# Drop any rows with NaN Values
df_gun_map.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)

In [None]:
# Isolate a single state -- Folium can only handle a couple hundred markers at a time, so find ways to segregate data.
state = 'Rhode Island'
df_gun_map_RI = df_gun_map[df_gun_map["state"] == state].copy()
#get rid of the date and state columns
df_gun_map_RI.drop(['date','state'], axis=1, inplace=True)

# Convert dataframe of coordinates to list for Folium
locationlist = df_gun_map_RI.values.tolist()
# Build base map and show it
gun_map = folium.Map(location=locationslist[0], zoom_start= 7)
gun_map

In [None]:
# Add locations for the points
for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point]).add_to(gun_map)
gun_map

In [None]:
# Initialize clusters
marker_cluster = MarkerCluster().add_to(gun_map)

# Display locations in clusters
for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point]).add_to(marker_cluster)
gun_map

Now that you have the basics of folium down, how can you use both Geospatial and Time Series analysis to drill down and find trends?

## Your Turn:

Now that you've walked through some of getting started with time series and geospatial analysis, spend time exploring the data and doing your own analysis. Consider what questions you'd like answered and begin answering them.

## Bonus: Machine Learning and Predictive Analysis

Where will the next shooting happen? When should hospitals be staffed for shootings? When should police and emergency responders be on alert for incidents? Who needs intervention to minimize gun violence? How will these trends continue into the future?

All these are questions that plague people working to save lives and end the gun violence epidemic. Machine learning is a technique that uses the past to guess at what the future will hold. We can apply it here in a variety of ways. There are two main types of machine learning: supervised and unsupervised. We'll work primarily with supervised machine learning for this dataset. Beyond that, there are several main tasks for the machines: categorizing data based on it's features or predicting out the future.

![Python's ML Package, Scikit-learn has it's own cheat sheet](http://scikit-learn.org/stable/_static/ml_map.png)
http://scikit-learn.org/stable/_static/ml_map.png

You may have seen while doing your exploratory analysis that some trends are clearer than others. There's always a fuzzy area and machine learnining works to handle that ambiguity in ways that humans are not so great at. Of course, there's plenty of legal and ethical questions surrounding the use of machine learning, as input biases lead to output biases and creating fairness in artificial intelligence is a hot topic, but for now we'll use machines to see if we can draw any new insights from out data.

## Write Up

What did you find during this exercise? What insights would you like to share with the world from doing this exploration? Here is your chance to make conclusions and present your findings. When you're finished, you may submit your kernel/notebook to Kaggle.