# Introduction
The following is basic exploratory data analysis of the death penalty dataset given some initial avenues of exploration in which I had an interest. 

Dataset accessed at:
https://www.kaggle.com/usdpic/execution-database

The dataset is a record of every execution in the United States since the Supreme Court reinstated the death penalty in 1976.

Per the dataset on the site
"The information in this database was obtained from news reports, the Department of Corrections in each state, and the NAACP Legal Defense Fund."

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
data = pd.read_csv("../input/database.csv")

# Initial Exploration

In [None]:
data.head()

In [None]:
print(data.shape)
print(data.dtypes)

Taking an initial look at the dataset it has 1442 records with 17 features. There are only two numeric features with Age and Victim Count.

In [None]:
print(data.isnull().sum())

It appears as if almost all of the data is there in the dataset with only a few features and a small set of instances missing values. If needed, we can drop these for model creation; however, as the focus is mainly exploratory at this point we will leave them in the dataset.

# Analysis Questions
With my initial knowledge about the dataset I think there are some basic ideas from the debates surrounding the death penalty that might be interesting to look at in the dataset.

1) How has the number of people executed since reinstatement in 1976 trended over time?

2) How does the death penalty application relate to race?

3) Are there any particular states that seem to execute more people than others?


# 1) Executions Over Time Since Reinstatement

In [None]:
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].map(lambda x: x.year)
executions_by_year = data.groupby(['Year'])['Name'].count()
ax = executions_by_year.plot.bar(figsize=(8,6), title="Total Number of Executions in US by Year Since 1976")
label = plt.ylabel("Total # of Executions Since 1976")
plt.show()
print ("Peak Execution Number in 1999: %d" % executions_by_year.ix[1999])

The trend in terms of number of executions seems to be that they are decreasing since a peak in 1999. The peak occurring in 1999 actually feels much later than might be expected. One could speculate that it might be due to the appeals process, the time for states and juries to begin enforcing it and increasing the number of special circumstances, an actual increase in violent crimes. But, further investigation would be needed to draw any relevant conclusions.

# 2) The Death Penalty and Race

In [None]:
executions_by_race = data.groupby(['Race'], as_index=False)['Name'].count()
executions_by_race.rename(columns={'Name':'Total Executed'}, inplace=True)
ax = executions_by_race.plot.bar(x='Race', y='Total Executed', title='Number of Executions Since 1976 \nBy Race of Person Executed')
label = plt.ylabel("Total # of Executions Since 1976")
plt.show()

In terms of raw totals there have been more white people executed than any other race since 1976. However, when these numbers are scaled by the factor of the total population the data paints a different picture.

As a basic comparison, I gathered race population numbers from the 2010 census accessed at http://www.infoplease.com/ipa/A0762156.html. I then divided the total number of executions since 1976 by the 2010 census population for each race to plot the number of executions for each race per million race population. 


In [None]:
census_2010_numbers = {'Asian':14465124,'Black':37685848,'Latino':50477594, 'Native American':2247098, 'Other':604265, 'White':196817552}
for index, row in executions_by_race.iterrows():
    executions_by_race.loc[index, 'Total Executed/Million Race Pop.'] = (float(row['Total Executed'])/census_2010_numbers[row['Race']]) * 1000000
ax2 = executions_by_race.plot.bar(x='Race', y='Total Executed/Million Race Pop.', title='Number of Executions Since 1976 \nBy Race of Person Executed Per Million Race Population')
label = plt.ylabel("Total # of Executions Since 1976 Per Million Race Pop.")
plt.show()

The second graph suggests there is perhaps a imbalance in application of the death penalty against blacks per population. However, to further look into this analysis one might also want to compare against the specific defendant race populations. In a brief exploration for a good dataset that could allow for comparison against the current dataset, I did not find an appropriate offender dataset specifically regarding using the race categories of the current dataset. An dataset with aligned race categories would be needed for further exploration of the race data.

# 3) States and Executions

In [None]:
executions_by_state = data.groupby(['State'])['Name'].count()
ax = executions_by_state.plot.bar(x='State', figsize=(8,6), title="Total # of Executions Since 1976")
txt = plt.ylabel("Total # of Executions Since 1976")

Looking at the above results. It is clear that Texas has been the biggest user of the death penalty as a state since it's reinstatement in terms of pure volume. 

However, Texas also has a relatively large state population. To delve a little deeper, I normalized the number of executions per million person state population based on the 2010 census. Census data was accessed at http://www.indexmundi.com/facts/united-states/quick-facts/all-states/population-2010#table and was converted to a python dictionary.

In [None]:
census_data_state = {'WA': 6724540.0, 'DE': 897934.0, 'FE': np.nan, 'FL': 18801310.0, 'WY': 563626.0, 'NM': 2059179.0, 'TX': 25145561.0, 'LA': 4533372.0, 'NC': 9535483.0, 'NE': 1826341.0, 'TN': 6346105.0, 'PA': 12702379.0, 'NV': 2700551.0, 'VA': 8001024.0, 'CO': 5029196.0, 'CA': 37253956.0, 'AL': 4779736.0, 'AR': 2915918.0, 'IL': 12830632.0, 'GA': 9687653.0, 'IN': 6483802.0, 'AZ': 6392017.0, 'ID': 1567582.0, 'CT': 3574097.0, 'MD': 5773552.0, 'OK': 3751351.0, 'OH': 11536504.0, 'UT': 2763885.0, 'MO': 5988927.0, 'MT': 989415.0, 'MS': 2967297.0, 'SC': 4625364.0, 'KY': 4339367.0, 'OR': 3831074.0, 'SD': 814180.0}
df = executions_by_state.to_frame()
df_ret = pd.DataFrame()
for index, row in df.iterrows():
    population =  census_data_state[index]
    df.loc[index, 'Executions/1000000 state population'] = (row['Name']/population) * 1000000
ax = df['Executions/1000000 state population'].plot.bar(title="Executions Since 1976 Per Million State Population", figsize=(8,6))
label = plt.ylabel('Executions/1000000 state population')

Texas is still one of the most frequent users of the death penalty adjusted for state population. However, Oklahoma is the largest user adjusted for population.

# Summary and Notes for Further Exploration

This is only an intial pass to get a feel for the dataset and some of the trends and areas of exploration that it could offer. The death penalty is an extremely complex topic and much deeper analyis could be conducted. The normalizing data that I used was simply easily accessible census data. More targeted normalization with appropriate historial data for each year could be very helpful. Also, as noted above, some of the analysis concerning the death penatly and race could benefit from the use of a crime dataset that aligned with the feature values in this dataset.

There are a few additional areas that I can immediately think of for further exploration that I would be intersted in exploring at a later date.

1) Does the sex of the victim appear to correlate to to application of the death penalty? What about race?

2) What methods of the death penalty are most commonly used. How has that changed over time?