In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn import cluster


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

df = pd.read_csv('../input/Mass Shootings Dataset.csv',encoding = "ISO-8859-1",parse_dates=['Date'])
print(df.shape) 

# Any results you write to the current directory are saved as output.

**Notes on issues with the data quality:**
It is not my intention to criticise or embarrass anyone but it is important to point out problems with the data. 

I have noticed multiple discrepancies:
There are duplicate entries, for example S# = 227 and S# = 228 are the same incident, also 301 and 302, plus many others. However duplicates can have different values for the same fields. For exampe S# 176 and 177 are duplicates - they refer to the same event. But in the first row the number of people killed is put at 12, injured at 8 and total at 20, whereas for S# = 177 the numbers are 13, 3 and 15. 

Latitude and Longitude values are also different for duplicate incidents. 

I would also question the number of incidents in the dataset. There are 398 rows, including duplicates. The site http://www.shootingtracker.com/ lists 383 mass shootings in the USA in 2016 alone. I think this dataset may give a false impression about the scale of the mass shooting problem in the USA.

Eliminating the duplicates is tricky since the latitudes and longitudes are often different for identical rows so I can't use those columns to identify duplicates. Instead I will assume that the number of occassions when there was more than one mass shooting on a given date in this dataset is a small number. I will then eliminate duplicates based on the Date column and hope that I don't lose too much good data. A more sophisticated approach could include also looking at the latitude and longitude values, if they are within some error band (say + or - 1%) then they are duplicates. 

In [None]:
print(df.duplicated(subset='Date').sum())
print(len(df))
df = df.drop_duplicates(subset='Date')
print(len(df))

The number of rows lost is worryingly high but....

Have the number of incidents increased or decreased?

In [None]:
df = df.sort_values('Date', ascending=True)
df.plot(x='Date',y='Fatalities',style='o',alpha=0.4,legend=False)
plt.xticks(rotation='horizontal')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Fatalities', fontsize=12)
plt.show()

The number of attacks seems to have increased and the outliers suggest the number of fatalities in the worst attacks is increasing. We can further investigate the frequecy of shootings by looking at yearly value counts.

In [None]:
df['Year'] = df['Date'].dt.year
counts = df['Year'].value_counts(sort=False)
counts.plot(kind='bar')
plt.show()

2015 and 2016 were particularly bad years. But it is not the case that the frequency of shootings is constantly increasing, between 2000 and 2005 the frequency of shootings decreased.
Has the ratio of fatalities to total victims changed over the years. Perhaps the type of weapon/calibre of amunition has changed and people are more or less likely to survive. I will add a new column to the dataset called Ratio, where Ratio = Fatalities/(Fatalities + Injured). I am not using the Total victims column since it does not count the shooter.

In [None]:
df['Ratio'] = df['Fatalities']/(df['Fatalities'] + df['Injured'])
df.plot(x='Year',y='Ratio',style='o',alpha=0.4,legend=False)
plt.show()

If the trend was towards fewer survivors then I would expect to see more points up around 1.0 and conversley if the number of fatalities as a ratio of total casualties was decreasing I would expect to see more data points near zero. I don't see any particular trend.

We can also plot the data by latitude and longitude which are available for most rows. The number of rows with nulls:

In [None]:
print(df.isnull().sum())

total number of rows is:

In [None]:
print(len(df))

So the 20 rows with missing lat and long data is around 7% of total. I will drop these rows.

In [None]:
df = df.dropna()
print(len(df))

We can plot the remaining rows by latitude and longitude and also use shade to indicate approx year:

In [None]:
def group_years(year):
    if year in [1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978]:
        group = 'A'
    elif year in [1979,1980,1981,1982,1983,1984]:
        group = 'B'
    elif year in [1985,1986,1987,1988,1989]:
        group = 'C'
    elif year in [1990,1991,1992,1993,1994]:
        group = 'D'
    elif year in [1995,1996,1997,1998,1999]:
        group = 'E'
    elif year in [2000,2001,2002,2003,2004]:
        group = 'F'
    elif year in [2005,2006,2007,2008,2009,2010]:
        group = 'G'
    elif year in [2011,2012,2013,2014,2015,2016,2017]:
        group = 'H'
    else:
        group = ''
    return group

df = df[df['S#'].isin([315,291,292])==False] #drop the two Hawaii and one Alaska rows, it keeps the plot more compact

df['Group'] = df['Year'].apply(lambda x: group_years(x))

colors = {'A':'#080707','B':'#282626','C':'#3d3939','D':'#686666','E':'#797777','F':'#a9a9a3','G':'#bebfc1','H':'#d2d2d2'}
fig, ax = plt.subplots()
ax.scatter(df['Longitude'],df['Latitude'], c=df['Group'].apply(lambda x: colors[x]),alpha=0.6)
plt.show()

Darker points are older. There do appear to be geograpic clusters and possible clusters in time periods. For example the upper left of the graph shows a cluster of about four more recent incidents, this area is approximately the Seattle area. Further south in CA there is another cluster around Los Angeles but this cluster shows older activity as well as more recent incidents. There is also a string of incidents on the East coast south of New York, all of these are more recent events. We can use a clustering algorithm to further investigate shooting geographical clusters:

In [None]:
k=15
f1 = df['Longitude'].values
f2 = df['Latitude'].values

X=np.matrix(list(zip(f1,f2)))
kmeans = cluster.KMeans(n_clusters=k).fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

for i in range(k):
    # select only data observations with cluster label == i
    ds = X[np.where(labels==i)]
    # plot the data observations
    plt.plot(ds[:,0],ds[:,1],'o')
    # plot the centroids
    lines = plt.plot(centroids[i,0],centroids[i,1],'kx')
    # make the centroid x's bigger
    plt.setp(lines,ms=15.0)
    plt.setp(lines,mew=2.0)
plt.show()

The West Coast seems to have three distinct clusters corresponding to the Los Angeles, San Francisco and Seattle areas. The East coast does not display these tight clusters. Is this maybe due to differences in population densities on the East and West coasts? (Populations are more clusterered on the West coast(?) and mass shootings are maybe more common in areas of higher population density(?))

In summary:
The frequency of attacks has increased in recent years, 2015 and 2016 were particularly bad years. However the overall trend is not simply upwards, there have been more peaceful periods for example between 2000 and 2005. While this year (2017) has seen the worst attack ever in terms of total casualties there have been fewer incidents in total. Unfortunately the outliers - the attacks that see large numbers of fatalities are getting worse. In the 60s and 70s the worst attacks saw fewer than 20 fatalities. Two recent attacks saw the total around or above 50 for each attack. I do not see any evidence to suggest that people are more or less likely to survive this kind of incident now compared to the past, but I imagine the weapons available to the shooters will have changed over the decades. Certainly the very high number of casualties in the recent shooting in Las Vegas was partly due to the kind of weapon(s) used. Geograpphically incidents seem to cluster around larger urban centers, this is especially true on the West Coast. There is a cluster of incidents in WA many of which are within the last 5 or 6 years, there are not many older events in WA.