**THIS NOTEBOOK USES CITY OF EDMONTON CRIME DATA AND TAKES A QUICK GLANCE TO SEE IF THERE ARE INTERESTING TRENDS OR ANOMALYS**
  

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
Crime= pd.read_csv("../input/edmonton-crime-and-population-datasets/EPS_Neighbourhood_Criminal_Incidents.csv")
Pop = pd.read_csv("../input/edmonton-crime-and-population-datasets/Edmonton_Population_History.csv")

Lets see what our dataset looks like.

In [None]:
Crime.head()



The crime data set is a collection of Edmonton neighbourhoods with the number of crimes occuring in each year, quarter and month. The first question to come to mind is how is this data changing over time? This would be useful to asses the effect of crime reduction interventions. Other questions are what areas have the most and the least crime?  Setting this data on a per-capita basis would be provide even more validity to comparisons of change over time. 

In [None]:
Pop = Pop.sort_values(by='Year')

Pop.tail()


Unfortunately  the population data is not broken down by neighbourhood so although we could set the crime rates to per-capita for the city of Edmonton this could give incorrect conclusions if the neighboorhood populations are experiencing unequal increases or decreases. Presumably the city would have some estimate of neighbourhood population distribution but it has not been shared. Additionally, the population data is discontinous so if it is to be used some kind of regression will need to be used to extrapolate missing years. I will not make any further use of this dataset and will compare the crimes on an absolute vs per-capita basis. 

In [None]:
Crime = Crime.rename(columns= {'Neighbourhood Description (Occurrence)':'Hood', 'UCR Violation Type Group (Incident)':'Crime',
                               'Incident Reported Year': 'Year', 'Incident Reported Quarter':'Quarter', 'Incident Reported Month':'Month',})

crimes = list(set(Crime['Crime']))
years = sorted(list(set(Crime['Year'])))
months = sorted(list(set(Crime['Month'])))
quarters = sorted(list(set(Crime['Quarter'])))
NHood = list(set(Crime['Hood']))

To make things easier in the future I will change the column names to easy to write names and I will make lists of the unique information contained in the dataset. Now I can refer to information using list indexes vs typing if I need to. It also lets us examine the data by printing the lists.

In [None]:
print(crimes)
print(years)
print(months)
print(quarters)
print(NHood)

We can see the timespan of the data is 2009 to 2018. The population data states that the population increased roughly 10,000 over that time period (2010 to 2016) that doesn't seem like much for a large population so hopefully it would have a significant effect of the rates of total crimes. Lets take a look at a neighbourhood. 

In [None]:
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots(figsize=(20,7))
    Crime[Crime['Crime']=='Robbery'][Crime['Hood']=='RIDEAU PARK'].groupby(['Crime','Year','Hood']).agg({'# Incidents':np.sum}).sort_values(by=['Year','Hood']).unstack().plot(ax=ax)


 Just for fun, lets make one more.

In [None]:
fig, ax = plt.subplots(figsize=(20,7))
Crime[Crime['Crime']=='Theft Over $5000'][Crime['Hood']=='KILLARNEY'].groupby(['Crime','Year','Hood']).agg({'# Incidents':np.sum}).sort_values(by=['Year','Hood']).unstack().plot(ax=ax)

These are interesting but its tought to assess interventions when the numbers of crimes are so small. Also, its unlikely the city is targeting reducing crime in low crime areas. So lets find the neighbourhoods with the most crimes which were presumably targeted by local government for crime reduction

In [None]:
HighCrime =pd.DataFrame()
HighCrime['Total Crime'] = Crime.groupby('Hood')['# Incidents'].sum()
HighCrime = HighCrime[HighCrime['Total Crime']>HighCrime['Total Crime'].mean()].sort_values(by= 'Total Crime')
print(HighCrime)

These neighbourhoods are the half of the dataset that is greater than the mean. Downtown has the most number of incidents.Graphing this neighbourhood shows the following trend.

In [None]:
fig, ax = plt.subplots(figsize=(20,7))

Crime[Crime['Hood']=='DOWNTOWN'].groupby(['Year']).agg({'# Incidents':np.sum}).sort_values(by=['Year']).unstack().plot(ax=ax)
plt.show()

 What is the most frequently reported crime? Lets see...

In [None]:
 
print(Crime[Crime['Hood']=='DOWNTOWN'].groupby(['Crime']).agg({'# Incidents':np.sum}))
 

Looks like assualts and theft are pretty high. Lets see what the worst month is for assualt 

In [None]:
print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Assault'].groupby(['Month']).agg({'# Incidents':np.sum}))

Not much of a difference over the time period. December does seem a little lower. Lets look at break and enter.

In [None]:
print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Break and Enter'].groupby(['Month']).agg({'# Incidents':np.sum}))

No real differences in B&E distribution over the 9 years of data. Lets look at the distribution in each year.

In [None]:
for year in years:
    print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Break and Enter'][Crime['Year']==year].groupby(['Year','Month']).agg({'# Incidents':np.sum}))

  

The data shows a trend of increasing B&Es but no monthly pattern immediatly stands out. Lets plot this to see what it looks like.

In [None]:
fig, ax = plt.subplots(figsize=(20,7))

Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Break and Enter'].groupby(['Year','Month']).agg({'# Incidents':np.sum}).sort_values(by=['Year']).unstack().plot(ax=ax)
plt.show()

Looks like a slight uptrend since 2013-2014. Unfortunately there aren't enough colors to give a unique one to each month. Regardless there doesn't seem to be anything here worth giving a more indepth look. Lets look at vehicle thefts.

In [None]:
print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Theft From Vehicle'].groupby(['Month']).agg({'# Incidents':np.sum}))

Seems like a pretty uniform distribution by month. Lets look at it by year.

In [None]:
print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Theft From Vehicle'].groupby(['Year']).agg({'# Incidents':np.sum}))

Quite a difference in the numbers between 2009 and 2013 so theft from vehicles are obviously changing over the years but no month is more prone than any other. 

In [None]:
Now lets look at robbery.

In [None]:
 print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Robbery'].groupby(['Year']).agg({'# Incidents':np.sum}))

Plotting robberies by year gives...

In [None]:
fig, ax = plt.subplots(figsize=(20,7))

Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Robbery'].groupby(['Year']).agg({'# Incidents':np.sum}).sort_values(by=['Year']).unstack().plot(ax=ax)
plt.show()

Any monthly trends in robberies?

In [None]:
print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Robbery'].groupby(['Month']).agg({'# Incidents':np.sum}))

Robberies seem to rise over the summer and fall over the winter. Maybe because more people are outside in the summer? Lets plot this.

In [None]:
fig, ax = plt.subplots(figsize=(20,7)) 

Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Robbery'].groupby(['Month']).agg({'# Incidents':np.sum}).sort_values(by=['Month']).unstack().plot(ax=ax)
plt.show()

In [None]:
 print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Sexual Assaults'].groupby(['Month']).agg({'# Incidents':np.sum}))

July is clearly the worst but is it statistically significant? 

In [None]:
import statsmodels.stats.api as sms

sms.DescrStatsW(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Sexual Assaults'].groupby(['Month']).agg({'# Incidents':np.sum})).tconfint_mean()

In [None]:
 print(Crime[Crime['Hood']=='DOWNTOWN'][Crime['Crime']=='Sexual Assaults'][Crime['Month']==7].groupby(['Year']).agg({'# Incidents':np.sum}))

So there is no seasonal trend in sexual assualts rather there is an outlier in 2011 july sexual assualts that is making that month stand out when compared to other months. Lets leave downtown and try to figure out which area had the greatest increase and decrease in overall crime from 2009 to 2018.

In [None]:
Start = Crime[Crime['Year']==2009].groupby('Hood')['# Incidents'].sum() 
print (Start)
End = Crime[Crime['Year']==2018].groupby('Hood')['# Incidents'].sum() 
print(End)

hmmm the lengths of the datasets aren't equal indicating data is absent in 2009 for 10 areas present in 2018. Looking at the printed data it is apparent that 'Albany' is a new area in 2018 not present in 2009. 

In [None]:

dif  = (Start- End).to_frame() 
dif = dif.reset_index()
print(dif.loc[dif['# Incidents']==dif['# Incidents'].max()])
print(dif.loc[dif['# Incidents']==dif['# Incidents'].min()])
 
 

So 'SUMMERLEA' had the greatest decrease in crime and 'WALKER' had the greatest increase. Lets look at SUMMERLEA 

In [None]:
fig, ax = plt.subplots(figsize=(20,7))

Crime[Crime['Hood']=='SUMMERLEA'].groupby(['Year']).agg({'# Incidents':np.sum}).sort_values(by=['Year']).unstack().plot(ax=ax)
plt.show()

In [None]:
That looks pretty impressive. Lets see what crimes decreased over the period. 

In [None]:
print(Crime[Crime['Hood']=='SUMMERLEA'][Crime['Year']==2009].groupby('Crime')['# Incidents'].sum())
print(Crime[Crime['Hood']=='SUMMERLEA'][Crime['Year']==2018].groupby('Crime')['# Incidents'].sum())
 

Decreases in crime were pretty remarkable for every crime except sexual assualts. Notable improvements were seen in property crimes ('B&E, Robbery,Theft from and of a vehicle etc) and also in violence ('Assault'). Wikipedia shows this district comprises west edmonton mall so property crimes would be expected to high in the area. Whatever intervention (if any) was taken appeared to be effective. Lets look at the area with the greatest increase.

In [None]:
fig, ax = plt.subplots(figsize=(20,7))

Crime[Crime['Hood']=='WALKER'].groupby(['Year']).agg({'# Incidents':np.sum}).sort_values(by=['Year']).unstack().plot(ax=ax)
plt.show()

This area is certainly experiencing more crime. Lets take a deeper dive.

In [None]:
print(Crime[Crime['Hood']=='WALKER'][Crime['Year']==2009].groupby('Crime')['# Incidents'].sum())
print(Crime[Crime['Hood']=='WALKER'][Crime['Year']==2018].groupby('Crime')['# Incidents'].sum())

The increase appears to be across the board. The only thing I could do now is to see if the proportion of crime is consistent with other similar areas but I don't know enough about the areas to guess at what might be a fair comparison. I suppose that opens the door to examining the proportion of crimes in each neighbourhood and finding those that are unusually high in one type of crime but not others.

**Conclusion**

At this point I'm going to end this data dive for the time being. There is still much work that can be done on this dataset and perhaps in the future I will return to it or someone else will continue the analysis. If you read to the end please provide comments as to what more you think could be done with this data set or if you noticed any errors in my code I would appreciate the feedback. Thanks for reading. 