# Introduction

The goal of this analysis is to check if there is any correlation between google searches of crime in Vancouver (British Columbia)  and the total number of crimes in Vancouver. The assumption is that the number of searches reflects what's going on in the real world and people's sentiment. *So does it?*


<br>
# Importing the Data Analysis and Visualization packages
---

In [None]:
# Import data manipulation packages
import numpy as np
import pandas as pd

# Import data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Processing and Transforming the data
---
### Importing the Google Trend Data

In [None]:
# Importing the data
googletrend = pd.read_csv('../input/googletrend.csv', index_col='Month')

In [None]:
# Taking a look at the first entries
googletrend.head()

In [None]:
# Checking index and data types
googletrend.info()

* We can see that the Google Trend data has 162 entries, which are monthly values from 2004-01 to 2017-06.
* The search index column shows the popularity of the search. A value of 100 is the peak of popularity. Other values are relative to the peak.

<br>
### Importing the Crime data

In [None]:
# Importing CSV file
crimes = pd.read_csv('../input/crime.csv')
crimes.head()

In [None]:
# Creating a date column from the date parts
crimes['DATE'] = pd.to_datetime({'year':crimes['YEAR'], 'month':crimes['MONTH'], 'day':crimes['DAY']})

# Change the index to the colum 'DATE'
crimes.index = pd.DatetimeIndex(crimes['DATE'])

In [None]:
# The crime data starts from 2003, but our Google data starts from 2004 and ends in 2017-06. 
# Let's remove 2003 from our crime data and 2017-07.
crimes = crimes[(crimes['DATE'] > '2003-12-31') & (crimes['DATE'] < '2017-07-01') ]

# The crime data lists all individual crimes. 
# We need to group it by month to compare it to the Google trend.
crimes_month = pd.DataFrame(crimes.resample('M').size()) 

In [None]:
crimes_month.info()

Now the __crimes_month__ data has a similar shape to the Google trend data. 162 entries and the same period.

In [None]:
# Just renaming the column...
crimes_month.columns = ['Total']

# Taking a look at the data
crimes_month.head()

The Total columns is the total number of crimes per month. To make it comparable to the Google trends data, let's make a "crime index", in which the month that had the most number of crime will have a value of 100 and others will be relative to it.

In [None]:
# Dividing the total number of crimes by the maximum value and round them
crimes_month['Crime Index'] = (crimes_month['Total']/crimes_month['Total']
                               .max()*100).astype(int)

Now let's join the two data frames.

In [None]:
crime_trend = pd.concat([crimes_month['Crime Index'],googletrend], axis =1)
crime_trend.head()

Now we have our data set called __crime_trend__.

<br>
# Analyzing Correlation
---
Let's start with a plot of crime index and Google trends.

In [None]:
crime_trend.plot(figsize=(12,6), linewidth=3)
plt.title('Crime Index and Google Trends', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});

## Using a 6-Months Moving Average

In [None]:
# Now let's use a 6 months window
crime_trend_rolling6 = crime_trend.rolling(window=6).mean().dropna()

In [None]:
# Plot
crime_trend_rolling6.plot(figsize=(8,4), linewidth=3)
plt.title('Crime Index and Google Trends - Moving Average', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});

This is interesting. Note that after 2010 there is a *lag* between crime index and search index. When crime increases, it takes a while until searches increase. When crime index is reaching its local peak, the search start increasing.

Now let's redo this plot with a *shift* in the search index.

In [None]:
# Using .shift(-5) to lag the search index
crime_trend_rolling6_shifted = (pd.concat([crime_trend_rolling6['Crime Index'],
                                             crime_trend_rolling6['Search Index']
                                             .shift(-5)], axis=1))

crime_trend_rolling6_shifted.columns = ['Crime Index','Search Index (shifted)']

# Let's focus on 2010 on
crime_trend_rolling6_shifted = crime_trend_rolling6_shifted[crime_trend_rolling6_shifted.index >=
                                                            '2010-01-01']

In [None]:
# Plot
crime_trend_rolling6_shifted.plot(figsize=(8,4), linewidth=3)
plt.title('Crime Index and Google Trends (Shifted) - Moving Average', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});

In [None]:
# Let's check the corrleation
crime_trend_rolling6_shifted.corr()

<br>
# Conclusion

There is a very high correlation between the moving average of Google searches for crime and the total number of crimes in Vancouver.

Reflection:
* It would be interesting to check if the same happens in other cities that we have crime data. Could we then use Google trends to tell if a city is having more or less crime? Especially for those cities that do not have crime statistics publicly available?