 # Thinkful Data Science Bootcamp Intro Unit Capstone #

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from scipy.stats import ttest_ind

### Read in the Dataset ###

In [4]:
baltimore = pd.read_csv('../input/BPD_Part_1_Victim_Based_Crime_Data.csv')

### Initial Look at the Dataset ###

##### Overview of the Data #####

This is the Baltimore Crime Dataset from Kaggle datasets.  The data spans the years of 2012 - 2017.  Lets take a look at the columns that constitute the dataset. 

In [5]:
baltimore.info()

##### Nulls in the Dataset #####

Let's take a look at the number of nulls in the dataset.  The only columns that don't have nulls are 'CrimeDate', 'CrimeTime', 'CrimeCode', 'Description', and 'Total Incidents'.  There are a lot of Nulls in the data.  For the purpose of this Capstone, we will only focus on the analyzing the data that do not have nulls.  

In [6]:
baltimore.isnull().sum()

##### Sample Data #####

Now, lets take a look at the some of the sample data.  We will do this by looking at the top 5 rows of data and the bottom 5 rows of data.  

In [7]:
baltimore.head()

In [8]:
baltimore.tail()

##### CrimeData, CrimeTime, and CrimeCode Column #####

By looking at the "CrimeData" column, the format is mm/dd/yyyy.  We see that the most recent crime instance is dated 9/02/2017 and the oldest instance of crime is dated 01/01/2012.  Looking at the "CrimeTime" column we see that the time is coded in 24-hour format (hh:mm:ss), but it appears that is detailed only down to the minute level.  The "CrimeCode" column is the FBI Uniformed Crime Reporting code for the specific crime.  The details of what the code identifies can be found in the FBI Uniformed Crime Reporting Handbook found at this website https://www2.fbi.gov/ucr/handbook/ucrhandbook04.pdf.  

#### Crime Code Example from UCR Handbook ####

The crime code for the fourth observation is 4C.  Let us see what the UCR handbook says about the specific crime code. 

In [9]:
baltimore.iloc[3]['CrimeCode']

In [7]:
img = Image.open('4c.png')
img

##### Location and Description Column #####

After the 'CrimeCode' column we have the 'Location' column.  The 'Location' column has the address where the crime occurred.  After the 'Location' column, you have the 'Description' column.  The 'Description' column describes the general category of the crime.  Here are the general categories of crime:

In [10]:
baltimore['Description'].unique()

##### Insdie/Outside Column #####

After the 'Description' column we have the 'Inside/Outside' column.  Which tells whether the crime was committed inside or outside.  Let's take a look at the breakout of the crimes whether inside or outside:

In [11]:
baltimore['Inside/Outside'].value_counts()

Looks like there is inconsistency in using just the letter or the entire word to desribe whether it was inside or outside.  Let's add the 'I' instances with the 'Inside' instances and the 'O' instances with the 'Outside' instances. The column name 'Inside/Outside' has a slash in it which makes it difficult to do some code.  So, lets create a new column without a slash called 'InsideOutside' and then add the Inside and Outside instances.  

In [12]:
#Create new column
baltimore['InsideOutside'] = baltimore['Inside/Outside']

#Get Inside instances
inside1 = int(baltimore[baltimore['InsideOutside'] == 'Inside'].InsideOutside.value_counts())
inside2 = int(baltimore[baltimore['InsideOutside'] == 'I'].InsideOutside.value_counts())

#Get Outside instances
outside1 = int(baltimore[baltimore['InsideOutside'] == 'Outside'].InsideOutside.value_counts())
outside2 = int(baltimore[baltimore['InsideOutside'] == 'O'].InsideOutside.value_counts())

print("There are {} inside instances".format(inside1 + inside2))
print("There are {} outside instances".format(outside1 + outside2))

There seems to be an equal amount of inside and outside crime instances. 

##### Weapon Column #####

The next column after 'Inside/Outside' is the 'Weapon' column.  The number of categories for the weapon of choice are small.  There are only knife, firearm, hands, and other.  

In [13]:
baltimore['Weapon'].unique()

##### Post Column #####

The next column after 'Weapon' is the 'Post' column.  The 'Post' column refers to the number of the nearest Baltimore Police Station where that crime was committed.  In this dataset, there are 179 unique posts.  The post gives an idea of which district the crime was committed because the hundreds numeral of the post number is shared among posts.

In [14]:
print(baltimore['Post'].unique())
print(len(baltimore['Post'].unique()))

For the fourth observation in the Baltimore Crime dataset the Post number is 934.  This post is in the southern district and as shown in the image, all Police posts in the southern district start with the number '9'.  

In [15]:
baltimore.iloc[3]['Post']

In [30]:
img1 = Image.open('934.png')
img1

##### District Column #####

The next column after 'Post' is the 'District' column.  There are nine unique districts in Baltimore.  

In [16]:
baltimore['District'].unique()

In [32]:
img2 = Image.open('Baltimore District.png')
img2

##### Neighborhood #####

The next column after 'District' is the 'Neighborhood' column.  There are 279 neighborhoods in the Baltimore Crime Dataset. 

In [17]:
print(len(baltimore['Neighborhood'].unique()))

In [41]:
img3 = Image.open('Baltimore Neighborhoods.png')
img3

##### Longitude, Latitude, and Location1 column #####

The next three columns are 'Longititude', 'Latitude', and 'Location1' describe the precise location of the occurence of the crime.  The 'Location1' column has the longititude and latitude location pair.  

##### Premise Column #####

The next column is 'Premise' and it describe the premise location of the crime.  There are 124 different premise locations where crimes have occurred.  

In [18]:
baltimore['Premise'].unique()

In [19]:
print(len(baltimore['Premise'].unique()))

##### Total Incidents Column #####

The last column is does not reveal much.  It only tells that there is only one occurrence of the crime.  

In [20]:
baltimore['Total Incidents'].unique()

## Analysis Questions ##

##### 1.) What is the trend of the Baltimore Crime Data from 2012-2017? #####

##### 2.) What is the most dangerous hour (The hour where most crimes occur)? #####

##### 3.) Which District has the most crime? #####

##### 4.) Which Crime Category has the most occurrences? #####

##### 5.) Amongst the neighborhoods, what is the spread of the most occuring crime category? #####

##### 6.) Is there a significant difference in larceny counts between the Northeastern and Western neighborhoods #####

### Baltimore Crime Trend from 2012-2017 ###

In [21]:
# Create a new column that has the year of which the crime occurred

for n in range(0, 276528):
    x = baltimore.loc[n, 'CrimeDate']
    baltimore.loc[n,'CrimeYear'] = int(x[6:])

In [22]:
# Create a dataframe that has the number of crime occurrences by year from 2012-2017

crime_year = baltimore.CrimeYear.value_counts()
crime_yearindex = crime_year.sort_index(axis=0, ascending=True)
print(crime_yearindex)


In [23]:
# Line plot of crime data from 2012-2017

fig = plt.figure(figsize=(20, 20))
f, ax = plt.subplots(1)
xdata = crime_yearindex.index
ydata = crime_yearindex
ax.plot(xdata, ydata)
ax.set_ylim(ymin=0, ymax=60000)
plt.xlabel('Year')
plt.ylabel('Number of Crimes')
plt.title('Baltimore Crimes from 2012-2017')
plt.show(f)

There looks like there is an overall downward trend.  The year 2012 was the most dangerous, there was a slight dip in 2014 and then it went back up a little.  From 2015-2016 the crime stabilized and then it took a noticeable dip in 2017.  

### The Most Dangerous Hour in Baltimore ###

In [24]:
# Create a new column that has the hour at which the crime occurred

for n in range(0, 276528):
    x = baltimore.loc[n, 'CrimeTime']
    baltimore.loc[n,'CrimeHour'] = int(x[:2])

In [25]:
# Create a dataframe with the crime occurrences by hour 

crime_hour = baltimore.CrimeHour.value_counts()
crime_hourindex = crime_hour.sort_index(axis=0, ascending=True)
print(crime_hourindex)

In [26]:
# There is only one occurrence of one crime at hour 24. For 24-hour format, midnight can be described as either 24:00
# or 00:00, so we will change the observation from 24 to 0

print(baltimore[baltimore['CrimeHour'] == 24])
baltimore.at[239894, 'CrimeHour'] = 0
print(baltimore.loc[239894])

In [27]:
# Incorporate the change of the observation

crime_hour = baltimore.CrimeHour.value_counts()
crime_hourindex = crime_hour.sort_index(axis=0, ascending=True)

In [28]:
# Create line plot that shows the crime occurrence by hour

fig = plt.figure(figsize=(20, 20))
f, ax = plt.subplots(1)
xdata = crime_hourindex.index
ydata = crime_hourindex
ax.plot(xdata, ydata)
ax.set_ylim(ymin=0, ymax=17000)
ax.set_xlim(xmin=0, xmax=24)
plt.xlabel('Hour')
plt.ylabel('Number of Crimes')
plt.title('Number of Crimes by Hour')
plt.show(f)

The most dangerous hour is 18:00, which translates to 6pm.  The least dangerous hour is 05:00, which translates to 5am.  The trend that the line plot suggests makes sense.  I predicted that the majority of the crime would be commited in the evening to late evening and that there would not be as much crime in the early mornings.  

In [29]:
# Create a pivot table to identify if hours had more occurence of specific crime categories

baltimore.pivot_table(index='Description',
               columns='CrimeHour',
               values='CrimeTime',
               aggfunc= 'count')

### The Most Dangerous District ###

In [30]:
#Create a dataframe that has the number of crime occurences by district

districtcount = baltimore.District.value_counts()
baltimore.District.value_counts()

In [31]:
#Create bar graph of number of crimes by district

my_colors = 'rgbkymc'
districtcount.plot(kind='bar',
                color=my_colors,
                title='Number of Crimes Committed by District')

The most dangerous district is the Northeastern district and the least dangerous district is the Western district. 

### The Crime Category with the Highest Occurrence ###

In [32]:
#Create a dataframe that has the occurrence of crimes by category

crimecount = baltimore.Description.value_counts()
baltimore.Description.value_counts()

In [33]:
#Create bar graph of number of crimes by category

my_colors = 'rgbkymc'
crimecount.plot(kind='bar',
                color=my_colors,
                title='Crimes Committed by Category')

Larceny is the most common crime and Arson is the least common crime. 

### Distribution of Larceny Among the Neighborhoods ###

In [34]:
# Create a list of unique neighborhoods in Baltimore and then create an empty list which will be appended with larceny
# counts by neighborhood.

neighborhood_list = baltimore['Neighborhood'].unique()
larceny_list = []

# Iterate through unique neighborhood list and then append the count of larceny for that neighborhood to the empty 
# larceny list.

for neighborhood in neighborhood_list:
    x = baltimore[(baltimore['Neighborhood'] == neighborhood) & (baltimore['Description'] == 'LARCENY')]
    larceny_list.append(len(x))

# Create a pandas dataframe with the Larceny counts and sort the values with the highest count at the top 
    
neighborhood_larceny = np.array(larceny_list)
neighborhood_larceny = pd.DataFrame(neighborhood_larceny)
neighborhood_larceny.columns = ['Larceny Counts']
neighborhood_larceny.index = neighborhood_list
neighborhood_larceny = neighborhood_larceny.sort_values(['Larceny Counts'], ascending = False)
print(neighborhood_larceny)

In [35]:
# Plot a histogram of the larceny counts 

#plt.figure()
#neighborhood_larceny.plot.hist(bins=50)

#Plot a histogram for larceny counts.
plt.hist(neighborhood_larceny['Larceny Counts'], bins=20, color='c')

# Add a vertical line at the mean.
plt.axvline(neighborhood_larceny['Larceny Counts'].mean(), color='b', linestyle='solid', linewidth=2)

# Add a vertical line at one standard deviation above the mean.
plt.axvline(neighborhood_larceny['Larceny Counts'].mean() + neighborhood_larceny['Larceny Counts'].std(), color='b', linestyle='dashed', linewidth=2)

# Add a vertical line at one standard deviation below the mean.
plt.axvline(neighborhood_larceny['Larceny Counts'].mean() - neighborhood_larceny['Larceny Counts'].std(), color='b', linestyle='dashed', linewidth=2) 

plt.title('Histogram of Larceny Counts by Neighborhood')

# Print the histogram.
plt.show()

In [36]:
neighborhood_larceny['Larceny Counts'].median()

In [37]:
neighborhood_larceny['Larceny Counts'].mean()

In [38]:
neighborhood_larceny['Larceny Counts'].std()

Looking at the histogram of larceny counts by neighborhood, it is very clear that the distribution is heavily skewed to the right.  This skewness is confirmed by the fact that the mean larceny count (214.99) is greater than the median larceny count (119).  Also the distribution has a noticeable spread which is indicated by the standard deviation of 290.95.  

### Difference between Larceny Counts in the Northeastern and Western District neighborhoods ###

In [39]:
# Get unique list of neighborhoods in the Northeastern district and create an empty list to hold larceny counts

ne_neighborhoods = baltimore[baltimore['District'] == 'NORTHEASTERN']
ne_neighborhoodlist = ne_neighborhoods['Neighborhood'].unique()

larceny_count1 = []

# Iterate through Northeastern neighborhood list and append the larceny counts to list

for neighborhood in ne_neighborhoodlist:
    x = ne_neighborhoods[(ne_neighborhoods['Neighborhood'] == neighborhood) & (baltimore['Description'] == 'LARCENY')]
    larceny_count1.append(len(x))

# Create a pandas dataframe with the Larceny counts and sort the values with the highest count at the top 
    
ne_larceny = np.array(larceny_count1)
ne_larceny = pd.DataFrame(ne_larceny)
ne_larceny.columns = ['Larceny Counts']
ne_larceny.index = ne_neighborhoodlist
ne_larceny = ne_larceny.sort_values(['Larceny Counts'], ascending = False)
print(ne_larceny)



In [40]:
# Remove the NaN indexed observation

ne_larceny = ne_larceny.loc[ne_larceny.index.dropna()]
print(ne_larceny)

In [41]:
# Get unique list of neighborhoods in the Western district and create an empty list to hold larceny counts

w_neighborhoods = baltimore[baltimore['District'] == 'WESTERN']
w_neighborhoodlist = w_neighborhoods['Neighborhood'].unique()

larceny_count2 = []

# Iterate through Western neighborhood list and append the larceny counts to list

for neighborhood in w_neighborhoodlist:
    x = w_neighborhoods[(w_neighborhoods['Neighborhood'] == neighborhood) & (baltimore['Description'] == 'LARCENY')]
    larceny_count2.append(len(x))

# Create a pandas dataframe with the Larceny counts and sort the values with the highest count at the top 
    
w_larceny = np.array(larceny_count2)
w_larceny = pd.DataFrame(w_larceny)
w_larceny.columns = ['Larceny Counts']
w_larceny.index = w_neighborhoodlist
w_larceny = w_larceny.sort_values(['Larceny Counts'], ascending = False)
print(w_larceny)

In [42]:
#Drop the NaN indexed observation

w_larceny = w_larceny.loc[w_larceny.index.dropna()]
print(w_larceny)

##### T-statistic and p-value between Northeastern and Western neighborhoods #####

In [43]:
# T-test to determine whether there is a difference between the means of the Northeastern and Western neighborhood counts of
# larceny

diff=ne_larceny['Larceny Counts'].mean( ) - w_larceny['Larceny Counts'].mean()
size = np.array([len(ne_larceny), len(w_larceny)])
sd = np.array([ne_larceny['Larceny Counts'].std(), w_larceny['Larceny Counts'].std()])
diff_se = (sum(sd ** 2 / size)) ** 0.5
t_val = diff/diff_se

print(ttest_ind(ne_larceny['Larceny Counts'], w_larceny['Larceny Counts'], equal_var=False))

According to the t-test, the t-statistic indicates that the difference between the mean counts of larceny between the Northeastern and Western neighborhoods are not significantly different.  The p-value is only 0.46, it is greater than 0.05 which states that the test is not statistically significant.  

### Proposition for Further Research ###

##### 1.) I propose that further research can be done by using Python mapping packages to map out the data and visualize where the specific crimes occur by using the Longitude and Latitude datapoints, this can help visualize crime cluster locations #####

##### 2.) I would like to expound on my pivot table of Crime Description and Crime Hour and use Python mapping packages and create a visualization that iterates through the hours and shows where certain crimes pop up #####

##### 3.) Lastly I would like to try and maybe use Neural Network to try and predict when and where a crime will occur #####