# Analysis of Highest Crime Area in London

From this big dataset I am wanting to do some data cleaning. With that I am also wanting too observe some policing and crime trends that are going on in the high crime areas of London. I would like to do this by observing crime and searches in different parts of the city and also observe the results from these interactions. With this it might also be important to know the demographic of the population that is being searched and look for some other trends that the dataset can provide.

# Table of Contents

* **[Cleaning Street Data](#Cleaning-Street-Data)**
* **[Cleaning Search Data](#Cleaning-Search-Data)**
* **[Cleaning Outcomes Data](#Cleaning-Outcomes-Data)**
* **[Joining the Datasets](#Joining-the-Datasets)**
* **[Analysis of Westminster Crime](#Analysis-of-Westminster-Crime)**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Load in the datasets
street = pd.read_csv("/kaggle/input/london-police-records/london-street.csv")
search = pd.read_csv("/kaggle/input/london-police-records/london-stop-and-search.csv")
outcomes = pd.read_csv("/kaggle/input/london-police-records/london-outcomes.csv")

# Cleaning Street Data

In [None]:
#Get look at the dataset
street.head()

In [None]:
street.info()

In [None]:
#Find proportion of missing data in street dataset
street.isnull().sum()/len(street)

In [None]:
#Drop Context Column from dataset
new_street = street.drop(columns=['Context'])

In [None]:
#Checking dataset too see if we need too keep Crime ID column
search.head()

In [None]:
#Checking dataset too see if any matches for Crime ID column
outcomes.info() #It looks like later we might be able to join datasets on Crime ID so we will leave it alone for now

In [None]:
#Quick look at Longitude
new_street.Longitude.describe()

Now working with this data and having context we would not find this information valuable if we do not have all the info on the loaction of the event and for that reason we are going to drop all missing values from columns.

In [None]:
#Make new dataframe for dropped nan dataset
new_street1 = new_street.dropna()

In [None]:
#Do we have any missing values?
new_street1.isnull().sum()

In [None]:
#Now lets take a look at the dataset
new_street1.head()

In [None]:
#Want too change Month column to string so I can slice the year
new_street1['Month'] = new_street1['Month'].astype(str)

In [None]:
#Make new columns for year and month separately
new_street1['Year'] = new_street1.Month.str[0:4]
new_street1['month'] = new_street1.Month.str[6:]

In [None]:
#Year and month column have now been created
new_street1.head()

In [None]:
#Import package for plotting
import seaborn as sns

In [None]:
#Count plot to show frequency of unique values in Reported by column
sns.countplot(x='Reported by', data=new_street1)

The count plot shows interesting data. It can be see that there are only two places that report London Crime. What I am interested in is if 'Reported by' and 'Falls within' are going to be very similar in their results.

In [None]:
#Count plot to show frequency of unique values in Falls within by column
sns.countplot(x='Falls within', data=new_street1)

We have a very similar graph so I want to go in and check the actual value counts for each column

In [None]:
#Value count for Reported by
new_street1['Reported by'].value_counts()

In [None]:
#Value count for Falls within
new_street1['Falls within'].value_counts()

From the results it can be shown that there was not a whole lot of value in making two separate columns when these columns show the exact same data. It is possible that randomly the numbers add up and the columns are not the same for every entry, but the odds of that are incredibly low given we have an exact match for over 232,000 entries.

In [None]:
#Quick sketch of Latitude and Longitude columns for entries
sns.scatterplot(x="Latitude", y="Longitude", data=new_street1)

In [None]:
#Find value_count of Location column
new_street1['Location'].value_counts()

The Location column is very interesting. Out of our 230,000+ entries we have over 36,000 locations that were recorded. For each entry there can only be one location so it will be interesting too see how this columns fits with analyzing the datasets later.

In [None]:
#LSOA name and LSOA code have same information
new_street1['LSOA name'].value_counts()

Like our Locations column this might not give us a lot of information. However, we may be able to get more information out of this if we make a new column negating the code identifcation at the end of the LSOA name.

In [None]:
#Want too change LSOA name column to string so I can slice end of it off
new_street1['LSOA name'] = new_street1['LSOA name'].astype(str)

In [None]:
new_street1.head()

In [None]:
new_street1['LSOA_Region'] = new_street1['LSOA name'].str[:-4]

In [None]:
new_street1.head()

In [None]:
#Now we will look at new LSOA region
new_street1['LSOA_Region'].value_counts()

We are able to see that we have 312 different section that are have entries in the dataset.

In [None]:
#Frequency of unique values in Crime type column
new_street1['Crime type'].value_counts()

We have 13 different types of crime that were reported in the dataset.

In [None]:
#Frequency of unique values in Last outcome category column
new_street1['Last outcome category'].value_counts()

We have quite a few outcome category results. Not sure what I want done to this column yet. May come back to it later in the data cleaning process.

# Cleaning Search Data

In [None]:
#Moving on to cleaning the search dataset
search.head()

In [None]:
#Basic info of search dataset
search.info()

In [None]:
#Find proportion of missing data in search dataset
search.isnull().sum()/len(search)

In [None]:
#Drop Columns with Over 70% of Missing Values
new_search = search.drop(columns=['Policing operation', 'Outcome linked to object of search', 'Removal of more than just outer clothing'])

In [None]:
#Check out new dataset
new_search.info()

Seeing that the rest of the columns are all object besides the Latitude and Longitude columns we are not going to fill these in. We especially do not want to manipulate the data considering these are not factor variables.

In [None]:
#Have dataset with no null values
new_search1 = new_search.dropna()

In [None]:
#Verify we have no null values in new dataset
new_search1.isnull().sum()

In [None]:
#See what the new dataset looks like
new_search1.head()

Now I want to go through each column and do a little more research.

In [None]:
#Find the amount of unique values in the Type column
new_search.Type.unique()

In [None]:
#Count plot to show frequency of unique values in Type column
sns.countplot(x='Type', data=new_search1)

From the graph it looks like that searches were almost always done of a person.

In [None]:
#Value count for Type
new_search1['Type'].value_counts()

For the date column it is in an interesting format. Like with the street data I want to take out certain parts of the column and store into a new column. From the date I would like to pull out the month, year, and hour.

In [None]:
#Want too change Date column to string so I can slice
new_search1['Date'] = new_search1['Date'].astype(str)

In [None]:
#Make new columns for year, month and hours separately
new_search1['Year'] = new_search1.Date.str[0:4]
new_search1['month'] = new_search1.Date.str[5:7]
new_search1['Hour'] = new_search1.Date.str[11:13]

In [None]:
#Look at new columns
new_search1.head()

In [None]:
#Verify no errors in column creation
new_search1.Year.value_counts()

In [None]:
new_search1.month.value_counts()

In [None]:
new_search1.Hour.value_counts()

This information will be useful for seeing trends for not only dates, but now times as well.

In [None]:
#Take a look at Age range column
new_search1['Age range'].value_counts()

In [None]:
#Do countplot of Age range for faster understanding
sns.countplot(x='Age range', data=new_search1)

From the data it looks like most of the searches in London were for teens and young adults. It also looks like their might be one or two searches that were done for people under the age of 10.

In [None]:
#Countplot for ethnicity
sns.countplot(x="Officer-defined ethnicity", data=new_search1)

The data shows that among those searched a large majority were either Black or White Officer-defined ethincity.

In [None]:
#Check out object of search column
new_search1['Object of search'].value_counts()

It looks like a large majority of searches were for either drugs or articles for use in criminal damage.

In [None]:
#Check out Outcome
new_search1['Outcome'].value_counts()

# Cleaning Outcomes Data

In [None]:
#Overview of Outcomes
outcomes.head()

This dataset looks very similar to the street dataset so similar procedures will be used too clean this dataset.

In [None]:
outcomes.isnull().sum()

In [None]:
#Make new dataframe for dropped nan dataset
new_outcome = outcomes.dropna()

In [None]:
#Want too change Month column to string so I can slice the year
new_outcome['Month'] = new_outcome['Month'].astype(str)

In [None]:
#Make new columns for year and month separately
new_outcome['Year'] = new_outcome.Month.str[0:4]
new_outcome['month'] = new_outcome.Month.str[6:]

In [None]:
new_outcome.head()

In [None]:
#Want too change LSOA name column to string so I can slice end of it off
new_outcome['LSOA name'] = new_outcome['LSOA name'].astype(str)

In [None]:
#Make LSOA Regions
new_outcome['LSOA_Region'] = new_outcome['LSOA name'].str[:-4]

In [None]:
#Check the work of last output
new_outcome.head()

Although there is a location column for both street and outcome datasets it can be seen that some of the locations in the outcome dataset are in all caps. In case we have similar locations when we merge datasets I want to lower the values in both of the location columns.

In [None]:
#Make description in each column lowercase
new_street1['Location'].str.lower()
new_outcome['Location'].str.lower()

Now we have cleaned all of the data like we want it and are able to join the data then analyze.

# Joining the Datasets

In [None]:
#Get column names of new_street1
new_street1.info()

In [None]:
#Get column names of new_outcome
new_outcome.info()

In [None]:
#Want a dataset that has exact info between new_street1 and new_outcome so merge on all similar columns
street_outcome = pd.merge(new_street1, new_outcome, on=['Crime ID', 'Month', 'Reported by', 'Falls within', 'Longitude', 'Latitude', 'Location', 'LSOA code', 'LSOA name', 'Year', 'month', 'LSOA_Region'])

In [None]:
#Check out the new dataset
street_outcome.head()

In [None]:
#Check the columns and null values of our data
street_outcome.info()

Now we have a dataset that fit new_street1 and outcome dataset together.

In [None]:
#Look at new_search1 data
new_search1.head()

I'm going to look too see if there are some entries that match up between street_outcome and new_search1.

In [None]:
#Merge street_outcome and new_search1
all_data = pd.merge(street_outcome, new_search1, on=['Latitude', 'Longitude','Year', 'month'])

In [None]:
#Did we get any results?
all_data.head()

Seeing that the Longitude, Latitutde, Years, and months do not match up, we will have to keep these two datasets separate for the sake of easier analysis.

For analysis we now have the cleaned datasets of street_outcome and new_search1.

# Analysis of Westminster Crime

Now that we have all the information clean we are able to go in and actually look for some trends and work on answering the questions we asked in the beginning. Again the main piece we are after is a thorough analysis of the high crime areas in London. With that we are also wanting to look at the demographics of those which are being searched as well as any other trends that the data could provide. One thing we will have to figure out is how we want to go about analyzing the different areas of London. Knowing from earlier that there are 343 different areas of involvement in this dataset we will definitely not want to work through every single section. Although there were 343 areas earlier, when we joined the data we might have less regions. It might be more beneficial to set filters within plots.

In [None]:
#Need to figure out how many observations are related to each LSOA_Region
lsoa_pivot = street_outcome.pivot_table(index=['LSOA_Region'], aggfunc='size').sort_values(ascending=False)
lsoa_pivot.describe()

In [None]:
#What areas are in top 5 for London Crime?
lsoa_pivot.head(5)

Now we know that the area of Westminster has the highest amount of crime within this dataset. This is very interesting. Westminster is the area where Buckingham Palace is located and is a major tourist area. We do not have the information in this dataset, but it is a curious question to think about the percentage of crime committed by people who reside in London and those who are just visiting the city. 

When researching Westminster a little further the outside perception of the area of Westminster is different than its reality according to an article on MyLondon News.

https://www.mylondon.news/news/zone-1-news/shocking-extremes-wealth-poverty-westminster-17125539

From the article it seems that there is a growing gap specifically in this area between the wealthy and the poor and so there seems to be more going on in this area than just labelling it as a tourist destination.

In [None]:
#Making new dataset with just Westminster data
westminster_crime = street_outcome[street_outcome['LSOA_Region'] == 'Westminster ']
westminster_crime.head()

In [None]:
#Look at the Westminster dataset using info
westminster_crime.info()

In [None]:
#Making count plot to see what types of crime are the most prevalent in Westminster
x = sns.countplot(x='Crime type', data=westminster_crime, order=pd.value_counts(westminster_crime['Crime type']).iloc[:5].index)
x.set_xticklabels(x.get_xticklabels(), rotation=30)

In [None]:
#Want to know the actual numbers for crime type in Westminster
westminster_crime['Crime type'].value_counts()

Although we could do a little more work and shorten down the length of the crime type values, from the graph above we are able to get an idea about what type of crime is going on in the Westminster area. Shoplifting accounts for about 3068 of the 13015 crime incidents in Westminster or about 24% of the crime in the area. How does Westminster's shoplifting compare to the overall average for shoplifting?

In [None]:
#Count plot for overall crime types in London
a = sns.countplot(x='Crime type', data=street_outcome, order=pd.value_counts(street_outcome['Crime type']).iloc[:5].index)
a.set_xticklabels(a.get_xticklabels(), rotation=30)

In [None]:
#Specific numbers for all of London crime types
street_outcome['Crime type'].value_counts()

From the information from the street_outcome dataset it can be seen that there are a total of 44,097 shoplifting incidents and there were a total of 175,405 total incidents leading to about 25% of the crimes being committed in all of London, which means that there is no spike or dip with the shoplifting data. 

In [None]:
#Making new dataframe for use looking at top 5 crime types in Westminster
crime_type_for_filter = ['Shoplifting', 'Violence and sexual offences', 'Other theft', 'Drugs', 'Theft from the person']
westminster_crime_filter = westminster_crime[westminster_crime['Crime type'].isin(crime_type_for_filter)]

In [None]:
#Look at the general area where crime took place
from matplotlib import pyplot

fig, ax = pyplot.subplots(figsize=(11.7, 8.27))
sns.scatterplot(x='Latitude', y='Longitude', hue='Crime type', data=westminster_crime_filter, ax=ax)

Although every crime seems to be centralized at 51.51 latitude, -.14 longitude there does not seem to be any trends that are going on. Shoplifting seems to occur more often in the north part of Westminster.

In [None]:
westminster_crime['Last outcome category'].value_counts()

For most of the crime in Westminster an investigation was complete and no suspects were found. More specifically in this dataset 50% of the investigations were concluded with no suspect found. Knowing that shoplifting was the highest crime type in the area it would be really interesting to see how the crime was reported. How many of the shoplifting incidents resulted in no suspect being found?

In [None]:
#Make subset of the data only involving shoplifting incidents in Westminster
shoplift_for_filter = ['Shoplifting']
westminster_shoplift_filter = westminster_crime_filter[westminster_crime_filter['Crime type'].isin(shoplift_for_filter)]

In [None]:
#Getting overall count of outcomes for shoplifting in Westminster
westminster_shoplift_filter['Last outcome category'].value_counts()

From the above information shoplifting in Westminster only accounts for 11% of the no suspect identified outcome. What crime type has the most no suspects identified?

In [None]:
#Make subset of data only looking at incidents where no suspect was identified
no_suspect_for_filter = ['Investigation complete; no suspect identified']
westminster_no_suspect_filter = westminster_crime[westminster_crime['Last outcome category'].isin(no_suspect_for_filter)]

In [None]:
#Get sum values of all crime type where no suspect was found
westminster_no_suspect_filter['Crime type'].value_counts()

Other theft is a pretty overarching category but it makes up for 29% of crime that no suspect was identified in. Looking at the other crime types a lot of the top crime types where no suspect was found had to do with stealing/theft of some degree. The one that stands out is Violence and sexual offences being the third highest crime type in the area where no suspect was identified. 

## Demographics of Suspects and Final Analysis

In [None]:
#Useful dataset for looking at Westminster crime demographics
westminster_demo = pd.merge(westminster_crime, new_search1, on=['Longitude', 'Latitude'])
westminster_demo.info()

Knowing that we are only dealing with Westminster data in the westminster_crime dataset matching westminster_crime and our new_search1 dataset on Latitude and Longitude is an easy and effective way to narrow down our demographic dataset to just Westminster.

In [None]:
#Start analysis of Gender
westminster_demo.Gender.value_counts()

The first column to look at is the Gender column. For Westminster crime 89% of it was committed by males. 

In [None]:
#Analysis of Officer-defined ethinicty 
westminster_demo['Officer-defined ethnicity'].value_counts()

We also now know that 50% of the crime in Westminster was done by a White citizen.

In [None]:
#Make subset of data only looking at incidents where no suspect was identified
white_for_filter = ['White']
westminster_white_filter = westminster_demo[westminster_demo['Officer-defined ethnicity'].isin(white_for_filter)]
westminster_white_filter['Crime type'].value_counts()

We also now know that for the White population the crimes that were committed fall in line with the crime types that were the most prevalent in the area.

In [None]:
#Count of Age demographics for Westminster crime
westminster_demo['Age range'].value_counts()

A surprising part of the data is seeing that the second highest amount of crime is committed by people at least over the age of 34.

In [None]:
#Check and see how our age demographic compares with shoplifting
westminster_shoplift_demo = westminster_demo[westminster_demo['Crime type'].isin(shoplift_for_filter)]
westminster_shoplift_demo['Age range'].value_counts()

Our age range looks the same for shoplifting as it did for our overall age range for crime.

In [None]:
#Countplot of Year
sns.countplot(x="Year_x", data=westminster_demo, order = westminster_demo['Year_x'].value_counts().index)

Looking at the data that comes from Westminster crime it looks like 2016 has the highest year of crime and there are no growing trends or declining trends of crime in the area as 2014 and 2017 both have the lowest amount of crime for all the years.

In [None]:
#Look at month crimes were committed in Westminster
sns.countplot(x="month_x", data=westminster_demo, order = westminster_demo['month_x'].value_counts().index)

From this count we can see that the earlier part of the year had higher amounts of crime than any other part of the year.

This dataset has a lot of useful information and the findings of Westminster have been interesting. There are endless amounts of trends that could be researched and this dataset has been useful for practicing data cleaning techniques and thorough in-depth analysis research.

**Resources**

https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan
https://stackoverflow.com/questions/31460146/plotting-value-counts-in-seaborn-barplot
https://datatofish.com/count-duplicates-pandas/
https://stackoverflow.com/questions/32891211/limit-the-number-of-groups-shown-in-seaborn-countplot
https://stackoverflow.com/questions/62025957/filter-data-and-modifying-labels-in-seaborn-boxplot-graphs
https://stackoverflow.com/questions/31594549/how-do-i-change-the-figure-size-for-a-seaborn-plot