### **Exploratory Data Analysis of Street Crime in Camden**

#### This notebook will conduct basic Exploratory Data Analysis of street crime in Camden. Analysis will be conducted as follows:

1. Setting up notebook and environment
2. Preperaing data
3. Overview of trends
4. METHOD 1
5. METHOD 2
6. METHOD 3
7. CONCLUSION 

***

# 1. Set up notebook

In [148]:
# Uncomment and run the below if you are not using CASA environment 

# pip install

In [149]:
# Installing packages needed for analysis

import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import os
import urllib.request
import zipfile
from shapely.geometry import Point
from matplotlib_scalebar.scalebar import ScaleBar


In [150]:
# Look at working dir

print("The working directory is " + os.getcwd())

The working directory is /home/jovyan/work/OneDrive/Job/Camden/EDA-Camden-Crime


In [151]:
# Create a folder for data to be downloaded into

if os.path.isdir("data") is not True:
    print("Creating 'Data' directory...")
    os.mkdir("data")

In [152]:
# Create a folder to save figures in 

if os.path.isdir("figures") is not True:
    print("Creating 'figures' directory...")
    os.mkdir("figures")

In [153]:
# Define some functions to make investigating data easier

# Check if merges havebeen successful

def success(olddataframe, newdataframe):
    print(f"There are " + str(newdataframe.isnull().sum().sum()) + " NaN values in the data frame")
    print(f"These NaN values are located in columns: " + str(newdataframe.columns[newdataframe.isnull().any()].tolist()))
    if len(olddataframe) == len(newdataframe):
        print("Success! The length's are the same")
    else:
        print("Something is wrong!")
        diff = len(newdataframe)-len(olddataframe)
        print(f"{diff} rows are missing")
        
# Check info of a df

def infodf(dataframe):
    print(f"There are "+str(len(dataframe)) + " rows in the dataframe")
    print(f"There are " + str(dataframe.shape[1]) + " columns in the data frame")
    print(f"The columns of the dataframe are: " + str(dataframe.columns))
    print(f"The data types of each column are: " + str(dataframe.dtypes))
    print(f"There are " + str(dataframe.isnull().sum().sum()) + " NaN values in the data frame")
    print(f"The NaN sums for each column are: " + str(dataframe.isna().sum()))
    print(f"These NaN values are located in columns: " + str(dataframe.columns[dataframe.isnull().any()].tolist()))
    
# To check projections of spatial data frames

def projection(df1, df2):
    if df1.geometry.crs==df2.geometry.crs:
        print("The projections are the same")
    else:
        print("Need to change the projections")
        print("Projection of first data frame is " + str(df1.crs.name))
        print("Projection of second data frame is " + str(df2.crs.name))
    

***
# 2. Data preperation
## 2.1 Read in data

#### Data is read in using the GitHub Repo URL. You can also change to the pathway of project if you have cloned the repo.

In [154]:
# Read in csv data

crime = pd.read_csv("https://raw.githubusercontent.com/rubyimogenjohnson/EDA-Camden-Crime/main/Data/street_crime_jan20_jan23.csv")
population = pd.read_csv("https://raw.githubusercontent.com/rubyimogenjohnson/EDA-Camden-Crime/main/Data/ward_pop_stats.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [155]:
# Read in zipped spatial data, and unzip

url = "https://github.com/rubyimogenjohnson/EDA-Camden-Crime/blob/main/Data/Wards_(May_2022)_UK_BFC_V2.zip?raw=true"
file_name = "Wards_(May_2022)_UK_BFC_V2.zip"
dir_name = "spatial data"

# Create the target directory if it doesn't exist
if not os.path.exists(dir_name):
    os.makedirs(dir_name)

# Download the file from the URL into the target directory
urllib.request.urlretrieve(url, os.path.join(dir_name, file_name))

# Extract the contents of the ZIP file into the target directory
with zipfile.ZipFile(os.path.join(dir_name, file_name), "r") as zip_ref:
    zip_ref.extractall(dir_name)
    
# Delete downloaded zipped folder
os.remove(os.path.join(dir_name, file_name))


In [156]:
# Read in spatial data

camden_wards = gpd.read_file(os.path.join("spatial data", "Wards_(May_2022)_UK_BFC_V2", "WD_MAY_2022_UK_BFC_V2.shp"))

In [157]:
# Download London spatial data

url = "https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip"
file_name = "statistical-gis-boundaries-london.zip"
dir_name = "spatial data"

# Create the target directory if it doesn't exist
if not os.path.exists(dir_name):
    os.makedirs(dir_name)

# Download the file from the URL into the target directory
urllib.request.urlretrieve(url, os.path.join(dir_name, file_name))

# Extract the contents of the ZIP file into the target directory
with zipfile.ZipFile(os.path.join(dir_name, file_name), "r") as zip_ref:
    zip_ref.extractall(dir_name)
    
# Delete downloaded zipped folder
os.remove(os.path.join(dir_name, file_name))

In [158]:
# Read in London data

london_wards = gpd.read_file(os.path.join("spatial data", "statistical-gis-boundaries-london", "ESRI", "London_Ward.shp"))
london_bor = gpd.read_file(os.path.join("spatial data", "statistical-gis-boundaries-london", "ESRI", "London_Borough_Excluding_MHW.shp"))

# 2.2 Clean data

### 2.2.1 Spatial data

In [159]:
# Get some info about shapefile

camden_wards.crs

<Projected CRS: EPSG:27700>
Name: OSGB 1936 / British National Grid
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore.
- bounds: (-9.0, 49.75, 2.01, 61.01)
Coordinate Operation:
- name: British National Grid
- method: Transverse Mercator
Datum: OSGB 1936
- Ellipsoid: Airy 1830
- Prime Meridian: Greenwich

In [160]:
london_wards.crs

<Projected CRS: PROJCS["OSGB 1936 / British National Grid",GEOGCS[ ...>
Name: OSGB 1936 / British National Grid
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: unnamed
- method: Transverse Mercator
Datum: OSGB 1936
- Ellipsoid: Airy 1830
- Prime Meridian: Greenwich

In [161]:
# Get outline of camden and London

camden_outline = london_bor.loc[london_bor["NAME"]=="Camden"]

In [162]:
# Just make a map looking at Camden wards
import folium

# Create a Map object centered at a specific latitude and longitude
m = folium.Map(location=[51.536388, -0.140556], zoom_start=13)

# Add in london outline
folium.GeoJson(london_bor, style_function = lambda x: {'fillColor': 'None', 
                                         'color': '#2DC4B2', 
                                         'weight': 1, 
                                         'fillOpacity': 0.5}).add_to(m)

# Add in camden outline
folium.GeoJson(camden_outline, style_function = lambda x: {'fillColor': 'White', 
                                         'color': '#2DC4B2', 
                                         'weight': 5, 
                                         'fillOpacity': 0.5}).add_to(m)


# Add in camden wards
folium.GeoJson(camden_wards, style_function = lambda feature: {'fillColor': 'White', 
                                         'color': '#2DC4B2', 
                                         'weight': 2, 
                                         'fillOpacity': 0.5,
                                         'popup': feature['properties']['WD22NM']}).add_to(m)

# Display the map
m

### 2.2.2 Crime data

In [163]:
# Get some info about the crime data

infodf(crime)

There are 136066 rows in the dataframe
There are 14 columns in the data frame
The columns of the dataframe are: Index(['id', 'crime_category', 'street_id', 'street_name', 'service',
       'location_subtype', 'month', 'year', 'easting', 'northing', 'longitude',
       'latitude', 'ward_code', 'ward_name'],
      dtype='object')
The data types of each column are: id                    int64
crime_category       object
street_id             int64
street_name          object
service              object
location_subtype     object
month                 int64
year                  int64
easting             float64
northing            float64
longitude           float64
latitude            float64
ward_code            object
ward_name            object
dtype: object
There are 133741 NaN values in the data frame
The NaN sums for each column are: id                       0
crime_category           0
street_id                0
street_name              0
service                  0
location_subty

### While there are NaN values for ward_code and ward_name, we still have the lat and long so we can work out what wards they are in.

In [164]:
# Drop location_subtype as it is NaN for every row

crime = crime.drop(["location_subtype"], axis=1)

In [165]:
# Make the data long/lat into spatial

crime_gdf = gpd.GeoDataFrame(crime, geometry=gpd.points_from_xy(crime.longitude, crime.latitude), crs="EPSG:4326")

In [166]:
# check projection is the same

projection(crime_gdf, camden_wards)

Need to change the projections
Projection of first data frame is WGS 84
Projection of second data frame is OSGB 1936 / British National Grid


In [167]:
# change projection

crime_gdf = crime_gdf.to_crs(camden_wards.crs)

In [168]:
# find out which crimes are in which wards

crimes_in_wards = gpd.sjoin(crime_gdf, camden_wards, op='within')

In [169]:
infodf(crimes_in_wards)

There are 135024 rows in the dataframe
There are 28 columns in the data frame
The columns of the dataframe are: Index(['id', 'crime_category', 'street_id', 'street_name', 'service', 'month',
       'year', 'easting', 'northing', 'longitude', 'latitude', 'ward_code',
       'ward_name', 'geometry', 'index_right', 'OBJECTID', 'WD22CD', 'WD22NM',
       'WD22NMW', 'LAD22CD', 'LAD22NM', 'BNG_E', 'BNG_N', 'LONG', 'LAT',
       'GlobalID', 'SHAPE_Leng', 'SHAPE_Area'],
      dtype='object')
The data types of each column are: id                   int64
crime_category      object
street_id            int64
street_name         object
service             object
month                int64
year                 int64
easting            float64
northing           float64
longitude          float64
latitude           float64
ward_code           object
ward_name           object
geometry          geometry
index_right          int64
OBJECTID             int64
WD22CD              object
WD22NM           

In [170]:
# drop WD22NMW columns as all NaNs
crimes_in_wards = crimes_in_wards.drop(["WD22NMW"], axis=1)

In [171]:
crimes_in_wards.columns

Index(['id', 'crime_category', 'street_id', 'street_name', 'service', 'month',
       'year', 'easting', 'northing', 'longitude', 'latitude', 'ward_code',
       'ward_name', 'geometry', 'index_right', 'OBJECTID', 'WD22CD', 'WD22NM',
       'LAD22CD', 'LAD22NM', 'BNG_E', 'BNG_N', 'LONG', 'LAT', 'GlobalID',
       'SHAPE_Leng', 'SHAPE_Area'],
      dtype='object')

In [172]:
# drop all unneeded data

crimes_in_wards = crimes_in_wards[['id', 'crime_category', 'month','year', 'longitude', 'latitude', 'geometry', 'index_right', 'OBJECTID', 'WD22CD', 'WD22NM', 'SHAPE_Leng', 'SHAPE_Area']]

In [173]:
infodf(crimes_in_wards)

There are 135024 rows in the dataframe
There are 13 columns in the data frame
The columns of the dataframe are: Index(['id', 'crime_category', 'month', 'year', 'longitude', 'latitude',
       'geometry', 'index_right', 'OBJECTID', 'WD22CD', 'WD22NM', 'SHAPE_Leng',
       'SHAPE_Area'],
      dtype='object')
The data types of each column are: id                   int64
crime_category      object
month                int64
year                 int64
longitude          float64
latitude           float64
geometry          geometry
index_right          int64
OBJECTID             int64
WD22CD              object
WD22NM              object
SHAPE_Leng         float64
SHAPE_Area         float64
dtype: object
There are 0 NaN values in the data frame
The NaN sums for each column are: id                0
crime_category    0
month             0
year              0
longitude         0
latitude          0
geometry          0
index_right       0
OBJECTID          0
WD22CD            0
WD22NM          

### 2.2.2 Population data

In [174]:
# Get some info about the crime data

infodf(population)

There are 20 rows in the dataframe
There are 14 columns in the data frame
The columns of the dataframe are: Index(['ward_code', 'ward_name', 'la_code', 'la_name', 'ward_area',
       'usual_residents', 'usual_residents_aged_16+',
       'usual_residents_aged_16+_in_employment', 'households',
       'housolds_deprived_in_0_dimensions', 'housolds_deprived_in_1_dimension',
       'housolds_deprived_in_2_dimensions',
       'housolds_deprived_in_3_dimensions',
       'housolds_deprived_in_4_dimensions'],
      dtype='object')
The data types of each column are: ward_code                                  object
ward_name                                  object
la_code                                    object
la_name                                    object
ward_area                                 float64
usual_residents                             int64
usual_residents_aged_16+                    int64
usual_residents_aged_16+_in_employment      int64
households                           

### Looks pretty clean already - undecided what elements we want to explore yet, so will keep all the columns. 

# 2.3 Aggregate and Merge data

### We now have clean point data for each crime, but this doesn't allow us any more insight into the population statistics of each ward. To solve this, we will aggregate each crime type to ward level. 

### 2.3.1 Aggregate

In [175]:
#crimes_in_wards.groupby(["crime_category","WD22CD"], as_index=False).count().rename(columns={"Reported by" :"count"})

# 3. Point Pattern Analysis

As we now have cleaned point data, we can explore the locations of crimes. This will be done by:
    
    1. Splitting data temporally by entire year, and then winter and summer. 
    2. Splitting data by crime type

In [176]:
crimes_gdf

NameError: name 'crimes_gdf' is not defined

In [None]:
# Split temporally 

crimes_year = crime_gdf
crimes_winter = 
crimes_summer = 
