<font size="20">Geospatial Analysis & Visualization w/ Python</font>

# Step 0) Setup
* To get started, we need to import all our packages.

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import scipy
import matplotlib.pyplot as plt
%matplotlib notebook

# Step 1) Importing our data
* We'll load the Police Killing Data as a "DataFrame" using pandas


* Then we'll convert it into a "GeoDataFrame" using Geopandas
    * To do this, we must assign the "geometry".  In this case its point data, and the coordinates are in lat/long
    
    
* Then we need to assign a Coordiante Reference System (CRS) manually
    * ESPG is a standardized code that is used to represent CRSs.
    * 'espg:4326' is for the refers to the WGS 1984 datum, which our latitude/longitude data is based in.
        * This is a CRS that is widely used by many web-based platforms because like Google Maps and Mapbox
        * The original only had addresses, not coordinates, so we used a webservice (Mapbox) to generate the coordinates of our addresses
        
        
* Once we have the data loaded, calling .head() will give us a "preview" of our dataset

In [None]:
# We import the Police Killings file, and set the incident ID as the index
police_Killings_Tabular = pd.read_csv('Data/PoliceKillings.csv',
                                      parse_dates=['date'],
                                      index_col=['id_incident']
                                     )

# We can then convert the pandas dataframe into a geopandas "GeodataFrame"
police_Killings = gpd.GeoDataFrame(police_Killings_Tabular,
    geometry=gpd.points_from_xy(police_Killings_Tabular.longitude,
                                police_Killings_Tabular.latitude
                               )
                                  )

# Now we can assign a CRS
WGS_1984={'init' :'epsg:4326'}
police_Killings.crs = WGS_1984

# Lets sort the incidents by date and then take a quick look.
police_Killings=police_Killings.sort_values(by='date')
police_Killings.head()

### Now we'll load some data from the 2016 Census

* We have a tabular dataset of population data.  We'll load that using pandas

In [None]:
# We'll import the tabualr census data with pandas
Census_Tabular = pd.read_csv('Data/Census.csv',index_col=['PRUID'])
Census_Tabular.head()

* We also have a provincial boundary shapefile that we can load with geopandas.
    * Shapefile are used to store georphric data.  They already have projections and coordiantes associated with them.
    * Geopandas has similar functionality to pandas.  But the read_file() method had less options, so we have to set the index manually.

In [None]:
# We'll import provincial boundaries using geopandas
Provincial_Boundaries = gpd.read_file('Data/Provincial_Boundaries.shp').set_index('PRUID')
Provincial_Boundaries.head()
# Provincial_Boundaries = Provincial_Boundaries.drop(['PRENAME','PRFNAME','PREABBR','PRFABBR','AREA_LCC','AREA_AEA','Area_Merc'],axis=1)
# Provincial_Boundaries.geometry = Provincial_Boundaries.simplify(100)
# Provincial_Boundaries.to_file('Data/Provincial_Boundaries.shp')
# Provincial_Boundaries.index

# Step 2) Joining our census data

* This will let us map the disparity by province and do a more detailed analysis

* PRUID is a "unique identifier" that represents the provinces.

    * Since both have the PRUID set as the index, we don't need to specify a join key.

In [None]:
Test_Join = Provincial_Boundaries.join(Census_Tabular)
Test_Join.head()

### But our join fails :(

* ## Notice the NaN values

* Wonder Why?
    * Lets look at the index for both files?  Maybe we have a datatype missmatch?

In [None]:
print(Provincial_Boundaries.index.dtype)
print(Census_Tabular.index.dtype)

### Sure enough!  The Provincial_Boundaries index is an "object", not an integer.

* We can fix that easily and then do the join!
    * We just need to change the datatype of the Provincial_Boundaries layer.

### How could we do this?
* Hint The anser is in the cell above!!

In [None]:
dtype = 
Provincial_Boundaries.index = Provincial_Boundaries.index.astype(dtype)
Provincial_Boundaries = Provincial_Boundaries.join(Census_Tabular)
Provincial_Boundaries.head()

# Step 3) Exploring the data

### First lets make a quick map.

* Our Layers need to be in the same coordinate system to match up properly on a map!

* We can re-project the police_Killings layer using the .to_crs function to set the CRS to that of the Provinces
    * The provinces layer uses the Canada Lambert Conformal Conic projection (LCC).  This is the standard projection used by stats canada and is ideally suited for displaying the whole of country.
        
        
* Once both datasets are in the same coordinate system, we can make a map!


* First we must define a plot, using the matplotlib.pyplot package.  We imported this earlier as "plt"
    * We use the plt.subplots() to create a figure, and we can define how big we want it to be
    
    
* Geoapandas can then use the .plot() fucntion to create a map using matplotlib.
    * We simply tell it what axis to draw the plot on with ax="axes"
    * Then set a few other parameters:
        * We just want the provinces as a grey background so we can set the color
        * We want to classify killings by race, so we can set race as the column.  THen we can add a legend to aid interpretation of the data

In [None]:
# We can use .to_crs() to create a police killings layer with the same projection as the provinces layer.
police_Killings = police_Killings.to_crs(Provincial_Boundaries.crs)

# Now, we can create a figure using matplotlib (plt), first we define the figure and the size
fig,axes=plt.subplots(
    figsize=(6,6)
)

# Now we can add the provinces using the .plot() function.  We set the plotting axes and give it a grey color
cb = Provincial_Boundaries.plot(
    ax=axes,
#     alpha=.5,
    column='Total',
    cmap = 'Greys',
    edgecolor='grey',
    legend=True,
)

# Then we add the police_Killings_LCC.  We'll set the column to 'race', so we can disply by race,
# give the point markers a few more parameters, and add them to a legend
police_Killings.plot(
    ax=axes,
    column='race',
    edgecolor='k',
    markersize=15,
    legend=True,
    legend_kwds={'loc': 'upper right','fontsize':8}
)

### And now you've made your first map with python!

* But its an ugly map :(
    * It doesn't look great.  This is just the quick and dirty way to look ata data
    * To make things more presentable, we'll have to be more explicit in setting up our map.  But that's a task for later.


### For now, lets move on and look at the dataset in more detail.

* Pandas & Geopandas have some nice features to quickly summarize our dataset.



* We can use .count() to get the total # incidents.
    * Callling .count() as is, will give us a list of all the columns, and a count for each.  We can see most collumns are "full" but in the "geocoding_Notes" column, we can see that 4 points don't have coordinates associated with their address.  This suggests there was an error in the data entry process.  We don't need to worry about this though.    

In [None]:
police_Killings.count()

* We can use .mean(), .min(), etc. followed by ['age'] to get some vital statistics on the age of victims.

In [None]:
print('Age Distribution of Victims')
print()
print('Mean:                ',
      police_Killings.mean()['age']
     )
print()
print('Standard Deviation:  ',
      police_Killings.std()['age']
     )
print()
print('Youngest:            ',
      police_Killings.max()['age']
     )
print()
print('Oldest:              ',
      police_Killings.min()['age']
     )

### We can resample our data to look for trends
* The date column is a special type of data that allows us to resample our data by year, month, etc
* The dataset has to be in order by date for this to work (we did this alread).

In [None]:
Resampled = police_Killings.set_index('date').resample('Y').count()


plt.figure()
plt.bar(
    Resampled.index.year,
    Resampled['id_victim'],
    edgecolor='black',
    facecolor='#FF0000'
)
plt.title('Police Killings per Year in Canda')

### We can group our data to look for patterns too.

* the .groupby() function can accept one or multple paramters to group our dataset by.
    * This allows us to create complex queries if we want.
* We can have to follow up with .count(), .mean(), etc.
    * This tells us "how" to aggregate

In [None]:
fig,ax = plt.subplots(figsize=(9,6))


Armed = police_Killings.groupby(['armed_type']).count()
ax.pie(
    Armed['id_victim'],
    labels=Armed.index,
    textprops={'fontsize': 8},
    autopct='%1.1f%%',
    wedgeprops={"edgecolor":"k",'linewidth': 1, 'linestyle': 'dashed'}
)
ax.set_title('Police Killings: Was the Victim Armed?')
plt.tight_layout()

In [None]:
police_Killings.groupby(['gender','mentral_distress_disorder']).count()

### We're intersted in a specific question.  What's the distribution of police killings by race?


In [None]:
police_Killings.groupby(['race']).count()['date'].sort_values()

# Step 4) Normalizing our Data

* The racial demographics of Canada aren't evenly split however!

* We need to Normalize our data by population statistics.

* Lets look at our census data again


In [None]:
Provincial_Boundaries[Census_Tabular.columns]

### The first row contains the total values for the whole country.  We can use this to calculate a police killing rate.

* But the Canadian Census' racial categories don't match up perfectly with the police violence dataset's racial
* How can we work around this?
    * We have the largest three groups in the police killing set: White, Indigenous, and Black.  So we can work with them as is
    * The other races make up a small portion of total killings.  And we can't be entirely sure how the CBC defined their groupings.  So, lets add a new category: "Other Minorities"
    
* We'll do this for both the provincial boundaires and the police_Killings
    * For the police killings, we'll leave the unknow records alone

In [None]:
Other_Minorities=['South Asian', 'Chinese', 'Filipino','Latin American',
 'Arab', 'Southeast Asian', 'West Asian', 'Korean',
'Japansese', 'Visible minority, n.i.e', 'Mixed']
Provincial_Boundaries['Other Minorities']=Provincial_Boundaries[Other_Minorities].sum(axis=1)

Other_Minorities=['Latin American', 'Arab', 'Other', 'South Asian', 'Asian']
police_Killings['race'] = police_Killings['race'].replace(to_replace=Other_Minorities,value='Other Minorities')


### From here, we can calculate the police killing rate.

* Dividing the total number of killings by the population gives us ...

In [None]:
Races = ['Indigenous','Black','Caucasian','Other Minorities']
Race_Breakdown = police_Killings.groupby(['race']).count()['id_victim']
Can_Pop = Provincial_Boundaries[Races].sum()

Racial_Rates = Race_Breakdown.T[Races]/Can_Pop
Racial_Rates['Average']=Race_Breakdown.T[Races].sum()/Can_Pop.sum()
print(Racial_Rates)
# police_Killings.groupby(['race']).count()['date'].sort_values()

### This number isn't that meaningful though.  It represents the number of killings "per person" over the whole study period.

* Lets convert the rate to a more meaninful unit.  Killings / Million Residents / Year

* The date record is a "date" object.
* It has some added functionality like being able to query the the year, month, day

### How might we use calculate our police killing rate?

In [None]:
First_Year = police_Killings['date'].min().year
Last_Year = police_Killings['date'].max().year
print(First_Year,Last_Year)

In [None]:

Scale = 
Duration = 
rate_Conversion = Scale / Duration

fig, ax = plt.subplots(figsize = (7,6))
ax.barh(
    Racial_Rates.index,
    Racial_Rates.values * rate_Conversion,
    facecolor='#FF0000',
    edgecolor='black',
    linewidth=1
)
ax.set_title('Police Killings by Race in Canada')
ax.set_xlabel('Killings/Million Residents/Year')
plt.tight_layout()

# The Police killing rates are 5x higher for Indigenous people and 4x higher for Black people than it is fo White people.

* This is an abhorent example of systemic racism in Canadian Policing.


### Now we  want to normalize by provincial demographics.

* We have a few more steps to go through first.
    * The police killings and census data use different abbreviations.  To do a join our dataset with the census data we'll need to assign an new abbreviaton
    * We'll us a dictionary to do this
    
    
* Then we can summarize the killings by province and join it to the Provinces_Join layer

* Now we can summarize the killings by province and join it to the Provinces_Join layer


* Note Prince Edward Island doesn't have any.

In [None]:

race_by_Province = police_Killings.groupby(['prov','race']).count()
race_by_Province = race_by_Province['date'].unstack()
race_by_Province['Total'] = race_by_Province.sum(axis=1)


for col in Races:
    Provincial_Boundaries = Provincial_Boundaries.join(race_by_Province[col],on='prov',rsuffix='_Killings')

for col in ['Unknown','Total']:
    Provincial_Boundaries = Provincial_Boundaries.join(race_by_Province[col],on='prov',rsuffix='_Killings')
Provincial_Boundaries

# Some provines/groups don't have any records.  Those are given NaN values, and need to be repalced with zeros
Provincial_Boundaries[[x+'_Killings' for x in Races]]=Provincial_Boundaries[[x+'_Killings' for x in Races]].fillna(0)
Provincial_Boundaries['Total_Killings']=Provincial_Boundaries['Total_Killings'].fillna(0)
Provincial_Boundaries[['Unknown' for x in Races]].fillna(0)

Provincial_Boundaries.head()

# Step 5) Calcualte the police killing rate (PKR) on the provincial level
* Nunavut has a huge problem.  Its not a conicidence that the population is 75% Inuit.

In [None]:
Provincial_Boundaries['PKR']=(Provincial_Boundaries['Total_Killings']/Provincial_Boundaries['Total']*rate_Conversion).round(2)

Provincial_Boundaries.plot(column='PKR',legend=True,scheme='naturalbreaks')

# Step 6) Calculate a Police Killings Discrimination Index:

* For this, we'll compare the rates of police killings of black and indigenous people to white people

* We'll use the following equations:


\begin{align}
\ Wr & = (\frac{White Killings}{White Population}) * Rate Conversion\\
\end{align}

\begin{align}
\ BIr & = (\frac{Black Killings + Indigenous Killings}{Black Population + Indigenous Population}) * Rate Conversion\\
\end{align}

\begin{align}
\ PKDI & = BIr - Wr\\
\end{align}

* This will hightlight the disparities in police killings
    * We'll classify the data using the following scheme:
    
        * "Slight Bias": -0.483293 to 0.483293 - This is the rate killings of whites.  Within these ranges, differences might be due to presence or lacktherof of a certain groups 
        * "Moderate Bias": 0.48 to 0.77 - Greater than the white rate, less than the national average
        * "Severe Bias": 0.77 to 2.31 - Greater than the national rate, less than the indigenouos rate
        * "Extreme Bias: 2.31 to 10 - Greater than the national indigenous rate
    

In [None]:
Provincial_Boundaries['Wr']=Provincial_Boundaries['Caucasian_Killings']/Provincial_Boundaries['Caucasian']*rate_Conversion
Provincial_Boundaries['BIr']=(Provincial_Boundaries['Indigenous_Killings']+Provincial_Boundaries['Black_Killings'])/(Provincial_Boundaries['Indigenous']+Provincial_Boundaries['Black'])*rate_Conversion

Provincial_Boundaries['PKDI'] = Provincial_Boundaries['BIr'] - Provincial_Boundaries['Wr']

Provincial_Boundaries['PKDI']=Provincial_Boundaries['PKDI'].fillna(0)


bins = [-0.48,0.48,0.77,2.31,10.0]
labels = ['Minimal Biaias','Moderate Bias','Severe Bias','Extreme Bias']
Provincial_Boundaries['PKDI_Classes']=(pd.cut(Provincial_Boundaries['PKDI'],bins=bins,labels=labels)).astype('str')

Provincial_Boundaries.round(2)

# print(Provincial_Boundaries[['prov','PKDI','PKDI_Classes']].sort_values(by='PKDI').round(2))

### Lets map the patterns

In [None]:
Provincial_Boundaries.plot(column='PKDI_Classes',legend=True)

# Step 7) Create a detailed infographic on police violence in Canada

* Matplotlib alows us to be very specific in determining our layout with gridspec.


* We can create a large plot and define specifically what we want.


* We'll have two maps, showing the PKR and the PKDI on the left


* Then we'll add some smaller plots on the right showing the annual trend, national PKR by race, and some pie charts


* We can set our default ontsize for consistency

In [None]:
SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 16

fig = plt.figure(figsize=(10,10))


plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=MEDIUM_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title



gs = fig.add_gridspec(100,100)

PKR_Map = fig.add_subplot(gs[0:45 , 0:50])
PKDI_Map = fig.add_subplot(gs[50:95, 0:50])

SourceStatement = fig.add_subplot(gs[95:, 0:50])

Annual_Trend = fig.add_subplot(gs[0:20, 65:])
PKR_national = fig.add_subplot(gs[28:48, 65:])
Pie_1 = fig.add_subplot(gs[58:78, 65:])
Pie_2 = fig.add_subplot(gs[80:100, 65:])


plt.suptitle('Police Killings in Canada (2000-2017)')

### Now we can add things to the figure
* First lets do the maps

In [None]:
Provincial_Boundaries.plot(ax=PKR_Map,
                           column='PKR',
                           legend=True,
                           cmap = 'Reds',
                           edgecolor='black',
                           scheme='naturalbreaks')
PKR_Map.get_legend().set_title('PKR') 
PKR_Map.get_xaxis().set_visible(False)
PKR_Map.get_yaxis().set_visible(False)
PKR_Map.set_title('Police Killing Rate')


Provincial_Boundaries=Provincial_Boundaries.sort_values(by='PKDI')
Provincial_Boundaries.plot(ax=PKDI_Map,
                           column='PKDI',
                           legend=True,
                           edgecolor='black',
                           cmap = 'Reds',
                           scheme='naturalbreaks')

# PKDI_Map.get_legend().set_bbox_to_anchor((1,0.5))
PKDI_Map.get_legend().set_title('PKDI') 
PKDI_Map.get_xaxis().set_visible(False)
PKDI_Map.get_yaxis().set_visible(False)
PKDI_Map.set_title('Police Killing Discrimination Index')


Annual_Trend.bar(
    Resampled.index.year,
    Resampled['id_victim'],
    edgecolor='black',
    facecolor='#FF0000'
)
Annual_Trend.set_title('Killings per Year')
Annual_Trend.set_xticks([2000,2005,2010,2015])


PKR_national.barh(
    Racial_Rates.index,
    Racial_Rates.values * rate_Conversion,
    facecolor='#FF0000',
    edgecolor='black',
    linewidth=1
)
PKR_national.set_title('Killings by Race in')
PKR_national.set_xlabel('Killings/Million Residents/Year')



Armed = police_Killings.groupby(['armed_type']).count()
Pie_1.pie(
    Armed['id_victim'],
    labels=Armed.index,
    textprops={'fontsize': 8},
    autopct='%1.1f%%',
    wedgeprops={"edgecolor":"k",'linewidth': 1, 'linestyle': 'dashed'}
)
Pie_1.set_title('Was the Victim Armed?')
# plt.tight_layout()


COD = police_Killings.groupby(['Charges']).count()
Pie_2.pie(
    COD['id_victim'],
    labels=COD.index,
    textprops={'fontsize': 8},
    autopct='%1.1f%%',
    wedgeprops={"edgecolor":"k",'linewidth': 1, 'linestyle': 'dashed'}
)
Pie_2.set_title('Were Officers Charged?')



DataSource='Created by June Skeeter\nPolice Killing data collected by the CBC\nDemographics data is from Stats Canada'

SourceStatement.set_axis_off()
SourceStatement.text(0, 0.5, 
                     DataSource,
                     horizontalalignment='left',
                     verticalalignment='center',
#                      transform=ax.transAxes
                    )

# plt.tight_layout()
plt.savefig('InfoGraphic.png')

# Save the data so we can use it in the future
* We're going to save it as a shapefile for use with geopandas or a desktop GIS
* We're also going to save it as a "GeoJSON" file.  This datatype is well suited for webmapping.  Which I cover in a dfferent workshp!

In [None]:
Provincial_Boundaries.to_file('Data/Provincial_Police_Violence.shp')
Provincial_Boundaries = Provincial_Boundaries.to_crs('epsg:4326')
Provincial_Boundaries.to_file("Data/Provincial_Police_Violence.json", driver = "GeoJSON")

# police_Killings = police_Killings.to_crs('epsg:4326')
# Temp=police_Killings[['prov','race','armed_type','age','mentral_distress_disorder','geometry']]
# print(Temp.head())
# Temp.to_file("Data/PoliceKillings.json", driver = "GeoJSON")

# To Do

* Maybe Chi Square

* Update legend labels

* Make Infographic taller

* Add Wr and BIr
    * Add explanation of PKDI
    
* Update pie chart color scheme

* Add info on Police Deparments