## Quantitative Methods 2:  Data Science and Visualisation
## Workshop 6: Advanced Dataframe Operations

Today we will be talking about and using **merging** and **joining** and performing some advanced Dataframe operations.

## Downloading the Data
Let's grab the data we will need this week from our course website and save it into our data folder. If you've not already created a data folder then do so using the following command. 

Don't worry if it generates an error, that means you've already got a data folder.

In [None]:
!mkdir data

In [None]:
!mkdir data/wk6
!mkdir data/wk6/crimeData

!curl https://qm2.s3.eu-west-2.amazonaws.com/wk6/crimeData/2020-03-metropolitan-street.csv -o ./data/wk6/crimeData/2020-03-metropolitan-street.csv
!curl https://qm2.s3.eu-west-2.amazonaws.com/wk6/crimeData/2020-04-metropolitan-street.csv -o ./data/wk6/crimeData/2020-04-metropolitan-street.csv
!curl https://qm2.s3.eu-west-2.amazonaws.com/wk6/crimeData/2020-05-metropolitan-street.csv -o ./data/wk6/crimeData/2020-05-metropolitan-street.csv
!curl https://qm2.s3.eu-west-2.amazonaws.com/wk6/crimeData/2020-06-metropolitan-street.csv -o ./data/wk6/crimeData/2020-06-metropolitan-street.csv
!curl https://qm2.s3.eu-west-2.amazonaws.com/wk6/crimeData/2020-07-metropolitan-street.csv -o ./data/wk6/crimeData/2020-07-metropolitan-street.csv
!curl https://qm2.s3.eu-west-2.amazonaws.com/wk6/crimeData/2020-08-metropolitan-street.csv -o ./data/wk6/crimeData/2020-08-metropolitan-street.csv
!curl https://qm2.s3.eu-west-2.amazonaws.com/wk6/lsoa-data.csv -o ./data/wk6/lsoa-data.csv

`--------------------------------`

Let's remind ourselves of the two types of join

In [None]:
from IPython.display import Image
Image("http://danielhammocks.uk/teaching/BASC/wk6/inner_join.png")

In [None]:
Image("http://danielhammocks.uk/teaching/BASC/wk6/left_outer_join.png")

We start with an example.

Let's create two Dataframes 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random

In [None]:
df1 = pd.DataFrame(np.random.randn(5, 5))

In [None]:
df1

In [None]:
df2 = pd.DataFrame(np.random.randn(3, 5))

In [None]:
df2

Let's merge them by rows. concat does that for us.

In [None]:
df3 = pd.concat([df1,df2])

In [None]:
df3

What is the problem with the dataframe above? 

In [None]:
# Enter answer here

`-----------------------------------------`   
  
The solution is to set ignore_index to True.

In [None]:
df3 = pd.concat([df1,df2], ignore_index=True)

In [None]:
df3

`ignore_index` is very useful when we want a new DataFrame which only contains data from other DataFrames , but unrelated otherwise. 




---

Now we want to look at more comlex merge operations, which take in to account the data values.

We have the last 6 months of crime data from the Metropolitan Police force. However, the data is not provided in a single file - it is separated into months. You should create a single dataframe called dataCrime that combines the 6 files.

*Hint:* A quick google and search on stack overflow will find you the answer!

In [None]:
#Enter your code here.

In [None]:
#Print the first 5 rows
dataCrime.head()

This dataset is not very informative in its current state due to the amount of information it contains. Summarise the number of crimes committed under each category for the last 6 months to provide an broad overview of the crime in London.

Hint: You can quickly summarise data in a DataFrame column by using the `value_counts()` method.

In [None]:
#Enter Code Here

Note that the returned data is a `Series`, we can turn it in to a DataFrame quite easily using 'to_frame'. Create a dataframe of the summary data called crimeCounts.

In [None]:
#Enter Code Here

In [None]:
#View the dataframe
crimeCounts

---
# Merging with Real Data - London Demographic Dataset. 

Socio-economic and demographic factors are often associated with criminality and rates of crime. Using the latest census data lets explore some of these relationships. 



In [None]:
#Specify the path to the demographic dataset
path = './data/wk6/lsoa-data.csv'
# Load the demographic dataset as df
dataLSOA =  pd.read_csv(path, encoding = 'latin1')

#View the first 5 entries
dataLSOA.head()

Merge the dataCrime and dataLSOA dataframes into a single dataframe called dataLondon.

In [None]:
#Enter your code here.

In [None]:
#View the combined dataset
dataLondon.head()

# Dataframe Subsets

There are now a lot of columns in our dataset. We can use the following command to get a list of the column names:

In [None]:
dataLondon.columns

You will see that there are 289 different columns now! Suppose we are only interested in the Household composition data. We can use the `filter()` method combined with a regular expression to subset all of the columns that begin with a specific phrase.

In [None]:
#Filter the dataset
dataHouseComposition = dataLondon.filter(regex=r'^Household Composition', axis=1)

#Preview the data
dataHouseComposition.head()


We can also filter the dataframe on specific values to analyse crime in areas with specific demographics.

Lets look at the most top 5 prevalent crimes in areas where the median household income is greater than £50,000 per year.


In [None]:
dataLondon[dataLondon['Household Income, 2011/12;Median Annual Household Income estimate (£)']>50000].value_counts('Crime type')

Repeat this for the areas where the median household income is less than £20,000 per year.

In [None]:
#Enter your code here

Can we compare these directly? If not, why not, and what could be done to resolve this issue?

In [None]:
#Enter your answers to the questions here

If you think a correction is needed implement it here...

In [None]:
#Enter your code here

 What do you notice between the two sets of results? 

In [None]:
#We see that... 

# Plotting the Data


Let us first create a time-series plot in order to better understand how Londons crime profile has changed over the last 6 months.

In [None]:
#Convert the month column to a date format
dataLondon['Month'] = pd.to_datetime(dataLondon['Month'])

#Use the groupby method to create a dataframe of the plotting data
dataPlot = dataLondon.groupby(['Month', 'Crime type']).size().reset_index(name="Count")

#View the dataframe
dataPlot.head()

In [None]:
#Seaborn: visualisation library
import seaborn as sns

#Create the plot
plt.figure()
sns.lineplot(data = dataPlot, x = 'Month', y = 'Count', hue = 'Crime type')

#Move the legend outside of the plot
plt.legend(loc='right', bbox_to_anchor=(1.6, 0.5), ncol=1, title = 'Crime Type')

#Don't forget to add a title and label your axes!
plt.ylabel('Frequency')
plt.title('Frequency of each Crime Type from March to August 2020 in London')

This creates a nice plot but it is difficult to distinguish the different categories. An interactive plot would help us overcome these difficulties...

Below we use the plotly library to create an interactive equivalent of the above graph. Have a play around with the hover and different features available on the top right of the plot. 

We will not cover plotly further so if you are not sure on what these commands are doing then check out the plotly documentation.

In [None]:
#Plotly Express: Visualisation Library (component of plotly)
import plotly.express as px

#Create the plot
fig = px.line(dataPlot, x = 'Month', y = 'Count',
              color = 'Crime type', line_group = 'Crime type',
              labels={"Count": "Frequency", "Crime type": "Crime Type"},
              title = 'Frequency of each Crime Type from March to August 2020 in London',
              hover_name = 'Crime type')

#Tidy up the legend
fig.for_each_trace(lambda t: t.update(name=t.name.replace("Crime Type=", "")))
fig.show()

What effect do you think covid-19 had on the crime profile? Consider the individual crime types and how lock down may have affected them. Do you need additional data to backup your claims? If so, what data?

In [None]:
#Enter your response here.

## Exercise:

Investigate the trend between a demographic variable and crime.

1. Choose a socio-demographic variable from the dataset and choose a crime variable that you would like to compare. This could be the total crime count in each LSOA, or the frequency of a specific crime type.
2. Manipulate the dataframes as necessary in order to aid plotting.
3. Plot absolute values, percent changes, means, medians. Compare the results, see if anything can be concluded. 
4. **Advanced:** Create a plot with different series.


In [None]:
#Enter your code here