# Description of analysis

I use the beach pollution data (1994-2021) to find the UK beaches with sufficient number of pre and post observations, considering 2015 as the year of policy change - the year when England began charging 5p for sigle use plastic bags. Here are some important dates to remember: 
* A five pence charge came into effect on single use carrier bags in England on 5 October 2015.

* Wales, Northern Ireland, and Scotland introduced a 5p levy on single use carrier bags in 2011, 2013, and 2014 respectively. The purpose of each single use carrier bag charge is to reduce the number of bags given out, increase their re-use and reduce litter.

Source: https://commonslibrary.parliament.uk/research-briefings/cbp-7241/


In [7]:
# Loading packages and data
import pandas as pd
import numpy as np

df = pd.read_excel("Beachwatch_AllData_1994-23.01.21inc Public Source Litter YChen.xlsx")

To include sufficient pre and post periods, I am only include beach pollution data between January 01, 2010 and December 31, 2019. We chose to exclude 2020 to avoid the pandemic years. 

In [8]:
start_date = pd.to_datetime('2010-01-01')
end_date = pd.to_datetime('2019-12-31')

df['Date of Survey'] = pd.to_datetime(df['Date of Survey'])

df = df[(df["Date of Survey"]>=start_date) & (df["Date of Survey"]<=end_date)]

df = df.reset_index(drop = True)

After filtering the dataset for the 2010 - 2019 period, the resulting data set has 8,300 observations for a total of 1,803 beaches.

In [9]:
# number of beahces
len(np.unique(df.BeachID))

1803

In [10]:
# number of observations
len(df)

8300

If we used monthly data for the above 1,803 beaches, we should have 216,360 observations [(10 x 12) x 1803]. But the current data appears to have only 8,300 observations. Perhaps, we should use yearly data instead of monthly.

### Converting to yearly data

The new dataset below, ndf, now contains beach pollution in yearly frequency. 

In [11]:
ndf  = df.sort_values(by = ['BeachID','Date of Survey'], ascending=[True, True]) #sorting
ndf= ndf.set_index('Date of Survey')
ndf = ndf.groupby('BeachID').resample('Y').sum() # converting to yearly average data
del ndf["BeachID"]
ndf = ndf.reset_index(level=1).reset_index() # resetting index

Now, considering October 2015 as the month of policy change, each beach should have maximum 5 years of pre-2015 observations and maximum 5 years of post-2015 (including 2015) observations. Below I subset the datset, including only the beaches that have at least 3 years' observatioins (total 6 years) from each of pre and post periods. 

In [12]:
# Separating the dataset for pre and post period

intervention = pd.to_datetime('2015-10-05') # the date when UK began the 5p charge
post_period = ndf[ndf['Date of Survey'] >= intervention]
pre_period = ndf[ndf['Date of Survey'] < intervention]

pre_observations_per_beach = pre_period.groupby('BeachID').size()
pre_observations_per_beach = pre_observations_per_beach.reset_index()
pre_observations_per_beach.columns = ['BeachID', '#of_Pre_observations']

post_observations_per_beach = post_period.groupby('BeachID').size()
post_observations_per_beach = post_observations_per_beach.reset_index()
post_observations_per_beach.columns = ['BeachID', '#of_Post_observations']

## Keeping only the beaches that have at least 3 years of pre data out of 5 years
pre_observations_per_beach = pre_observations_per_beach[pre_observations_per_beach['#of_Pre_observations']>=3]

## Keeping only the beaches that have at least 3 years of post data out of 4 years
post_observations_per_beach = post_observations_per_beach[post_observations_per_beach['#of_Post_observations']>=3]

The merged data set below shows the BeachID's with at least 3 years of pre and 3 years of post periods

In [13]:
## Merging the two subsets above
merged = pd.merge(pre_observations_per_beach, post_observations_per_beach, on = 'BeachID')

## Total 359 beaches in this "merged" sample 

In [14]:
len(merged)

359

In [16]:
merged.to_csv('beach_sample.csv')

## Checking the number of beaches in Enland only (excluding Scotland, Wales, Northern Ireland, and Channel Islands)

In [18]:
# Loading the original beach pollution data
df = pd.read_excel("Beachwatch_AllData_1994-23.01.21inc Public Source Litter YChen.xlsx")

# Including only beaches that are sampled in the "merged" data set. These beaches have sufficient pre and post observations for litter collection
df = df[df['BeachID'].isin(merged['BeachID'])]

In [20]:
# Checking all the different regions of UK beaches
beach_regions = np.unique(df["Beach Region"]).tolist()
print(beach_regions)

['Channel Islands', 'North East England', 'North West England', 'Northern Ireland', 'Scotland', 'South East England', 'South West England', 'Wales']


In [21]:
# Making a list of beaches in England only
beach_regions = ["North East England", "North West England","South East England", "South West England"]  

In [22]:
# Filtering out beaches outside of England
df = df[df["Beach Region"].isin(beach_regions)]

# Total number of beaches in England that are included in the final sample
len(np.unique(df["BeachID"]))

225

## In total 225 beaches are included in the final sample, including beaches in England only.