This notebooks put together and cleans the labelled data.

Labelled data come from three sources:
1. Top 100 pages in a randomly initialised original ranking method.
2. Top 100-200 pages in a randomly initialised original ranking method.
3. Randomly selected 200 pages.

Created sets:

1. `val` (validation) = top 100 pages
2. `train_top200` = top 100-200 pages
3. `train_random` = Random 200 pages
4. `train_all` =  `train_top200` + `train_random`

In [1]:
import numpy as np
import pandas as pd

In [35]:
labelled_top100 = pd.read_csv('./data/labelled/pages_ranked_with_data_labelled.csv')
labelled_top200 = pd.read_csv('./data/labelled/pages_ranked_with_data_replication_tolabel2.csv')
labelled_random = pd.read_csv('./data/labelled/random_websites_to_label_final.csv')

labelled_top100.dropna(subset = ["label"], inplace = True)
labelled_top200.dropna(subset = ["label"], inplace = True)

In [36]:
labelled_top200

Unnamed: 0,page path,label
0,/income-support,0
1,/search,0
2,/foreign-travel-advice,0
3,/get-state-pension,0
4,/browse/employing-people/recruiting-hiring,1
...,...,...
96,/apply-first-provisional-driving-licence,0
97,/business-coronavirus-support-finder,1
98,/order-copy-birth-death-marriage-certificate,0
99,/tax-codes,0


In [37]:
# print(labelled_data_val.head())
top100_pages = labelled_top100.loc[:,"page path"]
top200_pages = labelled_top200["page path"]
random_pages = labelled_random["pagePath"]

In [38]:
top100_random_overlap = np.array([p in random_pages for p in top100_pages]).sum()
top100_top200_overlap = np.array([p in top200_pages for p in top100_pages]).sum()
top200_random_overlap = np.array([p in random_pages for p in top200_pages]).sum()

print("Overlap of top100 & random", top100_random_overlap)
print("Overlap of top200 & random", top200_random_overlap)
print("Overlap of top200 & random", top200_random_overlap)

Overlap of top100 & random 0
Overlap of top200 & random 0
Overlap of top200 & random 0


In [39]:
df_val = labelled_top100.loc[:, ["page path", "label"]].copy()
df_val.rename(columns = {"page path": "page_path"}, inplace = True)

df_train_top200 = labelled_top100.loc[:, ["page path", "label"]].copy()
df_train_top200.rename(columns = {"page path": "page_path"}, inplace = True)

df_train_random = labelled_random.loc[:, ["pagePath", "er_label"]].copy()
df_train_random.rename(columns = {"pagePath": "page_path", "er_label": "label"}, inplace = True)

In [42]:
df_train_all = pd.concat([df_train_top200, df_train_random])

In [44]:
df_train_all.to_csv('./data/labelled/train_all.csv')
df_train_top200.to_csv('./data/labelled/train_top200.csv')
df_train_random.to_csv('./data/labelled/train_random.csv')
df_val.to_csv('./data/labelled/val.csv')

# Updated version

In [2]:
labelled_top100 = pd.read_csv('./data/labelled/pages_ranked_with_data_labelled_v2.csv')
labelled_top200 = pd.read_csv('./data/labelled/pages_ranked_with_data_replication_tolabel2.csv')
labelled_random = pd.read_csv('./data/labelled/random_websites_to_label_final_v2.csv')

In [3]:
labelled_top100.dropna(subset = ["label"], inplace = True)
labelled_top200.dropna(subset = ["label"], inplace = True)

In [4]:
top100_pages = labelled_top100.loc[:,"page path"]
top200_pages = labelled_top200["page path"]
random_pages = labelled_random["pagePath"]

In [5]:
top100_random_overlap = np.array([p in random_pages for p in top100_pages]).sum()
top100_top200_overlap = np.array([p in top200_pages for p in top100_pages]).sum()
top200_random_overlap = np.array([p in random_pages for p in top200_pages]).sum()

print("Overlap of top100 & random", top100_random_overlap)
print("Overlap of top200 & random", top200_random_overlap)
print("Overlap of top200 & random", top200_random_overlap)

Overlap of top100 & random 0
Overlap of top200 & random 0
Overlap of top200 & random 0


In [7]:
df_val = labelled_top100.loc[:, ["page path", "label"]].copy()
df_val.rename(columns = {"page path": "page_path"}, inplace = True)

df_train_top200 = labelled_top100.loc[:, ["page path", "label"]].copy()
df_train_top200.rename(columns = {"page path": "page_path"}, inplace = True)

df_train_random = labelled_random.loc[:, ["pagePath", "er_label"]].copy()
df_train_random.rename(columns = {"pagePath": "page_path", "er_label": "label"}, inplace = True)

In [8]:
df_train_all = pd.concat([df_train_top200, df_train_random])

In [9]:
df_train_all.to_csv('./data/labelled/train_all_v2.csv')
df_train_top200.to_csv('./data/labelled/train_top200_v2.csv')
df_train_random.to_csv('./data/labelled/train_random_v2.csv')
df_val.to_csv('./data/labelled/val_v2.csv')