# Comparative Analysis of Descriptive Metadata

### Newcastle University Special Collections and University of Edinburgh Archives

***

**Table of Contents**
* [Description Lengths](#description-lengths)
* [Gendered Language](#gendered-language)
* [Parts of Speech](#parts-of-speech)
  * [Adjectives](#adjectives)
  * [Adverbs](#adverbs)

***

In [108]:
import analysis_utils
import pandas as pd
import numpy as np
from pathlib import Path

Create variables to reference the files and data points from the `analysis_metadata_XXX.ipynb` notebooks.

In [39]:
dir_nusc = "data/analysis/"
f_descs_nusc = dir_nusc+"nusc_ead_descs_tokenized.csv"
files_nusc = [
    "nusc_ead_descs_stats.csv", 
    "nusc_ead_gendered_lower_token_counts_ead.csv", "nusc_ead_gendered_capitalized_token_counts.csv",
    "nusc_ead_descs_pos_tags.csv",
    "nusc_ead_nltk_adj_token_counts.csv", "nusc_ead_nltk_adj_token_lower_counts.csv",
    "nusc_ead_nltk_adv_token_counts.csv", "nusc_ead_nltk_adv_token_lower_counts.csv",
    "nusc_ead_nltk_adj_adv_counts.csv", "nusc_ead_nltk_adj_adv_stats.csv"
]

In [40]:
dir_uoe = "data/uoe/analysis/"
f_descs_uoe = "data/uoe/analysis/uoe_descs.csv"
files_uoe = [
    "uoe_ead_descs-oct2020_stats.csv", 
    "uoe_ead_gendered_lower_token_counts_oct2020.csv", "uoe_ead_gendered_capitalized_token_counts_oct2020.csv",
    "uoe_ead_descs_oct2020_pos_tags.csv", 
    "uoe_descs_oct2020_nltk_adj_token_counts.csv", "uoe_descs_oct2020_nltk_adj_token_lower_counts.csv",
    "uoe_descs_oct2020_nltk_adv_token_counts.csv", "uoe_descs_oct2020_nltk_adv_token_lower_counts.csv",
    "uoe_descs_oct2020_nltk_adj_adv_counts.csv", "uoe_oct2020_nltk_adj_adv_stats.csv"
]

In [4]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)  # display floating point numbers with 3 decimal places

In [109]:
comparison_dir = "data/nusc_vs_uoe/"
Path(comparison_dir).mkdir(parents=True, exist_ok=True)

## Description Lengths

In [13]:
df_len_nusc = pd.read_csv(dir_nusc + "nusc_ead_descs_stats.csv", index_col=0)
df_len_nusc

Unnamed: 0,sentences_per_description,tokens_per_description,tokens_per_sentence
mean,1.658,25.482,15.373
median,1.0,9.0,12.0
standard_deviation,2.396,63.481,14.191
minimum,1.0,0.0,0.0
maximum,91.0,2296.0,1563.0
total,124592.0,1915370.0,1915370.0


In [14]:
df_len_uoe = pd.read_csv(dir_uoe + "uoe_ead_descs_oct2020_stats.csv", index_col=0)
df_len_uoe

Unnamed: 0,sentences_per_description,tokens_per_description,tokens_per_sentence
mean,1.502,19.553,13.022
median,1.0,10.0,9.0
standard_deviation,5.377,97.627,14.975
minimum,1.0,1.0,1.0
maximum,742.0,14097.0,551.0
total,41671.0,542635.0,542635.0


Newcastle's data shows more tokens per description on average (25.482 vs. 19.553), which I hadn't expected, though the median is greater for Edinburgh (10 vs. 9).  There is also greater variation in description lengths in the Edinburgh data relative to the Newcastle data (for words per description, the standard deviation of Edinburgh's is 97.627 vs. Newcastle's at 63.481; for sentences per description, Edinburgh's is 5.377 vs. Newcastle's of 2.396).  Also, the maximum number of tokens in a single description is far larger (14,097 vs. 2,296 - over 6x as large), as is the maximum number of sentences (742 vs. 91).  The average tokens per sentence are similar (15.373 for Edinburgh vs. 13.022 for Newcastle).

What if we remove all descriptions that aren't more than 1 sentence long.  How many descriptions are we left with in the Newcastle and Edinburgh datasets?  What is the average length of the remaining descriptions?

In [35]:
df_descs_nusc = pd.read_csv(f_descs_nusc, index_col=0)
# The tokenized column will be read as a string, so convert it to a list of lists of strings
df_descs_nusc = analysis_utils.getColumnValuesAs2dList(df_descs_nusc, "tokenized", replace=True)
df_descs_nusc.head()

Unnamed: 0,description_id,eadid,rowid,field,doc,tokenized
0,0,17th C. Coll,17th C. Coll,bioghist,Formed in 1963 after an amalgamation of instit...,"[[Formed, in, 1963, after, an, amalgamation, o..."
1,1,17th C. Coll,17th C. Coll,scopecontent,The 17th Century Collection is a small but exp...,"[[The, 17th, Century, Collection, is, a, small..."
2,2,17th C. Coll,17th C. Coll,unittitle,17th Century Collection,"[[17th, Century, Collection]]"
3,3,18th C. Coll,18th C. Coll,bioghist,Formed in 1963 after an amalgamation of instit...,"[[Formed, in, 1963, after, an, amalgamation, o..."
4,4,18th C. Coll,18th C. Coll,scopecontent,The 18th Century Collection contains approxima...,"[[The, 18th, Century, Collection, contains, ap..."


In [36]:
tokenized_nusc = list(df_descs_nusc["tokenized"])
sents_per_desc_nusc = [len(desc) for desc in tokenized_nusc]
# tokens_per_desc_nusc = [sum([len(sent) for sent in desc]) for desc in tokenized_nusc]
print(sents_per_desc_nusc[0:5])
# print(tokens_per_desc_nusc[0:5])

[2, 3, 1, 2, 3]


In [37]:
assert len(sents_per_desc_nusc) == df_descs_nusc.shape[0], "There should be one sentence count for every description."

In [38]:
df_descs_uoe = pd.read_csv(f_descs_uoe, index_col=0)
df_descs_uoe = analysis_utils.getColumnValuesAsLists(df_descs_uoe, "sentence_id")
df_descs_uoe.head()

Unnamed: 0,description_id,sentence_id,token_id,token,pos
0,0,[0],[[2]],[['AA5']],[['NN']]
1,1,[1],"[[5, 6, 7, 8, 9, 10, 11, 12, 14]]","[['Papers', 'of', 'The', 'Very', 'Rev', 'Prof'...","[['NNS', 'IN', 'DT', 'NNP', 'NNP', 'NNP', 'NNP..."
2,2,[2],"[[17, 20, 21, 22, 24, 26, 28, 30, 31, 32, 33, ...","[['and', 'Sermons', 'and', 'addresses', '1948-...","[['CC', 'NNS', 'CC', 'NNS', 'JJ', 'NNS', 'JJ',..."
3,3,"[3, 4, 5, 6, 7, 8, 9, 10]","[[113, 114, 115, 116, 117, 118, 119, 120, 121,...","[['Professor', 'James', 'Aitken', 'White', 'wa...","[['NNP', 'NNP', 'NNP', 'NNP', 'VBD', 'DT', 'JJ..."
4,4,[11],[[310]],[['AA6']],[['NN']]


In [39]:
sent_id_lists = list(df_descs_uoe.sentence_id)
sents_per_desc_uoe = [len(sent_ids) for sent_ids in sent_id_lists]
print(sents_per_desc_uoe[0:5])

[1, 1, 1, 8, 1]


In [40]:
long_sents_nusc = [s_count for s_count in sents_per_desc_nusc if s_count > 1]
print("Remaining NUSC descriptions:", len(long_sents_nusc), "or", str((len(long_sents_nusc)/len(sents_per_desc_nusc))*100)+"% of total")
long_sents_uoe = [s_count for s_count in sents_per_desc_uoe if s_count > 1]
print("Remaining UoE descriptions:", len(long_sents_uoe), "or", str(len(long_sents_uoe)/len(sents_per_desc_uoe)*100)+"% of total")

Remaining NUSC descriptions: 15422 or 20.51752810483603% of total
Remaining UoE descriptions: 5554 or 20.01297203805131% of total


About 20% of **both** dataset's descriptions are 2 or more sentences long!

In [41]:
long_sents_nusc = [s_count for s_count in sents_per_desc_nusc if s_count > 2]
print("Remaining NUSC descriptions:", len(long_sents_nusc), "or", str((len(long_sents_nusc)/len(sents_per_desc_nusc))*100)+"% of total")
long_sents_uoe = [s_count for s_count in sents_per_desc_uoe if s_count > 2]
print("Remaining UoE descriptions:", len(long_sents_uoe), "or", str(len(long_sents_uoe)/len(sents_per_desc_uoe)*100)+"% of total")

Remaining NUSC descriptions: 6250 or 8.315040244794785% of total
Remaining UoE descriptions: 1899 or 6.8427500720668775% of total


In [42]:
long_sents_nusc = [s_count for s_count in sents_per_desc_nusc if s_count > 3]
print("Remaining NUSC descriptions:", len(long_sents_nusc), "or", str((len(long_sents_nusc)/len(sents_per_desc_nusc))*100)+"% of total")
long_sents_uoe = [s_count for s_count in sents_per_desc_uoe if s_count > 3]
print("Remaining UoE descriptions:", len(long_sents_uoe), "or", str(len(long_sents_uoe)/len(sents_per_desc_uoe)*100)+"% of total")

Remaining NUSC descriptions: 4132 or 5.497239406638728% of total
Remaining UoE descriptions: 982 or 3.538483712885558% of total


## COME BACK TO: B/H only!

## Gendered Language

In [23]:
i = 1 # lowercased grammatically/lexically gendered tokens

In [24]:
df_nusc_gendered = pd.read_csv(dir_nusc + files_nusc[i], index_col=0)
print(df_nusc_gendered.shape)

(30, 1)


In [25]:
df_uoe_gendered = pd.read_csv(dir_uoe + files_uoe[i], index_col=0)
print(df_uoe_gendered.shape)

(30, 1)


In [26]:
total_tokens_nusc = 1915370
perc_nusc = (df_nusc_gendered["count"]/total_tokens_nusc)*100
df_nusc_gendered.insert(len(df_nusc_gendered.columns), "percentage", perc_nusc)
df_nusc_gendered

Unnamed: 0_level_0,count,percentage
word,Unnamed: 1_level_1,Unnamed: 2_level_1
him,0,0.0
his,0,0.0
hers,0,0.0
her,0,0.0
she,0,0.0
he,0,0.0
granddaughter,3,0.0
niece,3,0.0
grandmother,6,0.0
grandson,7,0.0


In [27]:
df_nusc_gendered.describe()

Unnamed: 0,count,percentage
count,30.0,30.0
mean,133.633,0.007
std,212.657,0.011
min,0.0,0.0
25%,3.75,0.0
50%,47.5,0.002
75%,182.25,0.01
max,975.0,0.051


In [28]:
total_tokens_uoe = 542635
perc_uoe = (df_uoe_gendered["count"]/total_tokens_uoe)*100
df_uoe_gendered.insert(len(df_uoe_gendered.columns), "percentage", perc_uoe)
df_uoe_gendered

Unnamed: 0_level_0,count,percentage
word,Unnamed: 1_level_1,Unnamed: 2_level_1
he,0,0.0
niece,0,0.0
his,0,0.0
hers,0,0.0
granddaughter,0,0.0
him,0,0.0
her,0,0.0
she,0,0.0
aunt,2,0.0
grandmother,4,0.001


In [29]:
df_uoe_gendered.describe()

Unnamed: 0,count,percentage
count,30.0,30.0
mean,77.5,0.014
std,146.203,0.027
min,0.0,0.0
25%,0.5,0.0
50%,23.0,0.004
75%,63.75,0.012
max,656.0,0.121


The use of grammatically and lexically gendered terms is a small percentage of all tokens in both datasets, but Edinburgh's descriptions relative to Newcastle's are larger.

In [30]:
i = 2 # capitalized grammatically/lexically gendered tokens

In [31]:
df_nusc_gendered = pd.read_csv(dir_nusc + files_nusc[i], index_col=0)
print(df_nusc_gendered.shape)

(25, 1)


In [32]:
df_uoe_gendered = pd.read_csv(dir_uoe + files_uoe[i], index_col=0)
print(df_uoe_gendered.shape)

(25, 1)


In [33]:
total_tokens_nusc = 1915370
perc_nusc = (df_nusc_gendered["count"]/total_tokens_nusc)*100
df_nusc_gendered.insert(len(df_nusc_gendered.columns), "percentage", perc_nusc)
df_nusc_gendered

Unnamed: 0_level_0,count,percentage
word,Unnamed: 1_level_1,Unnamed: 2_level_1
Viscountess,1,0.0
gentlemen,2,0.0
gentleman,4,0.0
ladies,5,0.0
Duchess,5,0.0
Ms,5,0.0
Baroness,6,0.0
Countess,6,0.0
Count,8,0.0
Viscount,11,0.001


In [34]:
df_nusc_gendered.describe()

Unnamed: 0,count,percentage
count,25.0,25.0
mean,158.88,0.008
std,435.808,0.023
min,1.0,0.0
25%,6.0,0.0
50%,17.0,0.001
75%,127.0,0.007
max,2161.0,0.113


In [35]:
total_tokens_uoe = 542635
perc_uoe = (df_uoe_gendered["count"]/total_tokens_uoe)*100
df_uoe_gendered.insert(len(df_uoe_gendered.columns), "percentage", perc_uoe)
df_uoe_gendered

Unnamed: 0_level_0,count,percentage
word,Unnamed: 1_level_1,Unnamed: 2_level_1
Viscountess,0,0.0
Baroness,1,0.0
gentleman,1,0.0
gentlemen,1,0.0
Ms,2,0.0
lady,2,0.0
ladies,3,0.001
Messrs,4,0.001
Dame,5,0.001
Count,6,0.001


In [36]:
df_uoe_gendered.describe()

Unnamed: 0,count,percentage
count,25.0,25.0
mean,63.76,0.012
std,94.665,0.017
min,0.0,0.0
25%,3.0,0.001
50%,15.0,0.003
75%,99.0,0.018
max,406.0,0.075


There's greater variation in Newcastle's use of gendered terms in the second set while Edinburgh's has greater variation (double Newcastle's) for the first set.

## Parts of Speech

In [None]:
descriptor_files = [8, 9] # for adj and adv, respectively

In [50]:
df_nusc_descriptor_stats = pd.read_csv(dir_nusc + files_nusc[descriptor_files[1]], index_col=0)
df_nusc_descriptor_stats

Unnamed: 0,adj_by_desc,adv_by_desc
mean,4.515,1.34
median,1.0,0.0
minimum,0.0,0.0
maximum,101.0,45.0
total,127705.0,37915.0


In [57]:
print("Total adjectives as a percentage of all NUSC tokens:", str((df_nusc_descriptor_stats["adj_by_desc"]["total"]/total_tokens_nusc)*100)+"%")
print("Total adverbs as a percentage of all NUSC tokens:", str((df_nusc_descriptor_stats["adv_by_desc"]["total"]/total_tokens_nusc)*100)+"%")

Total adjectives as a percentage of all NUSC tokens: 6.66738019286091%
Total adverbs as a percentage of all NUSC tokens: 1.97951309668628%


In [51]:
df_uoe_descriptor_stats = pd.read_csv(dir_uoe + files_uoe[descriptor_files[1]], index_col=0)
df_uoe_descriptor_stats

Unnamed: 0,adj_by_desc,adv_by_desc
mean,2.625,0.467
median,2.0,0.0
minimum,0.0,0.0
maximum,1290.0,252.0
total,32432.0,5763.0


In [58]:
print("Total adjectives as a percentage of all UoE tokens:", str((df_uoe_descriptor_stats["adj_by_desc"]["total"]/total_tokens_uoe)*100)+"%")
print("Total adverbs as a percentage of all UoE tokens:", str((df_uoe_descriptor_stats["adv_by_desc"]["total"]/total_tokens_uoe)*100)+"%")

Total adjectives as a percentage of all UoE tokens: 5.976761543210445%
Total adverbs as a percentage of all UoE tokens: 1.0620398610484028%


The average use of adjectives and adverbs in the Newcastle descriptions is roughly double that of Edinburgh, however the median use of adjectives in the Edinburgh dataset is double that of Newcastle (2 vs. 1).  Most noticeable are the differences in *maximum* number of adjectives or adverbs that appear in a description: Edinburgh's is far larger than Newcastles, with the maximum adjective count over 10x as large and the maximum adverb count over 5x as large.

As a percentage of all tokens per dataset, there's not a big difference, with Newcastle's data actually having a slightly larger percentage of descriptors relative to all tokens!

### Adjectives

In [59]:
adjective_files = [4, 5]  # the second is with lowercased tokens

In [81]:
df_adj_nusc = pd.read_csv(dir_nusc + files_nusc[adjective_files[1]], index_col=0)
df_adj_nusc.insert(1, "percentage_of_all_tokens", (df_adj_nusc["count"]/total_tokens_nusc)*100)
print(df_adj_nusc.shape)
df_adj_nusc.head(30)

(6122, 2)


Unnamed: 0_level_0,count,percentage_of_all_tokens
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1
mid-twentieth,5386,0.281
key,3686,0.192
historic,3650,0.191
influential,3610,0.188
notable,3099,0.162
significant,2602,0.136
academic,2356,0.123
successful,2349,0.123
polar,2312,0.121
public,2144,0.112


In [106]:
df_adj_nusc.describe()

Unnamed: 0,count,percentage_of_all_tokens
count,6122.0,6122.0
mean,20.864,0.001
std,178.966,0.009
min,1.0,0.0
25%,1.0,0.0
50%,1.0,0.0
75%,3.0,0.0
max,5386.0,0.281


In [82]:
df_adj_uoe = pd.read_csv(dir_uoe + files_uoe[adjective_files[1]], index_col=0)
df_adj_uoe.insert(1, "percentage_of_all_tokens", (df_adj_uoe["count"]/total_tokens_uoe)*100)
print(df_adj_uoe.shape)
df_adj_uoe.head(30)

(6889, 2)


Unnamed: 0_level_0,count,percentage_of_all_tokens
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1
20th,2143,0.395
early,1569,0.289
various,478,0.088
late,468,0.086
scottish,415,0.076
other,384,0.071
british,367,0.068
first,307,0.057
early/mid,264,0.049
next,253,0.047


In [105]:
df_adj_uoe.describe()

Unnamed: 0,count,percentage_of_all_tokens
count,6889.0,6889.0
mean,4.708,0.001
std,36.273,0.007
min,1.0,0.0
25%,1.0,0.0
50%,1.0,0.0
75%,2.0,0.0
max,2143.0,0.395


Interesting to see that Newcastle's top adjectives include words that seem more likely to be communicate a judgment (e.g., "notable," "significant," "key," "influential") than Edinburgh's top adjectives!

In [95]:
common_adjs = df_adj_nusc.merge(df_adj_uoe, on="token_lower", how="inner", sort=True, suffixes=("_nusc", "_uoe"))
common_adjs.head()

Unnamed: 0_level_0,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1-10,4,0.0,3,0.001
1-2,7,0.0,3,0.001
1-22,1,0.0,2,0.0
1-3,14,0.001,4,0.001
1-4,15,0.001,3,0.001


Ignore numerical adjectives.

In [96]:
common_adjs = common_adjs.reset_index()
common_adjs_alpha = common_adjs.loc[common_adjs["token_lower"].str.isalpha()]
print(common_adjs.shape, common_adjs_alpha.shape)

(1651, 5) (1418, 5)


Look at the top adjectives for Newcastle...

In [107]:
common_adjs_alpha = common_adjs_alpha.sort_values(by="count_nusc", ascending=False)
common_adjs_alpha.head(30)

Unnamed: 0,token_lower,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
855,key,3686,0.192,22,0.004
757,historic,3650,0.191,1,0.0
812,influential,3610,0.188,4,0.001
1041,notable,3099,0.162,10,0.002
1364,significant,2602,0.136,5,0.001
171,academic,2356,0.123,40,0.007
1436,successful,2349,0.123,25,0.005
1154,polar,2312,0.121,2,0.0
1223,public,2144,0.112,52,0.01
1437,such,2010,0.105,97,0.018


...and for Edinburgh.

In [110]:
common_adjs_alpha = common_adjs_alpha.sort_values(by="count_uoe", ascending=False)
common_adjs_alpha.head(30)

Unnamed: 0,token_lower,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
530,early,976,0.051,1569,0.289
1589,various,777,0.041,478,0.088
872,late,61,0.003,468,0.086
1321,scottish,65,0.003,415,0.076
1090,other,1787,0.093,384,0.071
308,british,1943,0.101,367,0.068
640,first,1311,0.068,307,0.057
1028,next,82,0.004,253,0.047
513,domestic,106,0.006,220,0.041
1346,several,1933,0.101,213,0.039


Export the DataFrame as a CSV file.

In [111]:
common_adjs_alpha.to_csv(comparison_dir+"common_adjs_sorted_nusc_uoe.csv")

Look at how many of Newcastle and Edinburgh's top 30 most common adjectives overlap.

In [142]:
top = 50

In [143]:
df_adj_nusc_top = df_adj_nusc.head(top)
df_adj_uoe_top = df_adj_uoe.head(top)
df_adj_top = df_adj_nusc_top.merge(df_adj_uoe_top, on="token_lower", how="outer", sort=True, suffixes=("_nusc", "_uoe")).reset_index()
df_adj_top = df_adj_top.sort_values(by="token_lower")
df_adj_top

Unnamed: 0,token_lower,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
0,"""'new""",1801.000,0.094,,
1,1955-1958,770.000,0.040,,
2,19th,,,76.000,0.014
3,20th,,,2143.000,0.395
4,academic,2356.000,0.123,,
...,...,...,...,...,...
85,various,777.000,0.041,478.000,0.088
86,white,,,79.000,0.015
87,wide,790.000,0.041,,
88,widely-read,1795.000,0.094,,


In [145]:
df_adj_top_common = df_adj_top.loc[df_adj_top["count_nusc"].notna()]
df_adj_top_common = df_adj_top_common.loc[df_adj_top_common["count_uoe"].notna()]
print(f"Overlapping tokens in top {top} adjectives:", df_adj_top_common.shape[0])
df_adj_top_common

Overlapping tokens in top 50 adjectives: 10


Unnamed: 0,token_lower,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
12,british,1943.0,0.101,367.0,0.068
17,early,976.0,0.051,1569.0,0.289
20,first,1311.0,0.068,307.0,0.057
36,large,1200.0,0.063,149.0,0.027
42,many,1794.0,0.094,160.0,0.029
56,other,1787.0,0.093,384.0,0.071
71,scientific,816.0,0.043,78.0,0.014
74,several,1933.0,0.101,213.0,0.039
81,such,2010.0,0.105,97.0,0.018
85,various,777.0,0.041,478.0,0.088


In [146]:
df_adj_top = df_adj_top.fillna("NA") # replace NaN with the string 'NA' (for "Not Applicable")

Save the DataFrame of the most common adjectives from both datasets.

In [147]:
df_adj_top.to_csv(comparison_dir+f"top{top}_adjs_nusc_uoe.csv")

### Adverbs

In [112]:
adverb_files = [6, 7]     # the second is with lowercased tokens

In [116]:
df_adv_nusc = pd.read_csv(dir_nusc + files_nusc[adverb_files[1]], index_col=0)
df_adv_nusc.insert(1, "percentage_of_all_tokens", (df_adv_nusc["count"]/total_tokens_nusc)*100)
print(df_adv_nusc.shape)
df_adv_nusc.head(30)

(622, 2)


Unnamed: 0_level_0,count,percentage_of_all_tokens
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1
also,6188,0.323
most,2661,0.139
more,1884,0.098
particularly,1868,0.098
primarily,1838,0.096
ever,1810,0.094
earlier,1804,0.094
later,1801,0.094
furthermore,1795,0.094
hugely,1795,0.094


In [117]:
df_adv_uoe = pd.read_csv(dir_uoe + files_uoe[adverb_files[1]], index_col=0)
df_adv_uoe.insert(1, "percentage_of_all_tokens", (df_adv_uoe["count"]/total_tokens_uoe)*100)
print(df_adv_uoe.shape)
df_adv_uoe.head(30)

(599, 2)


Unnamed: 0_level_0,count,percentage_of_all_tokens
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1
also,799,0.147
not,425,0.078
early,406,0.075
then,306,0.056
as,146,0.027
well,143,0.026
together,130,0.024
later,123,0.023
first,114,0.021
possibly,112,0.021


There's lots of overlap in Edinburgh and Newcastle's top adverbs.  They seem to be used to communicate a time estimate or to communicate an uncertainty or likelihood.

In [118]:
common_advs = df_adv_nusc.merge(df_adv_uoe, on="token_lower", how="inner", sort=True, suffixes=("_nusc", "_uoe"))
common_advs.head()

Unnamed: 0_level_0,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
about,9,0.0,6,0.001
abroad,26,0.001,6,0.001
absolutely,1,0.0,1,0.0
accordingly,1,0.0,3,0.001
actively,9,0.0,2,0.0


Look at the common adverbs by Newcastle's most common...

In [119]:
common_advs = common_advs.sort_values(by="count_nusc", ascending=False)
common_advs.head(30)

Unnamed: 0_level_0,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
also,6188,0.323,799,0.147
most,2661,0.139,50,0.009
more,1884,0.098,46,0.008
particularly,1868,0.098,42,0.008
primarily,1838,0.096,12,0.002
ever,1810,0.094,14,0.003
earlier,1804,0.094,24,0.004
later,1801,0.094,123,0.023
furthermore,1795,0.094,1,0.0
hugely,1795,0.094,1,0.0


...and by Edinburgh's most common.

In [120]:
common_advs = common_advs.sort_values(by="count_uoe", ascending=False)
common_advs.head(30)

Unnamed: 0_level_0,count_nusc,percentage_of_all_tokens_nusc,count_uoe,percentage_of_all_tokens_uoe
token_lower,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
also,6188,0.323,799,0.147
not,847,0.044,425,0.078
early,30,0.002,406,0.075
then,164,0.009,306,0.056
as,1303,0.068,146,0.027
well,1280,0.067,143,0.026
together,158,0.008,130,0.024
later,1801,0.094,123,0.023
first,157,0.008,114,0.021
possibly,123,0.006,112,0.021


## COME BACK TO: run analysis and comparison code on sample NUSC data that was manually reviewed!