# Review Aggregation of Annotated Data

* Check for file mismatches
* Check description IDs - should be based on every description from every file uploaded to brat rapid annotation tool, rather than only on descriptions that have annotations in them!

In [1]:
import pandas as pd
import numpy as np

In [8]:
doc_clf_df = pd.read_csv("../doc_clf_data/desc_field_descid_label_eadid_exploded.csv")
doc_clf_df = doc_clf_df.drop(columns=["Unnamed: 0"])
print(doc_clf_df.shape)
doc_clf_df.head()

(1605, 8)


Unnamed: 0,eadid,file,desc_id,field,description,label,desc_start_offset,desc_end_offset
0,Coll-1320,Coll-1320_00400.txt,3247,Title,'Effect of an inhibitor of 3ß-hydroxysteroid d...,Masculine,5040,5281
1,Coll-1320,Coll-1320_00400.txt,3247,Title,'Effect of an inhibitor of 3ß-hydroxysteroid d...,Unknown,5040,5281
2,Coll-146,Coll-146_28000.txt,11317,Scope and Contents,"3 photographs : negative, col.Sent from: [Cap ...",Unknown,4731,4810
3,Coll-146,Coll-146_20500.txt,9233,Scope and Contents,"4 photographs : negative, col.. 1 stripSent fr...",Unknown,4387,4592
4,Coll-1130,Coll-1130_00100.txt,1465,Biographical / Historical,"A collection of copied letters, mainly from th...",Unknown,1262,1434


In [169]:
final_df = pd.read_csv("aggregated_final.csv", index_col=0)
final_df.head()

Unnamed: 0,agg_ann_id,file,offsets,text,label,category,associated_genders
12,0,Coll-1157_00100.ann,"(1407, 1415)",knighted,Gendered-Role,Linguistic,Unclear
22,1,Coll-1310_02300.ann,"(9625, 9635)",knighthood,Gendered-Role,Linguistic,Unclear
23,2,Coll-1281_00100.ann,"(2426, 2439)",Prince Regent,Gendered-Role,Linguistic,Unclear
24,3,Coll-1310_02700.ann,"(9993, 10003)",knighthood,Gendered-Role,Linguistic,Unclear
25,4,Coll-1310_02900.ann,"(7192, 7195)",Sir,Gendered-Role,Linguistic,Unclear


Check for duplicate annotations:

In [170]:
final_subdf = final_df.drop_duplicates()
assert final_subdf.shape[0] == final_df.shape[0], "Every row of annotation data should be unique"

Great!  There are no duplicates in the final aggregated data, `aggregated_final.csv`.

***

Next review the description data files:

In [4]:
desc_to_review = ["../doc_clf_data/desc_field_descid_label_eadid.csv", "../data/analysis_data/descriptions_with_counts.csv"]

In [3]:
df = pd.read_csv(desc_to_review[0])
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,eadid,file,desc_id,field,description,label,desc_start_offset,desc_end_offset
0,Coll-1320,Coll-1320_00400.txt,3247,Title,'Effect of an inhibitor of 3ß-hydroxysteroid d...,"{'Masculine', 'Unknown'}",5040.0,5281.0
1,Coll-146,Coll-146_28000.txt,11317,Scope and Contents,"3 photographs : negative, col.Sent from: [Cap ...",{'Unknown'},4731.0,4810.0
2,Coll-146,Coll-146_20500.txt,9233,Scope and Contents,"4 photographs : negative, col.. 1 stripSent fr...",{'Unknown'},4387.0,4592.0
3,Coll-1130,Coll-1130_00100.txt,1465,Biographical / Historical,"A collection of copied letters, mainly from th...",{'Unknown'},1262.0,1434.0
4,Coll-1143,Coll-1143_00100.txt,1493,Biographical / Historical,Alexander Herbert Main studied Law at Edinburg...,"{'Gendered-Pronoun', 'Stereotype', 'Masculine'...",1170.0,1559.0


In [181]:
annot_descids = list(df.desc_id)
annot_descs = list(df.description)
annot_descs = [desc.strip() for desc in annot_descs]
df["description"] = annot_descs
annot_descs_ids = dict(zip(annot_descids,annot_descs))

In [6]:
# descs = pd.read_csv("../crc_metadata/all_descriptions.csv", index_col=0)
# print(descs.shape)
# descs.head()

In [175]:
all_descids = list(descs.desc_id)
all_descs = list(descs.description)
all_descs = [desc.strip() for desc in all_descs]
descs["description"] = all_descs
all_descs_ids = dict(zip(all_descids,all_descs))

In [176]:
only_annot = [desc for desc in annot_descs if not desc in all_descs]
print(len(only_annot))

0


In [177]:
only_annotids = [did for did in annot_descs_ids if not did in all_descs_ids]
print(len(only_annotids))

0


There seems to be a mismatch between the description IDs. - NOW FIXED

In [69]:
subdf = df.loc[df.desc_id.isin(only_annotids)]
assert subdf.shape[0] == len(only_annotids)
subdf.head()

Unnamed: 0,description,field,label,eadid,desc_id,file,desc_start_offset,desc_end_offset
0,John Baillie: posthumous,Title,{'Unknown'},BAI,70381,BAI_01000.txt,1290,1315
1,"Letters received from Henry Sloane Coffin, wit...",Scope and Contents,"{'Masculine', 'Unknown'}",BAI,47675,BAI_01300.txt,5853,5983
2,Family photographs consist of:photographs of f...,Scope and Contents,"{'Masculine', 'Unknown', 'Feminine'}",BAI,81505,BAI_01600.txt,5967,6202
3,"Correspondence and related items, including le...",Scope and Contents,{'Unknown'},BAI,33009,BAI_01900.txt,5297,5506
4,From 1927-1930 John Baillie was Professor of S...,Biographical / Historical,"{'Gendered-Pronoun', 'Unknown', 'Masculine', '...",BAI,43372,BAI_02200.txt,15180,15419


Compare `subdf`'s `description`, `field`, and `file` columns to `descs` to see whether or not they're all present in `descs`:

In [70]:
subdf = subdf.set_index(["description","field","eadid"])
subdf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,desc_id,file,desc_start_offset,desc_end_offset
description,field,eadid,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
John Baillie: posthumous,Title,BAI,{'Unknown'},70381,BAI_01000.txt,1290,1315
"Letters received from Henry Sloane Coffin, with an enclosed letter from Hugh Martin and copies of John Baillie's reply to Coffin.",Scope and Contents,BAI,"{'Masculine', 'Unknown'}",47675,BAI_01300.txt,5853,5983
"Family photographs consist of:photographs of family members (John Baillie, snr., John Baillie, Florence Jewel Baillie and Ian Fowler Baillie)photographs and postcards of family-related locations (Gairloch, Edinburgh, Cupar and Bervie)",Scope and Contents,BAI,"{'Masculine', 'Unknown', 'Feminine'}",81505,BAI_01600.txt,5967,6202
"Correspondence and related items, including letters from Mona Anderson, Thomas Stearns Elliot and Hans-Heinrich Harms. Includes material relating to the World Council of Churches and other ecumenical matters.",Scope and Contents,BAI,{'Unknown'},33009,BAI_01900.txt,5297,5506
"From 1927-1930 John Baillie was Professor of Systematic Theology at Emmanuel College, University of Toronto. During this time his correspondants included his brother Donald Macpherson Baillie, Hugh Ross Mackintosh and Henry Sloane Coffin.",Biographical / Historical,BAI,"{'Gendered-Pronoun', 'Unknown', 'Masculine', '...",43372,BAI_02200.txt,15180,15419


In [71]:
joined = subdf.join(descs.set_index(["description", "field", "eadid"]), how="outer", lsuffix="subdf", rsuffix="descs")
joined.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,desc_idsubdf,file,desc_start_offset,desc_end_offset,desc_iddescs
description,field,eadid,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,Title,Coll-1310,,,,,,11892
"""All That Was Left of Them"": The Thirteen Sole Survivors of the Second Grenadier Guards",Title,Coll-1434,,,,,,4733
"""Baron o' Buchlyvie"" (11410)",Title,Coll-1434,,,,,,4311
"""Baron's Pride""",Title,Coll-1434,,,,,,4569
"""Duke of Northumberland"" (1940)",Title,Coll-1434,,,,,,4603


In [72]:
joined.shape

(15887, 6)

In [85]:
joined_notnull = joined.loc[~joined.desc_idsubdf.isnull()]
print(joined_notnull.shape)
joined_null = joined.loc[joined.desc_idsubdf.isnull()]
print(joined_null.shape)

(4300, 6)
(11587, 6)


In [86]:
joined_notnull.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,label,desc_idsubdf,file,desc_start_offset,desc_end_offset,desc_iddescs
description,field,eadid,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"'Effect of an inhibitor of 3ß-hydroxysteroid dehydrogenase on progesterone concentrations and embryo survival in sheep', C.J. Ashworth, I. Wilmut, A.J. Springbett and R. Webb, reprinted from the Journal of Endocrinology (1987), 112, 205-213",Title,Coll-1320,"{'Masculine', 'Unknown'}",19370.0,Coll-1320_00400.txt,5040.0,5281.0,3247
"3 photographs : negative, col.Sent from: [Cap Ferrat, France?]Koestler, Arthur",Scope and Contents,Coll-146,{'Unknown'},63842.0,Coll-146_28000.txt,4731.0,4810.0,11317
"4 photographs : negative, col.. 1 stripSent from: [Alpbach, Austria]; n.p.SchreiberhauslBuilding activity at the Schreiberhausl - Arthur Koestler driving - Arthur Koestler in a street cafeKoestler, Arthur",Scope and Contents,Coll-146,{'Unknown'},66091.0,Coll-146_20500.txt,4387.0,4592.0,9233
"A collection of copied letters, mainly from the seventeenth and eighteenth centuries, and probably written in the 1820s by G. E. Kinloch, and entitled Collecteana Scotica.",Biographical / Historical,Coll-1130,{'Unknown'},20367.0,Coll-1130_00100.txt,1262.0,1434.0,1465
"Alexander Herbert Main studied Law at Edinburgh University in the 1930s. At the time of his studies he lived in Haddington, East Lothian. During academic year 1933-34, his name appeared on Faculty of Law Class Merit Lists in the subjects of Conveyancing practice (5th equal), and Evidence and Procedure (26th equal). He was awarded his degree of B.L. (Bachelor of Law) on 21 October 1933.",Biographical / Historical,Coll-1143,"{'Gendered-Pronoun', 'Stereotype', 'Masculine'...",48400.0,Coll-1143_00100.txt,1170.0,1559.0,1493


In [87]:
joined_notnull.isnull().values.any()  # False so no nulls in any column - great!

False

Now we want to keep the `desc_id` column from the `descs` DataFrame, since that contains the complete list of descriptions (all those in all files uploaded to the brat rapid annotation tool).

In [88]:
joined_notnull = joined_notnull.drop(columns=["desc_idsubdf"])
joined_notnull = joined_notnull.rename({"desc_iddescs":"desc_id"}, axis=1)
joined_notnull = joined_notnull.reset_index()
joined_notnull.head()

Unnamed: 0,description,field,eadid,label,file,desc_start_offset,desc_end_offset,desc_id
0,'Effect of an inhibitor of 3ß-hydroxysteroid d...,Title,Coll-1320,"{'Masculine', 'Unknown'}",Coll-1320_00400.txt,5040.0,5281.0,3247
1,"3 photographs : negative, col.Sent from: [Cap ...",Scope and Contents,Coll-146,{'Unknown'},Coll-146_28000.txt,4731.0,4810.0,11317
2,"4 photographs : negative, col.. 1 stripSent fr...",Scope and Contents,Coll-146,{'Unknown'},Coll-146_20500.txt,4387.0,4592.0,9233
3,"A collection of copied letters, mainly from th...",Biographical / Historical,Coll-1130,{'Unknown'},Coll-1130_00100.txt,1262.0,1434.0,1465
4,Alexander Herbert Main studied Law at Edinburg...,Biographical / Historical,Coll-1143,"{'Gendered-Pronoun', 'Stereotype', 'Masculine'...",Coll-1143_00100.txt,1170.0,1559.0,1493


Add on the remainder of the `df` rows (those that did have description IDs that were in the `descs` DataFrame):

In [91]:
subdf2 = df.loc[~df.desc_id.isin(only_annotids)]
print(subdf2.shape)

(205, 8)


In [92]:
subdf2 = subdf2.set_index(["description","field","eadid","desc_id"])
subdf2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,label,file,desc_start_offset,desc_end_offset
description,field,eadid,desc_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Archivist's NoteNone Grant Buttars 25 February 2003,Processing Information,BAI,4172,"{'Occupation', 'Unknown'}",BAI_02500.txt,1224,1277
"Contains:Poultry Research Centre reports and associated papers (1942-1964);correspondence (1934-1984);Poultry Research Centre Visitors Book (1947-1964);Visitors Book listing 'visitors entertained at the B.E.C.C Environment Unit, Bush Estate' (1963-1968);assortment of newspaper clippings, articles, correspondence, photographs and papers relating to events (1927-1957);typescript of Alan Greenwood's memorandum 'The Poultry Research Centre of the Agricultural Research Council 1947-1962: A Director's Story' (1968-1985).",Scope and Contents,Coll-1057,6697,"{'Occupation', 'Omission', 'Unknown'}",Coll-1057_00700.txt,11621,12142
"33 (37), Thomson to Ledermann",Title,Coll-1064,7122,"{'Omission', 'Unknown'}",Coll-1064_00100.txt,1259,1289
Robert Forester wrote the title page to his book of music in July 1818.,Biographical / Historical,Coll-1069,1566,"{'Gendered-Pronoun', 'Masculine'}",Coll-1069_00100.txt,207,279
"The photographs were the work of publisher and photographer Gordon Wright. He had worked on photographs and lay-out for the nationalist literary magazineCatalystand his first publication was a pamphlet by the Scottish poet, Willie Neill. He was also responsible forFour points of a Saltire, a book of poems by Sorley Maclean, George Campbell Hay, William Neill and Stuart McGregor, and a book of Liz Lochhead's poemsMemo for Spring. Wright was said to own the largest collection of photographs of Hugh MacDiarmid.",Biographical / Historical,Coll-1070,6493,"{'Masculine', 'Occupation', 'Gendered-Pronoun'...",Coll-1070_00100.txt,461,975


In [96]:
joined2 = subdf2.join(descs.set_index(["description", "field", "eadid", "desc_id"]), how="left", lsuffix="subdf2", rsuffix="descs")
assert joined2.shape[0] == subdf2.shape[0]
joined2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,label,file,desc_start_offset,desc_end_offset
description,field,eadid,desc_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Archivist's NoteNone Grant Buttars 25 February 2003,Processing Information,BAI,4172,"{'Occupation', 'Unknown'}",BAI_02500.txt,1224,1277
"Contains:Poultry Research Centre reports and associated papers (1942-1964);correspondence (1934-1984);Poultry Research Centre Visitors Book (1947-1964);Visitors Book listing 'visitors entertained at the B.E.C.C Environment Unit, Bush Estate' (1963-1968);assortment of newspaper clippings, articles, correspondence, photographs and papers relating to events (1927-1957);typescript of Alan Greenwood's memorandum 'The Poultry Research Centre of the Agricultural Research Council 1947-1962: A Director's Story' (1968-1985).",Scope and Contents,Coll-1057,6697,"{'Occupation', 'Omission', 'Unknown'}",Coll-1057_00700.txt,11621,12142
"33 (37), Thomson to Ledermann",Title,Coll-1064,7122,"{'Omission', 'Unknown'}",Coll-1064_00100.txt,1259,1289
Robert Forester wrote the title page to his book of music in July 1818.,Biographical / Historical,Coll-1069,1566,"{'Gendered-Pronoun', 'Masculine'}",Coll-1069_00100.txt,207,279
"The photographs were the work of publisher and photographer Gordon Wright. He had worked on photographs and lay-out for the nationalist literary magazineCatalystand his first publication was a pamphlet by the Scottish poet, Willie Neill. He was also responsible forFour points of a Saltire, a book of poems by Sorley Maclean, George Campbell Hay, William Neill and Stuart McGregor, and a book of Liz Lochhead's poemsMemo for Spring. Wright was said to own the largest collection of photographs of Hugh MacDiarmid.",Biographical / Historical,Coll-1070,6493,"{'Masculine', 'Occupation', 'Gendered-Pronoun'...",Coll-1070_00100.txt,461,975


In [97]:
joined2.isnull().values.any()  # Great!

False

In [99]:
joined2 = joined2.reset_index()
joined2.head()

Unnamed: 0,description,field,eadid,desc_id,label,file,desc_start_offset,desc_end_offset
0,Archivist's NoteNone Grant Buttars 25 Februar...,Processing Information,BAI,4172,"{'Occupation', 'Unknown'}",BAI_02500.txt,1224,1277
1,Contains:Poultry Research Centre reports and a...,Scope and Contents,Coll-1057,6697,"{'Occupation', 'Omission', 'Unknown'}",Coll-1057_00700.txt,11621,12142
2,"33 (37), Thomson to Ledermann",Title,Coll-1064,7122,"{'Omission', 'Unknown'}",Coll-1064_00100.txt,1259,1289
3,Robert Forester wrote the title page to his bo...,Biographical / Historical,Coll-1069,1566,"{'Gendered-Pronoun', 'Masculine'}",Coll-1069_00100.txt,207,279
4,The photographs were the work of publisher and...,Biographical / Historical,Coll-1070,6493,"{'Masculine', 'Occupation', 'Gendered-Pronoun'...",Coll-1070_00100.txt,461,975


In [106]:
new_df = pd.concat([joined_notnull,joined2], axis=0, ignore_index=True, sort=True)
assert new_df.isnull().values.any() == False

In [105]:
# Reorder the columns
new_df = new_df[["eadid","file","desc_id","field","description","label","desc_start_offset","desc_end_offset"]]
new_df.head()

Unnamed: 0,eadid,file,desc_id,field,description,label,desc_start_offset,desc_end_offset
0,Coll-1320,Coll-1320_00400.txt,3247,Title,'Effect of an inhibitor of 3ß-hydroxysteroid d...,"{'Masculine', 'Unknown'}",5040.0,5281.0
1,Coll-146,Coll-146_28000.txt,11317,Scope and Contents,"3 photographs : negative, col.Sent from: [Cap ...",{'Unknown'},4731.0,4810.0
2,Coll-146,Coll-146_20500.txt,9233,Scope and Contents,"4 photographs : negative, col.. 1 stripSent fr...",{'Unknown'},4387.0,4592.0
3,Coll-1130,Coll-1130_00100.txt,1465,Biographical / Historical,"A collection of copied letters, mainly from th...",{'Unknown'},1262.0,1434.0
4,Coll-1143,Coll-1143_00100.txt,1493,Biographical / Historical,Alexander Herbert Main studied Law at Edinburg...,"{'Gendered-Pronoun', 'Stereotype', 'Masculine'...",1170.0,1559.0


In [113]:
new_df = new_df.drop_duplicates()
new_df.shape

(983, 8)

In [114]:
df.shape

(1855, 8)

In [115]:
df.head()

Unnamed: 0,description,field,label,eadid,desc_id,file,desc_start_offset,desc_end_offset
0,John Baillie: posthumous,Title,{'Unknown'},BAI,70381,BAI_01000.txt,1290,1315
1,"Letters received from Henry Sloane Coffin, wit...",Scope and Contents,"{'Masculine', 'Unknown'}",BAI,47675,BAI_01300.txt,5853,5983
2,Family photographs consist of:photographs of f...,Scope and Contents,"{'Masculine', 'Unknown', 'Feminine'}",BAI,81505,BAI_01600.txt,5967,6202
3,"Correspondence and related items, including le...",Scope and Contents,{'Unknown'},BAI,33009,BAI_01900.txt,5297,5506
4,From 1927-1930 John Baillie was Professor of S...,Biographical / Historical,"{'Gendered-Pronoun', 'Unknown', 'Masculine', '...",BAI,43372,BAI_02200.txt,15180,15419


In [116]:
descs_df = list(df.description)
fields_df = list(df.field)
files_df = list(df.file)
i, maxI = 0, len(descs_df)
subrows_df = [files_df[i]+" "+fields_df[i]+" "+descs_df[i] for i in range(maxI)]
print(subrows_df[0])

BAI_01000.txt Title John Baillie: posthumous


In [117]:
descs_newdf = list(new_df.description)
fields_newdf = list(new_df.field)
files_newdf = list(new_df.file)
i, maxI = 0, len(descs_newdf)
subrows_newdf = [files_newdf[i]+" "+fields_newdf[i]+" "+descs_newdf[i] for i in range(maxI)]
print(subrows_newdf[0])

Coll-1320_00400.txt Title 'Effect of an inhibitor of 3ß-hydroxysteroid dehydrogenase on progesterone concentrations and embryo survival in sheep', C.J. Ashworth, I. Wilmut, A.J. Springbett and R. Webb, reprinted from the Journal of Endocrinology (1987), 112, 205-213


In [118]:
missing_from_new = [s for s in subrows_df if not s in subrows_newdf]
print(len(missing_from_new))

0


In certain cases, it seems, the same description had accidentally been assigned more than one identifier (`desc_id`), making it seem as though there were more unique descriptions than there actually were.  I'll rename the old data file just in case I'd like it for reference and then write a new data file under the old file's previous name:

In [131]:
# new_df.to_csv(desc_to_review[0])  # file: desc_to_review = "../doc_clf_data/desc_field_descid_label_eadid.csv"
new_df = pd.read_csv(desc_to_review[0], index_col=0)

Make sure the other description data file's IDs align with those in `new_df`:

In [184]:
df = pd.read_csv(desc_to_review[1])  # file: "../token_clf_data/descid_token_offsets.csv"
df_descids = list(set(list(df.desc_id)))
print(len(df_descids))

88597


In [185]:
new_df_descids = list(set(list(df.desc_id)))
print(len(new_df_descids))

88597


In [186]:
df_descids.sort()
new_df_descids.sort()
assert df_descids == new_df_descids

Looks good!

### Descriptions Data

In [8]:
descs_with_offsets = pd.read_csv("../crc_metadata/descs_with_offsets.csv", index_col=0)
descs_with_offsets.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


In [9]:
descs_with_offsets.shape

(88597, 7)

In [10]:
descs_with_offsets_dedup = descs_with_offsets.drop_duplicates()
descs_with_offsets_dedup.shape

(88597, 7)

In [12]:
descs = descs_with_offsets.drop(columns=["file", "desc_start_offset", "desc_end_offset"])
descs = descs.drop_duplicates()
descs.shape

(88597, 4)

No duplicate rows!

In [11]:
print(len(set(list(descs_with_offsets.desc_id))))

88597


No duplicate IDs!

In [21]:
len(set(list(descs_with_offsets.file)))

3645

In [15]:
descs_with_offsets.loc[descs_with_offsets.file == "Coll-1036_00400.txt"]

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
68307,68307,Coll-1036,Scope and Contents,Coll-1036_00400.txt,Miscellaneous music.Several Marjory Kennedy-Fr...,137,1264
68308,68308,Coll-1036,Scope and Contents,Coll-1036_00400.txt,"Miscellaneous items, Part 1, 2, 3.'Burns as a...",1285,1361
68309,68309,Coll-1036,Scope and Contents,Coll-1036_00400.txt,"'Loose Leaf M.S.S. [manuscripts] of ""Book""'. S...",1584,1837
68310,68310,Coll-1036,Scope and Contents,Coll-1036_00400.txt,"'News Cuttings', note book. Softbound, charcoa...",1858,2033
68311,68311,Coll-1036,Scope and Contents,Coll-1036_00400.txt,"Various music collections, Part 1 2. Part 1: ...",2054,5254
68312,68312,Coll-1036,Scope and Contents,Coll-1036_00400.txt,"'Proofs M.S.S.[manuscripts], More Songs of [t...",5275,5552
68313,68313,Coll-1036,Scope and Contents,Coll-1036_00400.txt,"'Kennedy-Fraser MSS. [manuscripts], D. 18377 [...",5573,5941
68314,68314,Coll-1036,Scope and Contents,Coll-1036_00400.txt,'Tolmie Gesto'.,6004,6022
68315,68315,Coll-1036,Scope and Contents,Coll-1036_00400.txt,Proofs of A Life of Song by Marjory Kennedy-Fr...,6842,6902
68316,68316,Coll-1036,Scope and Contents,Coll-1036_00400.txt,Breton songs.,6923,6939


All present...why annotations among them not found when trying to assign desc IDs to annotations in TokenBIOTags notebook???

In [16]:
descs_with_offsets.loc[descs_with_offsets.file == "Coll-1234_00100.txt"]

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
80165,80165,Coll-1234,Title,Coll-1234_00100.txt,"Papers relating to James Campbell, of Carsphairn",30,79
80166,80166,Coll-1234,Scope and Contents,Coll-1234_00100.txt,The collection of papers includes:,100,139
80167,80167,Coll-1234,Biographical / Historical,Coll-1234_00100.txt,James Campbell was born at Carsphairn (Dumfrie...,1507,2663
80168,80168,Coll-1234,Processing Information,Coll-1234_00100.txt,"Compiled by Graeme D. Eddie, Edinburgh Univers...",2688,2767


Something missing here...there's too big a gab between the description start and end offsets.

In [20]:
descs_with_offsets.loc[descs_with_offsets.desc_id == 80166].description

80166    The collection of papers includes:    
Name: description, dtype: object