# Encyclopaedia Britannica 1

Our dataset is the original OCR of the Encyclopaedia Britannica, spanning eight editions, which were released between 1768 and 1860. The OCR has not been cleaned up. We decided to work with the plaintext version, which contains only the OCRed text, without the positional/size information. The data is a series of `.txt` files, covering the eight editions: this means it is completely unstructured textual data.

Apart from additional content such as cover page, end notes, and preface/list of authors in the first volume of an edition, the encyclopaedia is structured as follows:
* explanations/descriptions of entries, sorted alphabetically. An example entry: `ABSURD, an epithet for any thing that contradidls an apparent truth.` These can be a single sentence, to several pages of explanation.
* longer-form articles, that cover broader fields such as `AGRICULTURE` or `ALGEBRA`, in which case they are structured in parts & sections.
* Illustrated pages, which are not included in the plaintext version
* (numbered/headered with where you are in the alphabet)

In [24]:
# mention artefacts & s/f spelling
# look thru encyclopaedia images
# comment on non-perfectness of cleanup

# describe my subsets' features: descriptive stats
# select headers and start looking for longest words, words with -isms and their counts, as well as lengths of their definitions
# find definitions with a WORD + NEXT WORD IN CAPS selector?

In [6]:
import pandas as pd
import re

In [7]:
# reading in index file

inventory = pd.read_csv("encyclopaediaBritannica-inventory.csv", header=None)
inventory.columns = ['file','volume']
print("Number of text files: " + str(len(inventory)))

Number of text files: 195


In [18]:
print("Example of the OCRed text: \n")

f = open('text/' + inventory.iloc[0].file, 'r', encoding="utf8")
content = f.read()
f.close()
print(content[51000:52000])

Example of the OCRed text: 

hat is hard
to be underftood, whether the obfeurity arifes from
the difficulty of the fubjedt, or the confufed manner
of the writer.
ABSURD, an epithet for any thing that contradidls an
apparent truth.
ABSURDITY, the name of an abfurd addon or fenti-
ment.
ABSUS, in botany, the trivial name of a fpecies of the.;
caffia. i-'' Js”
ABSYNTHfUM. See Absinthium.
ABUAI, one of the Philippine ifles. See Philipptne.
ABUCCO, Abocco, or Aboochi, a. weight ufed in
the'kingdom of Pegu, equal to i cRteCcaiis ; two a-
buccoS make an agiro; and two agiri make half a bika,
which is equal to 2 lb ,5 oz. of the heavy weight of Ve¬
nice.
ABUKESO. SccAslani.
ABUNA, the title of the Archbiffiop or Metropolitan
of Abyffinia.
ABUNDANT numbers, fuch whofe aliquot parts ad¬
ded together exceed the number itfelf; as 20, the
aliquot parts of which are, 1, 2, 4, J, to, and make 22.
ABU SAN, an ifland on the coaft of Africa, in 35 35.
N lat dependent on the province of Caret, in the
kin

In [19]:
first_ed = inventory[:6] # want to work on first & last edition, and remove pref/list of authors and end?
# do i want to remove longform articles, or not worry rn?
first_ed

Unnamed: 0,file,volume
0,144133901.txt,"Encyclopaedia Britannica; or, A dictionary of ..."
1,144133902.txt,"Encyclopaedia Britannica; or, A dictionary of ..."
2,144133903.txt,"Encyclopaedia Britannica; or, A dictionary of ..."
3,144850366.txt,"Encyclopaedia Britannica: or, A dictionary of ..."
4,144850367.txt,"Encyclopaedia Britannica: or, A dictionary of ..."
5,144850368.txt,"Encyclopaedia Britannica: or, A dictionary of ..."


In [155]:
for index, row in first_ed.iterrows():
    print("Reading: " + row['file'])
    
    f = open('text/' + row['file'], 'r', encoding="utf8")
    content = f.read()
    print("Length: " + str(len(content)))
    f.close()
    print("\n")

Reading: 144133901.txt
Length: 3876195


Reading: 144133902.txt
Length: 5518008


Reading: 144133903.txt
Length: 4829230


Reading: 144850366.txt
Length: 3914268


Reading: 144850367.txt
Length: 5538435


Reading: 144850368.txt
Length: 4747819




In [156]:
f = open('text/144133901.txt', 'r', encoding="utf8")
content = f.read()
f.close()

In [158]:
print(content[:2000])

i ! $* i $: iu^b '
n*s-f7^'v
L
i A
j J ^ /^^W/
; h:;^’
J
- }r-r£c9'&} "*— "
..^4-—>,
'I
■
.
,/.
■ -,... v V *•
C*?>7 y
<rw /U^v UJ~L ^ (txk^L j 1rvt*Xitj
$/i*4j/cJysx*£>Xb<. f^oLZ^^c^. % 'bvC JJ.
' }v*c CclU^K <77t .
yy*t4**2^t*-C{+r ^tXCe^vK &v»w
8/y: t^cCv-yt^yA. *-? ^v. •^GL* ftc*frt
* U^>. ‘
** a^yUf^yX ^
}tA£. yylrrCj? yu>t f\ ^^2!
ENCYCLOPEDIA BRITANNICA.
VOLUME the FIRST.
**■*
'
,T S :u -I >;j .1 M U a C V'
.
A
ARTS and SCIENCES,
COVI PILED UPON A NEW PLAN.
IN WHICH
The diferent Sciences and Arts are dioefted into
" O
diflinct Treatifes or Syitems;
AND
. The \irious Technic a lTerms, <&c. are explained as they occur
in the order of the Alphabet.
ILLUSTRATED WITH ONE HUNDRED AND SIXTY COPPERPLATES.
fry a Society of GENTLEMEN in Scotland.
IN THREE VOLUME S.
VOL. I.
EDINBURGH:
Printed for A. Bell and C. Macearquhar;
Aid fold by C o l i n M a cf a r q.u h a r, at his Printing-office, N.coifon-HreeL
\l965r.
SCO,-
PREFACE
UTILITY ought to be the principal intention of every publicatio

In [159]:
sample = content[:10000]

In [160]:
# Code adapted from https://programminghistorian.org/en/lessons/counting-frequencies

wordlist = sample.split()

wordfreq = []
for w in wordlist:
    wordfreq.append(wordlist.count(w))

print("String\n" + sample +"\n")
# print("List\n" + str(wordlist) + "\n")
# print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))
print("\n")

counts_frame = pd.DataFrame(list(zip(wordlist, wordfreq)), 
               columns =['wordlist', 'wordfreq']) 
print(len(counts_frame))
counts_frame.drop_duplicates(subset ="wordlist", keep ='first', inplace = True) 
print(len(counts_frame))

String
i ! $* i $: iu^b '
n*s-f7^'v
L
i A
j J ^ /^^W/
; h:;^’
J
- }r-r£c9'&} "*— "
..^4-—>,
'I
■
.
,/.
■ -,... v V *•
C*?>7 y
<rw /U^v UJ~L ^ (txk^L j 1rvt*Xitj
$/i*4j/cJysx*£>Xb<. f^oLZ^^c^. % 'bvC JJ.
' }v*c CclU^K <77t .
yy*t4**2^t*-C{+r ^tXCe^vK &v»w
8/y: t^cCv-yt^yA. *-? ^v. •^GL* ftc*frt
* U^>. ‘
** a^yUf^yX ^
}tA£. yylrrCj? yu>t f\ ^^2!
ENCYCLOPEDIA BRITANNICA.
VOLUME the FIRST.
**■*
'
,T S :u -I >;j .1 M U a C V'
.
A
ARTS and SCIENCES,
COVI PILED UPON A NEW PLAN.
IN WHICH
The diferent Sciences and Arts are dioefted into
" O
diflinct Treatifes or Syitems;
AND
. The \irious Technic a lTerms, <&c. are explained as they occur
in the order of the Alphabet.
ILLUSTRATED WITH ONE HUNDRED AND SIXTY COPPERPLATES.
fry a Society of GENTLEMEN in Scotland.
IN THREE VOLUME S.
VOL. I.
EDINBURGH:
Printed for A. Bell and C. Macearquhar;
Aid fold by C o l i n M a cf a r q.u h a r, at his Printing-office, N.coifon-HreeL
\l965r.
SCO,-
PREFACE
UTILITY ought to be the principal intention of every pub

In [161]:
counts_frame.loc[counts_frame['wordfreq'] > 3]

Unnamed: 0,wordlist,wordfreq
0,i,5
10,A,8
25,.,4
74,the,91
79,S,4
86,a,19
92,and,33
102,The,5
107,are,8
114,or,7


In [162]:
def replace_by(s, a, b):
    print("Replacing \"" + a + "\" by \"" + b + "\", found " + str(len(re.findall(a, s))) + "...")
    new = re.sub(a, b, s)
    print("New length: " + str(len(new)) + "\n") # idk if relevant, maybe for punctuation/space removal?
    return new

In [163]:
replace_by("This is a good doggg", "doggg", "cat")

Replacing "doggg" by "cat", found 1...
New length: 18



'This is a good cat'

In [198]:
# Using regex cleanup ideas from: https://sites.temple.edu/tudsc/2014/08/12/text-scrubbing-hacks-cleaning-your-ocred-text/

# TODO https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions

def clean_up(s):
    print("Initial length: " + str(len(s)))
    s2 = replace_by(s, 'tbe', 'the')
    s3 = replace_by(s2, 'tiie', 'the')
    s4 = replace_by(s3, 'liis', 'his')
    s5 = replace_by(s4, 'bis', 'his')
    s6 = replace_by(s5, '■', '')
    s7 = replace_by(s6, '.(\.\.+)', '') # multiple periods
    #s8 = replace_by(s7, '\\n', ' ') # \n for some reason? or is this a format issue?
    print("Clean up done!")
    return s7 # CAREFUL to always pass the right one to next, and return

In [200]:
clean_content = clean_up(content[:50000])
print(clean_content[27000:])

Initial length: 50000
Replacing "tbe" by "the", found 1...
New length: 50000

Replacing "tiie" by "the", found 0...
New length: 50000

Replacing "liis" by "his", found 0...
New length: 50000

Replacing "bis" by "his", found 2...
New length: 50000

Replacing "■" by "", found 5...
New length: 49995

Replacing ".(\.\.+)" by "", found 6...
New length: 49976

Clean up done!
 manner, which could not be atoned for
with money.
ABERRATION, in adronorrty, a fmall apparent mo¬
tion of the fixed dars, fird difeovered by Dr Bradley
> Ant
and Mr MoHineux, and found to be owing to the pro-
grefiive motion of light, and the earth’s annual mo¬
tion in its orbit. If a lucid objeft be fixed, and the
eye of the obferver moving along in any other direc¬
tion than that of a dreight line from the eye to the
objeifl, it is plain, that the objedt mud have an appa¬
rent motion, greater or lefs, according to the velocity
with which the eye is moved, and the didance of the
objefl from the eye. See Astronomy.
ABER

In [71]:
len(re.findall('tbe', 'this is tbe best tbe'))

2

In [69]:
re.search('tbe', content)

<_sre.SRE_Match object; span=(8827, 8830), match='tbe'>

In [74]:
len(re.findall('tbe', content))

27

In [201]:
re.findall('[A-Z][A-Z]+', clean_content[27000:])

['ABERRATION',
 'ABERRATION',
 'ABERYSWITH',
 'ABESTA',
 'ABESTON',
 'ABETTOR',
 'ABEVACUATION',
 'ABEX',
 'ABEYANCE',
 'ABHEL',
 'ABIB',
 'ABIDING',
 'ABIES',
 'ABIGEAT',
 'AB',
 'GEATUS',
 'ABIG',
 'ES',
 'ABILITY',
 'ABINGDON',
 'AB',
 'INTESTATE',
 'ABISHERING',
 'ABIT',
 'ABJURATION',
 'ABLAC',
 'ABLACTATION',
 'ABLACQUEATION',
 'ABLATIVE',
 'ABLAY',
 'ABLECTI',
 'ABLEGMINA',
 'ABLET',
 'ABLUENTS',
 'ABLUTION',
 'ABO',
 'ABOARD',
 'ABOLITION',
 'ABOLLA',
 'ABOMASUS',
 'ABOMINATION',
 'ABORIGINES',
 'ABORTION',
 'ABORTIVE',
 'ABOY',
 'ABRA',
 'ABRACADABRA',
 'ABRAHAM',
 'ABRAHAMITES',
 'ABRAMIS',
 'ABRASA',
 'ABRASION',
 'ABRAUM',
 'ABRASAX',
 'ABRAX',
 'ABREAST',
 'ABRENUNCIATION',
 'ABRIDGEMENT',
 'ABROBANIA',
 'ABROCHMENT',
 'ABROGATION',
 'ABROLKOS',
 'ABRON',
 'ABRONO',
 'ABROTANOIDES',
 'ABROTANOIDES',
 'ABROTANUM',
 'ABRUPTION',
 'ABRUS',
 'ABRUZZO',
 'ABSCEDENTIA',
 'ABSCESS',
 'ABSCHARON',
 'ABSCISSE',
 'ABSCISSION',
 'ABSCONSA',
 'ABSENCE',
 'ABSINTHIATED',
 'ABSINTHIUM',

In [199]:
re.split('([A-Z][A-Z]+)', clean_content[10000:30000]) # TODO would then to add each word and its definition to a dataframe
# remove preface/list of authors, as well as endnotes ("end of first volume") can check if found with regex

['e other in Dutch\nBrabant.\n',
 'AAHUS',
 ', a fmall town and diftrift in Weftphalia.\n',
 'AAM',
 ', a Dutch meafure for liquids, containing about\n63 lb. avoirdupoife.\nA ',
 'AM',
 ' A, a province in Barbary, very little known.\n',
 'AAR',
 ',, the name of two rivers, one in Weftphalia, and one\nin Switzerland, It is likewife the name of a fmall\nifland in the Baltic fea.\n',
 'AARSEO',
 ', a town inAfrica, fituated near the mouth of\nthe river Mina.\n',
 'AATTER',
 ", or At ter, a province of Arabia Felix, fi¬\ntuated on the Red-fea.—N. 'B. All other places which\nbegin with a double A, but more generally with.a\nAngle one, will be inferted according to the laft ortho¬\ngraphy.\n",
 'AB',
 ', the eleventh month of the civil year of the Hebrews.\nIt correfponds to part of our June and July, and con-\nfifts of 30 days. On the firfi of this month the Jews\ncommemorate the death of Aaron by a fall: they fall\nalfo on the ninth, becaufe on that day both the temple\nof Solomon and that