# Multilingual Brandes

This notebook was part of Quinn Dombrowski's "Digital Brandes" project, for the April 25-26, 2019 hackathon.

It takes plain text files with a naming structure like *da1_12.txt* for Danish volume 1, chapter 12 (2-letter [ISO 639-1 language code](https://en.wikipedia.org/wiki/ISO_639-1), volume number, chapter number). The tab-separated output can be copied into a text editor and saved as a .tsv file that contains the word count, language, chapter number, and any chapter title or header I could find. This file is the input to Tableau for the [word count visualization](https://public.tableau.com/profile/quinn.dombrowski#!/vizhome/MultilingualBrandesBrowser/Sheet1).

### 1. Import modules and specify source file directory

In [3]:
#os is used for things like changing directories and listing files
import os
#io is used for opening and writing files
import io
#itertools is used for some of the iterative code
from itertools import chain
#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob
#regex is used to parse the file names
import re

#This is the full path to the directory where you've stored the source texts
sourcefiledirectory = '/Users/qad/Documents/GitHub/multilingual-brandes/19cen_lit_vol1'


#Changing the directory to where you've stored the source texts, so you can open them in later code blocks
os.chdir(sourcefiledirectory)

### 2. List all the text files in the directory
This makes sure that you're in the right file.

In [4]:
#If a file in the source directory ends in .txt, print the filename

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        print(filename)

en1_12.txt
en1_06.txt
es1_14.txt
es1_00.txt
ru1_05.txt
ru1_11.txt
yi1_11.txt
yi1_05.txt
yi1_04.txt
yi1_10.txt
ru1_10.txt
ru1_04.txt
es1_01.txt
es1_15.txt
en1_07.txt
en1_13.txt
en1_05.txt
en1_11.txt
es1_03.txt
ru1_12.txt
ru1_06.txt
yi1_06.txt
yi1_12.txt
yi1_13.txt
yi1_07.txt
ru1_07.txt
ru1_13.txt
es1_02.txt
en1_10.txt
en1_04.txt
en1_00.txt
en1_14.txt
es1_06.txt
es1_12.txt
ru1_17.txt
ru1_03.txt
yi1_03.txt
yi1_02.txt
ru1_02.txt
ru1_16.txt
es1_13.txt
es1_07.txt
en1_15.txt
en1_01.txt
yi_99.txt
en1_03.txt
es1_11.txt
es1_05.txt
ru1_00.txt
ru1_14.txt
yi1_14.txt
yi1_01.txt
ru1_15.txt
ru1_01.txt
es1_04.txt
es1_10.txt
en1_02.txt
de1_03.txt
pl1_07.txt
pl1_13.txt
pl1_12.txt
pl1_06.txt
de1_02.txt
de1_14.txt
de1_00.txt
da1_08.txt
pl1_10.txt
pl1_04.txt
pl1_05.txt
pl1_11.txt
da1_09.txt
de1_01.txt
de1_15.txt
de1_11.txt
de1_05.txt
pl1_15.txt
pl1_01.txt
pl1_00.txt
pl1_14.txt
de1_04.txt
de1_10.txt
de1_06.txt
de1_12.txt
pl1_02.txt
pl1_16.txt
pl1_03.txt
de1_13.txt
de1_07.txt
da1_02.txt
da1_03.txt
de1_09.txt


### 3. Process text files
This code grabs the metadata from the file name (language, chapter), the title from the first line of each text file, and a word count for the text file, and prints it. From there, you can copy and paste it into a plain text file and save it as a .tsv.

(I was running short on time at the hackathon, and there was nothing to be gained by sorting out the Python to write the lines to a file directly when I could copy and paste.)

In [8]:
#If a file in the source directory ends in .txt
for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        #Open it and read it into a string
        f = open(filename, 'r')
        firstline = f.readline()
        #Lowercase the first line, to avoid weirdness in capitalization from OCR
        firstline = firstline.lower()
        #Grabs chapter number
        numonly = re.search('[0-9][0-9](?=\.txt)', filename).group(0)
        #Grabs language value
        lang = re.search('[a-z][a-z]', filename).group(0)
        #Counts words
        text = f.read()
        words = 0
        for wordcount in text.split(" "):
            words += 1 
        print((str(words)), end="\t")
        print (lang, end="\t")
        print (numonly, end="\t")
        print (firstline, end="\t")

3889	en	12	xii. new conception of the antique
	2214	en	06	vi. nodier
	4030	es	14	barante
	2045	es	00	
	494	ru	05	новое душевное настроеніе
	3870	ru	11	борьба съ обществомъ -- дельфина г-жи сталь - условность национальныхъ предразсудковъ 
	7316	yi	11	.
	1066	yi	05	דער נײער זעלען-צושטאַנד.
	5392	yi	04	שאַטאָבריאַנ'ס "רענעי."
	5149	yi	10	די איטאַליענישע פּאָזיע
	1399	ru	10	адольфъ
	2115	ru	04	„рене* шатобріава
	3197	es	01	chateaubriand, atala
	1714	es	15	conclusión
	9724	en	07	vii. constant: »on religion«–»adolphe«
	6670	en	13	xiii. de l'allemagne
	4899	en	05	v. obermann
	6680	en	11	xi. attack upon national and protestant prejudices
	3439	es	03	werther
	2643	ru	12	г-жа сталь и вольтеръ
	3160	ru	06	„обсриаякъ* севаакура
	5913	yi	06	סעינאַנקור'ס „אַבערמאַן."
	4442	yi	12	נײער באַגריף פון דאָס אַנטיקע.
	7552	yi	13	מאַדאַם דע סטאַל'ס „איבער דײטשלאַנד."
	2365	yi	07	נאָדיע.
	3504	ru	07	адольфъ констана.
	1441	ru	13	итальавскаа поезіа
	2105	es	02	rousseau
	5065	en	10	x. »corinne«
	4961	en	04	iv. 