Collects scripts for processing texts for the C19 French Newspapers Scandal project at W&L.
Primary script for text analysis is main.py.
To begin, clone the repository and change into it.
$ git clone https://github.com/wludh/frenchnewspapers.git $ cd frenchnewspapers
The scripts are written in python 3. First, install homebrew if you haven't already:
$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Then use homebrew to install python3:
$ brew install python3
Then install python dependencies using pip.
$ pip3 install nltk $ pip3 install python-dateutil $ pip3 install treetaggerwrapper $ pip3 install matplotlib $ pip3 install scipy $ pip3 install sklearn $ pip3 install gensim
For Part of Speech tagging, first install Tree Hugger by following the directions here. You will want to install it into the root of the frenchnewspapers directory. This involves downloading four files from that link, but be careful that you don't unzip anything yourself. The installation script will do that for you. This involves creating a folder called 'tagger' in the project root, switching into that folder, and then running the install commands. Like so:
$ mkdir tagger $ cd tagger $ sh install-tagger.sh $ echo 'Hello world!' | cmd/tree-tagger-french reading parameters ... tagging ... Hello INT hello world NOM <unknown> ! SENT ! finished.
If you get results like this from the third command it was successful. You should be set up for things.
Individual scripts can be created for a particular purpose by modifying the main() function in main.py. By default, it outputs a series of basic statistics about the corpus to the file results.txt when run as a program.
A more flexible way to interact with the corpus is by importing the main script in the python interpreter. First, fire up your python interpreter and import your main.py package.
$ python3 >>> import main >>> corpus = main.Corpus()
The third line here loads the corpus from a given directory. By default, it reads in from a folder called "clean_ocr". If you don't have such a folder, you will have to create one and populate it with plain text files available from our WLU Box folder.
Once read in, the script prepares your corpus as a list of individual texts (organize by date by default) that can be accessed like this:
>>> corpus.texts [<main.IndexedText object at 0x10d87b550>, <main.IndexedText object at 0x10da060b8>, <main.IndexedText object at 0x10dbb11d0>, <main.IndexedText object at 0x10dc60dd8>, <main.IndexedText object at 0x10deef1d0>, <main.IndexedText object at 0x10e451710>, <main.IndexedText object at 0x10e680080>, <main.IndexedText object at 0x10e9cfa58>, <main.IndexedText object at 0x10eb57cf8>, <main.IndexedText object at 0x10d867898>, <main.IndexedText object at 0x10d9b84e0>, <main.IndexedText object at 0x10db7c5c0>, <main.IndexedText object at 0x10dbef048>, <main.IndexedText object at 0x10dd6a6a0>, <main.IndexedText object at 0x10e3a12e8>, <main.IndexedText object at 0x10e60ef98>, <main.IndexedText object at 0x10e88a978>, <main.IndexedText object at 0x10ea31978>, <main.IndexedText object at 0x10ec87e48>, <main.IndexedText object at 0x10d6a40f0>, <main.IndexedText object at 0x10d6a4630>, <main.IndexedText object at 0x10d8f4e80>, <main.IndexedText object at 0x10dabbcf8>, <main.IndexedText object at 0x10dcde278>, <main.IndexedText object at 0x10e1f2eb8>, <main.IndexedText object at 0x10e4ae940>, <main.IndexedText object at 0x10e721048>, <main.IndexedText object at 0x10ea1ab00>, <main.IndexedText object at 0x10ebccb38>]
You can then access any individual text by selecting it from the list:
>>> corpus.texts <main.IndexedText object at 0x10d87b550> >>> corpus.texts.filename 'figaro_june_1_1908' >>> corpus.texts.tokens ['assassinat', 'du', 'peintre', 'steinheil', 'et', 'de', 'sa', 'belle-mère', 'mme', 'veuve', 'japy', 'mme', 'steinheil', 'échappe', 'a', 'la', 'mort', 'un', 'crime', 'épouvantable', ',', 'un', 'triple', 'assassinat', ',', 'a', 'été', 'commis', 'à', 'paris', 'dans', 'la', 'nuit', 'de', 'samedi', 'à', 'dimanche', '.', 'dans', 'la', 'série', 'des', 'meurtres', "qu'il", 'faut', 'enregistrer', 'chaque', 'jour', ',', 'celui-là', 'prend', 'une', 'place', 'à', 'part', '.',...
I've baked in a variety of methods, some tied to the corpus itself:
- give name of the corpus directory
- give the list of stopwords currently being used.
- give the list of proper names used for querying by proper names
- give the list of all the texts
- sort the articles by date.
- output to a file called 'results.txt' a variety of stats about the texts in the corpus.
- orders the texts by publication and then each of these groupings by date.
- Actually two methods in one - csv_dump and single_token_by_date. The latter charts the usage of a single token across the corpus and the former writes it to a csv file for graphing in excel.
- prints out all filenames in the corpus folder
- take the filename (note the quotation marks) and return the text object associated with it. So you could do something like to store a text as a variable and then manipulate it for the future:
>>> corpus.list_all_filenames croix_june_2_1908 croix_november_14_1909 croix_november_26_1908 croix_november_27_1908 >>> my_text = corpus.find_by_filename('croix_november_26_1908') >>> my_text.filename croix_november_26_1908 >>> my_text.tokens ['l', "'", 'affaire', 'steinheil', 'les', 'découvertes', 'd', "'", 'hier', 'nous', 'ne', 'nous', 'trompions', 'pas', 'en', 'annonçant', 'que', 'la', 'perquisition', ...
- creates a new folder containing copies of the texts, but all the tokens are stemmed. in case we want to try topic modeling those instead.
- performs Latent Semantic Indexing (LSI) for topic modeling on the corpus.
- performs Latent Dirichlet Allocation (LDA) for topic modeling on the corpus. This takes significantly longer to run than LSI.
Others are tied to the individual texts:
- give filename of text
- give unprocessed text (includes line breaks, etc.)
- give tokenized version of a text
- gives a list of tag objects according to tree tagger, which contains word, pos, and lemma. Using this to create the tagged_tokens and stems attributes.
- gives a list of tuple pairs with (token, part of speech tag)
- gives a list of the text's tokens converted to stems
- give the length of a text (number of tokens)
- give a frequency distribution of the text (number of uses of each individual token)
- give date of a text. Can be further broken down with text.year, text.month, and text.day
- give the name of the journal that published the text.
- shows the distribution of punctuation marks over the course of a text.
- does the same as puncbysection_total, except for two punctuation marks at once
- produces a French stemmer for the text (not fully implemented yet)
- gives you the number of times that exact token occurs in the text. will not catch plural forms or alternative verb conjugations.
- stems the token and then gets the number of times that stem occurs in the text. So it will catch plural vs singular and verb conjugations.
- produces a stemmed index for the text. (mostly deprecated, since we're using a different stemmer now.)
- produces a list of tokens in the text with stopwords excluded.
- produces a list of word pairs or bigrams for the text as a list of tuples. Ex. [‘this’, ‘is’, ‘a’, ‘sentence’] becomes [(‘this’, ‘is’), (‘is’, ‘a’), (‘a’, ‘sentence’)].
- Keep in mind that each tuple is a unit, denoted by the (). You can access them as a unit: text.trigrams.count(('mme', 'steinheil')) will give you the number of times the bigram 'mme steinheil' occurs in a text.
- Or you can break apart the tuple and manipulate each unit of the tuple separately. The following will loop over all bigrams by assigning variable names to first and second element in each tuple, check to see if the second word in the pair is 'steinheil' and, if so, print the pair to the console. It effectively gives a list of all the different words that Steinheil follows:
for first, second in my_text.bigrams: if second == 'steinheil': print(first + " " second)
- produces a list of consecutive three-word phrases. Can be manipulated in the same way as text.bigrams.
So to do a lot of the more complicated analyses, you will need to move back and forth between corpus methods and text methods:
>> average = 0 >> for text in corpus.texts: ... average += text.length ... >> average / len(corpus.texts) 6106.3448275862065
Here I loop across the whole corpus, finding the length of each individual text. Adding up all those lengths, I then divide that number by the number of texts in the corpus. This gives me the average number of tokens in the corpus.
You should have everything you need for the basic building blocks of text analysis here. Let me know if anything else comes up.
And a final note: be sure to re-sync your computer's copy of the repository with the master version here on GitHub when you get ready to work each day by running
$ git pull
To preform cluster analysis, type:
$ python3 cluster_process.py -t 'VER'
Where 'VER' will do verbs, you can insert any part of speech tag abbreviation.