# Analyzing Language and Texts

[David J. Thomas](mailto:dave.a.base@gmail.com), [thePortus.com](http://thePortus.com)<br />
Instructor of Ancient History and Digital Humanities,<br />
Department of History,<br />
[University of South Florida](https://github.com/usf-portal)

---

## This workbook will...

* Use the `dhelp` module to access the `cltk` and `nltk` modules
* Preprocess the text for analysis
* POS (Part of Speech) tag each text
* Perform word counts
* Analyze several other features of the charter texts

---

## 1) Import Module Dependencies

The cell below loads all other Python packages needed. You **must** run this before any other cells.

In [23]:
import csv
import collections
from IPython.display import clear_output
import nltk
from nltk.text import Text, TextCollection
import cltk
from cltk.corpus.utils.importer import CorpusImporter
from dhelp import LatinText

## First Time Only Setup

The following cell MUST be run the first time you run this on a new computer. This will automatically use the `cltk` module to download training corpora and other necessary linguistic data.

In [5]:
# install required nltk linguistic packages
nltk_packages = [
    'punkt', 'verbnet', 'wordnet', 'large_grammars', 'averaged_perceptron_tagger',
    'maxent_treebank_pos_tagger', 'maxent_ne_chunker', 'universal_tagset', 'words',
    'sample_grammars', 'book_grammars', 'perluniprops'
]
for nltk_package in nltk_packages:
    try:
        print('(NLTK) Attempting to import', nltk_package)
        nltk.download(nltk_package)
    except Exception as e:
        print('Error importing', nltk_package, 'skipping...')

# install latin linguistic packages
for corpus in CorpusImporter('latin').list_corpora:
    try:
        print('(CLTK) Attempting to import', corpus)
        CorpusImporter('latin').import_corpus(corpus)
    except Exception as e:
        print('Error importing', corpus, 'skipping...')

(NLTK) Attempting to import punkt
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/davidthomas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
(NLTK) Attempting to import verbnet
[nltk_data] Downloading package verbnet to
[nltk_data]     /Users/davidthomas/nltk_data...
[nltk_data]   Package verbnet is already up-to-date!
(NLTK) Attempting to import wordnet
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/davidthomas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
(NLTK) Attempting to import large_grammars
[nltk_data] Downloading package large_grammars to
[nltk_data]     /Users/davidthomas/nltk_data...
[nltk_data]   Package large_grammars is already up-to-date!
(NLTK) Attempting to import averaged_perceptron_tagger
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/davidthomas/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
(NLTK) At

## 2) Configure Models 

In [21]:
class Charter(collections.UserString):
    id = None
    description = None
    archive = None
    language = None
    scholarly_date_avg = None
    
    def __init__(self, csv_data_row):
        # call parent class init function since we are overriding it
        super().__init__(str)
        self.id = csv_data_row['id']
        self.description = csv_data_row['description']
        self.archive = csv_data_row['archive']
        self.language = csv_data_row['language']
        self.scholarly_date_avg = csv_data_row['scholarly_date_avg']
        self.data = csv_data_row['text']
    
    def __str__(self):
        return self.data
    
    @property
    def text_clean(self):
        """Basic text pre-processing, removes stopwords, extra spaces, and numbers, adds macrons"""
        stopwords = []
        with open('stopwords_latin.txt') as text_file:
            for line in text_file.readlines():
                stopwords.append(line)
        remove_chars = ['.', ',', ';', ':', '+', '-']
        altered_text = self.data
        for remove_char in remove_chars:
            altered_text = altered_text.replace(remove_char, '')
        return LatinText(altered_text.lower()
            ).rm_lines(
            ).normalize(
            ).rm_stopwords(stopwords
            ).rm_spaces()
    
    @property
    def text_lemmatized(self):
        """Gets the clean form of the text then transforms all words to their lemmata."""
        return self.text_clean.lemmatize()
    
    @property
    def entities(self):
        """Scans text with cltk's entity recognition and returns a list."""
        return LatinText(self.text).entities()
    
    def longest_common_substring(self, other_string):
        """Returns the longest substring that this and another charter share."""
        return LatinText(self.data).longest_common_substring(other_string)
    
    def compare_minhash(self, other_string):
        """Compares the text minhash similarity of this and another charter."""
        return LatinText(self.data).compare_minhash(other_string)
        
    def word_count(self):
        """Gives a dictionary, each key is a word appearing and the value is the count."""
        return LatinText(str(self.text_lemmatized)).word_count()
    
    def word_count_raw(self):
        """Same as .word_count(), but does not lemmatize words before counting."""
        return LatinText(self.data).word_count()
    
    def clausulae_count(self):
        """Similar to word_count, but instead uses cltk to look for poetic clausulae in prose text."""
        return LatinText(self.data).clausulae()
    
    
class CharterCorpus(collections.UserList):
    """List of individual charter objects, provides methods to aid analysis of the documents."""
    
    def __init__(self, charters):
        # call parent class init function since we are overriding it
        super().__init__()
        # ensure that a list was passed
        try:
            iter(charters)
        except:
            raise Exception('CharterCorpus must be populated an iterable')
        # go through each item and manually append it to the internal list
        for charter in charters:
            self.data.append(charter)
            
    @classmethod
    def load(cls, filepath):
        """Loads a csv with charter data."""
        data_rows = []
        with open(filepath, mode='r+', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                data_rows.append(Charter(row))
        return cls(data_rows)
            
    @property
    def charter_ids(self):
        """Returns a new list containings ids of all charters in this list of charter objects."""
        id_list = []
        for charter_obj in self:
            id_list.append(charter_obj.id)
        return id_list
    
    def get_by_id(self, charter_id):
        """Returns a single instance of Charter."""
        # iterate each charter
        for charter in self:
            # if match, immediately return charter
            if charter.id == charter_id:
                return charter
        # if no match found, return None
        return None
    
    def get_by_ids(self, id_list):
        """Returns a new instance of CharterCorpus, populated only with charters matching the id_list."""
        filtered_charters = []
        # iterate each charter
        for charter_obj in self:
            # if if id matches, if so, add it to a new list of objects
            if charter_obj.id in id_list:
                filtered_charters.append(charter_obj)
        # use self.__class__ to construct a new instance, rather than CharterCorpus(filtered_charters)
        return self.__class__(filtered_charters)
    
    def minhash_distances(self, print_updates=False):
        """Returns dict with ids as keys and vals are dicts with keys/vals of ids/dists to other charters. e.g...
        {'id 1': {'id 2': 0.5, 'id 3': 0.2}, 'id 2': {'id 1': 0.5, 'id 3': 0.8}, 'id 3': {'id 1': 0.2, 'id 2': 0.8}}
        """
        distance_dict = {}
        counter = 0
        # create empty dicts inside for each charter
        for charter_id in self.charter_ids:
            distance_dict[charter_id] = {}
        # start looping through each charter
        for charter in self:
            # if silent is not flagged, clear cell and print info for new charter
            if print_updates:
                clear_output()
                print('Working on minhash distances for {} ({}/{}) '.format(charter.id, counter + 1, len(self)), end='')
            # start sublooping through other charters to compare against
            for other_charter in self:
                # skip to the next item if the charters are the same
                if charter.id == other_charter.id:
                    continue
                # computer the value with compare_minhash and store it in the dict
                distance_dict[charter.id][other_charter.id] = charter.compare_minhash(str(other_charter.data))
            counter += 1
        # if silent not flagged, print finished message
        if print_updates:
            print(' Done!')
        return distance_dict


print('Models created successfully.')

Models created successfully.


In [24]:
charter_corpus = CharterCorpus.load('../export/raw_charters.csv')

minhash_distances = charter_corpus.minhash_distances(print_updates=True)

charter_key_counter = 0
for charter_key in minhash_distances:
    if charter_key_counter > 5:
        break
    sub_charter_key_counter = 0
    for sub_charter_key in minhash_distances[charter_key]:
        if sub_charter_key_counter > 5:
            break
        print('{} -> {}: {}'.format(charter_key, sub_charter_key, minhash_distances[charter_key][sub_charter_key]))
        sub_charter_key_counter += 1
    charter_key_counter += 1
    

Working on minhash distances for S1442 (347/347)  Done!
S1 -> S2: 0.3012326656394453
S1 -> S3: 0.30495356037151705
S1 -> S4: 0.3002336448598131
S1 -> S7: 0.3141592920353982
S1 -> S8: 0.2926434923201294
S1 -> S9: 0.30641330166270786
S2 -> S1: 0.3012326656394453
S2 -> S3: 0.5206812652068127
S2 -> S4: 0.37763833428408444
S2 -> S7: 0.42310469314079424
S2 -> S8: 0.3224852071005917
S2 -> S9: 0.39107413010590014
S3 -> S1: 0.30495356037151705
S3 -> S2: 0.5206812652068127
S3 -> S4: 0.40069686411149824
S3 -> S7: 0.4117647058823529
S3 -> S8: 0.3029197080291971
S3 -> S9: 0.3640416047548291
S4 -> S1: 0.3002336448598131
S4 -> S2: 0.37763833428408444
S4 -> S3: 0.40069686411149824
S4 -> S7: 0.4298745724059293
S4 -> S8: 0.2909494725152693
S4 -> S9: 0.36945244956772333
S7 -> S1: 0.3141592920353982
S7 -> S2: 0.42310469314079424
S7 -> S3: 0.4117647058823529
S7 -> S4: 0.4298745724059293
S7 -> S8: 0.3699927166788055
S7 -> S9: 0.47593582887700536
S8 -> S1: 0.2926434923201294
S8 -> S2: 0.3224852071005917
S8 -

In [None]:
cleaned_charters = []
lemmatized_charters = []

# open session, read texts from db and store in charters
session = sessionmaker(bind=engine)()

print('Preparing texts (lemmatizing and cleaning)...', end='')
counter = 0
for charter_obj in session.query(Charter):
    counter += 1
    cleaned_charters.append(Text(charter_obj.text_clean.tokenize()))
    lemmatized_charters.append(Text(charter_obj.text_lemmatized.tokenize()))
    if counter % 10 == 0:
        print('.', end='')
session.close()
print('Done!')

# converts charters from list of texts into text collection
cleaned_charters = TextCollection(cleaned_charters)
lemmatized_charters = TextCollection(lemmatized_charters)

In [None]:
print(lemmatized_charters.dispersion_plot(['iesu', 'christi', 'rex', 'deus']))

In [None]:
print('Iesu')
print('Frequency to Document Ratio {}%'.format(round(lemmatized_charters.idf('iesu') * 100, 2)))

print('Concordance: First 10 appearances')
for concordance_appearance in lemmatized_charters.concordance_list('iesu')[0:10]:
    print(concordance_appearance.line)

In [None]:
print('Common Contexts')
print('\n---\nIesu:')
print(lemmatized_charters.common_contexts(['iesu']))
print('\n---\nRex:')
print(lemmatized_charters.common_contexts(['rex']))

In [None]:
print(cleaned_charters.plot(20))

## MORE COMING SOON

For now, try out the network analysis module