In [None]:
import json

with open ('sampledata.jsonl', 'r') as f:
    json_list = list(f)
    
print (len (json_list))

In [None]:
# let's focus on just the first article in the collection

article1 = json.loads(json_list[0])
print (article1.keys())


In [None]:
import json

articles = []
for line in open('sampledata.jsonl', 'r'):
    articles.append(json.loads(line))
 
# data = dict(articles)
# for element in data:
#     if 'unigramCount' in element:
#         del element['unigramCount']
#     if 'bigramCount' in element:
#         del element['bigramCount']
#     if 'trigramCount' in element:
#         del element['trigramCount']

# with open('newdata.jsonl', 'w') as data_file:
#     data = json.dump(data, data_file)

In [None]:
#here's a glance forward, at how we will loop through all the files

for json_str in json_list:
    article = json.loads(json_str)
    print (article.keys())
    print ("Author: " + str(article['creator'][0]), " Title: " + str(article['title']) )
    #the zero after creator means 'just the first author in the list'
    for key in article.keys():
        print (key, article[key], "\n")
        

Note the error message: KeyError: 'creator'
This means that some articles don't have an author name! This can happen in other fields as well. This kind of problem can really stump you. Here's a simple solution:

In [None]:
#handle errors without crashing--but this is really long and should not be run in the notebook. Too much printing text so I put in a break
counter = 0
for json_str in json_list:
    article = json.loads(json_str)
    print (article.keys())

    try:
        author = article['creator'][0]  
    except:
        author = "No Author"
        
    print ("Author: " + str(author), " Title: " + str(article['title']) )

    
    for key in article.keys():
        print (key, article[key], "\n")
    
    counter +=1
    if counter > 10:
        break

You can do these "try" "except" solutions to provide contents to categories that are sometimes missing, and thereby provide your program with stable data. Other categories besides "author" are sometimes missing as well.

In [None]:
# we usually want to load all our data into the memory 
# Let's start by making a list of all the authors in the first 10 articles

authors = []  #make a list of authors
titles = [] #and titles
counter = 0
for json_str in json_list:
    article = json.loads(json_str)

    try:
        authors.append(article['creator'][0]) #for simplicity, just take the first author
    except:
        na = "No Author"
        authors.append(na)
        
    if counter <10:
        print (authors)
        counter +=1
    else:
        break


In [None]:
# now we redo the program without limitations. and add dates
# let's also get the titles, dates and text

authors = []  #make a list of authors
titles = [] #and titles
texts = [] #and fulltext
dates = []

for json_str in json_list:
    article = json.loads(json_str)

    try:
        dates.append(article['datePublished'])
    except:
        dates.append("No Date")
    try:
        authors.append(article['creator'][0]) #for simplicity, just take the first author
    except:
        na = "No Author"
        authors.append(na)
    
    try:
        titles.append(article['title'])
    except:
        t = "No Title"
        titles.append(t)
    
    try:
        texts.append(article['fullText'])
    except:
        t = "No Text Available"
        texts.append(t)

It took about ten seconds to read through 1500 files and make four lists in the memory. Now let's put these lists together into a sort of 'excel spreadsheet' called a dataframe

In [None]:
import pandas as pd

df = pd.DataFrame(list(zip(dates, authors, titles, texts)), 
                        columns = ['dates', 'authors', 'titles', 'texts'])
df.head()

now you might be wondering what that 'list(zip)' thing is. It has to do with the way that dataframes understand the list objects that we have created, and that we want to put into a sort of spreadsheet.
To learn more about this, just google 'python dataframe list zip' and you'll find lots of resources. Stackexchange is usually the most useful.
You can also use Copilot, which is an AI powered programming partner available in VS Code, and helps you to search for the right way to do things in your code. But those are topics that are too advanced for this crash course.

In [None]:
df.tail() #see the end of the dataframe as well

In [None]:
# now let's look at the text contained in the 'fulltext' field. 

print (df['texts'][0])  # I'm saying: show me the contents of the first row [0]
                        # of the column titled 'texts' ['texts']

In [None]:
print (df['texts'][1]) #show me the second row

In [None]:
#looks awful. Let's get rid of those ][ and \n characters at least
# first of all, what kind of variable is it anyway?

text = df['texts'][0]
type(text)

Ok, so the variable is actually saved as a list. Let's turn it into a string first of all.

In [None]:
text = str(text)
type(text)
print (text)

So even though we turned it into a string, it still has those list characters at the beginning and end. Here's how we get rid of them: by 'slicing'

In [None]:
text = text[2:-2]
print (text)

Next, let's get rid of the newline character \n, and other useless whitespaces

In [None]:
text = text.strip()
print (text)

This only got rid of the newlines that were not connected to words though! Now to get rid of the remaining newline characters

In [None]:
type(text)

In [None]:
text2 = " ".join(text.split()) #this is what lots of online tutorials will tell you, but it doesn't work here. 
print (text2)


In [None]:
text3 = text2.replace(r'\n', '') #sometimes you need to do a 'raw string literal' to get rid of pesky nasty useless characters
print (text3)

In [None]:
# let's get rid of one more thing, \uf076

text4 = text3.replace(r'\uf076', '')
print (text4)

Ok, you can see that we will be in the weeds for a while cleaning our text. This is a whole sub-section of DH. You will need to get used to googling 'python remove newline from string' and an arcane subject called regex.
For our purposes, our text is now clean enough to start working on it. 

Please note! The quality of your output depends greatly on the quality of your input. Make sure you clean carefully

In [None]:
#let's recreate our dataframe, with clean texts this time 

authors = []  #make a list of authors
titles = [] #and titles
texts = [] #and fulltext
dates = []

for json_str in json_list:
    article = json.loads(json_str)
    try:
        dates.append(article['datePublished'])
    except:
        dates.append("No Date")

    try:
        authors.append(article['creator'][0]) #for simplicity, just take the first author
    except:
        na = "No Author"
        authors.append(na)
    
    try:
        titles.append(article['title'])
    except:
        t = "No Title"
        titles.append(t)
    
    try:
        text = str(article['fullText']) #don't forget to make the list a string
        text = text[2:-2] #cut off the first and last two characters
        # text = text.strip() #get rid of extra whitespace
        text = text.replace(r'\n', '') #get rid of string literal \n
        text = text.replace(r'\uf076', '') #eliminate whatever that is
        texts.append(text) #stick the clean text into the list 
    except:
        t = "No Text Available"
        texts.append(t)

import pandas as pd

df = pd.DataFrame(list(zip(dates, authors, titles, texts)), 
                        columns = ['dates', 'authors', 'titles', 'texts'])

In [None]:
df.head()


Now what can we do with this?
Let's look for all the articles that have a keyword in their title, for instance 'India'.
We can do this with our lists most easily (for simple applications). In more complicated situations we will use the dataframe, but we pay a price in terms of speed (dataframes are slower than lists) and for some operations this will be important

In [None]:
for t in titles:
    if "Women" in t:
        print (t) 

In [None]:
# we can also find the index value, and use that to recover the author and title
index_list = []
for index, t in enumerate(titles): # the variable named 't' is the element of the list of titles, 'index' is its position in the list
    if "Women" in t:
        print (index, "Author: ", authors[index], "Title: ", t)
        index_list.append(index)

At this point, we need to make a choice about which direction we are going to go. I think one of the most exciting possibilities today is to use large language models like GPT-3 to automatically analyze our texts. But there are tons of other things we could do too! Some of them were presented by prof. Kulic. 

We're going to use a package called SpaCy for language modeling

we need to pip install spacy

then we go to https://spacy.io/usage and download our model

python -m spacy download en_core_web_md

In [None]:
import spacy

nlp = spacy.load("en_core_web_md")
# import en_core_web_md

doc = nlp("This is a sentence about red apples and green pears.")
print ([(w.text, w.pos_) for w in doc])

We can break up a sentence into all its parts of speech, automatically!  Now let's see what else we can do with the tokenization of the sentence. Lots of information here.

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.is_stop)

Let's take a look at the last article from the list above, index number 20

In [None]:
print (df.iloc[20][3])  # get text from the dataframe, row 1196 and fourth column (count starting from zero)

#we have a problem with the quality of the text--some sentences are running together, and the keywords and abstract are 
#all thrown together in here. But let's continue anyway

In [None]:
doc = nlp(df.iloc[20][3])

for sentence in doc.sents:  #let's see how good a job it can do separating sentences out of the box
    print (sentence, "///////////////")



In [None]:
# looks like it's doing ok. Let's proceed!
adjectives = []  #initialize a list

for sentence in doc.sents:
    for token in sentence:
        # print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
        if token.is_stop: #skip over stop words
            continue
        if token.pos_ == 'ADJ':
            print (token.text)

In [None]:
# we can get a better list if we use lemmas, and filter for unique entries

adjectives = []  #initialize a list

for sentence in doc.sents:
    for token in sentence:
        # print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
        if token.is_stop: #skip over stop words
            continue
        if token.pos_ == 'ADJ':
            adjectives.append(token.lemma_)
print (adjectives)  #print all the adjectives
# filtered_list = set(adjectives) #get just the unique values from the list
# print (filtered_list) 

In [None]:
# we can also count the duplicate elements, sort them in ascending order, and see which are most numerous

my_dict = {i:adjectives.count(i) for i in adjectives}
new_dict = dict(sorted(my_dict.items(), key=lambda item: item[1])) #some details here that we don't understand, we just believe...
print (new_dict)

In [None]:
# now let's do that for all the articles that had "Imperial" in their titles
adjectives = []  #initialize a list

for index in index_list:
    doc = nlp(df.iloc[index][3])
    for sentence in doc.sents:
        for token in sentence:
            # print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
            if token.is_stop: #skip over stop words
                continue
            if token.pos_ == 'ADJ':
                adjectives.append(token.lemma_)
                
my_dict = {i:adjectives.count(i) for i in adjectives}

new_dict = dict(sorted(my_dict.items(), key=lambda item: item[1])) #some details here that we don't understand, we just believe...
print (new_dict)

we can also do Named Entity Recognition, or NER, and find kinds of information in the articles that isn't a part of speech, but a more complex concept. For example, we can find all the named entities in the article:

In [None]:
from spacy import displacy

d = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

dd = nlp(d)
displacy.render(dd, style="ent")

In [None]:
# now let's do that with our entire article 20

d = nlp(df.iloc[20][3])

displacy.render(d, style="ent")

In [None]:
#now we can extract all the countries and people named in the article
countries = []
persons = []

for entity in d.ents:
    if entity.label_ == 'GPE':
        countries.append(entity)
    if entity.label_ == 'PERSON':
        persons.append(entity)
print ("Countries: ", countries, "\n")
print ("Persons: ", persons)


So the list isn't perfect, we can see several errors. But in DH the point is usually to be able to ingest a large amount of information, more than you could read, and do analysis on that quantity. We expect to get it wrong sometimes (but we must minimize our errors!)

In [None]:
# one of the exciting aspects of LLMs is vectorization. We translate a text into a 'word vector' which can then be 
# manipulated in various ways. Here's an example.

doc0 = nlp("Hindus worship many gods.")
doc1 = nlp("Judaism introduced monotheism to the near East.")
doc2 = nlp("I like burgers and fries.")
doc3 = nlp("Mesopotamian religion is polytheistic.")

# Similarity of two documents
print(doc0, "<->", doc3, doc0.similarity(doc3))

print(doc1, "<->", doc3, doc1.similarity(doc3))

print(doc2, "<->", doc3, doc2.similarity(doc3))


the concept of similarity is complex. Do you think you could assign a percentage value to the similarity between any of the phrases we just saw? So let's take it with a grain of salt: the numbers coincide with what we would expect, but we shouldn't depend too much on just one number to characterize our texts.

"Correlation remains among the most crucial concepts undergirding nearly every aspect of existing AI systems. According to Chun, correlation is not just a conceptual category, but it constitutes an everyday practice whereby people are lumped into “categories based on their being ‘like’ one another amplifying the effects of historical inequalities” [Chun 2021, 58]. These inequalities are in turn naturalized with data organization systems making it appear as though they are innate or sui generis categories which already preexist in the world. As Chun warns, “correlation contains within it the seeds of manipulation, segregation and misrepresentation” [Chun 2021, 59]. As a result of their reliance on correlation, social networks create “microidentities” by default which instrumentalize and weaponize individual differences. Data analytics consequently reimagines eugenics discourses within a big data future where correlations are not only assumed to be predictive of future outcomes, but surveillance is assumed to be a necessary component of every human institution and one which will allow humanity to improve nearly every component of daily life." http://www.digitalhumanities.org/dhq/vol/16/4/000656/000656.html

In [None]:
#let's look at the most common words in the corpus
from collections import Counter
import matplotlib.pyplot as plt

#make a long string comprising every text in df.iloc[:,2]
long_string = ' '.join(list(df.iloc[:,2].values))

print (len (long_string)) #87838911 characters! This is too long to parse in one go, so we'll split it into chunks

#split the long string into chunks of 1000000 characters
chunks = [long_string[i:i+1000000] for i in range(0, len(long_string), 1000000)]

#now we'll parse the first chunk and create a list of the most common words

doc = nlp(chunks[0])

# Create a list of word tokens
tokens = [token.text for token in doc]

# Create a list of word tokens after removing punctuation
punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
word_tokens = [word for word in tokens if word not in punctuations]

# Create a list of word tokens after removing stopwords
from spacy.lang.en.stop_words import STOP_WORDS
stopword_tokens = [word for word in word_tokens if not word in STOP_WORDS]

# Create a list of word tokens after lemmatization
lemmas = [token.lemma_ for token in doc]
lemmas = [lemma for lemma in lemmas if lemma not in punctuations]
lemmas = [lemma for lemma in lemmas if lemma not in STOP_WORDS]

# Create a frequency list of tokens
word_freq = Counter(lemmas)
common_words = word_freq.most_common(10)
print(common_words)


This takes a few seconds to calculate one hundred thousand characters. 

In [None]:
#let's make a word cloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                min_font_size = 10).generate(long_string[:10000])  #shortening the string for speed

# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)


....word clouds, an absolutely unbearable and nearly useless object. Can we do anything more interesting?

In [None]:
from spacy import displacy

# we can do more advanced linguistic analysis with spaCy
piano_text = "Gus is learning to play piano"
piano_doc = nlp(piano_text)
for token in piano_doc:
    print(f"""TOKEN: {token.text}
=====
{token.tag_ = }
{token.head.text = }
{token.dep_ = }"""
)

displacy.render(piano_doc, style="dep", jupyter=True) #this will open a browser window with a dependency graph of the sentence


This is the basis of how I built a program to find the active and passive verbs related to divine figures in the Mesopotamian sources, in order to validate a claim about Mesopotamian gods being 'intransitive'. 

Making good-looking and informative visualizations is an art. It takes time and care, and often the best way to do it is to collaborate with an expert. I frequently work with data-visualization programmers, and pay a little to have a good visualization done quickly using my data backend. One of the best resources is D3, a visualization package you can learn about at https://d3js.org/ 