## Loading Davies Corpora

This notebook will explore the corpora in this folder.

In [2]:
import requests #for http requests
import pandas as pd#gives us DataFrames
import matplotlib.pyplot as plt #For graphics
import wordcloud #Makes word clouds
import numpy as np #For divergences/distances
import scipy #For divergences/distances
import seaborn as sns #makes our plots look nicer
import sklearn.manifold #For a manifold plot
import json #For API responses
import urllib.parse #For joining urls

# comp-linguistics
import spacy

#Displays the graphs
import graphviz #You also need to install the command line graphviz

#These are from the standard library
import os.path
import zipfile
import subprocess
import io
import tempfile

import random

#This 'magic' command makes the plots work better
#in the notebook, don't use it outside of a notebook
%matplotlib inline

import re
import zipfile
import os
import sys
import lucem_illud_2020
import pickle

### Loading NOW raw data

The following method iterates through the files in the folder, and unzips the files, storing them in a dictionary with each zip file mapping to a list of the texts.

```corpus_name``` is a string which contains the directory of the corpus you need to use. 

In [12]:
corpus_name = "/Users/74068/Documents/Uchicago/courses/Content Analysis/corpus/NOW"

In [13]:
def loadcorpus(corpus_name, corpus_style="text"):
    texts_raw = {}
    for file in os.listdir(corpus_name + "/"):
        if corpus_style in file:
            print(file)
            zfile = zipfile.ZipFile(corpus_name + "/" + file)
            for file in zfile.namelist():
                texts_raw[file] = []
                with zfile.open(file) as f:
                    for line in f:
                        texts_raw[file].append(line)                      
    return texts_raw

We will be using the movies corpus for our purposes, but you can uncomment the code and try out the other corpora too.
You might have to make some adjustments in the cleaning for the other corpora; I have tried it for most of them and it works fine.

In [None]:
now_raw = loadcorpus(corpus_name)

text-16-11.zip
text-16-12.zip
text-17-01.zip
text-17-02.zip
text-17-03.zip
text-17-04.zip
text-17-05.zip
text-17-06.zip
text-17-07.zip
text-17-08.zip
text-17-09.zip
text-17-10.zip
text-17-11.zip
text-17-12.zip
text-18-01.zip
text-18-02.zip
text-18-03.zip
text-18-04.zip
text-18-05.zip
text-18-06.zip
text-18-07.zip
text-18-08.zip
text-18-09.zip
text-18-10.zip


In [None]:
output = open('now_raw.pkl', 'wb')
pickle.dump(now_raw, output)
output.close()

In [None]:
# read python dict back from the file
pkl_file = open('now_raw.pkl', 'rb')
now_raw = pickle.load(pkl_file)
pkl_file.close()

In [None]:
with open('now_raw.txt','wb') as handle:
    pickle.dump(now_raw, handle)

In [19]:
with open('now_raw.txt','rb') as handle:
    now_raw = pickle.load(handle)

UnpicklingError: invalid load key, '{'.

In [10]:
f = open("now_raw.txt","w")
f.write(str(now_raw))
f.close()

In [None]:
now_raw=json.load(open("now.txt")

In [None]:
len(now_raw)

In [15]:
now_raw.keys()

dict_keys(['16-10-au.txt', '16-10-bd.txt', '16-10-ca.txt', '16-10-gb.txt', '18-04-au.txt', '12-05-au.txt', '14-07-au.txt', '10-08-au.txt', '16-03-au.txt', '11-12-au.txt', '12-02-au.txt', '14-08-au.txt', '14-10-au.txt', '18-09-au.txt', '19-07-au.txt', '13-08-au.txt', '16-01-au.txt', '19-08-au.txt', '19-06-au.txt', '18-10-au.txt', 'text_18-01-AU.txt', '13-07-au.txt', '13-09-au.txt', '13-11-au.txt', 'text_17-06-AU.txt', '14-09-au.txt', '12-10-au.txt', '16-07-au.txt', '19-04-au.txt', '12-08-au.txt', 'text_18-02-AU.txt', '14-11-au.txt', '13-03-au.txt', '11-02-au.txt', '13-02-au.txt', '10-09-au.txt', 'text_17-01-AU.txt', '14-12-au.txt', '12-01-au.txt', 'text_16-12-AU.txt', '15-10-au.txt', '11-11-au.txt', '10-10-au.txt', '11-03-au.txt', '10-04-au.txt', '12-04-au.txt', '13-05-au.txt', '15-09-au.txt', '12-09-au.txt', 'text_17-09-AU.txt', '16-04-au.txt', '12-06-au.txt', '14-05-au.txt', '19-09-au.txt', 'text_17-02-AU.txt', '15-02-au.txt', '10-11-au.txt', '12-11-au.txt', '18-05-au.txt', '12-03-au.

In [14]:
now_raw['18-11-hk.txt']

KeyError: '18-11-hk.txt'

In [12]:
list(now_raw.keys())[1]

'16-10-bd.txt'

It seems messy, but nothing we can't clean. This basic method replaces some of the issues with the formatting, and prints the errors if any for debugging. Let us clean one of the raw text files. 

Note: we skip any text data which isn't utf-8 encoded here. I do this to keep things clean; you might want more data or anticipate special characters and not include that restriction.

In [5]:
def clean_raw_text(raw_texts):
    clean_texts = []
    for text in raw_texts:
        try:
            text = text.decode("utf-8")
            clean_text = text.replace(" \'m", "'m").replace(" \'ll", "'ll").replace(" \'re", "'re").replace(" \'s", "'s").replace(" \'re", "'re").replace(" n\'t", "n't").replace(" \'ve", "'ve").replace(" /'d", "'d")
            clean_texts.append(clean_text)
        except AttributeError:
            # print("ERROR CLEANING")
            # print(text)
            continue
        except UnicodeDecodeError:
            # print("Unicode Error, Skip")
            continue
    return clean_texts

In [13]:
clean_11 = clean_raw_text(now_raw['18-11-hk.txt'])
clean_11[1]

KeyError: '18-11-hk.txt'

Nice. This is looking a lot cleaner. We can now run some of our lucem_illud text cleaning methods we discuss/model in week 4. 

In [20]:
def word_tokenize(word_list, model=nlp, MAX_LEN=1500000):
    
    tokenized = []
    if type(word_list) == list and len(word_list) == 1:
        word_list = word_list[0]

    if type(word_list) == list:
        word_list = ' '.join([str(elem) for elem in word_list]) 
    # since we're only tokenizing, I remove RAM intensive operations and increase max text size

    model.max_length = MAX_LEN
    doc = model(word_list, disable=["parser", "tagger", "ner"])
    
    for token in doc:
        if not token.is_punct and len(token.text.strip()) > 0:
            tokenized.append(token.text)
    return tokenized

In [4]:
def normalizeTokens(word_list, extra_stop=[], model=nlp, lemma=True, MAX_LEN=1500000):
    #We can use a generator here as we just need to iterate over it
    normalized = []
    if type(word_list) == list and len(word_list) == 1:
        word_list = word_list[0]

    if type(word_list) == list:
        word_list = ' '.join([str(elem) for elem in word_list]) 

    # since we're only normalizing, I remove RAM intensive operations and increase max text size

    model.max_length = MAX_LEN
    doc = model(word_list.lower(), disable=["parser", "tagger", "ner"])

    if len(extra_stop) > 0:
        for stopword in extra_stop:
            lexeme = nlp.vocab[stopword]
            lexeme.is_stop = True

    # we check if we want lemmas or not earlier to avoid checking every time we loop
    if lemma:
        for w in doc:
            # if it's not a stop word or punctuation mark, add it to our article
            if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and len(w.text.strip()) > 0:
            # we add the lematized version of the word
                normalized.append(str(w.lemma_))
    else:
        for w in doc:
            # if it's not a stop word or punctuation mark, add it to our article
            if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and len(w.text.strip()) > 0:
            # we add the lematized version of the word
                normalized.append(str(w.text.strip()))

    return normalized

In [None]:
word_tokenize(clean_11[1])

In [None]:
normalizeTokens(clean_11[1])

Great! Now let us create a Pandas dataframe with movie names, raw words, tokenized words, and so on.
The file "sources_movies.zip" has this information. Similar information files are found for the other datasets too, in their respective folders.

In [None]:
def loadsouce(corpus_name, corpus_style="source"):
    source = []
    for file in os.listdir(corpus_name + "/"):
        if corpus_style in file:
            print(file)
            zfile = zipfile.ZipFile(corpus_name + "/" + file)
            for file in zfile.namelist():
                with zfile.open(file) as f:
                    for line in f:
                        source.append(line)
    return source

In [None]:
now_source=loadsouce(corpus_name)

In [None]:
now_source[0:20]

It looks dirty because the file is encoded as bytes, but we can certainly see the information in here. The file id is also present in the original raw text data: as the first "word". Look back at the normalized/tokenized words to confirm that. We're going to use this to create a dataframe with: Fileid, movie name, genre, year, and country.

It is advised that you run a similar check of the source file before you do other extraction.

First, let us create a dictionary mapping file-id to all the text. Each movie will be mapped to a list of the tokenized words.

In this example, I only use it to load 1000 movies. You can comment this out or increase/decrease the number as inspired.

In [None]:
now_texts={}
now_df = pd.DataFrame(columns=["filename", "Year", "Month", "Country", "id", "Tokenized Text", "Normalized Text", "Raw"])

In [None]:
for files in now_raw:
    
    if len(now_texts)>48000:
        break
        
    print(files)
    # nows = clean_raw_text(now_raw[files][1:])
    if len(now_raw[files]) > 20:
        n=random.sample(list(now_raw[files]),20)
    else:
        n=random.sample(list(now_raw[files]),5)
    nows = clean_raw_text(n)
    for now in nows:
        txts_tokenized = word_tokenize(now)
        txts_nomalized = normalizeTokens(txts_tokenized)
        try:
            now_df=now_df.append({'filename': files,'id': txts_nomalized[0], 'Tokenized Text': txts_tokenized[4:], 'Normalized Text':txts_nomalized[4:],'Raw':now},ignore_index=True)
        except IndexError:
            continue
    
        try:
            now_texts[txts_nomalized[0][2:]] = txts_nomalized[1:]
        except IndexError:
            continue

In [None]:
for files in now_raw[0:5]:
    for file in files[0:3]:
        print("a"+ file)
        if "technology" in file:
            print("b"+ file)
            txts_tokenized = word_tokenize(file)
            txts_nomalized = normalizeTokens(file)
            try:
                now_df=now_df.append({'filename': files,'id': txts_nomalized[0], 'Tokenized Text': txts_tokenized[4:], 'Normalized Text':txts_nomalized[4:],'Raw':file},ignore_index=True)
            except IndexError:
                continue

In [None]:
len(now_raw['10-08-ke.txt'])

In [None]:
now_df

In [None]:
now_df.to_csv('now.csv')

In [None]:
source_df = pd.DataFrame(columns=["Date", "Country", "Media", "Link", "Title"])

In [None]:
for now in now_source[0:]:
    print(now)
    try:
        tid, fileid, date, country, media, link, title = now.decode("utf-8").split("\t")
    except UnicodeDecodeError:
        continue
    try:
        source_df.loc[tid.strip()] = [date.strip(), country.strip(), media.strip(), link.strip(), title.strip()]
    except KeyError:
        continue

        

In [None]:
source_df.head()

In [None]:
source_df.to_csv('source.csv')

This dataframe contains information of the name, the genre, the year, the country, and the texts associated with it: all sorts of analysis can be run with this information now.



You are encouraged to try the similar process and load the other datasets.