# Task 4.1 - Process Tokens

### Steps

0. Preliminary Steps: Construct & Import Corpus
1. Process Tokens
2. Keyword Analysis
3. **Latent Topic Analysis

# 0. Construct Corpus

# 0. Import Corpus

In [1]:
# pip install pandas

In [2]:
# pip install os

In [3]:
import pandas as pd
import os

In [4]:
# corpus path
corpus_path = "corpus/"

In [5]:
# corpus index file name
corpus_index_filename = "corpus index.xlsx"

In [6]:
# import corpus index
corpus_index = pd.read_excel(corpus_index_filename, header=0)
corpus_index.shape

(33, 3)

In [7]:
corpus_index

Unnamed: 0,Filename,Date,Description
0,20130504,2013.05.04,“Harmony Lessons” Honoured at Tribeca Film Fes...
1,20140609,2014.06.09,Spiritual and Religious Harmony to be Developed
2,20141219,2014.12.19,National News in Brief
3,20150128,2015.01.28,Why Celebrate 550 Years of Kazakh Statehood?
4,20150630,2015.06.30,Regional Conference Against Violent Extremism ...
5,20150702,2015.07.02,Calls for Broader Cooperation Define Astana An...
6,20160407,2016.04.07,New EU Project Aims to Enhance Criminal Justic...
7,20160408,2016.04.08,100 Concrete Steps Programme has Yielded Posit...
8,20161103,2016.11.03,Kazakhstan Significantly Reduces Prison Popula...
9,20161115,2016.11.15,Prosecutor General Offers New Methods to Fight...


In [10]:
# grab file names in corpus/
corpus_filenames = os.listdir(corpus_path)
len(corpus_filenames)

33

In [11]:
corpus_filenames

['20130504.txt',
 '20140609.txt',
 '20141219.txt',
 '20150128.txt',
 '20150630.txt',
 '20150702.txt',
 '20160407.txt',
 '20160408.txt',
 '20161103.txt',
 '20161115.txt',
 '20161210.txt',
 '20170603.txt',
 '20170903.txt',
 '20170916.txt',
 '20171005.txt',
 '20171030.txt',
 '20180130.txt',
 '20180217.txt',
 '20180316.txt',
 '20180719.txt',
 '20190605.txt',
 '20190606.txt',
 '20190614.txt',
 '20190904.txt',
 '20191022.txt',
 '20191104.txt',
 '20210305.txt',
 '20210323.txt',
 '20210615.txt',
 '20211223.txt',
 '20220411.txt',
 '20220715.txt',
 '20220720.txt']

In [12]:
doc_ids = []
doc_text = []

for filename in corpus_filenames:
    doc_file = open(corpus_path+filename,encoding="utf-8")
    doc_lines = doc_file.readlines()
    doc_lines = [line.strip() for line in doc_lines]
    doc_lines = "".join(doc_lines)
    doc_file.close()
    
    doc_id = filename.split(".")[0]
    
    doc_ids.append(doc_id)
    doc_text.append(doc_lines)
    
len(doc_ids)

33

In [13]:
doc_ids

['20130504',
 '20140609',
 '20141219',
 '20150128',
 '20150630',
 '20150702',
 '20160407',
 '20160408',
 '20161103',
 '20161115',
 '20161210',
 '20170603',
 '20170903',
 '20170916',
 '20171005',
 '20171030',
 '20180130',
 '20180217',
 '20180316',
 '20180719',
 '20190605',
 '20190606',
 '20190614',
 '20190904',
 '20191022',
 '20191104',
 '20210305',
 '20210323',
 '20210615',
 '20211223',
 '20220411',
 '20220715',
 '20220720']

In [14]:
doc_text[10]

'Public Safety Has Steadily Improved since Independence, Says Prosecutor GeneralPublic safety has steadily improved since independence, Prosecutor General Zhakip Asanov told a Dec. 9 roundtable of the Commission on Human Rights under the President of Kazakhstan.Murders have decreased three and a half times, violent crime three times and robberies five times since independence, he said. The nation’s prison population has also dropped from 105,000 in 1991 to 36,000 today and the country has closed eight prisons. The changes have improved the country’s position on the world prison ranking by 59 places to 62nd.Kazakhstan also ranks 75th on the 2016 Global Peace Index, the highest among Commonwealth of Independent States (CIS). The index gauges global peace based on a country’s level of societal safety and security, its extent of domestic and international conflict and degree of militarisation.The Prosecutor General also said 70 percent of torture cases are investigated and 146 government e

# 1. Process Tokens

Steps of processing text into tokens 
- Step 1 - tokenize
- Step 2 - lower case
- Step 3 - lemmatize
- Step 4 - remove stop words

Additional processing steps not required in this task
- Step 5 - tagging parts-of-speech
- Step 6 - chunking

NLTK references

- API - https://www.nltk.org/
- Tutorial - https://realpython.com/nltk-nlp-python/

In [15]:
# install nltk - run in command prompt (PC) or terminal (Mac)
### python -m pip install nltk==3.6
### python -m pip install numpy matplotlib
### python -m pip install pytest

In [16]:
# pip install -U nltk

In [17]:
# import nltk.data

In [18]:
import nltk

### Step 1.1 - tokenize

In [19]:
from nltk.tokenize import sent_tokenize, word_tokenize
import io

In [20]:
example_string = "Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult"

In [21]:
example_string

"Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult"

In [22]:
sent_tokenize(example_string)

["Muad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult"]

In [23]:
word_tokenize(example_string)

["Muad'Dib",
 'learned',
 'rapidly',
 'because',
 'his',
 'first',
 'training',
 'was',
 'in',
 'how',
 'to',
 'learn',
 '.',
 'And',
 'the',
 'first',
 'lesson',
 'of',
 'all',
 'was',
 'the',
 'basic',
 'trust',
 'that',
 'he',
 'could',
 'learn',
 '.',
 'It',
 "'s",
 'shocking',
 'to',
 'find',
 'how',
 'many',
 'people',
 'do',
 'not',
 'believe',
 'they',
 'can',
 'learn',
 ',',
 'and',
 'how',
 'many',
 'more',
 'believe',
 'learning',
 'to',
 'be',
 'difficult']

In [24]:
doc_text[0]

'“Harmony Lessons” Honoured at Tribeca Film Festival“Harmony Lessons” continues its triumphal tour around the globe, winning the Special Jury Diploma for best directorial debut at the Tribeca Film Festival, which was held from April 17-29 in New York City.Emir Baygazin (l) was warmly welcomed at the Tribeca festival by viewers and movie celebrities including Robert De Niro.Before the opening of the festival, The Village Voice included the film in its list of the ten festival must-sees. In total, 89 films were shown during the Tribeca Festival this year. A review on the American website Twitch calls “Harmony Lessons” “deceptively humorous,” comments on the director’s ability to get remarkable performances from his nonprofessional actors, and declares the director, Emir Baygazin, one to watch.“We are very pleased that the American premiere will take place at the Tribeca Film Festival in New York, which is by far the most prestigious in America. In this year’s jury are such well-known fil

In [25]:
doc_tokens = []

for doc in doc_text:
    doc_tokens.append(word_tokenize(doc))
    
len(doc_tokens)

33

In [26]:
len(doc_tokens[0])

786

In [27]:
doc_tokens[0][0:20]

['“',
 'Harmony',
 'Lessons',
 '”',
 'Honoured',
 'at',
 'Tribeca',
 'Film',
 'Festival',
 '“',
 'Harmony',
 'Lessons',
 '”',
 'continues',
 'its',
 'triumphal',
 'tour',
 'around',
 'the',
 'globe']

### Step 1.2 - lower case

In [28]:
# for every list of tokens in doc_tokens, convert all tokens to lowercase
doc_tokens_lower = [[token.casefold() for token in doc] for doc in doc_tokens]

In [29]:
doc_tokens_lower[0][0:20]

['“',
 'harmony',
 'lessons',
 '”',
 'honoured',
 'at',
 'tribeca',
 'film',
 'festival',
 '“',
 'harmony',
 'lessons',
 '”',
 'continues',
 'its',
 'triumphal',
 'tour',
 'around',
 'the',
 'globe']

### Step 1.3 - lemmatize tokens

In [30]:
import nltk
from nltk.corpus import wordnet
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\seoul\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [31]:
lmtzr = nltk.WordNetLemmatizer()

In [32]:
# short example
word = "Here are words and cars"
tokens = nltk.word_tokenize(word)
tokens

['Here', 'are', 'words', 'and', 'cars']

In [33]:
token_lemma = [lmtzr.lemmatize(token) for token in tokens]
token_lemma

['Here', 'are', 'word', 'and', 'car']

In [34]:
# number of documents
len(doc_tokens_lower)

33

In [35]:
# for each doc in doc_tokens_no_stopwords, lemmatize all tokens in that doc
doc_tokens_lemma = [[lmtzr.lemmatize(token) for token in doc] for doc in doc_tokens_lower]

In [36]:
# first few tokens in first doc, before lemmatizing tokens
doc_tokens_lower[0][0:10]

['“',
 'harmony',
 'lessons',
 '”',
 'honoured',
 'at',
 'tribeca',
 'film',
 'festival',
 '“']

In [37]:
# first few tokens in first doc, after lemmatizing tokens
doc_tokens_lemma[0][0:10]

['“',
 'harmony',
 'lesson',
 '”',
 'honoured',
 'at',
 'tribeca',
 'film',
 'festival',
 '“']

### Step 1.4 - remove stop words

In [38]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\seoul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
stop_words = list(set(stopwords.words("english")))

In [40]:
stop_words[0:20]

['itself',
 's',
 'herself',
 'all',
 'couldn',
 'between',
 'y',
 "needn't",
 'themselves',
 "that'll",
 'for',
 'haven',
 "mightn't",
 'didn',
 'very',
 'their',
 'those',
 'more',
 'with',
 'after']

In [41]:
# example of removing stop words from a quote
worf_quote = "Sir, I protest. I am not a merry man!"

In [42]:
quote_tokens  = word_tokenize(worf_quote)

In [43]:
quote_tokens

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

In [44]:
quote_tokens_no_stopwords = []

for token in quote_tokens:
    if token.casefold() not in stop_words:
        quote_tokens_no_stopwords.append(token)

In [45]:
quote_tokens_no_stopwords

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

In [46]:
# append list of stop words
filename_extra_stopwords = "extra stopwords.txt"
file_extra_stopwords = open(filename_extra_stopwords)
extra_stopwords = file_extra_stopwords.readlines()
file_extra_stopwords.close()
extra_stopwords

[',\n',
 '.\n',
 '?\n',
 '!\n',
 '[\n',
 ']\n',
 "'\n",
 ':\n',
 ';\n',
 '-\n',
 'ham\n',
 'v\n',
 'th\n',
 'qu\n']

In [47]:
extra_stopwords = [word.strip() for word in extra_stopwords]
extra_stopwords

[',', '.', '?', '!', '[', ']', "'", ':', ';', '-', 'ham', 'v', 'th', 'qu']

In [48]:
len(stop_words)

179

In [49]:
stop_words += extra_stopwords
stop_words = list(set(stop_words))
len(stop_words)

193

In [50]:
doc_tokens_processed = []

for doc in doc_tokens_lemma:
    
    temp_token_list = []
    
    for token in doc:
        
        if (token not in stop_words) and (len(token) > 1):
            
            temp_token_list.append(token)
    
    doc_tokens_processed.append(temp_token_list)

In [51]:
len(doc_tokens_processed)

33

In [52]:
len(doc_tokens_lemma[0][0])

1

In [53]:
# first few tokens in first doc, before removing stopwords
doc_tokens_lemma[0][0:10]

['“',
 'harmony',
 'lesson',
 '”',
 'honoured',
 'at',
 'tribeca',
 'film',
 'festival',
 '“']

In [54]:
# first few tokens in first doc, after removing stopwords
doc_tokens_processed[0][0:10]

['harmony',
 'lesson',
 'honoured',
 'tribeca',
 'film',
 'festival',
 'harmony',
 'lesson',
 'continues',
 'triumphal']

In [55]:
# number of tokens in first doc, before removing stopwords
len(doc_tokens_lemma[0])

786

In [56]:
# number of tokens in first doc, after removing stopwords
len(doc_tokens_processed[0])

389

In [57]:
doc_tokens_processed[0]

['harmony',
 'lesson',
 'honoured',
 'tribeca',
 'film',
 'festival',
 'harmony',
 'lesson',
 'continues',
 'triumphal',
 'tour',
 'around',
 'globe',
 'winning',
 'special',
 'jury',
 'diploma',
 'best',
 'directorial',
 'debut',
 'tribeca',
 'film',
 'festival',
 'wa',
 'held',
 'april',
 '17-29',
 'new',
 'york',
 'city.emir',
 'baygazin',
 'wa',
 'warmly',
 'welcomed',
 'tribeca',
 'festival',
 'viewer',
 'movie',
 'celebrity',
 'including',
 'robert',
 'de',
 'niro.before',
 'opening',
 'festival',
 'village',
 'voice',
 'included',
 'film',
 'list',
 'ten',
 'festival',
 'must-sees',
 'total',
 '89',
 'film',
 'shown',
 'tribeca',
 'festival',
 'year',
 'review',
 'american',
 'website',
 'twitch',
 'call',
 'harmony',
 'lesson',
 'deceptively',
 'humorous',
 'comment',
 'director',
 'ability',
 'get',
 'remarkable',
 'performance',
 'nonprofessional',
 'actor',
 'declares',
 'director',
 'emir',
 'baygazin',
 'one',
 'watch.',
 'pleased',
 'american',
 'premiere',
 'take',
 'pla

In [58]:
# add stopwords to "extra stopwords.txt" and run Step 2.4 again

In [59]:
import pickle

In [60]:
# pickle doc_tokens_processed
fileout = open("doc_tokens_processed.pkl",'wb')
stuff_to_pickle = [doc_ids,doc_text,doc_tokens_processed]
pickle.dump(stuff_to_pickle,fileout)

# 2. Keyword Analysis

In [None]:
# tf-idf

# 3. Latent Topic Analysis