# Task 4.1 - Process Tokens

### Steps

0. Preliminary Steps: Construct & Import Corpus
1. Process Tokens
2. Keyword Analysis
3. Latent Topic Analysis

# 0. Construct Corpus

# 0. Import Corpus

In [10]:
import pandas as pd
import os

In [6]:
# corpus path
corpus_path = "corpus/"

In [5]:
# corpus index file name
corpus_index_filename = "corpus index.xlsx"

In [8]:
# import corpus index
corpus_index = pd.read_excel(corpus_index_filename, header=0)
corpus_index.shape

(3, 3)

In [9]:
corpus_index[0:2]

Unnamed: 0,Filename,Date,Description
0,20220720,2022.07.20,Public Officials Visit Remand Prisons to Culti...
1,20220715,2022.07.15,Kazakh National Security Committee Celebrates ...


In [11]:
# grab file names in corpus
corpus_filenames = os.listdir(corpus_path)
corpus_filenames

['20220411.txt', '20220715.txt', '20220720.txt']

In [141]:
doc_ids = []
doc_text = []

for filename in corpus_filenames:
    doc_file = open(corpus_path+filename,encoding="utf-8")
    doc_lines = doc_file.readlines()
    doc_lines = [line.strip() for line in doc_lines]
    doc_lines = "".join(doc_lines)
    doc_file.close()
    
    doc_id = filename.split(".")[0]
    
    doc_ids.append(doc_id)
    doc_text.append(doc_lines)
    
len(doc_ids)

3

In [142]:
doc_ids

['20220411', '20220715', '20220720']

In [143]:
doc_text[0]

'US Ready to Partner With Kazakh Government on Implementation of Political Reforms, Senior Diplomat Says on Visit to KazakhstanThe United States supports President Tokayev’s political reforms agenda, said Uzra Zeya, Under-Secretary for Civilian Security, Democracy, and Human Rights, at the press meeting at the Kazakh Ministry of Foreign Affairs on April 11.Uzra Zeya meeting with Minister of Foreign Affairs – Mukhtar Tileuberdi. Photo credit: gov.kzDuring her visit to Kazakhstan, the sides discussed a wide range of issues, including the aftermath of the January unrest, combating human trafficking, empowerment of women, the rights of people with disabilities, and cooperation in security, law enforcement and anti-corruption.Zeya said that the U.S. government welcomed  President Kassym-Jomart Tokayev’s political reforms that he announced during his address to the nation on March 16, including those designed to strengthen the legislative branch of authority and political parties, and enhanc

# 1. Process Tokens

Steps of processing text into tokens 
- Step 1 - tokenize
- Step 2 - lower case
- Step 3 - lemmatize
- Step 4 - remove stop words

Additional processing steps not required in this task
- Step 5 - tagging parts-of-speech
- Step 6 - chunking

NLTK references

- API - https://www.nltk.org/
- Tutorial - https://realpython.com/nltk-nlp-python/

In [39]:
# install nltk - run in command prompt (PC) or terminal (Mac)
### python -m pip install nltk==3.6
### python -m pip install numpy matplotlib
### python -m pip install pytest

In [40]:
# pip install -U nltk

In [41]:
# import nltk.data

In [42]:
import nltk

### Step 1.1 - tokenize

In [43]:
from nltk.tokenize import sent_tokenize, word_tokenize
import io

In [44]:
example_string = "Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult"

In [45]:
example_string

"Muad'Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult"

In [46]:
sent_tokenize(example_string)

["Muad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult"]

In [47]:
word_tokenize(example_string)

["Muad'Dib",
 'learned',
 'rapidly',
 'because',
 'his',
 'first',
 'training',
 'was',
 'in',
 'how',
 'to',
 'learn',
 '.',
 'And',
 'the',
 'first',
 'lesson',
 'of',
 'all',
 'was',
 'the',
 'basic',
 'trust',
 'that',
 'he',
 'could',
 'learn',
 '.',
 'It',
 "'s",
 'shocking',
 'to',
 'find',
 'how',
 'many',
 'people',
 'do',
 'not',
 'believe',
 'they',
 'can',
 'learn',
 ',',
 'and',
 'how',
 'many',
 'more',
 'believe',
 'learning',
 'to',
 'be',
 'difficult']

In [144]:
doc_text[0]

'US Ready to Partner With Kazakh Government on Implementation of Political Reforms, Senior Diplomat Says on Visit to KazakhstanThe United States supports President Tokayev’s political reforms agenda, said Uzra Zeya, Under-Secretary for Civilian Security, Democracy, and Human Rights, at the press meeting at the Kazakh Ministry of Foreign Affairs on April 11.Uzra Zeya meeting with Minister of Foreign Affairs – Mukhtar Tileuberdi. Photo credit: gov.kzDuring her visit to Kazakhstan, the sides discussed a wide range of issues, including the aftermath of the January unrest, combating human trafficking, empowerment of women, the rights of people with disabilities, and cooperation in security, law enforcement and anti-corruption.Zeya said that the U.S. government welcomed  President Kassym-Jomart Tokayev’s political reforms that he announced during his address to the nation on March 16, including those designed to strengthen the legislative branch of authority and political parties, and enhanc

In [145]:
doc_tokens = []

for doc in doc_text:
    doc_tokens.append(word_tokenize(doc))
    
len(doc_tokens)

3

In [146]:
len(doc_tokens[0])

463

In [147]:
doc_tokens[0][0:20]

['US',
 'Ready',
 'to',
 'Partner',
 'With',
 'Kazakh',
 'Government',
 'on',
 'Implementation',
 'of',
 'Political',
 'Reforms',
 ',',
 'Senior',
 'Diplomat',
 'Says',
 'on',
 'Visit',
 'to',
 'KazakhstanThe']

### Step 1.2 - lower case

In [148]:
# for every list of tokens in doc_tokens, convert all tokens to lowercase
doc_tokens_lower = [[token.casefold() for token in doc] for doc in doc_tokens]

In [149]:
doc_tokens_lower[0][0:20]

['us',
 'ready',
 'to',
 'partner',
 'with',
 'kazakh',
 'government',
 'on',
 'implementation',
 'of',
 'political',
 'reforms',
 ',',
 'senior',
 'diplomat',
 'says',
 'on',
 'visit',
 'to',
 'kazakhstanthe']

### Step 1.3 - lemmatize tokens

In [150]:
import nltk
from nltk.corpus import wordnet
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\seoul\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [151]:
lmtzr = nltk.WordNetLemmatizer()

In [152]:
# short example
word = "Here are words and cars"
tokens = nltk.word_tokenize(word)
tokens

['Here', 'are', 'words', 'and', 'cars']

In [153]:
token_lemma = [lmtzr.lemmatize(token) for token in tokens]
token_lemma

['Here', 'are', 'word', 'and', 'car']

In [154]:
# number of documents
len(doc_tokens_lower)

3

In [155]:
# for each doc in doc_tokens_no_stopwords, lemmatize all tokens in that doc
doc_tokens_lemma = [[lmtzr.lemmatize(token) for token in doc] for doc in doc_tokens_lower]

In [156]:
# first few tokens in first doc, before lemmatizing tokens
doc_tokens_lower[0][0:10]

['us',
 'ready',
 'to',
 'partner',
 'with',
 'kazakh',
 'government',
 'on',
 'implementation',
 'of']

In [157]:
# first few tokens in first doc, after lemmatizing tokens
doc_tokens_lemma[0][0:10]

['u',
 'ready',
 'to',
 'partner',
 'with',
 'kazakh',
 'government',
 'on',
 'implementation',
 'of']

### Step 1.4 - remove stop words

In [158]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\seoul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [159]:
stop_words = list(set(stopwords.words("english")))

In [160]:
stop_words[0:20]

['is',
 "wouldn't",
 'out',
 'about',
 'hasn',
 'by',
 'further',
 'myself',
 "wasn't",
 "you'll",
 'should',
 'm',
 'from',
 "you're",
 'yourselves',
 'ma',
 "won't",
 'some',
 "don't",
 'needn']

In [161]:
# example of removing stop words from a quote
worf_quote = "Sir, I protest. I am not a merry man!"

In [162]:
quote_tokens  = word_tokenize(worf_quote)

In [163]:
quote_tokens

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

In [164]:
quote_tokens_no_stopwords = []

for token in quote_tokens:
    if token.casefold() not in stop_words:
        quote_tokens_no_stopwords.append(token)

In [165]:
quote_tokens_no_stopwords

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

In [166]:
# append list of stop words
filename_extra_stopwords = "extra stopwords.txt"
file_extra_stopwords = open(filename_extra_stopwords)
extra_stopwords = file_extra_stopwords.readlines()
file_extra_stopwords.close()
extra_stopwords

[',\n',
 '.\n',
 '?\n',
 '!\n',
 '[\n',
 ']\n',
 "'\n",
 ':\n',
 ';\n',
 '-\n',
 'ham\n',
 'v\n',
 'th\n',
 'qu\n']

In [167]:
extra_stopwords = [word.strip() for word in extra_stopwords]
extra_stopwords

[',', '.', '?', '!', '[', ']', "'", ':', ';', '-', 'ham', 'v', 'th', 'qu']

In [168]:
len(stop_words)

179

In [169]:
stop_words += extra_stopwords
stop_words = list(set(stop_words))
len(stop_words)

193

In [187]:
doc_tokens_processed = []

for doc in doc_tokens_lemma:
    
    temp_token_list = []
    
    for token in doc:
        
        if (token not in stop_words) and (len(token) > 1):
            
            temp_token_list.append(token)
    
    doc_tokens_processed.append(temp_token_list)

In [188]:
len(doc_tokens_processed)

3

In [189]:
len(doc_tokens_lemma[0][0])

1

In [190]:
# first few tokens in first doc, before removing stopwords
doc_tokens_lemma[0][0:10]

['u',
 'ready',
 'to',
 'partner',
 'with',
 'kazakh',
 'government',
 'on',
 'implementation',
 'of']

In [191]:
# first few tokens in first doc, after removing stopwords
doc_tokens_processed[0][0:10]

['ready',
 'partner',
 'kazakh',
 'government',
 'implementation',
 'political',
 'reform',
 'senior',
 'diplomat',
 'say']

In [192]:
# number of tokens in first doc, before removing stopwords
len(doc_tokens_lemma[0])

463

In [193]:
# number of tokens in first doc, after removing stopwords
len(doc_tokens_processed[0])

245

In [194]:
doc_tokens_processed[0]

['ready',
 'partner',
 'kazakh',
 'government',
 'implementation',
 'political',
 'reform',
 'senior',
 'diplomat',
 'say',
 'visit',
 'kazakhstanthe',
 'united',
 'state',
 'support',
 'president',
 'tokayev',
 'political',
 'reform',
 'agenda',
 'said',
 'uzra',
 'zeya',
 'under-secretary',
 'civilian',
 'security',
 'democracy',
 'human',
 'right',
 'press',
 'meeting',
 'kazakh',
 'ministry',
 'foreign',
 'affair',
 'april',
 '11.uzra',
 'zeya',
 'meeting',
 'minister',
 'foreign',
 'affair',
 'mukhtar',
 'tileuberdi',
 'photo',
 'credit',
 'gov.kzduring',
 'visit',
 'kazakhstan',
 'side',
 'discussed',
 'wide',
 'range',
 'issue',
 'including',
 'aftermath',
 'january',
 'unrest',
 'combating',
 'human',
 'trafficking',
 'empowerment',
 'woman',
 'right',
 'people',
 'disability',
 'cooperation',
 'security',
 'law',
 'enforcement',
 'anti-corruption.zeya',
 'said',
 'u.s.',
 'government',
 'welcomed',
 'president',
 'kassym-jomart',
 'tokayev',
 'political',
 'reform',
 'announce

In [195]:
# add stopwords to "extra stopwords.txt" and run Step 2.4 again

In [196]:
import pickle

In [199]:
# pickle doc_tokens_processed
fileout = open("doc_tokens_processed.pkl",'wb')
pickle.dump(doc_ids,fileout)
pickle.dump(doc_text,fileout)
pickle.dump(doc_tokens_processed,fileout)

# 2. Keyword Analysis

# 3. Latent Topic Analysis