# Tutorial

This is actually less of a tutorial and more the IPython notebook I used to learn how to pre-process and train the models used in the app. It's work in progress, as I intend to comment & explain each step.

Here are some important links I found very useful in developing this notebook, and in creating this project:
- https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
- https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb
- https://radimrehurek.com/gensim/tut1.html
- https://radimrehurek.com/gensim/tut2.html
- https://radimrehurek.com/gensim/tut3.html
- https://www.textrazor.com/tutorials#tagging
- https://canvasapi.readthedocs.io/en/latest/

In [167]:
import logging, gensim, os, textract
import itertools as it
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Step 1: Save all sentences from a given assignment (all students) into a file
### Start with Unigrams

In [224]:
assignment = "Assignment 5"
save_files = True

assignment_code = assignment.lower().replace(' ', '_')
path = os.path.join(os.getcwd(), "pdfs")

In [225]:
unigram_sentences_filepath = os.path.join(path, 'unigram_sentences_all_' + assignment_code + '.txt')

if save_files:
    with open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for file in os.listdir(path):
            if assignment in str(file):
                text = textract.process(os.path.join(path, file)).decode("utf-8")
                unigram_sentences = []
                for sent in gensim.summarization.textcleaner.get_sentences(text):
                    processed = preprocess_string(sent)
                    if len(processed) > 0:
                        f.write(" ".join(processed) + "\n")

Checking if it worked:

In [226]:
unigram_sentences = gensim.models.word2vec.LineSentence(unigram_sentences_filepath)

In [227]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print(u' '.join(unigram_sentence))
    print(u'')

cut deal cop turn blind ey

problem

individu dirti cop deepli entrench cultur

depart

violenc ensu disput meet strong drug

demand larg bribe allow violenc continu

kumar vimal

skaperda stergio

econom organ crime

crimin law econom februari



## Step 2: Use the Unigrams to create Bigrams model, and save all sentences to file

In [228]:
bigram_model_filepath = os.path.join(path, 'bigram_model_all_' + assignment_code)
bigram_model = gensim.models.Phrases(unigram_sentences)
if save_files:
    bigram_model.save(bigram_model_filepath)

2018-10-06 17:50:46,355 : INFO : collecting all words and their counts
2018-10-06 17:50:46,358 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-10-06 17:50:46,495 : INFO : PROGRESS: at sentence #10000, processed 45341 words and 31928 word types
2018-10-06 17:50:46,595 : INFO : PROGRESS: at sentence #20000, processed 88508 words and 56809 word types
2018-10-06 17:50:46,692 : INFO : PROGRESS: at sentence #30000, processed 132793 words and 79786 word types
2018-10-06 17:50:46,795 : INFO : PROGRESS: at sentence #40000, processed 177136 words and 101457 word types
2018-10-06 17:50:46,894 : INFO : PROGRESS: at sentence #50000, processed 219625 words and 121591 word types
2018-10-06 17:50:46,996 : INFO : PROGRESS: at sentence #60000, processed 262268 words and 141521 word types
2018-10-06 17:50:47,084 : INFO : PROGRESS: at sentence #70000, processed 306621 words and 159731 word types
2018-10-06 17:50:47,091 : INFO : collected 161262 word types from a corpus of 309968

In [229]:
bigram_sentences_filepath = os.path.join(path, 'bigram_sentences_all_' + assignment_code + '.txt')

if save_files:
    with open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for unigram_sentence in unigram_sentences:
            bigram_sentence = " ".join(bigram_model[unigram_sentence])
            f.write(bigram_sentence + "\n")



Checking if it worked:

In [230]:
bigram_sentences = gensim.models.word2vec.LineSentence(bigram_sentences_filepath)

for bigram_sentence in it.islice(bigram_sentences, 5000, 5010):
    print(u' '.join(bigram_sentence))
    print(u'')

paper discuss piano plai practic help benefit senior_citizen

benefit

outlin overal sens physic mental includ lessen stress

pain medic usag slow ag relat cognit declin feel pleasur

enjoy pride sens accomplish learn new skill creation

mainten social connect mean creativ self express construct

ident time life sens ident flux

“educ gamif

game_base learn compar study”

paper discuss differ game element motiv user long continu game



## Step 3: Use the Bigrams to create Trigrams model, and save all sentences to file

In [231]:
trigram_model_filepath = os.path.join(path, 'trigram_model_all_' + assignment_code)
trigram_model = gensim.models.Phrases(bigram_sentences)
if save_files:
    trigram_model.save(trigram_model_filepath)

2018-10-06 17:50:48,939 : INFO : collecting all words and their counts
2018-10-06 17:50:48,941 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-10-06 17:50:49,056 : INFO : PROGRESS: at sentence #10000, processed 42578 words and 32354 word types
2018-10-06 17:50:49,140 : INFO : PROGRESS: at sentence #20000, processed 82887 words and 57898 word types
2018-10-06 17:50:49,223 : INFO : PROGRESS: at sentence #30000, processed 124477 words and 81675 word types
2018-10-06 17:50:49,307 : INFO : PROGRESS: at sentence #40000, processed 166229 words and 104210 word types
2018-10-06 17:50:49,389 : INFO : PROGRESS: at sentence #50000, processed 206262 words and 125109 word types
2018-10-06 17:50:49,470 : INFO : PROGRESS: at sentence #60000, processed 246579 words and 145792 word types
2018-10-06 17:50:49,556 : INFO : PROGRESS: at sentence #70000, processed 287723 words and 165004 word types
2018-10-06 17:50:49,562 : INFO : collected 166604 word types from a corpus of 290857

In [232]:
trigram_sentences_filepath = os.path.join(path, 'trigram_sentences_all_' + assignment_code + '.txt')

if save_files:
    with open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = " ".join(trigram_model[bigram_sentence])
            f.write(trigram_sentence + "\n")



Checking if it worked:

In [233]:
trigram_sentences = gensim.models.word2vec.LineSentence(trigram_sentences_filepath)

for trigram_sentence in it.islice(trigram_sentences, 5000, 5010):
    print(u' '.join(trigram_sentence))
    print(u'')

paper discuss piano plai practic help benefit senior_citizen

benefit

outlin overal sens physic mental includ lessen stress

pain medic usag slow ag relat cognit declin feel pleasur

enjoy pride sens accomplish learn new skill creation

mainten social connect mean creativ self express construct

ident time life sens ident flux

“educ gamif

game_base_learn compar study”

paper discuss differ game element motiv user long continu game



## Step 4: Use the Trigram model to transform all documents, and save each document in one line of a big file containing all transformed documents

In [234]:
trigram_assignments_filepath = os.path.join(path, 'trigram_transformed_assignments_all_' + assignment_code + '.txt')

if save_files:
    with open(trigram_assignments_filepath, 'w', encoding='utf_8') as f:
        for file in os.listdir(path):
            if assignment in str(file):
                text = textract.process(os.path.join(path, file)).decode("utf-8")
                unigram_assignment = preprocess_string(text)
                bigram_assignment = bigram_model[unigram_assignment]
                trigram_assignment = trigram_model[bigram_assignment]
                f.write(" ".join(trigram_assignment) + "\n") 













## Step 5: Checking the work!

In [235]:
from random import randrange

file_list = []
for file in os.listdir(path):
    if assignment in str(file):
        file_list.append(file)

random_index = randrange(len(file_list))
random_filename = file_list[random_index]

In [236]:
with open(trigram_assignments_filepath) as f:
    trigram_assignments_list = f.readlines()

trigram_assignments_list = [x.strip() for x in trigram_assignments_list]

In [237]:
original = textract.process(os.path.join(path, random_filename)).decode("utf-8")
print(original)

Assignment #5, CS6460 ET, Fall 2018, Chris Hockett (chockett3)

Assignment: Collecting Your Sources
Introduction
For my project, I intend to create a tool which will visualize the process of lexical analysis. If I am
exceptionally proficient, I will tackle a tool which will visualize the process of grammatical analysis too
(to be clear, this last piece is a stretch goal). A tool set like this stimulates learning for those who most
apply their spatial intelligence during study. I happen to be a learner who fits into this category‐ once I
can see how something works (i.e. its behavior) I almost immediately understand what I’m studying.
The project will require research in multiple areas:
1) Lexical Analysis;
2) Graphical Visualization;
3) Spatial Intelligence and its associated best practice for teaching these learners.
With that in mind, here is my annotated bibliography, subdivided into the relevant categories.

Lexical Analysis
van Engelen, Robert (6/27/2017) Constructing Fast Lexical

In [238]:
processed = trigram_assignments_list[random_index]
print(processed)

assign_fall_chri_hockett chockett assign_collect_sourc introduct project intend creat tool visual process lexic_analysi exception profici tackl tool visual process grammat analysi clear piec stretch goal tool set like stimul learn appli spatial_intellig studi happen learner fit category‐ work behavior immedi understand i’m studi project requir research multipl area lexic_analysi graphic visual spatial_intellig associ best_practic teach learner mind annot_bibliographi subdivid relev categori lexic_analysi van engelen robert construct fast lexic_analyz flex retriev genivia websit http_www genivia com websit outlin lexer reflect tool closest imagin graph lexic_analysi state machin address visual mind‐ visual actual process input stream websit give import inform tool drill deepli actual mechan lexic_analysi inner work lexer valu project proof fact idea realiz brouwer gellerich ploedered myth fact effici implement finit automata lexic_analysi koskimi ed compil construct lectur_note_scienc v

Working perfectly! Now, let's train Doc2Vec!

# Training Doc2Vec

In [239]:
file_list = []
for file in os.listdir(path):
    if assignment in str(file):
        file_list.append(file)

def get_student_name(file_index):
    return file_list[file_index].split(" -")[0]

In [240]:
def read_corpus(fname, tokens_only=False):
    with open(fname, encoding="utf-8") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [get_student_name(i)])

In [241]:
train_corpus = list(read_corpus(trigram_assignments_filepath))
test_corpus = list(read_corpus(trigram_assignments_filepath, tokens_only=True))

In [242]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=300, min_count=5, epochs=100)

In [243]:
model.build_vocab(train_corpus)

2018-10-06 17:51:43,325 : INFO : collecting all words and their counts
2018-10-06 17:51:43,327 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-10-06 17:51:43,392 : INFO : collected 18055 word types and 184 unique tags from a corpus of 184 examples and 280049 words
2018-10-06 17:51:43,393 : INFO : Loading a fresh vocabulary
2018-10-06 17:51:43,456 : INFO : effective_min_count=5 retains 4736 unique words (26% of original 18055, drops 13319)
2018-10-06 17:51:43,456 : INFO : effective_min_count=5 leaves 259371 word corpus (92% of original 280049, drops 20678)
2018-10-06 17:51:43,479 : INFO : deleting the raw counts dictionary of 18055 items
2018-10-06 17:51:43,482 : INFO : sample=0.001 downsamples 48 most-common words
2018-10-06 17:51:43,485 : INFO : downsampling leaves estimated 235262 word corpus (90.7% of prior 259371)
2018-10-06 17:51:43,505 : INFO : estimated required memory for 4736 words and 300 dimensions: 13992000 bytes
2018-10-06 17:51:43,507 

In [244]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

2018-10-06 17:51:44,896 : INFO : training model with 3 workers on 4736 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2018-10-06 17:51:45,204 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:51:45,207 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:51:45,212 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:51:45,213 : INFO : EPOCH - 1 : training on 280049 raw words (235273 effective words) took 0.3s, 752419 effective words/s
2018-10-06 17:51:45,472 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:51:45,475 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:51:45,489 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:51:45,489 : INFO : EPOCH - 2 : training on 280049 raw words (235303 effective words) took 0.3s, 855507 effective words/s
2018-10-06 17:51:45,791 : INFO : wo

2018-10-06 17:51:51,649 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:51:51,651 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:51:51,662 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:51:51,663 : INFO : EPOCH - 21 : training on 280049 raw words (235600 effective words) took 0.3s, 873778 effective words/s
2018-10-06 17:51:51,927 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:51:51,946 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:51:51,949 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:51:51,949 : INFO : EPOCH - 22 : training on 280049 raw words (235443 effective words) took 0.3s, 828805 effective words/s
2018-10-06 17:51:52,219 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:51:52,226 : INFO : worker thread finished; awaiting finish of 1 more threads


2018-10-06 17:51:57,658 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:51:57,672 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:51:57,673 : INFO : EPOCH - 41 : training on 280049 raw words (235596 effective words) took 0.3s, 722080 effective words/s
2018-10-06 17:51:57,951 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:51:57,962 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:51:57,965 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:51:57,966 : INFO : EPOCH - 42 : training on 280049 raw words (235481 effective words) took 0.3s, 810750 effective words/s
2018-10-06 17:51:58,237 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:51:58,239 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:51:58,241 : INFO : worker thread finished; awaiting finish of 0 more threads


2018-10-06 17:52:03,568 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:52:03,568 : INFO : EPOCH - 61 : training on 280049 raw words (235520 effective words) took 0.3s, 914488 effective words/s
2018-10-06 17:52:03,812 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:52:03,819 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:52:03,821 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:52:03,823 : INFO : EPOCH - 62 : training on 280049 raw words (235372 effective words) took 0.3s, 930516 effective words/s
2018-10-06 17:52:04,127 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:52:04,131 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:52:04,144 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:52:04,144 : INFO : EPOCH - 63 : training on 280049 raw words (235318 effectiv

2018-10-06 17:52:09,311 : INFO : EPOCH - 81 : training on 280049 raw words (235475 effective words) took 0.2s, 961143 effective words/s
2018-10-06 17:52:09,561 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:52:09,566 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:52:09,571 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:52:09,572 : INFO : EPOCH - 82 : training on 280049 raw words (235481 effective words) took 0.3s, 910550 effective words/s
2018-10-06 17:52:09,804 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-06 17:52:09,806 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-06 17:52:09,809 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-06 17:52:09,809 : INFO : EPOCH - 83 : training on 280049 raw words (235408 effective words) took 0.2s, 1001668 effective words/s
2018-10-06 17:52:10,050 : INFO : worker threa

CPU times: user 1min 12s, sys: 844 ms, total: 1min 13s
Wall time: 29.2 s


In [245]:
doc2vec_model_filepath = os.path.join(path, 'doc2vec_model_' + assignment_code)
if save_files:
    model.save(doc2vec_model_filepath)

2018-10-06 17:52:27,602 : INFO : saving Doc2Vec object under /Users/carlossouza/Dropbox/playground/gensim/pdfs/doc2vec_model_assignment_5, separately None
2018-10-06 17:52:27,750 : INFO : saved /Users/carlossouza/Dropbox/playground/gensim/pdfs/doc2vec_model_assignment_5


# Checking the result!

In [246]:
test_student = "Carlos Souza"
test_student_index = -1

In [247]:
similar_doc = model.docvecs.most_similar(test_student) 
print("Most similar documents of " + test_student + " for " + assignment)
for item in similar_doc:
    print(str(item[1])  + " \t" + str(item[0]))

2018-10-06 17:52:33,255 : INFO : precomputing L2-norms of doc weight vectors


Most similar documents of Carlos Souza for Assignment 5
0.3223743438720703 	Abdiel Sanchez
0.2908666133880615 	Steven Gray
0.28265196084976196 	Alaa Shafaee
0.2778726816177368 	Christine Mcmanus
0.25868552923202515 	Navnit Belur
0.2560320496559143 	Sayed Ali
0.2442658394575119 	Bhaskar Majji
0.23750287294387817 	Aravind Rajamani
0.23571227490901947 	Zeya Aung
0.2298918217420578 	Terrence Chida


  if np.issubdtype(vec.dtype, np.int):


# Verifying Topics with TextRazor

In [248]:
import textrazor

textrazor.api_key = "d26228c7df9e0ea2a9c2656b520eefe049ae8c837fb1ef5a95619e04"
client = textrazor.TextRazor(extractors=["entities", "topics"])

In [249]:
def print_top_entities(student, limit=15):
    text = textract.process(os.path.join(path, student + " - " + assignment + ".pdf")).decode("utf-8")
    response = client.analyze(text)
    entities = list(response.entities())
    entities.sort(key=lambda x: x.relevance_score, reverse=True)
    seen = set()
    index = 0
    for entity in entities:
        if entity.id not in seen:
            print(entity.id, entity.relevance_score, entity.confidence_score)
            seen.add(entity.id)
            index += 1
            if index >= limit:
                break

In [250]:
print_top_entities("Carlos Souza")

Implicit theories of intelligence 1 3.858
Grit (personality trait) 1 2.91
Praise 1 2.898
Practice (learning method) 1 1.722
Constructivism (philosophy of education) 1 13.52
Educational technology 0.9193 10.65
Carol Dweck 0.9027 2.579
Problem solving 0.8731 8.915
Internet bot 0.8699 3.106
Cognitive tutor 0.8641 5.617
Big Five personality traits 0.8641 2.311
Adaptive learning 0.862 2.434
Self-regulated learning 0.8504 6.654
Motivation 0.8363 5.049
Chatbot 0.8359 9.671


In [252]:
print_top_entities("Christine Mcmanus")

Social cognitive theory 1 1.827
Self-efficacy 1 11.25
Problem solving 1 9.216
Technology integration 1 1.396
National Council of Teachers of Mathematics 1 5.684
Self-confidence 0.9839 3.844
Albert Bandura 0.9664 1.417
Qualitative research 0.9602 4.475
Educational technology 0.9587 4.621
Carol Dweck 0.9412 4.922
Instructional scaffolding 0.9224 2.379
Motivation 0.9167 5.581
Numeracy 0.9046 7.64
Science, technology, engineering, and mathematics 0.8968 7.094
Gender role 0.8842 1.342


Further reading:
https://cs.stanford.edu/~quocle/paragraph_vector.pdf