# Tutorial: Using Doc2Vec to calculate similarity between documents

## Overview
Duration: 1min

This tutorial will show you how to use Doc2Vec to calculate similarity scores between pairs of documents. Doc2Vec is an NLP tool for representing documents as a vector and is a generalizing of the Word2Vec method. 

Before you start, follow the instructions from [README.md](https://github.com/ucals/bettertogether/blob/master/README.md) to get the PDF documents in your local machine. This corpus, which will be located in `pdfs/` folder, contains ~1,000 documents. They are essays written by OMSCS CS-6460 students in Fall 2018 class.

### What you'll learn

- How to pre-process text, getting it ready to apply NLP tecniques, by using [Gensim Library](https://radimrehurek.com/gensim/)
- How to train Doc2Vec model, also using [Gensim Library](https://radimrehurek.com/gensim/)
- How to test it by eye, using [TextRazor](https://www.textrazor.com/) classification API to check if the similarities from our trained model are actually true

### What you'll need

- The corpora: a collection of PDF files correspondent to Assignments 2, 3 and 4 from OMSCS CS-6460 students
- Some basic Python knowledge

### Steps:

1. Preprocess the text: 10min
2. Train Doc2Vec model: 5min
3. Test the result: 10min

## Step 1: Preprocess the text

The first step will be preprocess the text, getting it ready to train Doc2Vec model. We will do this by combining all assignment documents in one single file called `unigram_sentences_all_assignment_X.txt`, where X will be 2, 3 or 4 depending on the assignment. In this file, each line will be one sentence from the assignment documents. We will read the assignment documents line-by-line, pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Then, we will combine the list of words of each line in one string (separated by spaces) and store it in the file.

In [1]:
# Basic setup
import logging, gensim, os, textract
import itertools as it
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
assignment = "Assignment 2"
save_files = True

assignment_code = assignment.lower().replace(' ', '_')
path = os.path.join(os.getcwd(), "pdfs")

In [3]:
unigram_sentences_filepath = os.path.join(path, 'unigram_sentences_all_' + assignment_code + '.txt')

if save_files:
    with open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for file in os.listdir(path):
            if assignment in str(file):
                text = textract.process(os.path.join(path, file)).decode("utf-8")
                unigram_sentences = []
                for sent in gensim.summarization.textcleaner.get_sentences(text):
                    processed = preprocess_string(sent)
                    if len(processed) > 0:
                        f.write(" ".join(processed) + "\n")

Gensim's LineSentence class provides a convenient iterator for working with other gensim components. It streams the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora:

In [4]:
unigram_sentences = gensim.models.word2vec.LineSentence(unigram_sentences_filepath)

Now, we need to apply *phrase modeling* to combine tokens that together represent meaningful multi-word concepts (bi-grams and tri-grams). After applying it, `new york` would become `new_york`; `new york times` would become `new_york_times`. 

First, let's create bi-grams from the uni-grams:

In [5]:
bigram_model_filepath = os.path.join(path, 'bigram_model_all_' + assignment_code)
bigram_model = gensim.models.Phrases(unigram_sentences)
if save_files:
    bigram_model.save(bigram_model_filepath)

2018-10-11 23:44:57,631 : INFO : collecting all words and their counts
2018-10-11 23:44:57,634 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-10-11 23:44:57,743 : INFO : PROGRESS: at sentence #10000, processed 45671 words and 31392 word types
2018-10-11 23:44:57,827 : INFO : collected 53046 word types from a corpus of 86396 words (unigram + bigrams) and 18830 sentences
2018-10-11 23:44:57,827 : INFO : using 53046 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2018-10-11 23:44:57,828 : INFO : saving Phrases object under /Users/carlossouza/Dropbox/2018/OMSCS/CS-6460-Education-Technology/bettertogether/pdfs/bigram_model_all_assignment_2, separately None
2018-10-11 23:44:57,904 : INFO : saved /Users/carlossouza/Dropbox/2018/OMSCS/CS-6460-Education-Technology/bettertogether/pdfs/bigram_model_all_assignment_2


In [8]:
bigram_sentences_filepath = os.path.join(path, 'bigram_sentences_all_' + assignment_code + '.txt')

if save_files:
    with open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for unigram_sentence in unigram_sentences:
            bigram_sentence = " ".join(bigram_model[unigram_sentence])
            f.write(bigram_sentence + "\n")
            
bigram_sentences = gensim.models.word2vec.LineSentence(bigram_sentences_filepath)



We can check that it worked by examining a small slice of our file:

In [13]:
bigram_sentences = gensim.models.word2vec.LineSentence(bigram_sentences_filepath)

for bigram_sentence in it.islice(bigram_sentences, 3000, 3010):
    print(u' '.join(bigram_sentence))
    print(u'')

inform tent project cours onlin languag exam app

gener work field languag math_scienc

mobil_devic

mobil learn fellow text http blog acu edu adamscent fund mobilelearn fellow present research abilen christian univers

interest total unexpect result phone facebook usag

correl spiritu

excess social comput associ decreas religi wellb overal life

satisfact

person reloc origin home countri facebook social_media

wai mean keep relat old friend famili think



As you can see, the algorithm combined some tokens, forming bi-grams like `mobil_devic`, `social_media`, etc.

Now, let's use the bi-grams to create tri-grams:

In [9]:
trigram_model_filepath = os.path.join(path, 'trigram_model_all_' + assignment_code)
trigram_model = gensim.models.Phrases(bigram_sentences)
if save_files:
    trigram_model.save(trigram_model_filepath)

2018-10-11 23:46:33,069 : INFO : collecting all words and their counts
2018-10-11 23:46:33,071 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-10-11 23:46:33,173 : INFO : PROGRESS: at sentence #10000, processed 43031 words and 31843 word types
2018-10-11 23:46:33,248 : INFO : collected 54074 word types from a corpus of 81319 words (unigram + bigrams) and 18830 sentences
2018-10-11 23:46:33,248 : INFO : using 54074 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2018-10-11 23:46:33,249 : INFO : saving Phrases object under /Users/carlossouza/Dropbox/2018/OMSCS/CS-6460-Education-Technology/bettertogether/pdfs/trigram_model_all_assignment_2, separately None
2018-10-11 23:46:33,317 : INFO : saved /Users/carlossouza/Dropbox/2018/OMSCS/CS-6460-Education-Technology/bettertogether/pdfs/trigram_model_all_assignment_2


In [11]:
trigram_sentences_filepath = os.path.join(path, 'trigram_sentences_all_' + assignment_code + '.txt')

if save_files:
    with open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = " ".join(trigram_model[bigram_sentence])
            f.write(trigram_sentence + "\n")
            
trigram_sentences = gensim.models.word2vec.LineSentence(trigram_sentences_filepath)



In [25]:
trigram_assignments_filepath = os.path.join(path, 'trigram_transformed_assignments_all_' + assignment_code + '.txt')

if save_files:
    with open(trigram_assignments_filepath, 'w', encoding='utf_8') as f:
        for file in os.listdir(path):
            if assignment in str(file):
                text = textract.process(os.path.join(path, file)).decode("utf-8")
                unigram_assignment = preprocess_string(text)
                bigram_assignment = bigram_model[unigram_assignment]
                trigram_assignment = trigram_model[bigram_assignment]
                f.write(" ".join(trigram_assignment) + "\n") 













Let's examine a slice of our tri-gram file to check that it worked as well:

In [18]:
for trigram_sentence in it.islice(trigram_sentences, 1000, 1010):
    print(u' '.join(trigram_sentence))
    print(u'')

huhn

don’t topic

develop game believ game teach great design

implement challeng

simul_base_learn interest allow student experi environ

abl experi

provid wai experi

activ danger time student fly airplan

note learn theori “simul base_learn constructivist learn model

provid learner experi work usual simplifi simul world



As you can see, it worked: our algorithm created tri-grams like `simul_base_learn` (simulated based learning).

### Comparing the original version with the preprocessed version of an assignment

Let's see exactly what we did by randomly selecting an assignment and comparing as slice (first 2000 characters) of its original version with an slice of its preprocessed version:

In [26]:
from random import randrange

file_list = []
for file in os.listdir(path):
    if assignment in str(file):
        file_list.append(file)

random_index = randrange(len(file_list))
random_filename = file_list[random_index]

with open(trigram_assignments_filepath) as f:
    trigram_assignments_list = f.readlines()

trigram_assignments_list = [x.strip() for x in trigram_assignments_list]

In [27]:
original = textract.process(os.path.join(path, random_filename)).decode("utf-8")
print(original[:2000])

Tyler Roland
CS 6460
Assignment 2

Learning Management Systems (LMS) are a powerful educational tool that can be used by
numerous fields, from education to medicine to software and technology. They incorporate different
components to help facilitate learning that can be modified, scaled, and directed to different audiences,
all while using the same program. For example, schools can choose to use LMS systems to help their
students keep track of assignments for each class, communicate with their peers and professors,
monitor grades and academic progress, and access learning tools such as a school library or other
important documents. Canvas and T-Square are examples of LMS systems that are currently being used
by Georgia Tech. Software and technology professionals can include other types of LMS systems to help
train companies and users on how to use their software. There are many types of learning management
systems around today, and each has their advantages and disadvantages, but there

In [28]:
processed = trigram_assignments_list[random_index]
print(processed[:2000])

tyler roland assign learn_manag_system lm power educ tool numer field educ medicin softwar technolog incorpor differ compon help facilit learn modifi scale direct differ audienc program exampl school choos us lm system help student track assign class commun peer professor monitor grade academ progress access learn tool school librari import document canva squar exampl lm system current georgia_tech softwar technolog profession includ type lm system help train compani user us softwar type learn_manag_system todai advantag disadvantag gener improv ad date advanc technolog onlin learn quickli grow field school incorpor onlin learn includ student wouldn’t abl attend time travel constraint allow collabor individu world student omsc_program georgia_tech experi benefit onlin learn hand advantag onlin learn self pace learn cours content train content view review time dai increas access class unavail person understaf high labor cost higher qualiti learn differ tool adapt differ learn style link

As you can see, our algorithm worked perfectly!

## Step 2: Train Doc2Vec model

Before we train our model, let's define a function `read_corpus` to open the train/test file (with latin encoding), read the file line-by-line, pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Also, to train the model, we'll need to associate a tag/number with each line of the training corpus. In our case, the tag is the student's name, given by `get_student_name` function:

In [30]:
def read_corpus(fname, tokens_only=False):
    with open(fname, encoding="utf-8") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [get_student_name(i)])
                
def get_student_name(file_index):
    return file_list[file_index].split(" -")[0]

In [31]:
train_corpus = list(read_corpus(trigram_assignments_filepath))
test_corpus = list(read_corpus(trigram_assignments_filepath, tokens_only=True))

Now, we'll instantiate a Doc2Vec model with a vector size with 300 words and iterating over the training corpus 100 times. We set the minimum word count to 5 in order to discard words with very few occurrences. (Without a variety of representative examples, retaining such infrequent words can often make a model worse!) Typical iteration counts in published 'Paragraph Vectors' results, using 10s-of-thousands to millions of docs, are 10-20. More iterations take more time and eventually reach a point of diminishing returns.

In [32]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=300, min_count=5, epochs=100)

Now, let's build a vocabulary. Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

In [33]:
model.build_vocab(train_corpus)

2018-10-12 00:11:42,369 : INFO : collecting all words and their counts
2018-10-12 00:11:42,372 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-10-12 00:11:42,392 : INFO : collected 7207 word types and 179 unique tags from a corpus of 179 examples and 77911 words
2018-10-12 00:11:42,394 : INFO : Loading a fresh vocabulary
2018-10-12 00:11:42,405 : INFO : effective_min_count=5 retains 2128 unique words (29% of original 7207, drops 5079)
2018-10-12 00:11:42,406 : INFO : effective_min_count=5 leaves 69681 word corpus (89% of original 77911, drops 8230)
2018-10-12 00:11:42,414 : INFO : deleting the raw counts dictionary of 7207 items
2018-10-12 00:11:42,416 : INFO : sample=0.001 downsamples 56 most-common words
2018-10-12 00:11:42,422 : INFO : downsampling leaves estimated 61600 word corpus (88.4% of prior 69681)
2018-10-12 00:11:42,434 : INFO : estimated required memory for 2128 words and 300 dimensions: 6421800 bytes
2018-10-12 00:11:42,435 : INFO : re

Now, let's train our model. It should take ~10 seconds to run:

In [34]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

2018-10-12 00:12:52,812 : INFO : training model with 3 workers on 2128 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2018-10-12 00:12:52,909 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:52,924 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:52,928 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:52,930 : INFO : EPOCH - 1 : training on 77911 raw words (61881 effective words) took 0.1s, 618051 effective words/s
2018-10-12 00:12:53,027 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:53,028 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:53,034 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:53,035 : INFO : EPOCH - 2 : training on 77911 raw words (61737 effective words) took 0.1s, 624358 effective words/s
2018-10-12 00:12:53,117 : INFO : worker

2018-10-12 00:12:54,920 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:54,940 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:54,943 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:54,944 : INFO : EPOCH - 21 : training on 77911 raw words (61789 effective words) took 0.1s, 611655 effective words/s
2018-10-12 00:12:55,020 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:55,040 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:55,042 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:55,043 : INFO : EPOCH - 22 : training on 77911 raw words (61799 effective words) took 0.1s, 647797 effective words/s
2018-10-12 00:12:55,125 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:55,141 : INFO : worker thread finished; awaiting finish of 1 more threads
2018

2018-10-12 00:12:56,687 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:56,693 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:56,694 : INFO : EPOCH - 41 : training on 77911 raw words (61773 effective words) took 0.1s, 595553 effective words/s
2018-10-12 00:12:56,785 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:56,807 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:56,809 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:56,810 : INFO : EPOCH - 42 : training on 77911 raw words (61795 effective words) took 0.1s, 549440 effective words/s
2018-10-12 00:12:56,892 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:56,918 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:56,920 : INFO : worker thread finished; awaiting finish of 0 more threads
2018

2018-10-12 00:12:58,493 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:58,494 : INFO : EPOCH - 61 : training on 77911 raw words (61798 effective words) took 0.1s, 701348 effective words/s
2018-10-12 00:12:58,561 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:58,573 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:58,580 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:58,580 : INFO : EPOCH - 62 : training on 77911 raw words (61779 effective words) took 0.1s, 732284 effective words/s
2018-10-12 00:12:58,642 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:12:58,656 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:12:58,657 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:12:58,658 : INFO : EPOCH - 63 : training on 77911 raw words (61748 effective word

2018-10-12 00:13:00,140 : INFO : EPOCH - 81 : training on 77911 raw words (61729 effective words) took 0.1s, 729104 effective words/s
2018-10-12 00:13:00,205 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:13:00,218 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:13:00,222 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:13:00,223 : INFO : EPOCH - 82 : training on 77911 raw words (61794 effective words) took 0.1s, 757972 effective words/s
2018-10-12 00:13:00,287 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-12 00:13:00,306 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-12 00:13:00,309 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-12 00:13:00,310 : INFO : EPOCH - 83 : training on 77911 raw words (61826 effective words) took 0.1s, 750085 effective words/s
2018-10-12 00:13:00,375 : INFO : worker thread finis

CPU times: user 20.6 s, sys: 532 ms, total: 21.1 s
Wall time: 8.9 s


In [35]:
model_path = os.path.join(os.getcwd(), "models")
doc2vec_model_filepath = os.path.join(model_path, 'doc2vec_model_' + assignment_code)
if save_files:
    model.save(doc2vec_model_filepath)

2018-10-12 00:14:54,197 : INFO : saving Doc2Vec object under /Users/carlossouza/Dropbox/2018/OMSCS/CS-6460-Education-Technology/bettertogether/models/doc2vec_model_assignment_2, separately None
2018-10-12 00:14:54,254 : INFO : saved /Users/carlossouza/Dropbox/2018/OMSCS/CS-6460-Education-Technology/bettertogether/models/doc2vec_model_assignment_2


## Step 3: Test the result

That's it! Now, let's test the result by first looking who are the most similar students of **Carlos Souza**:

In [36]:
test_student = "Carlos Souza"

similar_doc = model.docvecs.most_similar(test_student) 
print("Most similar documents of " + test_student + " for " + assignment)
for item in similar_doc:
    print(str(item[1])  + " \t" + str(item[0]))

2018-10-12 00:17:03,079 : INFO : precomputing L2-norms of doc weight vectors


Most similar documents of Carlos Souza for Assignment 2
0.3829500675201416 	Alaa Shafaee
0.3700012266635895 	Hector Ortiz-Mena
0.3654715120792389 	Joshua Harris
0.3286665380001068 	Christopher Bischke
0.3188931345939636 	Hieu Nguyen
0.3181723356246948 	Zhiyu Zong
0.31330111622810364 	Dan Fujita
0.2916083335876465 	Rui Zhan
0.28721070289611816 	Mitchell Tufford
0.279945433139801 	Lokesh Pathak


  if np.issubdtype(vec.dtype, np.int):


Our model says that, based on **Assignment 2**, **Carlos Souza**'s most similar student is **Alaa Shafaee**. Let's see if that's true by retrieving main topics from Carlos Souza's Assignment 2 and Alaa Shafaee's Assignment 2 and compare them. We will do it by using an NPL API called TextRazor.

First, let's define a function `print_top_entities` to retrieve main topics of an assignment, and initialize TextRazor:

In [37]:
def print_top_entities(student, limit=15):
    text = textract.process(os.path.join(path, student + " - " + assignment + ".pdf")).decode("utf-8")
    response = client.analyze(text)
    entities = list(response.entities())
    entities.sort(key=lambda x: x.relevance_score, reverse=True)
    seen = set()
    index = 0
    for entity in entities:
        if entity.id not in seen:
            print(entity.id, entity.relevance_score, entity.confidence_score)
            seen.add(entity.id)
            index += 1
            if index >= limit:
                break

In [38]:
import textrazor

textrazor.api_key = "d26228c7df9e0ea2a9c2656b520eefe049ae8c837fb1ef5a95619e04"
client = textrazor.TextRazor(extractors=["entities", "topics"])

Now, let's compare Carlos Souza's main topics and Alaa Shafaee's main topics in Assignment 2:

In [39]:
print_top_entities("Carlos Souza")

Intelligent tutoring system 1 8.865
Social media 1 9.509
Social networking service 1 1.756
Metacognition 1 14.24
Learning 0.9217 10.26
Educational technology 0.8096 16.99
Pedagogy 0.8096 5.95
Learning environment 0.8002 2.443
Problem solving 0.7723 6.813
Motivation 0.7709 4.478
Virtual community 0.7603 4.903
Carol Dweck 0.7437 4.027
Facebook 0.7412 15.26
Community 0.7169 1.289
Education 0.7148 5.371


In [40]:
print_top_entities("Alaa Shafaee")

Intelligent tutoring system 1 8.568
Metacognition 1 9.326
Learning theory (education) 0.8641 4.046
Cognitive tutor 0.8053 5.843
Educational technology 0.7945 4.005
Teaching method 0.7295 1.722
Project-based learning 0.7283 9.479
Problem-based learning 0.7254 12.69
Constructionism (learning theory) 0.659 3.184
Learning 0.6255 9.671
Problem solving 0.6051 5.814
Motivation 0.5837 5.01
Simulation 0.5769 5.537
Test (assessment) 0.5768 2.248
Pedagogy 0.5468 6.151


As you can see, our algorithm worked perfectly! The API identifies all main topics and most of them are the same!

### Next steps

- If you need support, drop me a note at [souza@gatech.edu](mailto:souza@gatech.edu), I will be glad to help.

### Further readings

- [Distributed Representations of Sentences and Documents](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
- [Doc2Vec Tutorial on the Lee Dataset](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb)
- [Modern NLP in Python](https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb)
- [Gensim - Topic Modelling for Humans - tutorials](https://radimrehurek.com/gensim/tutorial.html)
- [TextRazor API tutorials](https://www.textrazor.com/tutorials)
