## Python Recipes for Assignment 1

**Notes**:

The documentation for TextBlob, if you're interested is [here](https://textblob.readthedocs.io/en/dev/)  
TF-IDF code adapted from Steven Loria: http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/  
Explanations for the code below are provided. Feel free to modify the code to try out ideas, but doing so is not required for assignment 1. 

In [None]:
'''
In this cell we load a directory of text files into memory using the TextBlob library.
TextBlob is a simpler interface to many of the Natural Language Tool Kit programs.
We're using the variable 'blob' to refer to a TextBlob object, which contains our text.
It allows us to do useful things like blob.lower() to lowercase all words, or
blob.words to access the words as a list.

Here we assume your text files are in a directory called 'text_files'.
The code below iterates through each filename in the 'text_files' directory,
opens it, creates a TextBlob object, lowercases, strips commas, and saves it to a
list of texts, here called 'bloblist'. One by one we accumulate a list of prepared texts.
'''

import os
from textblob import TextBlob

bloblist = []


path = 'text_files/'
for filename in os.listdir(path):
    with open(os.path.join(path, filename)) as f:
        text = f.read()
        blob = TextBlob(text)
        blob_lower = blob.lower()
        blob_strip = blob_lower.strip(',')
        bloblist.append(blob_strip)

### Count number of words in each 'blob'

In [None]:
'''
This code iterates through each document (or 'blob') and uses the len() function to count how many words are in it.
Again we use the "accumulator pattern" to iteratively add the number of words in each document to a running total
variable called corpus_total_words.
'''

corpus_total_words = 0
for i, blob in enumerate(bloblist):
    corpus_total_words = corpus_total_words + len(blob.words)
    print("Text {}".format(i + 1), ": ", len(blob.words))
print("Total:", corpus_total_words)

### Create frequency distributions

In [None]:
'''
Notice the accumulator pattern is used often. We loop through our data (bloblist), creating a new data
structure for each text (blob_frequency_dictionary). We then loop through the words of each text (blob.words)
and find the frequency of that word in the text - ie how many times it occurs in that text - and record it. 
This blob_frequency_dictionary data looks something like: {'the': 38, 'to':33, 'and':28 ...}
By default no stopwords are applied, but you can uncomment a line below to apply the NLTK English stopword list.
This will filter out function words, leaving you with the 'content' or 'lexical' words that are likely to be
of interest for this assignment.
After this we sort by the 
'''

from nltk.corpus import stopwords

sw = stopwords.words('english')

for i, blob in enumerate(bloblist):
    blob_frequency_dictionary = {}
    for word in blob.words:
        count = blob.words.count(word)
        if word not in blob_frequency_dictionary:
            # add stopword filtering here by uncommenting the next line, and indenting the line after it
            # word.lower() not in sw:
            blob_frequency_dictionary[word] = count
        sorted_words = sorted(blob_frequency_dictionary.items(), key=lambda x: x[1], reverse=True)
    print('10 Most frequent words in document {}'.format(i + 1))
    for word in sorted_words[:10]:
        print("\tWord:", word)


### Make some NLTK text objects and view concordance

In [None]:
'''
Note your texts may not be in the order they appear as files, but you can view the list
by running this cell.
'''

from nltk.text import Text

nltk_text_list = []

for blob in bloblist:
    nltk_text = Text(blob.words)
    nltk_text_list.append(nltk_text)

nltk_text_list

### NLTK Concordance

In [None]:
'''
You can select a text by changing the number 6 below - in this example 0-13 will work.
You can also specify how many lines and how 'wide' you'd like the text snippet to be.
'''

nltk_text_list[6].concordance('bluetooth', lines=50, width=90)

### Use NLTK's similar() and common_contexts() methods

In [None]:
'''
NLTK's similar() method finds words that appear in the same context, ie have the same words on either side.
'''

nltk_text_list[7].similar('work')

In [None]:
'''
NLTK's common_contexts() methods works in the opposite direction to similar().
Give it two words that share a context (ie matching words on either side), and
it will show you what the context words are. 
There can be multiple contexts, and this gets more interesting the larger our 
texts are.
'''

nltk_text_list[7].common_contexts(['work', 'brain'])

### Calculate Term Frequency - Inverse Document Frequency

Term Frequency - Inverse Document Frequency (TF-IDF) is a weighting that finds words that are characteristic of a document within a corpus. It finds words that appear quite frequently in a given document, but not in the other documents. 

Words that occur only once or twice in a single document and not in any other documents don't tell us a lot about the document - they may be just the whim of the writer. Similarly, words that appear a lot in all the documents don't tell us much about the differences between documents.

#### Definitions

For each word in the corpus:

**Term Frequency** (tf) = frequency of the word in each document

**Document Frequency** (df) = number of documents in the corpus containing the word

**Inverse Document Frequency** (idf) = (logarithm of) the number of documents divided by the document frequency for the word

So tf-idf for a word in the corpus is calculated by tf * idf

In [None]:
'''
Here are function definitions for tf, df (here called 'n_containing), idf and tfidf.
'''

import math

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

In [None]:
'''
Here we loop through the list of documents called 'bloblist'.
Scores is a dictionary of key:value pairs. 
Each key is a word in the document and the value is its tfidf score. 
Results are sorted by the tfidf score with the largest value at the top.
Lastly we print the first 10 results for each document.
'''

for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:10]:
        print("\t{}, TF-IDF: {}".format(word, round(score, 5)))

### Extract Noun phrases

In [None]:
'''
Noun phrases are phrases that function as a noun (ie a 'thing' or concept), eg 'collaboration technology' in
document 1 below.
'''

for i, blob in enumerate(bloblist):
    print('Noun_phrases in document {}'.format(i + 1))
    noun_phrases = sorted(blob.noun_phrases)
    print('========================================')
    print(noun_phrases)
    print('========================================')

### Getting a list of stopwords for use in AntConc

In [None]:
from nltk.corpus import stopwords
sw = stopwords.words('english')
with open('english-stopwords-nltk.txt', 'w') as f:
    for word in sw:
        f.write(word + '\n')

In AntConc, the stoplist can be applied via the Word List Tool Preferences dialog.

In [None]:
%cat english-stopwords-nltk.txt

### Loading output from AntConc into your Jupyter notebook

There is more than one way to do this. A good way is to save the results you want to keep from AntConc by going to File > Save Output to Text File. Put the resulting text file in the same directory as your Jupyter notebook (my example is called 'antconc_word_lst_results.txt'). Then use the following code.  

In [None]:
'''
Play around with the number 190 (number of characters, not words in this situation)
until you get the desired result.
'''

with open('antconc_word_lst_results.txt') as f:
    results = f.read()
    print(results[:190])

When displaying results from the AntConc Concordance tool, to get your results to look nice you may need to limit adjust the Concordance Tool Preferences to limit the width and columns displayed.