## Python Recipes for Assignment 1

**Notes**:

The documentation for TextBlob, if you're interested is [here](https://textblob.readthedocs.io/en/dev/)  
TF-IDF code adapted from Steven Loria: http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/  
Explanations for the code below are provided. Feel free to modify the code to try out ideas, but doing so is not required for assignment 1. 

In [4]:
'''
In this cell we load a directory of text files into memory using the TextBlob library.
TextBlob is a simpler interface to many of the Natural Language Tool Kit programs.
We're using the variable 'blob' to refer to a TextBlob object, which contains our text.
It allows us to do useful things like blob.lower() to lowercase all words, or
blob.words to access the words as a list.

Here we assume your text files are in a directory called 'text_files'.
The code below iterates through each filename in the 'text_files' directory,
opens it, creates a TextBlob object, lowercases, strips commas, and saves it to a
list of texts, here called 'bloblist'. One by one we accumulate a list of prepared texts.
'''

import os
from textblob import TextBlob

bloblist = []

def check_file(filename):
    if not filename.endswith('.txt'):
        return None
    if filename.startswith('.'):
        return None
    return filename


path = 'text_files/'
for filename in os.listdir(path):
    if check_file(filename):
        with open(os.path.join(path, filename), 'r', encoding='utf-8', errors='ignore') as f:
            print(filename)
            text = f.read()
            blob = TextBlob(text)
            blob_lower = blob.lower()
            blob_strip = blob_lower.strip(',')
            bloblist.append(blob_strip)
        
print(len(bloblist))
        
# a = '''
# In this cell we load a directory of text files into memory using the TextBlob library.
# TextBlob is a simpler interface to many of the Natural Language Tool Kit programs.
# We're using the variable 'blob' to refer to a TextBlob object, which contains our text.
# It allows us to do useful things like blob.lower() to lowercase all words, or
# blob.words to access the words as a list.

# Here we assume, your text files are in a directory called 'text_files'.
# The code below iterates through each filename in the 'text_files' directory,
# opens it, creates a TextBlob object, lowercases, strips commas, and saves it to a
# list of texts, here called 'bloblist'. One by one we accumulate a list of prepared texts.
# '''
# blob = TextBlob(a)
# print(blob)
# print(blob.strip(','))

1918ebc2ccb0.txt
2361705f299d.txt
38a6c08d4652.txt
3ebb50ba370e.txt
3f9d20f019a3.txt
6312bfd4c87a.txt
89b97cb1cca5.txt
8ebd88efcff6.txt
b054a3b4fb16.txt
bde914224749.txt
becf40fd76b9.txt
efca757f0313.txt
f611db9ba568.txt
fa5a1fd08da4.txt
14


### Count number of words in each 'blob'

In [6]:
'''
This code iterates through each document (or 'blob') and uses the len() function to count how many words are in it.
Again we use the "accumulator pattern" to iteratively add the number of words in each document to a running total
variable called corpus_total_words.
'''

corpus_total_words = 0
for i, blob in enumerate(bloblist):
    corpus_total_words = corpus_total_words + len(blob.words)
    print("Text {}".format(i + 1), ": ", len(blob.words))
    print('.'in blob)
    print('.' in blob.words)
print("Total:", corpus_total_words)


# blob.words tao list bao gom cac words, loai tru cac puntuations

Text 1 :  1225
True
False
Text 2 :  951
True
False
Text 3 :  1589
True
False
Text 4 :  984
True
False
Text 5 :  1041
True
False
Text 6 :  2255
True
False
Text 7 :  1110
True
False
Text 8 :  460
True
False
Text 9 :  952
True
False
Text 10 :  917
True
False
Text 11 :  756
True
False
Text 12 :  809
True
False
Text 13 :  1237
True
False
Text 14 :  575
True
False
Total: 14861


### Create frequency distributions

In [9]:
'''
Notice the accumulator pattern is used often. We loop through our data (bloblist), creating a new data
structure for each text (blob_frequency_dictionary). We then loop through the words of each text (blob.words)
and find the frequency of that word in the text - ie how many times it occurs in that text - and record it. 
This blob_frequency_dictionary data looks something like: {'the': 38, 'to':33, 'and':28 ...}
By default no stopwords are applied, but you can uncomment a line below to apply the NLTK English stopword list.
This will filter out function words, leaving you with the 'content' or 'lexical' words that are likely to be
of interest for this assignment.
After this we sort by the 
'''

from nltk.corpus import stopwords

sw = stopwords.words('english')

for i, blob in enumerate(bloblist):
    blob_frequency_dictionary = {}
    for word in blob.words:
        count = blob.words.count(word)
        if word not in blob_frequency_dictionary:
            # add stopword filtering here by uncommenting the next line, and indenting the line after it
            if word.lower() not in sw:
                blob_frequency_dictionary[word] = count
        sorted_words = sorted(blob_frequency_dictionary.items(), key=lambda x: x[1], reverse=True)
    print('10 Most frequent words in document {}'.format(i + 1))
    for word in sorted_words[:10]:
        print("\tWord:", word)


10 Most frequent words in document 1
	Word: ('instagram', 30)
	Word: ('snapchat', 21)
	Word: ('stories', 16)
	Word: ('users', 12)
	Word: ('it’s', 10)
	Word: ('people', 8)
	Word: ('social', 8)
	Word: ('facebook', 7)
	Word: ('daily', 7)
	Word: ('”', 7)
10 Most frequent words in document 2
	Word: ('information', 9)
	Word: ('humans', 8)
	Word: ('consciousness', 7)
	Word: ('one', 5)
	Word: ('human', 5)
	Word: ('brain', 5)
	Word: ('something', 5)
	Word: ('able', 5)
	Word: ('google', 5)
	Word: ('already', 4)
10 Most frequent words in document 3
	Word: ('”', 25)
	Word: ('better', 17)
	Word: ('artificial', 15)
	Word: ('intelligence', 15)
	Word: ('students', 11)
	Word: ('ai', 10)
	Word: ('“what', 10)
	Word: ('computers', 10)
	Word: ('like', 9)
	Word: ('ask', 9)
10 Most frequent words in document 4
	Word: ('bluetooth', 48)
	Word: ('devices', 11)
	Word: ('also', 8)
	Word: ('it’s', 8)
	Word: ('5', 8)
	Word: ('things', 7)
	Word: ('wireless', 7)
	Word: ('one', 6)
	Word: ('master', 6)
	Word: ('applica

### Make some NLTK text objects and view concordance

In [10]:
'''
Note your texts may not be in the order they appear as files, but you can view the list
by running this cell.
'''

from nltk.text import Text

nltk_text_list = []

for blob in bloblist:
    nltk_text = Text(blob.words)
    nltk_text_list.append(nltk_text)

nltk_text_list

[<Text: lots of people are saying that instagram just...>,
 <Text: in the future philosophers will be gods there...>,
 <Text: how to teach ai to kids science fiction...>,
 <Text: 6 things you didn’t know about bluetooth like...>,
 <Text: neural networks as the architecture of human work...>,
 <Text: the big business of organized religion or why...>,
 <Text: the power of not knowing by tobias van...>,
 <Text: elon musk’s boring company tunnels might be a...>,
 <Text: what mega chupa chups taught me about artificial...>,
 <Text: while i’m a big fan of cool new...>,
 <Text: should robots dance how anthropological methods can answer...>,
 <Text: chatbots as loyal friends to humans age of...>,
 <Text: nintendo’s surface first published on 4/29/17 on 5ish...>,
 <Text: machine learning just made it really easy to...>]

### NLTK Concordance

In [46]:
'''
You can select a text by changing the number 6 below - in this example 0-13 will work.
You can also specify how many lines and how 'wide' you'd like the text snippet to be.
'''

nltk_text_list[0].concordance('connect', lines=50, width=90)
nltk_text_list[0].concordance('begin', lines=50, width=90)

nltk_text_list[0].concordance('people', lines=50, width=90)
# lines: The number of lines to display (default=25)
# width: the number of characters displayed

Displaying 1 of 1 matches:
erent ways people use these platforms to connect with others snapchat is not just another
Displaying 2 of 2 matches:
ting the two apps as similar platforms to begin with is a fruitless exercise first it’s w
le would have ever joined the platform to begin with “instagram is still a place for user
Displaying 8 of 8 matches:
                                          people are saying that instagram just officially
ached 200 million daily active users daus people were quick to pounce on that significant 
g to be the same.” instagram stories many people say that instagram stories offer a more e
a more enjoyable experience because “more people see your story.” but they are thinking of
s allowing users to discover content from people they don’t follow but the snaps that are 
pchat users it’s unclear if many of these people would have ever joined the platform to be
ather than focusing on irrelevant metrics people should think critically about the differe
think critic

### Use NLTK's similar() and common_contexts() methods

In [31]:
'''
NLTK's similar() method finds words that appear in the same context, ie have the same words on either side.
'''

nltk_text_list[0].similar('connect')

begin


In [51]:
'''
NLTK's common_contexts() methods works in the opposite direction to similar().
Give it two words that share a context (ie matching words on either side), and
it will show you what the context words are. 
There can be multiple contexts, and this gets more interesting the larger our 
texts are.
'''

nltk_text_list[0].common_contexts(['connect', 'begin'])

to_with


### Calculate Term Frequency - Inverse Document Frequency

Term Frequency - Inverse Document Frequency (TF-IDF) is a weighting that finds words that are characteristic of a document within a corpus. It finds words that appear quite frequently in a given document, but not in the other documents. 

Words that occur only once or twice in a single document and not in any other documents don't tell us a lot about the document - they may be just the whim of the writer. Similarly, words that appear a lot in all the documents don't tell us much about the differences between documents.

#### Definitions

For each word in the corpus:

**Term Frequency** (tf) = frequency of the word in each document

**Document Frequency** (df) = number of documents in the corpus containing the word

**Inverse Document Frequency** (idf) = (logarithm of) the number of documents divided by the document frequency for the word

So tf-idf for a word in the corpus is calculated by tf * idf

In [52]:
'''
Here are function definitions for tf, df (here called 'n_containing), idf and tfidf.
'''

import math

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

In [53]:
'''
Here we loop through the list of documents called 'bloblist'.
Scores is a dictionary of key:value pairs. 
Each key is a word in the document and the value is its tfidf score. 
Results are sorted by the tfidf score with the largest value at the top.
Lastly we print the first 10 results for each document.
'''

for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:10]:
        print("\t{}, TF-IDF: {}".format(word, round(score, 5)))

Top words in document 1
	instagram, TF-IDF: 0.04765
	snapchat, TF-IDF: 0.02148
	stories, TF-IDF: 0.01636
	users, TF-IDF: 0.0083
	social, TF-IDF: 0.00818
	platform, TF-IDF: 0.00755
	facebook, TF-IDF: 0.00716
	daily, TF-IDF: 0.00716
	app, TF-IDF: 0.00716
	snap, TF-IDF: 0.00635
Top words in document 2
	consciousness, TF-IDF: 0.00922
	likely, TF-IDF: 0.00648
	experts, TF-IDF: 0.00614
	complexity, TF-IDF: 0.00614
	enhance, TF-IDF: 0.00614
	thought, TF-IDF: 0.00614
	frictionless, TF-IDF: 0.00614
	instant, TF-IDF: 0.00614
	friction, TF-IDF: 0.00614
	collective, TF-IDF: 0.00614
Top words in document 3
	computers, TF-IDF: 0.01225
	students, TF-IDF: 0.01066
	“what, TF-IDF: 0.00788
	videos, TF-IDF: 0.00735
	ask, TF-IDF: 0.0071
	”, TF-IDF: 0.00695
	artificial, TF-IDF: 0.00654
	intelligence, TF-IDF: 0.00654
	curiosity, TF-IDF: 0.00612
	flight, TF-IDF: 0.00612
Top words in document 4
	bluetooth, TF-IDF: 0.09492
	wireless, TF-IDF: 0.01384
	devices, TF-IDF: 0.01151
	5, TF-IDF: 0.01019
	standard, TF-ID

### Extract Noun phrases

In [54]:
'''
Noun phrases are phrases that function as a noun (ie a 'thing' or concept), eg 'collaboration technology' in
document 1 below.
'''

for i, blob in enumerate(bloblist):
    print('Noun_phrases in document {}'.format(i + 1))
    noun_phrases = sorted(blob.noun_phrases)
    print('========================================')
    print(noun_phrases)
    print('========================================')

Noun_phrases in document 1
['100m instagram followers', '400m daus', 'achievement — it’s', 'active user', 'active users', 'active users', 'active users', 'active users”', 'advertisers’ “experimental bucket”', 'average snapchatter', 'average user', 'average video view times', 'beyonce blasts', 'broadcast model', 'broadcast platform', 'camera company', 'checks instagram stories', 'communication platform', 'company’s flagship app', 'consumer behavior', 'craven senior vice president', 'daily active users', 'daily basis.” instagram stories facebook won’t', 'dead fam', 'different set', 'different ways people', 'doesn’t stoke', 'early stage investor', 'enjoyable experience', 'equate apples', 'equate snapchat', 'essential destination', 'facebook messenger', 'facebook messenger', 'fertile niche', 'follower counts', 'high level', 'imaginary war', 'important part', 'instagram stories', 'instagram stories', 'instagram stories', 'instagram stories', 'instagram stories', 'instagram stories', 'instag

### Getting a list of stopwords for use in AntConc

In [None]:
from nltk.corpus import stopwords
sw = stopwords.words('english')
with open('english-stopwords-nltk.txt', 'w') as f:
    for word in sw:
        f.write(word + '\n')

In AntConc, the stoplist can be applied via the Word List Tool Preferences dialog.

In [None]:
%cat english-stopwords-nltk.txt

### Loading output from AntConc into your Jupyter notebook

There is more than one way to do this. A good way is to save the results you want to keep from AntConc by going to File > Save Output to Text File. Put the resulting text file in the same directory as your Jupyter notebook (my example is called 'antconc_word_lst_results.txt'). Then use the following code.  

In [None]:
'''
Play around with the number 190 (number of characters, not words in this situation)
until you get the desired result.
'''

with open('antconc_word_lst_results.txt') as f:
    results = f.read()
    print(results[:190])

When displaying results from the AntConc Concordance tool, to get your results to look nice you may need to limit adjust the Concordance Tool Preferences to limit the width and columns displayed.