## Requirements for the tutorial notebook

1. One or two complete examples with functional Python code and real data. 
2. Cover all the key content from chapter. 
3. Do not simply reproduce the same code and explanations that you find in the text. Draw on knowledge from other parts of the course. 
4. Paragraph in the notebook that describes exactly what each member of your group contributed.
5. Whether or not your group lived up to the original agreements for collaborating.

## Motivation for the Topic

The code we have seen and worked with so far has given us the experience to gather data from webpages and store it. While this is useful in it's own right, our main interests lie in the interactions between the internet and the human forms of communication on the web. More specifically, in learning to modify and control such interactions. In Data Challenge 2, we gathered 2-grams and 3-grams from the Plos Website and counted the most frequently occuring n-grams. In both cases, words like "of"  and "the" were most frequently occuring. This did not seem to be very useful information. We realized that merely knowing the top 2 and 3-grams was not sufficient. It would be more useful to be able to separate the useful or "context-appropriate" words from the list of n-grams. This was the motivation for choosing a chapter on reading and writing natural languages as our tutorial notebook.

In this tutorial, we are going to examine a speech by Trump. To do this, we are going to break his speech into 2-grams first. NLTK is great for generating statistical information about word counts, word frequency,
and word diversity in sections of text.

In [13]:
import re
import os
import string
import collections

from bs4 import BeautifulSoup
from urllib.request import urlopen
from collections import Counter
from nltk import word_tokenize
from nltk import Text

The first step would be to read in the speech that is saved in a text file `speech.txt`.

In [6]:
path = './speech.txt'
file = open(path, mode='r') 
speech = file.read()
speech[:1000]

"We're here today to discuss matters of vital importance to us all: America's security, prosperity, and standing in the world. I want to talk about where we've been, where we are now, and, finally, our strategy for where we are going in the years ahead.\n\nOver the past 11 months, I have travelled tens of thousands of miles to visit 13 countries. I have met with more than 100 world leaders. I have carried America's message to a grand hall in Saudi Arabia, a great square in Warsaw, to the General Assembly of the United Nations, and to the seat of democracy on the Korean Peninsula. Everywhere I travelled, it was my highest privilege and greatest honour to represent the American people.\n\nThroughout our history, the American people have always been the true source of American greatness. Our people have promoted our culture and promoted our values. Americans have fought and sacrificed on the battlefields all over the world. We have liberated captive nations, transformed former enemies int

Next, we need to clean the speech. From a small subset of the speech, we note the presence of the following:
* `\n` - these need to be removed to prevent it from being read as a n-gram
* numbers - for the purpose of this example, numbers won't provide any significant information so we will be removing that as well.
* upper cases - for the sake of consistency and to prevent double counting of the same words. For example`Apple` and `apple` need to be counted as one and the same word. We need to convert everything to a uniform case. We choose to switch to lower case. 
* special symbols like `,` and `.` - again for clarity in out analysis, we remove punctuations.

Note: We wished to preserve the `'` in between words like `we're` but were unable to do that. [Stack Overflow](https://stackoverflow.com/questions/24695092/how-to-not-remove-apostrophe-only-for-some-words-in-text-file-in-python) had a fairly complicated regex syntax that we could not work with to remove the apostrophes.

In our `clean_speech` method we first remove the newline characters and then the special characters using `re.sub`. The regex code `'[^A-Za-z ]+'` tags all characters except upper and lower case alphabets and single spaces that are later removed.

In [20]:
def clean_speech(speech):
    re_newline = '\n+'
    re_specials = '[^A-Za-z ]+'
    remove_newlines = re.sub(re_newline, '', speech)
    remove_special = re.sub(re_specials, '', remove_newlines)
    
    return remove_special

cleaned_speech = clean_speech(speech).lower()
cleaned_speech_in_words = cleaned_speech.split(' ')
cleaned_speech[:1000]

'were here today to discuss matters of vital importance to us all americas security prosperity and standing in the world i want to talk about where weve been where we are now and finally our strategy for where we are going in the years aheadover the past  months i have travelled tens of thousands of miles to visit  countries i have met with more than  world leaders i have carried americas message to a grand hall in saudi arabia a great square in warsaw to the general assembly of the united nations and to the seat of democracy on the korean peninsula everywhere i travelled it was my highest privilege and greatest honour to represent the american peoplethroughout our history the american people have always been the true source of american greatness our people have promoted our culture and promoted our values americans have fought and sacrificed on the battlefields all over the world we have liberated captive nations transformed former enemies into the best of friends and lifted entire re

Let us try to extract ngrams from our cleaned speech next.

In [21]:
def create_ngrams(speech, n):
    ngrams = {}
    
    for index in range(len(speech)-n+1):
        ngram = " ".join(speech[index:index+n])
        
        if ngram in ngrams:
            ngrams[ngram] += 1
        else:
            ngrams[ngram] = 1
            
    return ngrams

create_ngrams(cleaned_speech_in_words, 2)

{'were here': 1,
 'here today': 1,
 'today to': 1,
 'to discuss': 1,
 'discuss matters': 1,
 'matters of': 1,
 'of vital': 1,
 'vital importance': 1,
 'importance to': 1,
 'to us': 2,
 'us all': 1,
 'all americas': 1,
 'americas security': 1,
 'security prosperity': 1,
 'prosperity and': 2,
 'and standing': 1,
 'standing in': 2,
 'in the': 29,
 'the world': 11,
 'world i': 1,
 'i want': 1,
 'want to': 1,
 'to talk': 2,
 'talk about': 1,
 'about where': 1,
 'where weve': 1,
 'weve been': 1,
 'been where': 1,
 'where we': 2,
 'we are': 20,
 'are now': 3,
 'now and': 1,
 'and finally': 2,
 'finally our': 2,
 'our strategy': 8,
 'strategy for': 1,
 'for where': 1,
 'are going': 1,
 'going in': 1,
 'the years': 1,
 'years aheadover': 1,
 'aheadover the': 1,
 'the past': 4,
 'past ': 1,
 ' months': 1,
 'months i': 1,
 'i have': 4,
 'have travelled': 1,
 'travelled tens': 1,
 'tens of': 2,
 'of thousands': 1,
 'thousands of': 2,
 'of miles': 1,
 'miles to': 1,
 'to visit': 1,
 'visit ': 1,
 '

`create_ngrams` is fairly straight forward, it accepts a list of words and the number of ngrams we want to create. For every n words, we concat the adjacent n words into a string and add it to a dict. The last bit is the classic histogram technique where we initialize the ngram with a counter set to 1 if we haven't see it before else we increment it's counter by 1.

We can use Python's `OrderedDict` from the `collections module` ([Stack Overflow article](https://stackoverflow.com/questions/9001509/how-can-i-sort-a-dictionary-by-key#9001529) for reference) to sort the ngrams and see the most common ngrams consisting of 2 words.

In [22]:
ngrams = create_ngrams(cleaned_speech_in_words, 2)
ordered_ngrams = collections.OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
ordered_ngrams

OrderedDict([('of the', 47),
             ('in the', 29),
             ('to the', 22),
             ('of our', 21),
             ('we are', 20),
             ('we have', 17),
             ('and the', 15),
             ('the people', 14),
             ('on the', 13),
             ('the world', 11),
             ('thank you', 11),
             ('we will', 11),
             ('for the', 11),
             ('and our', 10),
             ('our country', 10),
             ('and we', 9),
             ('calls for', 9),
             ('of a', 9),
             ('spirit of', 9),
             ('of liberty', 9),
             ('our strategy', 8),
             ('the united', 8),
             ('the american', 8),
             ('that is', 8),
             ('from the', 8),
             ('of my', 8),
             ('is the', 8),
             ('we must', 7),
             ('of their', 7),
             ('to a', 6),
             ('and to', 6),
             ('american people', 6),
             ('america is', 6),
 

As we can see from the results, 'the people' is probably the most reasonable ngram with the highest count, the first few are not that helpful. We can remove these unwanted words using a corpus of English words that distinguish 'interesting' and 'uninsteresting' words in the context of text analysis. For our purposes, let us take the first 100 words from the [Corpus of Contemporary American English](https://corpus.byu.edu/coca/) and filter out unwanted words from our ngrams.

In [27]:
def common_words():
    return ["the", "be", "and", "of", "a", "in", "to", "have", "it",
            "i", "that", "for", "you", "he", "with", "on", "do", "say", "this",
            "they", "is", "an", "at", "but","we", "his", "from", "that", "not",
            "by", "she", "or", "as", "what", "go", "their","can", "who", "get",
            "if", "would", "her", "all", "my", "make", "about", "know", "will",
            "as", "up", "one", "time", "has", "been", "there", "year", "so",
            "think", "when", "which", "them", "some", "me", "people", "take",
            "out", "into", "just", "see", "him", "your", "come", "could", "now",
            "than", "like", "other", "how", "then", "its", "our", "two", "more",
            "these", "want", "way", "look", "first", "also", "new", "because",
            "day", "more", "use", "no", "man", "find", "here", "thing", "give",
            "many", "well"]

filtered_speech_in_words = list(filter(lambda word: word not in common_words(), cleaned_speech_in_words))
ngrams = create_ngrams(filtered_speech_in_words, 2)
ordered_ngrams = collections.OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
ordered_ngrams

OrderedDict([('united states', 6),
             ('thank thank', 5),
             ('national security', 5),
             ('spirit liberty', 4),
             ('american greatness', 3),
             (' ', 3),
             ('since election', 3),
             ('standing world', 2),
             ('where are', 2),
             ('finally strategy', 2),
             ('past ', 2),
             (' world', 2),
             ('failures past', 2),
             ('after another', 2),
             ('leaders washington', 2),
             ('men women', 2),
             ('fair share', 2),
             ('north korea', 2),
             ('american energy', 2),
             ('lost confidence', 2),
             ('america coming', 2),
             ('coming back', 2),
             ('serve citizens', 2),
             ('great job', 2),
             ('made clear', 2),
             ('action against', 2),
             ('taken care', 2),
             ('stock market', 2),
             ('alltime high', 2),
             (

Cool, 'american greatness' does seem like something Trump would say.

Now that we have a cleaned text `cleaned_speech`, we need to tokenize it. This involves splitting the speech into a list of individual words. 

`word_tokenize` is a method of the `nltk` module. It splits a string by punctuation other than periods and returns a list of strings that are perfect for creating n-grams.

Finally, filter is used to go through all the strings and return only those with a length of greater than 1. This is done because words like "I" and "A" are very commonly used in English language but add no value to this analysis. From the output below, we can see that `nltk.word_tokenize` generated a list of individual words from the speech.

In [123]:
def tokenize(cleaned_speech):
    tokenize = nltk.word_tokenize(cleaned_speech)
    cleaned_list = list(filter(lambda char: len(char) >1, tokenize))
    return cleaned_list

tokenized_list = tokenize(cleaned_speech)
tokenized_list[:10]

['were',
 'here',
 'today',
 'to',
 'discuss',
 'matters',
 'of',
 'vital',
 'importance',
 'to']

In [124]:
def create_ngrams(tokenized_list, n):
    n_grams = []
    generated_grams = nltk.ngrams(tokenized_list, n)
    
    for grams in generated_grams:
        n_grams += [grams]

    counts = Counter(n_grams)
    common_counts = counts.most_common()
    return common_counts
    
two_grams = create_ngrams(tokenized_list, 2)
two_grams[:10]


[(('of', 'the'), 47),
 (('in', 'the'), 29),
 (('to', 'the'), 22),
 (('of', 'our'), 21),
 (('we', 'are'), 20),
 (('we', 'have'), 17),
 (('and', 'the'), 15),
 (('the', 'people'), 14),
 (('on', 'the'), 13),
 (('the', 'world'), 11)]

## Lexicographical Analysis with NLTK