## Requirements for the tutorial notebook

1. One or two complete examples with functional Python code and real data. 
2. Cover all the key content from chapter. 
3. Do not simply reproduce the same code and explanations that you find in the text. Draw on knowledge from other parts of the course. 
4. Paragraph in the notebook that describes exactly what each member of your group contributed.
5. Whether or not your group lived up to the original agreements for collaborating.

## Motivation for the Topic

The code we have seen and worked with so far has given us the experience to gather data from webpages and store it. While this is useful in it's own right, our main interests lie in the interactions between the internet and the human forms of communication on the web. More specifically, in learning to modify and control such interactions. In Data Challenge 2, we gathered 2-grams and 3-grams from the Plos Website and counted the most frequently occuring n-grams. In both cases, words like "of"  and "the" were most frequently occuring. This did not seem to be very useful information. We realized that merely knowing the top 2 and 3-grams was not sufficient. It would be more useful to be able to separate the useful or "context-appropriate" words from the list of n-grams. This was the motivation for choosing a chapter on reading and writing natural languages as our tutorial notebook.

In this tutorial, we are going to examine a speech by Trump. To do this, we are going to break his speech into 2-grams first. NLTK is great for generating statistical information about word counts, word frequency,
and word diversity in sections of text.

In [120]:
import re
import os
import string

from bs4 import BeautifulSoup
from urllib.request import urlopen
from collections import Counter
from nltk import word_tokenize
from nltk import Text

The first step would be to read in the speech that is saved in a text file `speech.txt`.

In [121]:
path = './speech.txt'
file = open(path, mode='r') 
speech = file.read()
speech[:1000]

"We're here today to discuss matters of vital importance to us all: America's security, prosperity, and standing in the world. I want to talk about where we've been, where we are now, and, finally, our strategy for where we are going in the years ahead.\n\nOver the past 11 months, I have travelled tens of thousands of miles to visit 13 countries. I have met with more than 100 world leaders. I have carried America's message to a grand hall in Saudi Arabia, a great square in Warsaw, to the General Assembly of the United Nations, and to the seat of democracy on the Korean Peninsula. Everywhere I travelled, it was my highest privilege and greatest honour to represent the American people.\n\nThroughout our history, the American people have always been the true source of American greatness. Our people have promoted our culture and promoted our values. Americans have fought and sacrificed on the battlefields all over the world. We have liberated captive nations, transformed former enemies int

Next, we need to clean the speech. From a small subset of the speech, we note the presence of the following:
* `\n` - these need to be removed to prevent it from being read as a n-gram
* numbers - for the purpose of this example, numbers won't provide any significant information so we will be removing that as well.
* upper cases - for the sake of consistency and to prevent double counting of the same words. For example`Apple` and `apple` need to be counted as one and the same word. We need to convert everything to a uniform case. We choose to switch to lower case. 
* special symbols like `,` and `.` - again for clarity in out analysis, we remove punctuations.

Note: We wished to preserve the `'` in between words like `we're` but were unable to do that. [Stack Overflow](https://stackoverflow.com/questions/24695092/how-to-not-remove-apostrophe-only-for-some-words-in-text-file-in-python) had a fairly complicated regex syntax that we could not work with to remove the apostrophes.

In our `clean_speech` method we first remove the newline characters and then the special characters using `re.sub`. The regex code `'[^A-Za-z ]+'` tags all characters except upper and lower case alphabets and single spaces that are later removed.

In [122]:
def clean_speech(speech):
    re_newline = '\n+'
    re_specials = '[^A-Za-z ]+'
    remove_newlines = re.sub(re_newline, '', speech)
    remove_special = re.sub(re_specials, '', remove_newlines)
    
    return remove_special

cleaned_speech = clean_speech(speech).lower()
cleaned_speech[:1000]

'were here today to discuss matters of vital importance to us all americas security prosperity and standing in the world i want to talk about where weve been where we are now and finally our strategy for where we are going in the years aheadover the past  months i have travelled tens of thousands of miles to visit  countries i have met with more than  world leaders i have carried americas message to a grand hall in saudi arabia a great square in warsaw to the general assembly of the united nations and to the seat of democracy on the korean peninsula everywhere i travelled it was my highest privilege and greatest honour to represent the american peoplethroughout our history the american people have always been the true source of american greatness our people have promoted our culture and promoted our values americans have fought and sacrificed on the battlefields all over the world we have liberated captive nations transformed former enemies into the best of friends and lifted entire re

Now that we have a cleaned text `cleaned_speech`, we need to tokenize it. This involves splitting the speech into a list of individual words. 

`word_tokenize` is a method of the `nltk` module. It splits a string by punctuation other than periods and returns a list of strings that are perfect for creating n-grams.

Finally, filter is used to go through all the strings and return only those with a length of greater than 1. This is done because words like "I" and "A" are very commonly used in English language but add no value to this analysis. From the output below, we can see that `nltk.word_tokenize` generated a list of individual words from the speech.

In [123]:
def tokenize(cleaned_speech):
    tokenize = nltk.word_tokenize(cleaned_speech)
    cleaned_list = list(filter(lambda char: len(char) >1, tokenize))
    return cleaned_list

tokenized_list = tokenize(cleaned_speech)
tokenized_list[:10]

['were',
 'here',
 'today',
 'to',
 'discuss',
 'matters',
 'of',
 'vital',
 'importance',
 'to']

In [124]:
def create_ngrams(tokenized_list, n):
    n_grams = []
    generated_grams = nltk.ngrams(tokenized_list, n)
    
    for grams in generated_grams:
        n_grams += [grams]

    counts = Counter(n_grams)
    common_counts = counts.most_common()
    return common_counts
    
two_grams = create_ngrams(tokenized_list, 2)
two_grams[:10]


[(('of', 'the'), 47),
 (('in', 'the'), 29),
 (('to', 'the'), 22),
 (('of', 'our'), 21),
 (('we', 'are'), 20),
 (('we', 'have'), 17),
 (('and', 'the'), 15),
 (('the', 'people'), 14),
 (('on', 'the'), 13),
 (('the', 'world'), 11)]

## Lexicographical Analysis with NLTK