# Advanced: Text Processing in Matrices

## Load Natural Language Toolkit for Parsing

In [1]:
! pip install nltk
import nltk

# Enter 'd' for Download, then 'punkt', and then 'q' for quit
nltk.download()


Collecting nltk
  Downloading nltk-3.2.2.tar.gz (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 1.2MB/s eta 0:00:01
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25l- \ | / done
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/42/b5/27/718985cd9719e8a44a405d264d98214c7a607fb65f3a006f28
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.2.2
[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> punkt
    Downloading package punkt to /home/jovyan/nltk_data...
      Unzipp

True

## Import text files into dictionary

As a "corpus" we fetched some data from Wikipedia, based on currently
trendy (2/18/2017) topics.  Each topic had multiple interpretations, some of which 
we suspected would "intersect" in interesting ways (e.g., Trump/Putin, Cloud/Google, 
Cloud/Climate).  Others had various interpretations (e.g., there are many types of 
Football).  See _Wikipedia.ipynb_ for the original download code.

Selected topics (for which the top-10 matches were returned by Wikipedia) were:

 * Pennsylvania
 * Trump
 * Apple
 * Google
 * Farm
 * Climate
 * Cloud
 * Football
 * Government
 * Putin

*docs* is a map from file --> text

In [3]:
import os

docs = {}

for filename in os.listdir('text'):
    file = open('text/' + filename)
    docs[filename] = file.read()
    print ('Loaded',filename)

print ("All files loaded")

Loaded Province of Pennsylvania.txt
Loaded Eric Trump.txt
Loaded Apple.txt
Loaded Putin khuilo!.txt
Loaded Cooking apple.txt
Loaded Apple TV.txt
Loaded Pennsylvania Historical and Museum Commission.txt
Loaded Football player.txt
Loaded Donald Trump.txt
Loaded Public image of Vladimir Putin.txt
Loaded Google Developers.txt
Loaded Alpine climate.txt
Loaded Desert climate.txt
Loaded Century Farm.txt
Loaded Apple Inc..txt
Loaded Animal Farm.txt
Loaded Google Books.txt
Loaded Google Account.txt
Loaded Oort cloud.txt
Loaded HP Cloud.txt
Loaded Farm Aid.txt
Loaded History of Pennsylvania.txt
Loaded E-government.txt
Loaded Trump University.txt
Loaded Outline of Pennsylvania.txt
Loaded Google Search.txt
Loaded Arrest of Vladimir Putin viral video.txt
Loaded AtGoogleTalks.txt
Loaded Cloud computing.txt
Loaded Government of Australia.txt
Loaded Government.txt
Loaded Family of Donald Trump.txt
Loaded Stratus cloud.txt
Loaded Brook Farm.txt
Loaded Google.txt
Loaded Wind farm.txt
Loaded Subarctic cl

## Other preliminaries to get you started.

The function *has_letter* should be used to filter words based on the presence of a letter.

The set *stopwords* includes words to ignore.

In [4]:
import nltk
from nltk.stem.porter import *
import re
import numpy as np

"""
# Returns True if the input (string) parameter has
# any sort of letter in it, else returns False.
"""
def has_letter(x):
    return re.match('.*[a-zA-Z].*',x) != None

# Stopwords are words we will ignore for search
# purposes, because they are too common to be useful
stopwords = set()

stop_file = open('stopwords.txt')
for line in stop_file:
    stopwords.add(line.strip())

# The NLTK parser breaks apostrophe-s into a separate "word"
# so we'll want to add it to the list... Though it's technically
# not a stop word in the traditional sense.
stopwords.add("'s")

# Use this as the maximum number of words we will index
MAX_WORDS = 18174

# Create the word stemmer
stemmer = PorterStemmer()

# Your Code Goes Here!

Note that you may want to read more about TF*IDF scoring at:

* http://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html
* https://en.wikipedia.org/wiki/Tf%E2%80%93idf