# Preprocessing

Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing. Text data has its own unique kind of preprocessing. In this notebook, we will cover the core preprocessing methods in preparation for our next two weeks:

- Reading in files
- Tokenization
- Sentence segmentation
- Removing punctuation
- Stripping whitespace
- Text normalization
- Stop words
- Stemming/Lemmatizing
- POS tagging
- DTM/TF-IDF


## Reading in files

The first step is to read in the files containing the data. As we discussed last week, the most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`?

In [5]:
import os
DATA_DIR = 'data'
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()
    
fname

'data\\pride-and-prejudice.txt'

#### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

- What type is `tweets`?
- How many entries are in `raw`?
- Which entry is the header row?
- How can we get the text of the first question?
- How can we get a list of the texts of all questions?

In [10]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []
#with open(fname) as f:
import codecs
with codecs.open(fname, "r", encoding='utf-8', errors='ignore') as f: ##for special encoding issues  
    reader = csv.reader(f)
    tweets = list(reader)


['16-11-11',
 '13:33:35',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'text',
 '',
 '',
 '7.97E+17',
 'https://twitter.com/realDonaldTrump/status/797069763801387008',
 '141527',
 '28654',
 '',
 '']

#### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

- How many tweets are there?
- What happened to the header row?

In [12]:
import os
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname) 

In [13]:
tweets.head(3)

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,


In [14]:
tweet_text = list(tweets['Tweet_Text'])
tweet_text[:4]

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!']

#### Reading in `.json` files

Python has built-in support for reading in `.json` files.

- How many questions are there in the dataset?
- What data type is each question?
- How can we access the question text of the first question?
- How can we get a list of the texts of all questions?

In [18]:
import json
fname = 'jeopardy.json'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    data = json.load(f)

In [19]:
data[:4]

[{'category': 'HISTORY',
  'air_date': '2004-12-31',
  'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
  'value': '$200',
  'answer': 'Copernicus',
  'round': 'Jeopardy!',
  'show_number': '4680'},
 {'category': "ESPN's TOP 10 ALL-TIME ATHLETES",
  'air_date': '2004-12-31',
  'question': "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'",
  'value': '$200',
  'answer': 'Jim Thorpe',
  'round': 'Jeopardy!',
  'show_number': '4680'},
 {'category': 'EVERYBODY TALKS ABOUT IT...',
  'air_date': '2004-12-31',
  'question': "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'",
  'value': '$200',
  'answer': 'Arizona',
  'round': 'Jeopardy!',
  'show_number': '4680'},
 {'category': 'THE COMPANY LINE',
  'air_date': '2004-12-31',
  'question': '\'In 1963, live on "The Art Linkletter Show", this company served its billionth burger\

In [24]:
df = pd.DataFrame(data)

df.head(4)

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680


#### Reading in `.html` files

The best way to read in `.html` files in Python is with the `BeautifulSoup` package.

In [25]:
from bs4 import BeautifulSoup
fname = 'time.html'
fname = os.path.join(DATA_DIR, fname)
import codecs
#with open(fname) as f:
with codecs.open(fname, "r", encoding='utf-8', errors='ignore') as f: ##for special encoding issues  
    soup = BeautifulSoup(f, "html")

In [27]:
texts = soup.findAll(text=True)
texts

['html',
 '\n',
 '\n',
 '\n',
 'Time - Wikipedia',
 '\n',
 'document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );',
 '\n',
 '(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Time","wgTitle":"Time","wgCurRevisionId":825866365,"wgRevisionId":825866365,"wgArticleId":30012,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with inconsistent citation formats","CS1 maint: Multiple names: authors list","Webarchive template wayback links","Wikipedia indefinitely semi-protected pages","Use dmy dates from November 2012","Articles containing potentially dated statements from May 2010","All articles containing potentially dated statements","Wikipedia articles needing clarification from February 2014","All articles with unsourced statements","Articl

#### Reading in `.xml` files

We read in `.xml` files using the `BeautifulSoup` package as well. We can think of `.xml` files as trees where each branch has a tag name. We can find all the branches with a certain name as follows:

In [28]:
from xml.etree import ElementTree as ET
fname = 'books.xml'
fname = os.path.join(DATA_DIR, fname)
with codecs.open(fname, "r",encoding='utf-8', errors='ignore') as f:
    soup = BeautifulSoup(f, 'lxml')

In [29]:
descriptions = soup.findAll('description')
text = [x.get_text() for x in descriptions] ## list comprehension
text[:3]

['An in-depth look at creating applications \r\n      with XML.',
 'A former architect battles corporate zombies, \r\n      an evil sorceress, and her own childhood to become queen \r\n      of the world.',
 'After the collapse of a nanotechnology \r\n      society in England, the young survivors lay the \r\n      foundation for a new society.']

#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to be able to read them all into a single variable.

- What type is `austen`?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How 

In [32]:
import glob
fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(fnames)
austen = ''
for fname in fnames:
    with codecs.open(fname, "r", encoding='utf-8-sig', errors='ignore') as f:
        text = f.read()
        austen += text

### Challenge 

Read in all the `.csv` files in the folder `amazon`. Extract out only the text column from THE FIRST TWO files and store them all in a list called `reviews`.

In [48]:
#Read in files from "amazon"
fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
fnames = glob.glob(fnames)

#Create output list
reviews = []

fnames_2 = fnames[:2]
cols = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

for x in fnames_2:
    file = pd.read_csv(x, names=cols)
    col_text = list(file['10'])
    reviews.append(col_text)

reviews[0][1]



'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace

- What problems do you notice with tokenizing by whitespace?
- What type is `text`?
- What type is `tokens`?
- What type is each element of `tokens`?

In [49]:
import os
fname = 'example1.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [50]:
text

"In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way. \n"

In [51]:
text.split()[:10] #Split by whitespaces

['In',
 'this',
 'little',
 'example,',
 "we're",
 'going',
 'to',
 'see',
 'some',
 'of']

#### Tokenizing with regular expressions

In [52]:
import re
word_pattern = r'\w+' #Regular expression, see regex101.com
tokens = re.findall(word_pattern, text)
tokens[:10]

['In', 'this', 'little', 'example', 'we', 're', 'going', 'to', 'see', 'some']

#### Tokenizing with `nltk`

[Just a bunch of regular expressions under the hood](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py)

In [54]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
tokens[:10]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ruschenpohler\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['In', 'this', 'little', 'example', ',', 'we', "'re", 'going', 'to', 'see']

### Challenge

A while ago you read in a bunch of Jane Austen books into a variable called `austen`. Tokenize that using a method of your choice. Find all the unique words types (you might want the `set` function). Sort the resulting set object to create a vocabulary (you might want to use the `sorted` function).

In [61]:
#tokens = word_tokenize(austen)

word_pattern = r'\w+'
tokens = re.findall(word_pattern, austen)


vocabulary = sorted(set(tokens))
vocabulary[:1000]


['0',
 '000',
 '1',
 '10',
 '105',
 '10th',
 '11',
 '11th',
 '12',
 '121',
 '1212',
 '12th',
 '13',
 '1399',
 '13th',
 '14',
 '141',
 '14th',
 '15',
 '1500',
 '158',
 '1586',
 '15th',
 '16',
 '161',
 '16th',
 '17',
 '1760',
 '1784',
 '1785',
 '1787',
 '1789',
 '1790',
 '1791',
 '1792',
 '1797',
 '18',
 '1800',
 '1803',
 '1806',
 '1810',
 '1811',
 '1814',
 '1816',
 '1818',
 '1887',
 '18th',
 '19',
 '1994',
 '1997',
 '1998',
 '1ST',
 '1st',
 '2',
 '20',
 '200',
 '2001',
 '2008',
 '2010',
 '2012',
 '2015',
 '2016',
 '20th',
 '21',
 '22',
 '22nd',
 '23',
 '23rd',
 '24',
 '24th',
 '25',
 '25th',
 '26',
 '26th',
 '27',
 '27th',
 '28',
 '28th',
 '29',
 '29th',
 '2d',
 '2nd',
 '3',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '3d',
 '3rd',
 '4',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '4557',
 '46',
 '47',
 '48',
 '49',
 '4TH',
 '4th',
 '5',
 '50',
 '501',
 '50L',
 '55',
 '57',
 '596',
 '5th',
 '6',
 '60',
 '6221541',
 '64',
 '6th',
 '7',
 '7000L',
 '7th',
 '8',
 '

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

#### Sentence segmentation by splitting on punctuation

In [64]:
text.split('.')

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 " Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way",
 " \n\nWe can split text into sentences using punctuation, but unfortunately that's not always going to work",
 ' For example, if I wanted to tell you about Dr',
 ' Frankenstein, or Mrs',
 " Doubtfire, we'd be in trouble",
 ' What if I wanted to write about U',
 'C',
 ' Berkeley? When you think about it, URLs like www',
 'google',
 'com are troublesome too',
 ' How would we settle on a price of $10',
 '50? The main point is that these punctuation characters serve a variety of purposes in writing',
 ' Moreover, the functions they serve change depending on the domain (medical vs forum text) and language',
 '']

We could improve on this by using regular expressions. They'll allow us to split strings based on a number of characters.

In [65]:
sent_boundary_pattern = r'[.?!]'
re.split(sent_boundary_pattern, text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't and would've) get in the way",
 " \n\nWe can split text into sentences using punctuation, but unfortunately that's not always going to work",
 ' For example, if I wanted to tell you about Dr',
 ' Frankenstein, or Mrs',
 " Doubtfire, we'd be in trouble",
 ' What if I wanted to write about U',
 'C',
 ' Berkeley',
 ' When you think about it, URLs like www',
 'google',
 'com are troublesome too',
 ' How would we settle on a price of $10',
 '50',
 ' The main point is that these punctuation characters serve a variety of purposes in writing',
 ' Moreover, the functions they serve change depending on the domain (medical vs forum text) and language',
 '']

### Challenge

The file `example2.txt` has more punctuation problems. Read it in and see what the problems are. Try your best to modify the code from above to work for as many cases as you can.

In [66]:
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
re.split(sent_boundary_pattern, text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't and would've) get in the way",
 " \n\nWe can split text into sentences using punctuation, but unfortunately that's not always going to work",
 ' For example, if I wanted to tell you about Dr',
 ' Frankenstein, or Mrs',
 " Doubtfire, we'd be in trouble",
 ' What if I wanted to write about U',
 'C',
 ' Berkeley',
 ' When you think about it, URLs like www',
 'google',
 'com are troublesome too',
 ' How would we settle on a price of $10',
 '50',
 ' The main point is that these punctuation characters serve a variety of purposes in writing',
 ' Moreover, the functions they serve change depending on the domain (medical vs forum text) and language',
 '']

#### Sentence segmentation by `nltk`

In [67]:
from nltk.tokenize import sent_tokenize
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
sent_tokenize(text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization.",
 "Tokenization may seem simple, but it's harder than it first appears.",
 'Why is it so hard?',
 "Punctuations, contractions (like don't, won't and would've) get in the way.",
 "We can split text into sentences using punctuation, but unfortunately that's not always going to work.",
 "For example, if I wanted to tell you about Dr. Frankenstein, or Mrs. Doubtfire, we'd be in trouble.",
 'What if I wanted to write about U.C.',
 'Berkeley?',
 'When you think about it, URLs like www.google.com are troublesome too.',
 'How would we settle on a price of $10.50?',
 'The main point is that these punctuation characters serve a variety of purposes in writing.',
 'Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.']

## Removing punctuation

Sometimes (although admittedly less frequently than tokenizing and sentence segmentation), you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.

- What type is `punctuation`?

In [68]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [69]:
no_punct = ''.join([ch for ch in text if ch not in punctuation])
no_punct

'In this little example were going to see some of the problems that regularly appear in tokenization Tokenization may seem simple but its harder than it first appears Why is it so hard Punctuations contractions like dont wont and wouldve get in the way \n\nWe can split text into sentences using punctuation but unfortunately thats not always going to work For example if I wanted to tell you about Dr Frankenstein or Mrs Doubtfire wed be in trouble What if I wanted to write about UC Berkeley When you think about it URLs like wwwgooglecom are troublesome too How would we settle on a price of 1050 The main point is that these punctuation characters serve a variety of purposes in writing Moreover the functions they serve change depending on the domain medical vs forum text and language'

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [70]:
string = ' Hello! '
string.strip()

'Hello!'

In [71]:
fname = 'example3.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
print(text)



This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.






In [72]:
stripped_text = text.strip()
print(stripped_text)
#This method only strips whitespace before and after a string

This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.


In [73]:
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text.strip()
#With regular expressions, one can define the kinds of whitespace one wants to strip

'This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines. The Python method called "strip" only catches whitespace at the start and end of a string. But it won\'t catch it in the middle, for example, in this sentence. Once again, regular expressions will help us with this.'

## Text normalization

Text normalization means making our text fit some standard patterns. Lots of steps come under this wide umbrella, but the most common are:

- case folding
- removing URLs, digits, hashtags
- OOV (removing infequent words)

#### Case folding

Case folding means dealing with upper and lower cases characters. This is usually done by making all characters lower cased.

In [74]:
fname = 'example4.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
text

'Upper and lower case characters can be annoying. Characters are the individual letters and numbers that we see on the page. Case folding is the generic term we use for dealing with upper and lower case characters. Lower case is often what people want. Title Case refers to a multi-word expression with the first character of every word in upper case. '

In [82]:
text.lower()

'upper and lower case characters can be annoying. characters are the individual letters and numbers that we see on the page. case folding is the generic term we use for dealing with upper and lower case characters. lower case is often what people want. title case refers to a multi-word expression with the first character of every word in upper case. '

### Challenge
The `lower` method we used above is a string method, that is, it works on strings. But what if you want to lowercase every word in a list (say you've already tokenized the text). Take the list of tokens below and make each one lower case.

In [91]:
tokens = word_tokenize(text)
tokens_lc = []

for t in tokens:
    lc = t.lower()
    tokens_lc.append(lc)

tokens_lc[0]

'upper'

### Removing URLs, digits and hashtags

We rarely care about the exact URL used in a tweet, or the exact number. We could remove them completely (think about how we'd do that), but it's often informative to know that there is a URL or a digit in the text. So we want to replace individual URLs asnd digits with a symbol that preserves the fact that a URL was there. It's standard to just use the strings "URL" and "DIGIT".

How do we do this? Once again, regular expressions save the day.

In [92]:
url_pattern = r'https?:\/\/.*[\r\n]*'
single_tweet = tweet_text[0]
single_tweet

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z'

In [94]:
re.sub(url_pattern, ' URL ', single_tweet)

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL '

Above we replaced the URL in a single tweet. Now we will replace all the URLs in all tweets in `tweet_text`.

In [95]:
url_pattern = r'https?:\/\/.*[\r\n]*'
URL_SIGN = ' URL '
list_of_url_less_tweets = []
## Using a for loop
for tweet in tweet_text:
    url_less_tweet = re.sub(url_pattern, URL_SIGN, tweet)
    list_of_url_less_tweets.append(url_less_tweet)
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

In [96]:
## Alternative using list comprehension
list_of_url_less_tweets = [re.sub(url_pattern, URL_SIGN, tweet) for tweet in tweet_text]
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

Now let's remove hashtags and digits.

In [97]:
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
digit_pattern = '\d+'

In [98]:
no_hashtags = [re.sub(hashtag_pattern, ' HASHTAG ', tweet) for tweet in tweet_text]
no_hashtags

['Today we express our deepest gratitude to all those who have served in our armed forces. HASHTAG  https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm. HASHTAG  HASHTAG  https://t.co/HfuJeRZ

In [99]:
no_digit = [re.sub(digit_pattern, ' DIGIT ', tweet) for tweet in tweet_text]
no_digit

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk DIGIT QWpK DIGIT Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy  DIGIT st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz DIGIT dhrXzo DIGIT ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at  DIGIT : DIGIT p

#### OOV words

Sometimes it's best for us to remove infrequent words (sometimes not!). When we do remove infrequent words, it's often for a downstream method (like classification) that is sensitive to rare words.

In [100]:
all_tweets = ' '.join(tweet_text)
clean = re.sub(url_pattern, ' URL ', all_tweets)
clean = re.sub(hashtag_pattern, ' HASHTAG ', clean)
clean = re.sub(digit_pattern, ' DIGIT ', clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

['Today',
 'we',
 'express',
 'our',
 'deepest',
 'gratitude',
 'to',
 'all',
 'those',
 'who',
 'have',
 'served',
 'in',
 'our',
 'armed',
 'forces',
 'HASHTAG',
 'URL',
 'HASHTAG',
 'HASHTAG']

We can count the frequency of each word type with the built-in `Counter` in Python. This basically just takes the set of word types (we calculated this above as `vocabularly`) and makes a special Python dictionary with each value being the number of times it appears in the list. We can ask that dictionary for the most common words, or for the frequency of individual word types.

In [101]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

[('URL', 932),
 ('HASHTAG', 717),
 ('DIGIT', 258),
 ('the', 87),
 ('in', 76),
 ('to', 72),
 ('of', 61),
 ('you', 57),
 ('I', 56),
 ('is', 54)]

In [102]:
freq['Missouri']

3

In [103]:
OOV = 'OOV'
new_tokens = []
for token in tokens:
    if freq[token] == 1:
        new_tokens.append(OOV)
    else:
        new_tokens.append(token)

In [104]:
new_tokens[:20]

['OOV',
 'we',
 'OOV',
 'our',
 'OOV',
 'OOV',
 'to',
 'all',
 'those',
 'who',
 'have',
 'OOV',
 'in',
 'our',
 'OOV',
 'OOV',
 'HASHTAG',
 'URL',
 'HASHTAG',
 'HASHTAG']

### Challenge 

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

- What other stop words do you think there are?

In [108]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop = stopwords.words('english')
stop

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ruschenpohler\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Challenge 

Use the list `stop` of English stopwords to remove stopwords from our tokenized review above.

In [109]:
tokens_nostop = [token for token in tokens[1] if token not in stop]
tokens_nostop

['w', 'e']

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [110]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [111]:
stemmer.stem('grows')

'grow'

In [112]:
stemmer.stem('running')

'run'

In [113]:
stemmer.stem('leaves')

'leav'

In [117]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
nltk.download('wordnet')
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ruschenpohler\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [118]:
print(snowballer_stemmer.stem('running'))
print(snowballer_stemmer.stem('leaves'))

run
leav


In [119]:
print(lemmatizer.lemmatize('leaves'))

leaf


### Challenge 

Use the Porter stemmer to stem each word in the tweet dataset after having removed stop words.

In [125]:
tokens = [word_tokenize(tweet) for tweet in tweet_text]
tokens_stem = []

for token in tokens:
    tweet_stem = [stemmer.stem(t) for t in token]
    tokens_stem.append(tweet_stem)

tokens_stem

[['today',
  'we',
  'express',
  'our',
  'deepest',
  'gratitud',
  'to',
  'all',
  'those',
  'who',
  'have',
  'serv',
  'in',
  'our',
  'arm',
  'forc',
  '.',
  '#',
  'thankavet',
  'http',
  ':',
  '//t.co/wpk7qwpk8z'],
 ['busi',
  'day',
  'plan',
  'in',
  'new',
  'york',
  '.',
  'will',
  'soon',
  'be',
  'make',
  'some',
  'veri',
  'import',
  'decis',
  'on',
  'the',
  'peopl',
  'who',
  'will',
  'be',
  'run',
  'our',
  'govern',
  '!'],
 ['love',
  'the',
  'fact',
  'that',
  'the',
  'small',
  'group',
  'of',
  'protest',
  'last',
  'night',
  'have',
  'passion',
  'for',
  'our',
  'great',
  'countri',
  '.',
  'We',
  'will',
  'all',
  'come',
  'togeth',
  'and',
  'be',
  'proud',
  '!'],
 ['just',
  'had',
  'a',
  'veri',
  'open',
  'and',
  'success',
  'presidenti',
  'elect',
  '.',
  'now',
  'profession',
  'protest',
  ',',
  'incit',
  'by',
  'the',
  'media',
  ',',
  'are',
  'protest',
  '.',
  'veri',
  'unfair',
  '!'],
 ['A',
  'f

## POS tagging

POS tagging means assigning each token a part-of-speech (e.g. noun, verb, adjective, etc.). Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input.When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [128]:
from nltk import pos_tag
single_review = reviews[3]
single_review

[['Text',
  'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
  'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
  'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.',
  'If you

In [None]:
tokens = word_tokenize(single_review)
tagged_review = pos_tag(tokens)
tagged_review

### Challenge 

Below I've read in the text of Austen's _Pride and Prejudice_ into a variable called `pride`. Preprocess using the following steps:

- Strip whitespace
- Replace all numbers with '0'
- Tokenize
- Tag each token with a POS tag

Make sure you know:
- What type is the result?
- What type is each element of the result?
- What type are the elements of the elements of the result?

## Things we didn't cover

- Named entity recognition
- Syntactic parsing
- Information extraction
- Removing markup from HTML
- Extracting numerical features
- SpaCy