----
# Language Processing
----

Python is exceedingly useful for handling numbers and performing math. With a few lines of code, it can instantly solve complex equations, perform mathematical functions on thousands of rows of data, and generate intuitive visualizations.

Interpreting words, sentences, and **language**, however, is a more complicated task for Python. Language is inherently more challenging than math for a computer to understand because language is illogical, figurative, and contextual. The English language is full of irregularities and subtleties that are easy for an experienced human to understand but are much harder for a computer to make sense of. (More on this [here](https://www.youtube.com/watch?v=Q-B_ONJIEcE) and [here](http://www.byrdseed.com/ambiguous-sentences/).)

Despite these challengesm, Python has many useful tools for **language processing**. Although these tools aren't as precise as Python's quantitative functions, with the help of ever-increasing processing power and the proliferation of **machine learning**, they are growing more and more effective and widely adopted. 

In fact, language-processing programs are all around us, embedded in many popular applications:
- Outlook and GMail use language processing to identify spam and classify emails
- Amazon Alexa, Apple's Siri, and Google Home use language processing (among other techniques) to interpret voice commands
- Here in Regulatory Compliance, language processing is used to identify suspicious messages associated with financial transactions as well as for [entity resolution](http://localhost:8888/notebooks/Documents/Internal%20Projects/Python%20Training/Lesson%20Notebooks/6_Language_Processing.ipynb)

This lesson will cover various methods of processing text and language in Python, ranging from simple applications in base Python to more complicated machine learning applications.

---
## Text in Base Python
---

Before delving into complex algorithms or machine learning, we'll refresh ourselves on base Python's **string** functions and operators, which can be combined to write surprisingly powerful functions for language processing. 

To review these concepts, we'll start with a few exercises, using the [UN Speech](https://obamawhitehouse.archives.gov/the-press-office/2016/09/20/address-president-obama-71st-session-united-nations-general-assembly) in our data folder:

In [1]:
speech = open('UN_Speech.txt','r').read()
speech

"Mr. President; Mr. Secretary General; fellow delegates; ladies and gentlemen:  As I address this hall as President for the final time, let me recount the progress that we’ve made these last eight years.\n\nFrom the depths of the greatest financial crisis of our time, we coordinated our response to avoid further catastrophe and return the global economy to growth.  We’ve taken away terrorist safe havens, strengthened the nonproliferation regime, resolved the Iranian nuclear issue through diplomacy.  We opened relations with Cuba, helped Colombia end Latin America’s longest war, and we welcome a democratically elected leader of Myanmar to this Assembly.  Our assistance is helping people feed themselves, care for the sick, power communities across Africa, and promote models of development rather than dependence.  And we have made international institutions like the World Bank and the International Monetary Fund more representative, while establishing a framework to protect our planet fro

### Exercise
Find the total number of paragraphs in the speech, and create a list of them.

**Hint**: Each paragraph of the speech is delimited by two line breaks (**`\n\n`**).

In [2]:
paragraphs = speech.split('\n\n')
print("There are " + str(len(paragraphs)) + " paragraphs.")

There are 65 paragraphs.


In [3]:
paragraphs[1]

'From the depths of the greatest financial crisis of our time, we coordinated our response to avoid further catastrophe and return the global economy to growth.  We’ve taken away terrorist safe havens, strengthened the nonproliferation regime, resolved the Iranian nuclear issue through diplomacy.  We opened relations with Cuba, helped Colombia end Latin America’s longest war, and we welcome a democratically elected leader of Myanmar to this Assembly.  Our assistance is helping people feed themselves, care for the sick, power communities across Africa, and promote models of development rather than dependence.  And we have made international institutions like the World Bank and the International Monetary Fund more representative, while establishing a framework to protect our planet from the ravages of climate change.'

Now complete the same exercise for each _sentence_ of the speech.

In [4]:
sentences  = [sentence.strip() for sentence in speech.split(".  ")]
#Splitting on a full-stop followed by two spaces sidesteps the "Mr"s, which can throw the code off
for sentence in sentences:
    print(sentence)

Mr. President; Mr. Secretary General; fellow delegates; ladies and gentlemen:  As I address this hall as President for the final time, let me recount the progress that we’ve made these last eight years.

From the depths of the greatest financial crisis of our time, we coordinated our response to avoid further catastrophe and return the global economy to growth
We’ve taken away terrorist safe havens, strengthened the nonproliferation regime, resolved the Iranian nuclear issue through diplomacy
We opened relations with Cuba, helped Colombia end Latin America’s longest war, and we welcome a democratically elected leader of Myanmar to this Assembly
Our assistance is helping people feed themselves, care for the sick, power communities across Africa, and promote models of development rather than dependence
And we have made international institutions like the World Bank and the International Monetary Fund more representative, while establishing a framework to protect our planet from the ravag

Now find the average length of a sentence in the speech, measured by number of characters.

In [5]:
total_char = 0
for sentence in sentences:
    total_char += len(sentence)
    
average_length = total_char / len(sentences)
average_length

191.8546511627907

### Exercise
Write code that finds the number of times each word occurs in the document and creates a dictionary of the results.

In [6]:
words =  [word.strip().lower() for word in speech.split(" ") if word != ""]
word_counts = dict()
for word in set(words):
    word_counts[word] = words.count(word)

word_counts

{'every': 5,
 'ultimately,': 2,
 'cultures': 1,
 'expose': 1,
 'communities.': 1,
 'security': 2,
 'who,': 1,
 'suppression': 1,
 'sect;': 1,
 'intolerance.': 1,
 'far': 3,
 'seen': 1,
 'met': 1,
 'some': 8,
 'recognize': 7,
 'fourth': 1,
 'depths': 1,
 'order.': 1,
 'long.': 1,
 'surprise': 1,
 'east.': 1,
 'narrowing': 1,
 'aims': 1,
 'president;': 1,
 'enhanced': 1,
 'thrive;': 1,
 'i': 45,
 'ill-equipped,': 1,
 'dominate': 1,
 'place,': 1,
 'believe': 23,
 'luther': 1,
 'extremism': 3,
 'soviet': 1,
 'hardship': 1,
 'their': 21,
 'point': 1,
 'oligarchs': 1,
 'happen': 1,
 'new,': 1,
 'small': 2,
 'muzzling': 1,
 'asks': 1,
 'if': 18,
 'calls': 1,
 'strong,': 1,
 'systems': 1,
 'cannot': 5,
 'destructive': 1,
 'somewhat': 1,
 'catastrophe': 1,
 'terrorist': 2,
 'democracy,': 1,
 'americans': 2,
 'know': 3,
 'value': 1,
 'view': 1,
 'retreat': 1,
 'smartphone': 1,
 'ingenuity': 1,
 'truism': 1,
 'think': 6,
 'among': 3,
 'today,': 2,
 'divisions,': 1,
 'when': 8,
 'democratic,': 1,


_You may notice that the code gets tripped on the _**`\n`**_ newline characters. Later on, we'll learn to sidestep issues like this using regular expressions._

In [7]:
print("There are " + str(len(words)) + " total words...")
print("...and " + str(len(word_counts)) + " unique words.")

There are 5588 total words...
...and 1821 unique words.


Hopefully, these exercises have refreshed your memory on base Python. They also show how base Python can very quickly and systematically distill text into its core elements - paragraphs, sentences, words, and even characters.

This becomes a very useful feature and is at the heart of the other complicated functions this course will cover. One of them is computing the **similarity** of two pieces of text.

## QGrams and Jaccard's Similarity
There are many creative, programmatic ways of computing the **similarity** of two pieces of text. Some include:
- [Approximate String Matching:](https://en.wikipedia.org/wiki/Approximate_string_matching) A measure of the number of simple operations it takes to make one string into another
- [Plagiarism Detection:](https://en.wikipedia.org/wiki/Plagiarism_detection) A more complicated, multi-layered technique for finding plagiarism (which you may remember from college)
- [The Tversky Index:](https://en.wikipedia.org/wiki/Tversky_index) A more complicated, tunable version of the method we'll use today

In this example, we'll use the [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index) together with the [QGrams](https://en.wikipedia.org/wiki/N-gram) method to compare the names of different companies.

The **QGrams** method breaks a word or string into sequences of **Q** characters.

For example, if you broke the word **firetruck** into QGrams where Q = 4, you'd end up with these results:
> fire, iret, retr, etru, truc, ruck

### Exercise 
Write a function that takes a string and an integer **Q** as an argument and and returns a list of its QGrams. Test it on the word "sesquipedalian".

In [8]:
def QGrams(word,q):  
    word = word.lower()
    return [word[i:i+q] for i in range(len(word)-q+1)]

In [9]:
word = "sesquipedalian"

QGrams(word, 4)

['sesq',
 'esqu',
 'squi',
 'quip',
 'uipe',
 'iped',
 'peda',
 'edal',
 'dali',
 'alia',
 'lian']

Once we've broken a word or sentence into QGrams, we can use those core elements to assess how similar that word is to other words. We'll do this using **Jaccard Similarity**.

The Jaccard Similarity (or Jaccard "Coefficient") compares the **union** of two sets to the **intersection** of the two sets.

Remember, the **union** describes the unique elements that appear in _either_ set:
![Union](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ee/Union_of_sets_A_and_B.svg/371px-Union_of_sets_A_and_B.svg.png)

In Python, we can calculate this using the **`.union()`** operator or, in shorthand, with the **|** character.

In [10]:
set_A = {'a','b','c','d'}
set_B = {'b','c','d','e'}
set_A | set_B

{'a', 'b', 'c', 'd', 'e'}

The  **intersection**, on the other hand, describes all the unique elements that appear in _both_ sets.

![Intersection](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Intersection_of_sets_A_and_B.svg/371px-Intersection_of_sets_A_and_B.svg.png)

In Python, we can calculate this using the **`.intersection()`** operator or, in shorthand, with the **&** character.

In [11]:
set_A & set_B

{'b', 'c', 'd'}

The **Jaccard Similarity** is simply the size of the intersection divided by the size of the union. 

![Test](https://wikimedia.org/api/rest_v1/media/math/render/svg/eaef5aa86949f49e7dc6b9c8c3dd8b233332c9e7)

In Python, you'd compute it like this:

In [12]:
len(set_A & set_B) / len (set_A | set_B)

0.6

### Exercise
Write a function that computes the Jaccard Similarity between two sets.

In [13]:
def Jaccard(setA, setB):
    return len(setA & setB) / len(setA | setB)

How does this help us analyze text or compute similarity?

By breaking two pieces of text up using QGrams, and then finding the Jaccard Coefficient between the two resulting sets, we get a useful measurement of similarity between the two pieces of text.

Take, for example, the words "banana" and "bonanza" - two similar sounding words. Using our pre-written functions, let's split them into QGrams using Q = 2 and then find the Jaccard Similarity of the resulting sets.

In [14]:
banana_qg = set(QGrams("banana",2))
bonanza_qg = set(QGrams("bonanza",2))

print(banana_qg)
print(bonanza_qg)

{'na', 'ba', 'an'}
{'na', 'on', 'nz', 'an', 'bo', 'za'}


In [15]:
Jaccard(banana_qg, bonanza_qg)

0.2857142857142857

0.286 - not bad, but hard to make sense of without context. What if we used a different Q?

Let's write a function that takes as an input two strings and an integer Q and then returns the Jaccard similarity of the resulting QGrams sets, using Q = Q.

In [16]:
def Text_Comparison(string_A, string_B, Q):
    return(Jaccard(set(QGrams(string_A, Q)), set(QGrams(string_B, Q))))

In [17]:
Text_Comparison("word", 'other word', 2)

0.3333333333333333

Now let's test this function out on another two very similar words - "meteritorious" and "meritricious". Let's also write a loop that shows us how the similarity rating changes as we change Q.

In [18]:
word_A = "meritorious"
word_B = "meritricious"

for q in range(1,6):
    print("Q:  ", str(q), "\t Similarity: \t{0:3f}".format(Text_Comparison(word_A, word_B, q)))

Q:   1 	 Similarity: 	0.888889
Q:   2 	 Similarity: 	0.583333
Q:   3 	 Similarity: 	0.357143
Q:   4 	 Similarity: 	0.214286
Q:   5 	 Similarity: 	0.071429


Makes sense, right? As Q gets larger, the word is broken into larger and larger strings, and the larger the string, the less likely it is to overlap with a string from another word. 

When we use Q = 1, the words are just split into their component letters:

In [19]:
print(QGrams(word_A, 1))
print(QGrams(word_B, 1))

['m', 'e', 'r', 'i', 't', 'o', 'r', 'i', 'o', 'u', 's']
['m', 'e', 'r', 'i', 't', 'r', 'i', 'c', 'i', 'o', 'u', 's']


As we can see, almost all (89%) of the letters appear in both words.

But if we use Q = 4, the resulting QGrams are longer and less likely to overlap.

In [20]:
print(QGrams(word_A, 4))
print(QGrams(word_B, 4))

['meri', 'erit', 'rito', 'itor', 'tori', 'orio', 'riou', 'ious']
['meri', 'erit', 'ritr', 'itri', 'tric', 'rici', 'icio', 'ciou', 'ious']


These 4-letter-long QGrams are more unique, so it's no surprise that fewer of them (22%) are shared between the two words.

What are some practical applications for this method? Measurements of similarity are useful for **entity resolution** - linking entries in data that are not explicitly tied together, but belong to the same real-world _entity_. In our line of work, entity resolution is essential for matching **customer names**. Financial institutions often store customer data in disparate silos, so redundancy can become a problem. 

For example, the customer "John C. Smith" may have a credit card account and a mortgage account at a bank. If the two departments keep their data separately, how will the bank know that John C. Smith owns both accounts? What if he's listed as "John Smith" in one system, and  "Johnathan C. Smith" in the other? How can we differentiate him from all of the other John Smiths in the data?

There are many complex, cutting-edge ways to perform entity resolution, ranging from regression to deep learning. But one simple  and surprisingly effective way to do this is by calculating **similaritiy** between two data entries, using the methods we've learned so far. For example, if if both the **name** and **address** fields in our two data silos are more than, say, 80% similar, we can safely assume that they belong to the same person in the real world.

### Exercise
Write a function that takes as an input a single text value, a list of other text values, and an integer Q, and returns a dictionary with...
1. The most similar value in the list, as assessed by the Jaccard coefficient with Q = Q 
2. The Jaccard similarity coefficient of the two values.

(Remember to leverage the functions we've already written!)

Test the function using the **restaurant names dataset** in the `/data` folder.

In [21]:
def Find_Closest_Match(text, comparison_list, Q):
    closest_jaccard = 0
    closest_match = None
    for comparison in comparison_list:
        if Text_Comparison(text, comparison, Q) > closest_jaccard:
            closest_jaccard = Text_Comparison(text, comparison, Q)
            closest_match = comparison
    return {'Closest Match': closest_match, 'Jaccard': closest_jaccard}

In [22]:
restaurant_names = open('restaurant-names.txt', 'r').read().split('\n')

In [23]:
Find_Closest_Match("McDonald's", restaurant_names, 2)

{'Closest Match': "MCDONALD'S", 'Jaccard': 1.0}

In [24]:
Find_Closest_Match("Burger Joint", restaurant_names, 2)

{'Closest Match': 'BURGER JOINT', 'Jaccard': 1.0}

In [25]:
Find_Closest_Match("Fancy French", restaurant_names, 2)

{'Closest Match': 'DIRTY FRENCH', 'Jaccard': 0.5}

In [26]:
Find_Closest_Match("Szechuan Sauce", restaurant_names, 2)

{'Closest Match': 'SZECHUAN PALACE', 'Jaccard': 0.5}

In [27]:
Find_Closest_Match("Mark Padison Eleven", restaurant_names, 2)

{'Closest Match': 'ELEVEN MADISON PARK', 'Jaccard': 0.8421052631578947}

As you can see, using QGrams and Jaccard Similarity we've developed a sort of primitive search engine - something more powerful and dynamic than a simple CTRL+F search. Similarity indexes like this are at the core of many more powerful entity resolution  tools. Some of these tools are offered in custom [Python libraries](http://recordlinkage.readthedocs.io/en/latest/about.html), but more often they are developed on a case-by-case basis, unique to the particular data sets and circumstances. Similarity indexing is also useful in other contexts; see the appendix for a walkthrough of an engagement specific use of similarity indexing.

Another even more powerful tool is the **regular expression**, which is used widely used not only for entity resolution but for a myriad of other text-mining techniques.

---
## Regular Expressions
---

**Regular expressions** are an important and powerful tool for text-mining; they allow us to search for highly specific patterns within data and extract specific text matching those patterns. Regular expressions, or **regex** for short, aren't specific to Python, either. In fact, they're built into virtually every operating system and programming language and they're essential to the functioning of applications like search engines and word processors. 

To work with regular expressions, import the **re library** - Python's built-in module for processing regular expressions.

In [28]:
import re

To visualize how regex works, we'll use some pre-written code that mimicks the [grep functionality](https://www.gnu.org/software/grep/manual/grep.html) that is common to many operating systems. This functionality (literally) highlights the pattern-mathching mechanism of regex. 

This code is written specifically for the Jupyter IDE, so normally you wouldn't use code like this in your Python programs. We use it here, however, to help visualize the way regex works. Don't worry about how it works.

In [29]:
def grep(regex_expression, text_items):
    BACKGROUND_YELLOW = '\x1b[43m'
    COLOR_RESET  = "\x1b[0m"
    regex= re.compile(regex_expression)
    if type(text_items) in (list, set, tuple):
        for item in text_items:
            matches = regex.finditer(item)
            for m in matches:
                highlighted  = item[:m.start()] # the string before the regex match
                highlighted += BACKGROUND_YELLOW + item[m.start():m.end()] + COLOR_RESET 
                highlighted += item[m.end():] # the string after the regex match
                print(highlighted)
    elif type(text_items) == str:
        matches = regex.finditer(text_items)
        for m in matches:
            highlighted  = text_items[:m.start()] # the string before the regex match
            highlighted = BACKGROUND_YELLOW + text_items[m.start():m.end()] + COLOR_RESET 
            highlighted += text_items[m.end():] # the string after the regex match
            print(highlighted)
    elif type(text_items) not in (list, set, tuple, str):
        raise ValueError("Wrong data type.")

Now we're ready to work with regular expressions. 

Regular expressions are like `CTRL+F` on steroids. They find specific words, phrases, or combinations of characters within text.

Let's use our **`grep()`** function to find all of the restaurant names that contain the word "pizza". (We'll use the **`restaurant_names`** variable from the previous section.)

_Remember to prefix strings with _**`r`** _since we'll be using lots of backslashes and special characters._

In [30]:
grep(r"PIZZA", restaurant_names)

$1.25 [43mPIZZA[0m
10TH AVENUE [43mPIZZA[0m & CAFE
14TH STREET [43mPIZZA[0m BAGEL CAFE
18 EAST GUNHILL [43mPIZZA[0m
2 BROS [43mPIZZA[0m
286 [43mPIZZA[0m PLACE INC
3-J RESTAURANT AND [43mPIZZA[0m
42ND STREET [43mPIZZA[0m DINER
44TH STREET [43mPIZZA[0m
5 STAR CHEESE STEAK AND [43mPIZZA[0m
5 STAR CHEESESTEAK AND [43mPIZZA[0m
99 CENT BEST [43mPIZZA[0m 5AVE INC
99 CENT EXPRESS [43mPIZZA[0m
99 CENT FRESH [43mPIZZA[0m
99 CENT FRESH [43mPIZZA[0m VILLA CAFE
99 CENT [43mPIZZA[0m
99 CENTS BEST & FRESH [43mPIZZA[0m INC.
99 CENTS FRESH HOT [43mPIZZA[0m
99 CENTS FRESH SLICE [43mPIZZA[0m
99 CENTS MEGA [43mPIZZA[0m
99 FRESH [43mPIZZA[0m
99C FRESH [43mPIZZA[0m
99Â¢ FAMOUS [43mPIZZA[0m
99Â¢ HOT [43mPIZZA[0m
A & M [43mPIZZA[0m
A PLUS [43mPIZZA[0m OF NY INC
A&L [43mPIZZA[0m RESTAURANT
A-1 [43mPIZZA[0m SHOP
ABITINO'S [43mPIZZA[0m
ACAPELLA GOURMET [43mPIZZA[0m & RESTAURANT CORP
ACE [43mPIZZA[0m
ADRIENNE'S [43mPIZZA[0m BAR
AEGEA GYROS AND [43mP

UNIVERSITY [43mPIZZA[0m
US FRIED CHICKEN & [43mPIZZA[0m
US KENNEDY FRIED CHICKEN AND [43mPIZZA[0m
V.I. [43mPIZZA[0m
VALENTINO'S [43mPIZZA[0m
VALIANO [43mPIZZA[0m
VENEZIA RISTORANTE & [43mPIZZA[0m
VENICE [43mPIZZA[0m
VERONA [43mPIZZA[0m
VESUVIO [43mPIZZA[0m
VESUVIO RESTAURANT & [43mPIZZA[0m
VESUVIOS [43mPIZZA[0m
VIC'S [43mPIZZA[0m ON ESSEX
VICTORIA [43mPIZZA[0m
VICTORIO'S [43mPIZZA[0m PLUS
VIDALI'S [43mPIZZA[0m
VILLA GARDEN FAMOUS [43mPIZZA[0m
VILLA MIA [43mPIZZA[0m
VILLAGE MARIA [43mPIZZA[0m II
VILLAGE [43mPIZZA[0m
VINNY'S FAMOUS [43mPIZZA[0m
VINNY'S [43mPIZZA[0m
VITO'S ROMA [43mPIZZA[0m
WALDY'S WOOD FIRED [43mPIZZA[0m & PENNE
WANTED [43mPIZZA[0m
WEST 190 STREET [43mPIZZA[0m
WILBEL [43mPIZZA[0m
WILLIAMSBURG [43mPIZZA[0m
WOODSIDE [43mPIZZA[0m
WORLD [43mPIZZA[0m & FRIED CHICKEN
WORLD [43mPIZZA[0m CHAMPION
XOCHIL [43mPIZZA[0m
YANKEE JZ [43mPIZZA[0m
YANKEE'S SK [43mPIZZA[0m
YANKY'S [43mPIZZA[0m
YUMMY FRIED CHICKEN AND [

Just like CTRL+F, the program finds and highlights ever instance of the word **`PIZZA`**.

But what if we want to search for something more nuanced? Or something that could change? Or something that doesn't necessarily follow the same exact pattern every time?

The power of regular expressions is that they can specify **patterns**, not just fixed characters. In regular epxressions, **ordinary characters** (such as **`a`**, **`X`**, **`9`**, or **`PIZZA`** from example above) just match themselves exactly, like **`the`** does in the example above.

But other characters take on special meanings to find more nuanced patterns. The most important ones include:

| Character|(In Plain English)|Regex Purpose|
| :-------------| :-------------| :-------------|
| **.**  | Period | Matches _any_ single character _except_ the newline character **\n**|
| **\w** | Backslash Lowercase w   | Matches any "word" character: a letter or digit or underscore|
| **\W** | Backslash Uppercase W   | Matches any _non_-word character|
| **\b** | Backslash Lowercase b   | Matches the boundary between a word and non-word character|
| **\s** | Backslash Lowercase S   | Matches any single whitespace character (spaces, newlines, return, and tab) |
| **\S** | Backslash Uppercase S   | Matches any non-whitespace character|
| **\t** | Backslash Lowercase t   | Matches tabs|
| **\n** | Backslash Lowercase n   | Matches newlines|
| **\r** | Backslash Lowercase r   | Matches returns|
| **\d** | Backslash Lowercase d   | Matches a decimal digit (0 through 9) |
| **^** | Carrot   | Matches the start of a string|
| **$**  | Dollar Sign | Matches the end of a string|
| **\**  | Backslash    | Inhibits the "specialness" of a character, just as it does in Python|
    
So let's say we wanted to find any restaurant names containing a number. With regular expressions, we don't have to specify which number, the way we would with CTRL+F. Instead, we can use a special character.

In [31]:
grep(r'\d',restaurant_names)

ï»¿#[43m1[0m GARDEN CHINESE
#[43m1[0m ME. NICK'S
#[43m1[0m SABOR LATINO RESTAURANT
$[43m1[0m.25 PIZZA
$1.[43m2[0m5 PIZZA
$1.2[43m5[0m PIZZA
(PUBLIC FARE) [43m8[0m1ST STREET AND CENTRAL PARK WEST (DELACORTE THEATRE)
(PUBLIC FARE) 8[43m1[0mST STREET AND CENTRAL PARK WEST (DELACORTE THEATRE)
[43m0[0m02 MERCURY TACOS LLC
0[43m0[0m2 MERCURY TACOS LLC
00[43m2[0m MERCURY TACOS LLC
[43m1[0m 2 3 BURGER SHOT BEER
1 [43m2[0m 3 BURGER SHOT BEER
1 2 [43m3[0m BURGER SHOT BEER
[43m1[0m BANANA QUEEN
[43m1[0m BUEN SABOR
[43m1[0m DARBAR
[43m1[0m EAST 66TH STREET KITCHEN
1 EAST [43m6[0m6TH STREET KITCHEN
1 EAST 6[43m6[0mTH STREET KITCHEN
[43m1[0m OAK
[43m1[0m OR 8
1 OR [43m8[0m
[43m1[0m STOP PATTY SHOP
[43m1[0m.5 GALBI CORP
1.[43m5[0m GALBI CORP
[43m1[0m0 DEVOE
1[43m0[0m DEVOE
[43m1[0m0 POINTS KTV
1[43m0[0m POINTS KTV
[43m1[0m00 FUN
1[43m0[0m0 FUN
10[43m0[0m FUN
[43m1[0m00% PATACON CACHAPA YAROA
1[43m0[0m0% PATACON CACHAPA YAROA
10[43

72[43m9[0m BAR & GRILL
[43m7[0m2ND STREET BAGEL
7[43m2[0mND STREET BAGEL
[43m7[0m39 FRANKLIN BAR & LOUNGE
7[43m3[0m9 FRANKLIN BAR & LOUNGE
73[43m9[0m FRANKLIN BAR & LOUNGE
[43m7[0m6/BEERS OF BROOKLYN
7[43m6[0m/BEERS OF BROOKLYN
[43m7[0m65 FOOD MARKET
7[43m6[0m5 FOOD MARKET
76[43m5[0m FOOD MARKET
[43m7[0m73 LOUNGE
7[43m7[0m3 LOUNGE
77[43m3[0m LOUNGE
[43m7[0m77 THEATER BAR
7[43m7[0m7 THEATER BAR
77[43m7[0m THEATER BAR
[43m7[0mB BAR
[43m7[0mTH AVENUE DONUT SHOP
[43m7[0mTH FLOOR CAFE
[43m7[0mTH MANSION KTV
[43m8[0m DRAGON & PHOENIX CHINESE RESTAURANT
[43m8[0m PAET RIO
[43m8[0m0 RIVERSIDE CAFE
8[43m0[0m RIVERSIDE CAFE
[43m8[0m05 ZHENG YUAN BAO GOURMET
8[43m0[0m5 ZHENG YUAN BAO GOURMET
80[43m5[0m ZHENG YUAN BAO GOURMET
[43m8[0m09 GRILL & BAR RESTAURANT
8[43m0[0m9 GRILL & BAR RESTAURANT
80[43m9[0m GRILL & BAR RESTAURANT
[43m8[0m09 PERFECT DELI.
8[43m0[0m9 PERFECT DELI.
80[43m9[0m PERFECT DELI.
[43m8[0m090 TAI WANESE
8[43m0

STANDS [43m3[0m03 AND 301 PEPSI PORCH
STANDS 3[43m0[0m3 AND 301 PEPSI PORCH
STANDS 30[43m3[0m AND 301 PEPSI PORCH
STANDS 303 AND [43m3[0m01 PEPSI PORCH
STANDS 303 AND 3[43m0[0m1 PEPSI PORCH
STANDS 303 AND 30[43m1[0m PEPSI PORCH
STAR ON [43m1[0m8TH DINER CAFE
STAR ON 1[43m8[0mTH DINER CAFE
STARBUCKS # [43m1[0m4840
STARBUCKS # 1[43m4[0m840
STARBUCKS # 14[43m8[0m40
STARBUCKS # 148[43m4[0m0
STARBUCKS # 1484[43m0[0m
STARBUCKS (JFK TERMINAL [43m5[0m-POST SECURITY DEPARTURE)
STARBUCKS (STORE [43m1[0m6628)
STARBUCKS (STORE 1[43m6[0m628)
STARBUCKS (STORE 16[43m6[0m28)
STARBUCKS (STORE 166[43m2[0m8)
STARBUCKS (STORE 1662[43m8[0m)
STARBUCKS [43m2[0m2420
STARBUCKS 2[43m2[0m420
STARBUCKS 22[43m4[0m20
STARBUCKS 224[43m2[0m0
STARBUCKS 2242[43m0[0m
STARBUCKS COFFEE  #[43m1[0m6608
STARBUCKS COFFEE  #1[43m6[0m608
STARBUCKS COFFEE  #16[43m6[0m08
STARBUCKS COFFEE  #166[43m0[0m8
STARBUCKS COFFEE  #1660[43m8[0m
STARBUCKS COFFEE # [43m1[0m5440
STARB

Or what if we want to look for all the restaurants whose name _start_ with a number?

In [32]:
grep(r'^\d', restaurant_names)

[43m0[0m02 MERCURY TACOS LLC
[43m1[0m 2 3 BURGER SHOT BEER
[43m1[0m BANANA QUEEN
[43m1[0m BUEN SABOR
[43m1[0m DARBAR
[43m1[0m EAST 66TH STREET KITCHEN
[43m1[0m OAK
[43m1[0m OR 8
[43m1[0m STOP PATTY SHOP
[43m1[0m.5 GALBI CORP
[43m1[0m0 DEVOE
[43m1[0m0 POINTS KTV
[43m1[0m00 FUN
[43m1[0m00% PATACON CACHAPA YAROA
[43m1[0m00% SMOOTHIES & EMPANADAS
[43m1[0m001 NIGHTS
[43m1[0m001 NIGHTS CAFE
[43m1[0m005 CATERING
[43m1[0m01 CAFE
[43m1[0m01 DELI
[43m1[0m01 RESTAURANT AND BAR
[43m1[0m02 NOODLES TOWN RESTAURANT
[43m1[0m020 BAR
[43m1[0m028 BAR & RESTAURANT EL SALVADORENO 
[43m1[0m04-01 FOSTER AVENUE COFFEE SHOP(UPS)
[43m1[0m061 CATERING LLC
[43m1[0m07 WEST RESTAURANT
[43m1[0m08 FAST FOOD CORP
[43m1[0m08 LOUNGE - CLUB 108
[43m1[0m081 FULTON
[43m1[0m0TH AVENUE COOKSHOP
[43m1[0m0TH AVENUE PIZZA & CAFE
[43m1[0m1 STREET CAFE
[43m1[0m11 RESTAURANT
[43m1[0m174 FULTON CUISINE, HALAL FOOD
[43m1[0m2 CHAIRS
[43m1[0m2 CHAIRS CAFE
[43m1

Restaurant names that _end_ with a number?

In [33]:
grep('\d$', restaurant_names)

1 OR [43m8[0m
108 LOUNGE - CLUB 10[43m8[0m
3RD & [43m7[0m
83 1/[43m2[0m
AFGHAN KEBAB HOUSE #[43m1[0m
AFTER [43m8[0m
AJI 1[43m8[0m
ALFREDO 10[43m0[0m
ALICES TEA CUP CHAPTER [43m2[0m
AMC ORPHEUM [43m7[0m
AMC THEATRES EMPIRE 2[43m5[0m
AMC THEATRES FRESH MEADOWS [43m7[0m
AMC THEATRES MAGIC JOHNSON HARLEM [43m9[0m
AMICI 3[43m6[0m
AMOR BAKERY NO [43m2[0m
APT. 7[43m8[0m
AROME CAFE 3[43m2[0m
ASIAN TASTE 8[43m6[0m
BAMBOO 5[43m2[0m
BAR & GRILL 4[43m3[0m
BAR 1[43m3[0m
BAR 13[43m1[0m
BAR 24[43m5[0m
BAR 36[43m0[0m
BAR 48[43m3[0m
BAR 51[43m5[0m
BAR 71[43m8[0m
BARRIO 4[43m7[0m
BASSO5[43m6[0m
BIN 7[43m1[0m
BIN NO 22[43m0[0m
BIN NO [43m5[0m
BISTRO 3[43m3[0m
BISTRO 6[43m1[0m
BISTRO TEN-1[43m8[0m
BLACKTHORN 5[43m1[0m
BOONCHU #[43m2[0m
BRASSERIE 8 1/[43m2[0m
BROOKLYN CAFE [43m1[0m
BUCEO 9[43m5[0m
BUFFET 5[43m8[0m
BUNGALOW 3[43m1[0m
BUNGALOW1[43m8[0m
BUTTERFILED [43m8[0m
C-PAC/PULSE 4[43m8[0m
CABRINI 18[43m1[0m
CAFE

As you can see, regular expressions are much more versatile than other search functions. But we've still only scratched the surface. 

Other special characters take on the role of **quantifiers**, which tell the regular expression how many of a particular pattern it should search for. These include:

| Character|(In Plain English)|Regex Purpose|
| :-------------| :-------------| :-------------|
| **\***  | Asterisk | Matches 0 or more of the preceding pattern|
| **+**  | Plus Sign | Matches 1 or more of the preceding pattern|
| **?**  | Plus Sign | Matches 0 or 1 of the preceding pattern|
| **{5}**  | Curly Braces, One Value | Matches exactly 5 of the preceding pattern|
| **{2,}**  | Curly Braces, Empty Comma | Matches two or more the preceding pattern|
| **{1,3}**  | Curly Braces, Two Values| Matches between 1 and 3 of the preceding pattern|
| **&#124;**| Pipe | Matches one pattern _or_ another|
| **[  ]** | Square Brackets | Defines a **character set**, matching _any_ one element within the group| 
| **[1-7]** | Square Brackets, Range | Matches any number between 1 and 7| 
| **[a-j]** | Square Brackets, Range | Matches any letter between a and j alphabetically| 

So what if we wanted to find restaurant names that started with three characters and then a space?

In [34]:
grep(r'^\w{3}\s', restaurant_names)

[43m002 [0mMERCURY TACOS LLC
[43m100 [0mFUN
[43m101 [0mCAFE
[43m101 [0mDELI
[43m101 [0mRESTAURANT AND BAR
[43m102 [0mNOODLES TOWN RESTAURANT
[43m107 [0mWEST RESTAURANT
[43m108 [0mFAST FOOD CORP
[43m108 [0mLOUNGE - CLUB 108
[43m111 [0mRESTAURANT
[43m120 [0mBAY CAFE
[43m121 [0mFULTON STREET
[43m123 [0mNIKKO
[43m124 [0mCOFFEE SHOP
[43m128 [0mDUMPLING HOUSE
[43m137 [0mBAR & GRILL
[43m149 [0mSTEAM FISH CORP.
[43m156 [0mTEX BAR AND LOUNGE
[43m162 [0mEB CORP BAKERY
[43m168 [0mASUKA SUSHI
[43m168 [0mBOWERY HOLDING LLC
[43m168 [0mHI TEA
[43m168 [0mTEA LLC
[43m169 [0mBAR
[43m181 [0mST CARIDAD RESTAURANT
[43m187 [0mYANG GARDEN
[43m188 [0mFAST FOOD INC
[43m19A [0mEMPIRE RESTAURANT
[43m1ST [0mAVENUE GOURMET
[43m1ST [0mBASE CONCESSION STAND
[43m1ST [0mMAMA RESTAURANT
[43m1ST [0mSTOP
[43m200 [0mFIFTH AVENUE RESTAURANT & SPORTS BAR
[43m200 [0mORCHARD BAR
[43m201 [0mBAR AND RESTAURANT
[43m203 [0mLENA INC
[43m211 [0mNEW TACO GRILL

[43mLAS [0mTAINAS BAR & RESTAURANT
[43mLAS [0mTAPAS
[43mLAS [0mTIAS BAKERY
[43mLCL [0mBAR AND KITCHEN
[43mLCZ [0mRESTAURANT
[43mLEA [0mWINE BAR
[43mLEE [0mCHINESE RESTAURANT
[43mLEE [0mCHUNG CAFE
[43mLEE [0mGARDEN CHINESE RESTAURANT
[43mLEE [0mGOOD TASTE KITCHEN
[43mLEE [0mLEE'S BAKED GOODS
[43mLEE [0mWANG RESTAURANT
[43mLEE [0mXING RESTAURANT
[43mLEO [0mCASA CALAMARI
[43mLES [0mAMIS
[43mLES [0mCAYES, INC.
[43mLES [0mCREPES
[43mLES [0mHALLES
[43mLES [0mJARDINS DE LA DUCHESSE LLC
[43mLEX [0mRESTAURANT
[43mLIA [0mCOFFEE SHOP AND DELI
[43mLIC [0mBAGELS
[43mLIC [0mLANDING BY COFFEED
[43mLIC [0mMARKET
[43mLIL [0mBITE'S CAFE
[43mLIL [0mFRANKIE'S PIZZA
[43mLIN [0mCHINA WOK
[43mLIN [0mGARDEN
[43mLIN [0mHOME CHINESE RESTAURA
[43mLIN [0mKEE HONG CHINESE RESTAURANT
[43mLIN [0mLONG XUAN RESTAURANT INC
[43mLIN [0mNEW PEOPLE RESTAURANT
[43mLIT [0mLOUNGE NYC
[43mLIU [0mBING YING
[43mLIU [0mGARDEN
[43mLIU [0mLIU SEAFOOD RESTAURANT

[43mPIA [0mPIZZERIA
[43mPIC [0mUP STIX
[43mPIE [0mCORPS
[43mPIE [0mFACE
[43mPIE [0mPIE PIZZA
[43mPIE [0mPIE Q CAFE
[43mPIG [0m& KHAO
[43mPIG [0m& WHISTLE
[43mPIG [0m& WHISTLE ON 3RD
[43mPIG [0m'N' WHISTLE
[43mPIG [0mGUY NYC
[43mPIG [0mHEAVEN
[43mPIL [0mPIL SPANISH TAPAS INC
[43mPIO [0mBAGEL
[43mPIO [0mHOT BAGELS
[43mPIO [0mPIO
[43mPIO [0mPIO BROOKLYN
[43mPIO [0mPIO EXPRESS
[43mPIO [0mPIO RIKO
[43mPIO [0mPIO TO GO
[43mPIT [0mSTOP BAR
[43mPNS [0mSOUL FOOD
[43mPOD [0mCAFE
[43mPOK [0mPOK NY
[43mPOK [0mPOK PHAT THAI
[43mPOM [0mPOM DINER
[43mPOP [0mBAR
[43mPOP [0mPUB
[43mPOP [0mYOGURT
[43mPRB [0m24-7
[43mPSC [0mCAFETERIA
[43mPYE [0mBOAT NOODLE
[43mQUE [0mRICO POLLO RESTAURANT & GRILL
[43mQUE [0mRICO TACO
[43mQUE [0mSABOR BAKERY CAFE
[43mQUE [0mSABROSURA RESTAURANT
[43mRAA [0mNYC LLC
[43mRAG [0mTOP
[43mRAI [0mRAI KEN
[43mRAJ [0mMAHAL INDIAN RESTAURANT
[43mRAN [0mTEA HOUSE
[43mRAW [0mJUICE CAFE
[43mRAW [0mORG

[43mTHE [0mHILLS RESTAURANT AND BAR
[43mTHE [0mHISTORIC OLD BERMUDA INN
[43mTHE [0mHIVE SPORTS BAR AND GRILL
[43mTHE [0mHOG PIT NEW YORK CITY
[43mTHE [0mHOP SHOP
[43mTHE [0mHORSE BOX
[43mTHE [0mHOUSE IN GRAMERCY PARK
[43mTHE [0mHOUSE OF BREWS
[43mTHE [0mHUMMUS & PITA
[43mTHE [0mHUMMUS & PITA CO.
[43mTHE [0mHUMMUS AND PITA CO.
[43mTHE [0mHYTES BAR
[43mTHE [0mICE BOX-RALPH'S FAMOUS ITALIAN ICES
[43mTHE [0mIMMIGRANT NYC
[43mTHE [0mIMMIGRANT TAP ROOM
[43mTHE [0mINKAN
[43mTHE [0mINTERNATIONAL CULINARY INSTITUTE
[43mTHE [0mIRISH AMERICAN
[43mTHE [0mIRISH EXIT
[43mTHE [0mIRISH PUB
[43mTHE [0mIRON HORSE
[43mTHE [0mISLAND
[43mTHE [0mISLANDS
[43mTHE [0mIZAKAYA
[43mTHE [0mJACKPOT CAFE
[43mTHE [0mJAG
[43mTHE [0mJAGUAR RESTAURANT
[43mTHE [0mJAKE WALK
[43mTHE [0mJAR BAR
[43mTHE [0mJEFFREY
[43mTHE [0mJOCKEY'S ROOM
[43mTHE [0mJOHN J O'CONNOR RESIDENCE
[43mTHE [0mJOHNSON'S
[43mTHE [0mJOINT ON MYRTLE
[43mTHE [0mJOLLY MONK
[43mTHE [0

Or restaurant whose names end with a handful of numbers, not just one?

In [35]:
grep(r'\d+$', restaurant_names)

1 OR [43m8[0m
108 LOUNGE - CLUB [43m108[0m
3RD & [43m7[0m
83 1/[43m2[0m
AFGHAN KEBAB HOUSE #[43m1[0m
AFTER [43m8[0m
AJI [43m18[0m
ALFREDO [43m100[0m
ALICES TEA CUP CHAPTER [43m2[0m
AMC ORPHEUM [43m7[0m
AMC THEATRES EMPIRE [43m25[0m
AMC THEATRES FRESH MEADOWS [43m7[0m
AMC THEATRES MAGIC JOHNSON HARLEM [43m9[0m
AMICI [43m36[0m
AMOR BAKERY NO [43m2[0m
APT. [43m78[0m
AROME CAFE [43m32[0m
ASIAN TASTE [43m86[0m
BAMBOO [43m52[0m
BAR & GRILL [43m43[0m
BAR [43m13[0m
BAR [43m131[0m
BAR [43m245[0m
BAR [43m360[0m
BAR [43m483[0m
BAR [43m515[0m
BAR [43m718[0m
BARRIO [43m47[0m
BASSO[43m56[0m
BIN [43m71[0m
BIN NO [43m220[0m
BIN NO [43m5[0m
BISTRO [43m33[0m
BISTRO [43m61[0m
BISTRO TEN-[43m18[0m
BLACKTHORN [43m51[0m
BOONCHU #[43m2[0m
BRASSERIE 8 1/[43m2[0m
BROOKLYN CAFE [43m1[0m
BUCEO [43m95[0m
BUFFET [43m58[0m
BUNGALOW [43m31[0m
BUNGALOW[43m18[0m
BUTTERFILED [43m8[0m
C-PAC/PULSE [43m48[0m
CABRINI [43m181[0m
CAFE

Or restaurants whose names begin with either _THE_ or _123_?

In [36]:
grep(r'^THE|123', restaurant_names)

[43m123[0m NIKKO
CAFE 1[43m123[0m1
CHIPOTLE MEXICAN GRILL #2[43m123[0m
COFFEE 1[43m123[0m8
STAND [43m123[0m
[43mTHE[0m 13TH STEP
[43mTHE[0m 3 LUIGIS
[43mTHE[0m 5 AND DIAMOND
[43mTHE[0m ABBEY
[43mTHE[0m ABBEY PUB
[43mTHE[0m AINSWORTH
[43mTHE[0m ALHAMBRA BALL ROOM
[43mTHE[0m ALLIE WAY SPORTS BAR
[43mTHE[0m AMBASSADOR GRILL AND LOUNGE
[43mTHE[0m AMERICANO HOTEL
[43mTHE[0m ANCHORED INN
[43mTHE[0m ARCH DINER
[43mTHE[0m ARCHIVE
[43mTHE[0m ASSEMBLY BAR
[43mTHE[0m ASTOR ROOM
[43mTHE[0m ASTORIA WORLD MANOR
[43mTHE[0m ATRIUM
[43mTHE[0m AURORA PIZZA CAFFE
[43mTHE[0m AUSTRALIAN
[43mTHE[0m AVE LUNCH BOX
[43mTHE[0m AVENUE
[43mTHE[0m BACK ROOM
[43mTHE[0m BAGEL BASKET
[43mTHE[0m BAGEL FACTORY
[43mTHE[0m BAGEL HOUSE
[43mTHE[0m BAGEL MARKET
[43mTHE[0m BAGEL STORE
[43mTHE[0m BAHCHE
[43mTHE[0m BAILEY
[43mTHE[0m BAKE SHOP BY WOOPS
[43mTHE[0m BAKE SHOPPE
[43mTHE[0m BANK OF NEW YORK
[43mTHE[0m BAO
[43mTHE[0m BAO SHOPPE
[43mTHE[0m

What about restaurants whose names started with the letters A, B, C or a number? We'd need a **character set**.

In [37]:
grep(r'^[ABC\d]', restaurant_names)

[43m0[0m02 MERCURY TACOS LLC
[43m1[0m 2 3 BURGER SHOT BEER
[43m1[0m BANANA QUEEN
[43m1[0m BUEN SABOR
[43m1[0m DARBAR
[43m1[0m EAST 66TH STREET KITCHEN
[43m1[0m OAK
[43m1[0m OR 8
[43m1[0m STOP PATTY SHOP
[43m1[0m.5 GALBI CORP
[43m1[0m0 DEVOE
[43m1[0m0 POINTS KTV
[43m1[0m00 FUN
[43m1[0m00% PATACON CACHAPA YAROA
[43m1[0m00% SMOOTHIES & EMPANADAS
[43m1[0m001 NIGHTS
[43m1[0m001 NIGHTS CAFE
[43m1[0m005 CATERING
[43m1[0m01 CAFE
[43m1[0m01 DELI
[43m1[0m01 RESTAURANT AND BAR
[43m1[0m02 NOODLES TOWN RESTAURANT
[43m1[0m020 BAR
[43m1[0m028 BAR & RESTAURANT EL SALVADORENO 
[43m1[0m04-01 FOSTER AVENUE COFFEE SHOP(UPS)
[43m1[0m061 CATERING LLC
[43m1[0m07 WEST RESTAURANT
[43m1[0m08 FAST FOOD CORP
[43m1[0m08 LOUNGE - CLUB 108
[43m1[0m081 FULTON
[43m1[0m0TH AVENUE COOKSHOP
[43m1[0m0TH AVENUE PIZZA & CAFE
[43m1[0m1 STREET CAFE
[43m1[0m11 RESTAURANT
[43m1[0m174 FULTON CUISINE, HALAL FOOD
[43m1[0m2 CHAIRS
[43m1[0m2 CHAIRS CAFE
[43m1

[43mA[0mMPLE HILLS CREAMERY
[43mA[0mMSTERDAM ALE HOUSE
[43mA[0mMSTERDAM BILLIARDS
[43mA[0mMSTERDAM BURGER CO.
[43mA[0mMSTERDAM GOURMET
[43mA[0mMSTERDAM RESTAURANT & TAPAS LOUNGE
[43mA[0mMSTERDAM SOCIAL
[43mA[0mMSTERDAM TAVERN
[43mA[0mMURA JAPANESE RESTAURANT
[43mA[0mMUSE WINE BAR
[43mA[0mMY & CATHY'S CHINESE RESTAURANT
[43mA[0mMY RUTH'S RESTAURANT
[43mA[0mMY'S BREAD
[43mA[0mMY'S CAFE AND BAKERY
[43mA[0mMY'S RESTAURANT
[43mA[0mN BEAL BOCHT CAFE
[43mA[0mN CHOI
[43mA[0mNA'S BAKERY & CAFE
[43mA[0mNA'S PASTRY SHOP CORP.
[43mA[0mNABLE BASIN SAILING
[43mA[0mNAIAH RESTAURANT
[43mA[0mNALOGUE
[43mA[0mNARKALI INDIAN FOOD
[43mA[0mNASSA TAVERNA
[43mA[0mNATOLIA MEDITERRANEAN CUISINE
[43mA[0mNATOLIAN GYRO RESTAURANT
[43mA[0mNCHOR
[43mA[0mNCHOR COFFEE
[43mA[0mNCHOR INN
[43mA[0mNCHOR WINEBAR LLC
[43mA[0mNDALUCIA BAR & LOUNGE
[43mA[0mNDAMAN THAI BISTRO
[43mA[0mNDANADA
[43mA[0mNDAZ
[43mA[0mNDAZ FIFTH AVENUE
[43mA[0mNDIAMO CAFE
[43mA

[43mB[0mASSO56
[43mB[0mASTA PASTA RESTAURANT
[43mB[0mASURERO
[43mB[0mATARD
[43mB[0mATEAUX NEW YORK
[43mB[0mATH BEACH DINER
[43mB[0mATI
[43mB[0mATTERSBY
[43mB[0mATTERY GARDENS RESTAURANT
[43mB[0mATTERY HARRIS
[43mB[0mATTISTA
[43mB[0mAVARIA BIERHOUSE
[43mB[0mAWARCHI INDIAN CUISINE
[43mB[0mAY BAGELS
[43mB[0mAY CLUB RESTAURANT
[43mB[0mAY HOUSE BISTRO
[43mB[0mAY HOUSE RICE & UDON STATION
[43mB[0mAY LEAF
[43mB[0mAY LEAF INDIAN FOOD
[43mB[0mAY PIZZERIA RESTAURANT
[43mB[0mAY POINT PIZZERIA
[43mB[0mAY RIDGE CAFE
[43mB[0mAY RIDGE DINER
[43mB[0mAY RIDGE MANOR CATERING
[43mB[0mAY STREET LUNCHEONETTE & SODA FOUNTAIN
[43mB[0mAY SUSHI
[43mB[0mAY TERRACE POOL & TENNIS CENTER
[43mB[0mAY TERRACE POOL CLUB
[43mB[0mAYARD'S ALEHOUSE
[43mB[0mAYBRIDGE SZECHUAN RESTAURANT
[43mB[0mAYHOUSE
[43mB[0mAYOU
[43mB[0mAYRIDGE PIZZA
[43mB[0mAYRIDGE SUSHI
[43mB[0mAYSIDE DINER
[43mB[0mAYSIDE MARINA SNACK BAR
[43mB[0mAZ BAGEL AND RESTAURANT
[43mB[0

[43mB[0mUFFALO BOSS
[43mB[0mUFFALO BOSS TWO
[43mB[0mUFFALO JO'S WINGS
[43mB[0mUFFALO WILD WINGS
[43mB[0mUFFALO WILD WINGS GRILL & BAR
[43mB[0mUFFALO WILD WINGS GRILL AND BAR
[43mB[0mUFFALO WILD WINGS,PEETS COOFEE &TEA, PANOPOLIS BAKERY & CAFE
[43mB[0mUFFET 58
[43mB[0mUGS
[43mB[0mUILD-A-BEAR WORKSHOP LOWER LEVEL
[43mB[0mUILDING ON BOND
[43mB[0mUKA
[43mB[0mUKHARA GRILL
[43mB[0mUKHARI RESTAURANT
[43mB[0mULL & BEAR WALDORF ASTORIA
[43mB[0mULL HEAD TAVERN
[43mB[0mULL MCCABES
[43mB[0mULL SHOTS
[43mB[0mULLPEN DELI TWIN DONUTS
[43mB[0mULLSEYE SPORTS PUB
[43mB[0mULLY'S DELI
[43mB[0mULOVA
[43mB[0mUM BUM BAR
[43mB[0mUMBLE & BUMBLE CAFE
[43mB[0mUMBLE AND BUMBLE
[43mB[0mUN-KER VIETNAMESE RESTAURANT
[43mB[0mUNCH OF BAGELS
[43mB[0mUNDU KHAN KABAB HOUSE
[43mB[0mUNGA'S DEN
[43mB[0mUNGALO
[43mB[0mUNGALOW 31
[43mB[0mUNGALOW BAR & RESTAURANT
[43mB[0mUNGALOW18
[43mB[0mUNNA CAFE
[43mB[0mUNNY DELI
[43mB[0mUNNY'S WEST INDIAN RESTAURANT
[

[43mC[0mAFFEE DEI FIORI RISTORANTE
[43mC[0mAFFEE EXPRESS
[43mC[0mAFFEINA ESPRESSO BAR
[43mC[0mAFFEINE FIX CAFE
[43mC[0mAFFINO
[43mC[0mAFIERO LUSSIER
[43mC[0mAFÃ‰ GUSTO
[43mC[0mAGEN
[43mC[0mAIN'S TAVERN
[43mC[0mAIRO GRILL & SEAFOOD
[43mC[0mAJA MUSICAL
[43mC[0mAJUN CAFE & GRILL
[43mC[0mAKE AMBIANCE
[43mC[0mAKE BOSS CAFE
[43mC[0mAKE HOUSE WIN
[43mC[0mAKE MIO BAKERY
[43mC[0mAKE SHOP
[43mC[0mAKE TIN
[43mC[0mAKES 'N SHAPES
[43mC[0mAKOR RESTAURANT
[43mC[0mALACA
[43mC[0mALAVERAS
[43mC[0mALEDONIA
[43mC[0mALEXICO
[43mC[0mALEXICO CARNE ASADA
[43mC[0mALI AJI CON SABOR RESTAURANT
[43mC[0mALIBELLA BAKERY
[43mC[0mALICO JACKS
[43mC[0mALICO JACKS CANTINA
[43mC[0mALIENTE  GRILL
[43mC[0mALIENTE CAB
[43mC[0mALIENTE CAB CO
[43mC[0mALIENTITO DELI, RESTAURANT & LOUNGE BAR
[43mC[0mALIFORNIA PIZZA KITCHEN
[43mC[0mALISTA SUPERFOODS
[43mC[0mALIXTO'S COFFEE SHOP
[43mC[0mALL IT A WRAP
[43mC[0mALLE DAO
[43mC[0mALLE OCHO
[43mC[0mALLIOPE


[43mC[0mITY CHOW CAFE V (EQUINOX)
[43mC[0mITY COFFEE
[43mC[0mITY COLLEGE CAFETERIA
[43mC[0mITY COLLEGE MARSHAK CAFE
[43mC[0mITY CRAB
[43mC[0mITY DINER
[43mC[0mITY ECLAIR
[43mC[0mITY GOURMET
[43mC[0mITY HALL RESTAURANT
[43mC[0mITY ISLAND CHINESE RESTAURANT
[43mC[0mITY ISLAND DELI
[43mC[0mITY ISLAND DINER /SNUG BAR
[43mC[0mITY ISLAND LOBSTER HOUSE
[43mC[0mITY ISLAND YACHT CLUB
[43mC[0mITY ISLAND YOGURT INC
[43mC[0mITY KITCHEN
[43mC[0mITY LINE PIZZA & PASTA
[43mC[0mITY LOBSTER & STEAK
[43mC[0mITY MARKET CAFE
[43mC[0mITY OF SAINTS COFFEE ROASTERS
[43mC[0mITY ONE CHINESE RESTAURANT
[43mC[0mITY PERK
[43mC[0mITY PIE
[43mC[0mITY PLACE GRILL
[43mC[0mITY RESTAURANT
[43mC[0mITY SANDWICH
[43mC[0mITY SLICE
[43mC[0mITY SNOOKER POOL HOUSE
[43mC[0mITY SWIGGERS
[43mC[0mITY TECH BOOKSTORE & CAFE
[43mC[0mITY TECH CAFE
[43mC[0mITY VIEW DINER
[43mC[0mITY VIEW RACQUET CLUB
[43mC[0mITY WINERY
[43mC[0mITY WINGS  CAFE
[43mC[0mITYRIB
[43mC

As you can see, regular expression allow for a variety of creative, dynamic pattern matching mechanisms. Let's try them out on this out on a familiar piece of text.

### Exercise
Use regular expressions to extract...
- All the capitalized words in the speech
- All the words preceding a period
- All the words that start with a vowel
- Every line of the speech
- Every sentence of the speech

To get a better  sense of how the **`re`** library works, use the **`re.compile()`** function instead of using the **`grep()`** function we've built.

In [38]:
gettysburg = '''
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.
We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.
'''

#### Capitalized Words

In [39]:
regex = re.compile(r'[A-Z]\w+')
re.findall(regex, gettysburg)

['Four',
 'Liberty',
 'Now',
 'We',
 'We',
 'It',
 'But',
 'The',
 'The',
 'It',
 'It',
 'God']

#### Words Preceding Period

In [40]:
regex = re.compile(r'\w+\.')
re.findall(regex, gettysburg)

['equal.',
 'endure.',
 'war.',
 'live.',
 'this.',
 'ground.',
 'detract.',
 'here.',
 'advanced.',
 'earth.']

#### Words Starting with Vowel

In [41]:
regex = re.compile(r'\s[a,e,i,o,u]\w+')
re.findall(regex, gettysburg)

[' and',
 ' ago',
 ' our',
 ' on',
 ' in',
 ' and',
 ' all',
 ' are',
 ' equal',
 ' are',
 ' engaged',
 ' in',
 ' or',
 ' any',
 ' and',
 ' endure',
 ' are',
 ' on',
 ' of',
 ' of',
 ' as',
 ' is',
 ' altogether',
 ' and',
 ' in',
 ' and',
 ' it',
 ' above',
 ' our',
 ' add',
 ' or',
 ' it',
 ' is',
 ' us',
 ' unfinished',
 ' advanced',
 ' is',
 ' us',
 ' us',
 ' increased',
 ' of',
 ' in',
 ' under',
 ' of',
 ' and',
 ' of',
 ' earth']

#### Sentences

In [42]:
regex = re.compile('[^\n][\w\s,-]+\.')
re.findall(regex, gettysburg)

['Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.',
 'Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.',
 'We are met on a great battle-field of that war.',
 ' We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live.',
 ' It is altogether fitting and proper that we should do this.',
 'But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground.',
 ' The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract.',
 ' The world will little note, nor long remember what we say here, but it can never forget what they did here.',
 ' It is for us the living, rather, to be dedicated here to the unfinished w

#### Lines

In [43]:
regex = re.compile(r'.+')
re.findall(regex, gettysburg)

['Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.',
 'Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.',
 'We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.',
 'But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought

Regular expressions aren't just useful for _finding_ pattern matches within text, they're also useful for _extracting_ particular patterns within text. We can accomplish this with the use of **capture groups**, which delineate patterns within a regular expression. **Capture groups** are defined using single parentheses.

For example, what if we wanted to use regular expressions to split apart the user names and domain names in a handful of email addresses?

We can pretty easily write a regular expression that matches the email addresses:

In [44]:
emails = open('Emails.csv').read().split('\n')
grep(r'[\w\.]+@\w+\.[\w\.]+',emails)

[43msrylance0@dot.gov[0m
[43mbmacalees1@businessinsider.com[0m
[43mckemish2@bing.com[0m
[43msbentick3@usa.gov[0m
[43mmmckenna4@geocities.com[0m
[43madowtry5@tripadvisor.com[0m
[43mdhegges6@cmu.edu[0m
[43mcrace7@yellowpages.com[0m
[43mmsebyer8@bloglovin.com[0m
[43mahuckster9@eventbrite.com[0m
[43mwacotta@blogtalkradio.com[0m
[43mnemblemb@dmoz.org[0m
[43mdhuckfieldc@acquirethisname.com[0m
[43mjborleased@sohu.com[0m
[43mygalwaye@cnet.com[0m
[43mjseefeldtf@cisco.com[0m
[43mcstrondg@phpbb.com[0m
[43mwpirieh@twitpic.com[0m
[43mljoyneri@wordpress.com[0m
[43mzgreetj@craigslist.org[0m
[43mtblunnk@google.ru[0m
[43mkgerckensl@boston.com[0m
[43manayshem@seesaa.net[0m
[43mssilverlockn@google.com[0m
[43mtphillotto@columbia.edu[0m
[43mcbrasnerp@yolasite.com[0m
[43mlspurrittq@drupal.org[0m
[43miriddickr@elegantthemes.com[0m
[43mlchallaces@clickbank.net[0m
[43mnleguint@yellowbook.com[0m
[43mndankovu@dion.ne.jp[0m
[43mjwinspurv@lulu.com[0m


[43mrscanderetks@tinypic.com[0m
[43mdbaumertkt@moonfruit.com[0m
[43mkobernku@npr.org[0m
[43mfzettlerkv@gravatar.com[0m
[43mobeynekw@dot.gov[0m
[43mmdopsonkx@narod.ru[0m
[43mmjarmynky@comcast.net[0m
[43mmcrambiekz@npr.org[0m
[43mlgerinl0@tuttocitta.it[0m
[43mhboageyl1@51.la[0m
[43mmhollingsbyl2@statcounter.com[0m
[43mndargiel3@webnode.com[0m
[43mcebertzl4@facebook.com[0m
[43msdenisol5@gmpg.org[0m
[43mnnarbettl6@nba.com[0m
[43mbgreenwoodl7@msu.edu[0m
[43mbfetherbyl8@prnewswire.com[0m
[43mvgutersonl9@webeden.co.uk[0m
[43mmlamontla@cnn.com[0m
[43mevasnevlb@fotki.com[0m
[43mrshearalc@eepurl.com[0m
[43mknoonanld@rediff.com[0m
[43mjdobeyle@cornell.edu[0m
[43mcingreelf@sphinn.com[0m
[43mbstonmanlg@engadget.com[0m
[43mmawdelh@networksolutions.com[0m
[43mbtossellli@reuters.com[0m
[43mboverthrowlj@yolasite.com[0m
[43mdtuplinglk@virginia.edu[0m
[43mmgomarll@nba.com[0m
[43mgdarraghlm@free.fr[0m
[43mheneverln@tamu.edu[0m
[43mspeglerlo@

In [45]:
emails

['email',
 'srylance0@dot.gov',
 'bmacalees1@businessinsider.com',
 'ckemish2@bing.com',
 'sbentick3@usa.gov',
 'mmckenna4@geocities.com',
 'adowtry5@tripadvisor.com',
 'dhegges6@cmu.edu',
 'crace7@yellowpages.com',
 'msebyer8@bloglovin.com',
 'ahuckster9@eventbrite.com',
 'wacotta@blogtalkradio.com',
 'nemblemb@dmoz.org',
 'dhuckfieldc@acquirethisname.com',
 'jborleased@sohu.com',
 'ygalwaye@cnet.com',
 'jseefeldtf@cisco.com',
 'cstrondg@phpbb.com',
 'wpirieh@twitpic.com',
 'ljoyneri@wordpress.com',
 'zgreetj@craigslist.org',
 'tblunnk@google.ru',
 'kgerckensl@boston.com',
 'anayshem@seesaa.net',
 'ssilverlockn@google.com',
 'tphillotto@columbia.edu',
 'cbrasnerp@yolasite.com',
 'lspurrittq@drupal.org',
 'iriddickr@elegantthemes.com',
 'lchallaces@clickbank.net',
 'nleguint@yellowbook.com',
 'ndankovu@dion.ne.jp',
 'jwinspurv@lulu.com',
 'dbeeckxw@mozilla.org',
 'gvahlx@netvibes.com',
 'jhingy@github.com',
 'hdrynanz@amazon.com',
 'amaccaghan10@google.com',
 'mifill11@nature.com',
 'i

But decomposing them into their components is a different task. What if we wanted to decompose each email address into a user name, domain name, and domain suffix, like this:
> Max.Davish@ey.com

> **User Name:** Max.Davish

> **Domain Name:** ey

> **Domain Suffix:** .com

To do this, we could use the same regular expression but implement three distinct **capture groups**. More specifically, we'd put parentheses around the parts of the regular expression that match each of the three components. It would look like this: 

In [46]:
grep(r'([\w\.]+)@(\w+)\.([\w\.]+)',emails)

[43msrylance0@dot.gov[0m
[43mbmacalees1@businessinsider.com[0m
[43mckemish2@bing.com[0m
[43msbentick3@usa.gov[0m
[43mmmckenna4@geocities.com[0m
[43madowtry5@tripadvisor.com[0m
[43mdhegges6@cmu.edu[0m
[43mcrace7@yellowpages.com[0m
[43mmsebyer8@bloglovin.com[0m
[43mahuckster9@eventbrite.com[0m
[43mwacotta@blogtalkradio.com[0m
[43mnemblemb@dmoz.org[0m
[43mdhuckfieldc@acquirethisname.com[0m
[43mjborleased@sohu.com[0m
[43mygalwaye@cnet.com[0m
[43mjseefeldtf@cisco.com[0m
[43mcstrondg@phpbb.com[0m
[43mwpirieh@twitpic.com[0m
[43mljoyneri@wordpress.com[0m
[43mzgreetj@craigslist.org[0m
[43mtblunnk@google.ru[0m
[43mkgerckensl@boston.com[0m
[43manayshem@seesaa.net[0m
[43mssilverlockn@google.com[0m
[43mtphillotto@columbia.edu[0m
[43mcbrasnerp@yolasite.com[0m
[43mlspurrittq@drupal.org[0m
[43miriddickr@elegantthemes.com[0m
[43mlchallaces@clickbank.net[0m
[43mnleguint@yellowbook.com[0m
[43mndankovu@dion.ne.jp[0m
[43mjwinspurv@lulu.com[0m


[43mdfrentzpy@home.pl[0m
[43mzandraultpz@goo.gl[0m
[43mttillyq0@storify.com[0m
[43mrpeelq1@upenn.edu[0m
[43mrgillogleyq2@hhs.gov[0m
[43mespadelliq3@people.com.cn[0m
[43mclongfutq4@bandcamp.com[0m
[43mdklimkoq5@cbc.ca[0m
[43mcseedmanq6@mapquest.com[0m
[43mkcrosherq7@naver.com[0m
[43mamccurleyq8@csmonitor.com[0m
[43mvhullinq9@photobucket.com[0m
[43mdacorsqa@multiply.com[0m
[43mhhamlettqb@artisteer.com[0m
[43mccruseqc@marriott.com[0m
[43mdfilipponeqd@examiner.com[0m
[43mcdecullipqe@woothemes.com[0m
[43mjpuseyqf@moonfruit.com[0m
[43mjplyqg@storify.com[0m
[43mnmacvaghqh@multiply.com[0m
[43mdbibbqi@dell.com[0m
[43mblynasqj@quantcast.com[0m
[43mdpanswickqk@upenn.edu[0m
[43mhaparkql@oaic.gov.au[0m
[43mltunniclisseqm@rediff.com[0m
[43mcheffordeqn@earthlink.net[0m
[43mccogginqo@phpbb.com[0m
[43magrattageqp@studiopress.com[0m
[43mghelmkeqq@google.com.br[0m
[43mcshaxbyqr@uol.com.br[0m
[43mcezzyqs@gnu.org[0m
[43mmminneyqt@un.org[0m
[4

The results of the grep function look the same, but using the re library's **search()** function and **group()** operator we have the power to extract each component match individually. It works like this:

In [47]:
pattern = r'([\w\.]+)@(\w+)\.([\w\.]+)'
single_email = 'max.davish@ey.com'
matches = re.search(pattern, single_email)

print("Capture Group 1 -", matches.group(1))
print("Capture Group 2 -", matches.group(2))
print("Capture Group 3 -", matches.group(3))

Capture Group 1 - max.davish
Capture Group 2 - ey
Capture Group 3 - com


Let's write a loop to extract these capture group matches and then store them as a dtaframe to review the results. As always, it's a good idea to incorporate some error handling into a code, so that the code makes note of any emails it encounters that don't follow our regex pattern but proceeds anyway.

In [48]:
emails_dicts = []
nonmatching_emails = []
pattern = r'([\w\.]+)@(\w+)\.([\w\.]+)'
single_email = 'max.davish@ey.com'

for email in emails:
    matches = re.search(pattern, email)
    if matches:
        print('Found Match!', email)
        emails_dict = {
            'Email' : email,
            'User Name' : matches.group(1),
            'Domain' : matches.group(2),
            'Domain Suffix' : matches.group(3)
            }
        emails_dicts.append(emails_dict)
    else:
        nonmatching_emails.append(email)
        print("Not Found: ", email)

Not Found:  email
Found Match! srylance0@dot.gov
Found Match! bmacalees1@businessinsider.com
Found Match! ckemish2@bing.com
Found Match! sbentick3@usa.gov
Found Match! mmckenna4@geocities.com
Found Match! adowtry5@tripadvisor.com
Found Match! dhegges6@cmu.edu
Found Match! crace7@yellowpages.com
Found Match! msebyer8@bloglovin.com
Found Match! ahuckster9@eventbrite.com
Found Match! wacotta@blogtalkradio.com
Found Match! nemblemb@dmoz.org
Found Match! dhuckfieldc@acquirethisname.com
Found Match! jborleased@sohu.com
Found Match! ygalwaye@cnet.com
Found Match! jseefeldtf@cisco.com
Found Match! cstrondg@phpbb.com
Found Match! wpirieh@twitpic.com
Found Match! ljoyneri@wordpress.com
Found Match! zgreetj@craigslist.org
Found Match! tblunnk@google.ru
Found Match! kgerckensl@boston.com
Found Match! anayshem@seesaa.net
Found Match! ssilverlockn@google.com
Found Match! tphillotto@columbia.edu
Found Match! cbrasnerp@yolasite.com
Found Match! lspurrittq@drupal.org
Found Match! iriddickr@eleganttheme

Found Match! tlaste8l@vimeo.com
Found Match! dtomsu8m@pcworld.com
Found Match! mcockman8n@bloglines.com
Found Match! zpowrie8o@gizmodo.com
Not Found:  flantaph8p@shop-pro.jp
Found Match! jsline8q@home.pl
Found Match! twaplinton8r@cdc.gov
Found Match! asnooks8s@1und1.de
Found Match! jeyes8t@fotki.com
Found Match! psarle8u@yellowbook.com
Found Match! jquidenham8v@arstechnica.com
Found Match! hwiggin8w@cargocollective.com
Found Match! rrubee8x@ehow.com
Found Match! bstennett8y@cpanel.net
Found Match! nmattedi8z@chron.com
Found Match! ewhiff90@gov.uk
Found Match! mtinn91@about.com
Found Match! vbrandacci92@omniture.com
Found Match! aroyall93@reference.com
Found Match! grodden94@aboutads.info
Found Match! wwharrier95@techcrunch.com
Found Match! lforbes96@wired.com
Found Match! lrutty97@examiner.com
Found Match! ltiler98@purevolume.com
Found Match! cblyden99@pbs.org
Found Match! fvalasek9a@icq.com
Found Match! mmorton9b@buzzfeed.com
Found Match! cdevin9c@youtu.be
Not Found:  gweild9d@shop-pr

In [49]:
import pandas as pd
pd.DataFrame(emails_dicts).head(10)

Unnamed: 0,Domain,Domain Suffix,Email,User Name
0,dot,gov,srylance0@dot.gov,srylance0
1,businessinsider,com,bmacalees1@businessinsider.com,bmacalees1
2,bing,com,ckemish2@bing.com,ckemish2
3,usa,gov,sbentick3@usa.gov,sbentick3
4,geocities,com,mmckenna4@geocities.com,mmckenna4
5,tripadvisor,com,adowtry5@tripadvisor.com,adowtry5
6,cmu,edu,dhegges6@cmu.edu,dhegges6
7,yellowpages,com,crace7@yellowpages.com,crace7
8,bloglovin,com,msebyer8@bloglovin.com,msebyer8
9,eventbrite,com,ahuckster9@eventbrite.com,ahuckster9


These exercises scratch the surface of regular expressions, but there is even more that they do and be used for. Regular expressions are embedded in virtually every programming language and are used somewhere in virtually every program that analyzes text. For further information, consult the resources in the appendix, which detail even more of the pattern-matching capabilities of regular expressions.

---
## Machine Learning for Language
---
So far we've learned techniques for reorganizing text, computing similarity, and matching complex patterns. But these techniques fall short of actually _understanding_ language in a meaningful way. To solve this intractable problem, we must use **machine learning**.

Unsurprisingly, machine learning is at the heart of many cutting edge language processing techniques. In fact, many of the algorithms and concepts covered in the last lesson can also be applied to text, if we creatively reconstruct the data using the techniques we've learned. Both supervised and unsupervised algorithms like **logistic regression** and **K-Means** clustering can be applied to language and text in much the same way they're applied to numeric or categorical data. There are also algorithms such as **Naive Bayes**, which we'll cover in this lesson, that are tailored to language-based machine learning, as well as cutting edge algorithms like **Neural Networks**, which fall beyond the scope of this course but are integral to emerging fields like [voice technology](https://research.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html).

To work with machine learning, we'll use **SciKitLearn**, just as in the previous lesson, and for special language processing capabilities we'll also use the [**Natural Langauge Toolkit (NLTK)**](https://www.nltk.org/) - the leading platform for working with language in Python. If you haven't installed them, do so now:

In [50]:
!pip install sklearn
!pip install NLTK



In [51]:
import sklearn
import nltk

Before we delve into specific examples, we need to understand how textual, language-based data can be transformed into a format that machine learning algorithms can handle.

In the previous lesson, our empty SciKitLearn models took in dataframes and/or matrices as arguments and output models based on the data.

The **mtcars** dataset is a classic example:

In [52]:
pd.DataFrame.from_csv('mtcars.csv').head(10)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


A dataframe like this is an ideal input for a machine learning model - rows full observations and columns full of numeric and categorical datapoints.

How can we make organize langauge into a similar matrix that a computer easily understand? We can use a method called [**tokenization**](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html), which splits pieces of text into component attributes. Once tokenized, every sentence, product review, email, or book - whatever compilation of language we're using - becomes a _row_ in a dataframe, and each _word_ becomes a column.

Take this sentence, for example:

> It was the best of times, it was the worst of times.

Tokenized, this sentences becomes...

| |It| was | the | best | of | times | worst |
| :-------| :-------| :-------| :-------| :-------| :-------|
| **Sentence #1** | 2 | 2 | 2 | 1 | 2 | 2 |  1| 

The sentence becomes a row in the dataframe, and each column corresponds to a unique word in the dataframe. The datapoints are the **counts** of each word in the sentence. 

What if we added a second sentence to our data? We'd need to not only add another row, but more columns too. If we added this sentence to the data...
> Happy families are all alike; every unhappy family is unhappy in its own way. 

... suddenly the dataframe would expand and look like this:

| |It| was | the | best | of | times | worst | Happy | families| are| all| alike| every| unhappy| family| is| unhappy| in| its| | own| way|
| :-------| :-------| :-------| :-------| :-------| :-------|
| **Sentence #1** | 2 | 2 | 2 | 1 | 2 | 2 |  1|  0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| **Sentence #2** | 0 | 0| 0 | 0 | 0 | 0 |  0|  1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1

(You can imagine that these dataframes grow enormously large as we add more sentences, emails, speeches, or whatever else we're cataloging.)

Individual words constitute the **independent variables** of a language-based ML model, but they're incomplete without a **dependent variable**. In this example, perhaps we'd use the _author_ of each sentence as the dependent variable. This becomes another - but the most important - column in the dataframe:

| | **Author**|It| was | the | best | of | times | worst | Happy | families| are| all| alike| every| unhappy| family| is| in| its| own| way|
| :-------| :-------| :-------| :-------| :-------| :-------|
| **Sentence #1** | **Dickens** | 2 | 2 | 2 | 1 | 2 | 2 |  1|  0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| **Sentence #2** | **Tolstoy** | 0 | 0| 0 | 0 | 0 | 0 |  0|  1| 1| 1| 1| 1| 1| 2| 1| 1| 1| 1| 1| 1|

If we extended our example to include a lot more data, we could used a **supervised classification algorithm**, like the Decision Tree we learned about in the previous lesson, to build a model that could predict whether a sentence was written by Charles Dickens or Leo Tolstoy (or any number of other authors, for that matter).

But there's more to tokenization, which, like other aspects of machine learning, can be more of an art than a science. Here are some techniques that we can and should use during tokenization to make the most out of our data:

### Lemmatization
Consider the words "family" and "families" in the sentence above. Each word gets a separate column in the dataframe, but should that be so? Don't those words express the same idea? From a modeling perspective, does it really make sense to treat them differently?

Probably not, and that's where [**lemmatiziation**](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) comes in handy. Lemmatization is the process of transforming words into their base forms. So for example, a lemmatizer would treat plural and singular forms - "family" and "families" - the same. Similarly, different conjugations or tenses of the same verb - "is", "was", "were", "am", etc. - would be treated the same as well.

Lemmatiziation is accomplished through a complex algorithm, but fortunately we easily utilize it with the NLTK library.

In [53]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize("families")

'family'

As you can see, if we'd used a lemmatizer from the start, we would have only had one column for "family" in our dataframe, rather than two for "family" and "families".

Let's try it with a few verbs, too:

In [54]:
be_verbs = ['was', 'am', 'were', 'are']

for be_verb in be_verbs:
    print(be_verb + ' ---> ' + wordnet_lemmatizer.lemmatize(be_verb))

was ---> wa
am ---> am
were ---> were
are ---> are


This doesn't quite work as expected. These are all forms of the same verb, so ideally we'd want them all the same way.

The reason that the lemmatizer has failed to do so is because we've failed to specify a **part of speech** argument. By default, the NLTK lemmatizer treats words as nouns, so it works well for words like "family" and "families", but not for complex verbs like "to be". To fix this, we just need to add a **`pos`** argument to the **`lematize`** funciton.

In [55]:
be_verbs = ['was', 'am', 'were', 'are']

for be_verb in be_verbs:
    print(be_verb + ' ---> ' + wordnet_lemmatizer.lemmatize(be_verb, pos='v'))

was ---> be
am ---> be
were ---> be
are ---> be


Much better - but it still presents a problem. We can't manually specify the part of speech for every single word that we deal with. Ideally, we would want our lemmatizer to algorithmically determine the part of speech of each token word and then lemmatize accordingly. 

Fortunately, NLTK has a tool for this too, called a **part of speech tagger**. To use it, we first have to use the **`download()`** function to install the **averaged_perceptron_tagger**, which is basically a large database of English words and their parts of speech.

In [56]:
nltk.download('average_perceptron_tagger')

[nltk_data] Error loading average_perceptron_tagger: <urlopen error
[nltk_data]     [WinError 10060] A connection attempt failed because
[nltk_data]     the connected party did not properly respond after a
[nltk_data]     period of time, or established connection failed
[nltk_data]     because connected host has failed to respond>


False

Now we can import the **`pos_tag`** module and use it to find the part of speech of words (or lists of words).

In [57]:
from nltk import pos_tag
#Note that the POS tagger takes lists as an argument, rather than single words.
#This is because it normally accepts lists of Tokens as an argument, as we'll show shortly. 
nltk.pos_tag(['running'])

[('running', 'VBG')]

The function outputs a tuple of the original word and its part of speech.

In this case, it outputs **VBG** which means _Verb, gerund or present participle_. You can find a full list of the parts of speech that this function outputs [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 

Using this output, we can write a function that _dynamically_ lemmatizes words based on their part of speech.

In [58]:
def dynamic_lemmatize(word):
    pos_tag = nltk.pos_tag([word])
    pos_argument = pos_tag[0][1][0].lower()
    #Adjective Error Handling
    if pos_argument == 'j':
        pos_argument = 'a'
    return wordnet_lemmatizer.lemmatize(word, pos_argument)

As we can see, the lemmatizer works for both all of our previous test cases, regardless of part of speech!

In [59]:
test_words = ['was', 'am', 'were', 'are', 'family', 'families', 'quickly', 'large', 'larger']

for test_word in test_words:
    print(test_word + ' ---> ' + dynamic_lemmatize(test_word))

was ---> be
am ---> be
were ---> be
are ---> be
family ---> family
families ---> family
quickly ---> quickly
large ---> large
larger ---> large


### Stop Words
Another technique we can use to improve the quality of our data is to remove **stop words** from our tokens. [Stop words](https://en.wikipedia.org/wiki/Stop_words) are words that are so common that they add little or no value to language-processing programs and are therefore removed in pre-processing.

As you may have guessed, NLTK offers a list of these too:

In [60]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stopwords.words('English')

[nltk_data] Error loading stopwords: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

As you can see, most are pronouns, prepositions, basic linking verbs, and other rudimentary words that are extremely common in language but don't tell us much about the sentiment or content of speech. As such, they should be removed from our machine learning models - since bad or irrelevant data only weakens the accuracy of our models.

We're now ready to write a primitive **tokenization** function. 

### Exercise 
Write a function that takes as an input a piece of text - whether sentences, essays, or entire books - and distills it into **lemmatized tokens** without **stop words**.

Again, tokenization is more of an art than a science, and there is not necessarily one _correct_ way to tokenize. Still, we recommend a few measures to improve the predictive capacity of your data:
- Standardize the _case_ of all tokens by making them all uppercase or lowercase. There is no meaningful difference between "Family" and "family"
- Although you could use the **`.split()`** method (or perhaps even a regular expression) from base Python to split a large string up into component words, the **`nltk.tokenize.word_tokenize()`** method is a handy shorcut.
- This tokenize function doesn't do a good job of removing **punctuation** and other small, irrelevant characters, so your tokenizer should only accept tokens with a length of **three characters** or greater. This gets rid of irrelevant punctuation as well as small prepositions or titles not caught by the stopwords list.

In [61]:
def tokenize(text):
    #Splitting the words
    tokens = nltk.tokenize.word_tokenize(text)
    #Standardizing case
    tokens = [token.lower() for token in tokens]
    #Removing stop words
    tokens = [token for token in tokens if token not in stopwords.words('English')]
    #Removing punctuation/small characters
    tokens = [token for token in tokens if len(token) > 2]
    #Lemmatizing - note the error handling
    lemmatized_tokens = []
    for token in tokens:
        try:
            lemmatized_tokens.append(dynamic_lemmatize(token))
        except KeyError:
            pass
    return lemmatized_tokens

In [62]:
tokenize('It was the best of times, it was the worst of times.')

['best', 'time', 'bad', 'time']

### Word Occurences vs. Presence vs. Frequency
The final consideration we should make in preparing our data is to consider which independent variables are best for our model. In the example about, we use word **occurences** - the number of times each word appears in each sentence - as the independent variables. 

| | **Author**|It| was | the | best | of | times | worst | Happy | families| are| all| alike| every| unhappy| family| is| in| its| own| way|
| :-------| :-------| :-------| :-------| :-------| :-------|
| **Sentence #1** | **Dickens** | 2 | 2 | 2 | 1 | 2 | 2 |  1|  0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| **Sentence #2** | **Tolstoy** | 0 | 0| 0 | 0 | 0 | 0 |  0|  1| 1| 1| 1| 1| 1| 2| 1| 1| 1| 1| 1| 1|

For smaller datasets, however, it might be more appropriate to simply use word **presence** as the independent variable. With this method, rather than a **count** of each word, each column contains simply contains `TRUE` if the sentence contains the word, or `FALSE` if it doesn't. 

Using word presence, our dataframe would look like this:

| | **Author**|It| was | the | best | of | times | worst | Happy | families| are| all| alike| every| unhappy| family| is| in| its| own| way|
| :-------| :-------| :-------| :-------| :-------| :-------|
| **Sentence #1** | **Dickens** | `TRUE` | `TRUE` | `TRUE` | `TRUE` | `TRUE` | `TRUE` | `TRUE`|  `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`| `FALSE`|
| **Sentence #2** | **Tolstoy** | `FALSE` | `FALSE`| `FALSE` | `FALSE` | `FALSE` | `FALSE` | `FALSE`|  `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`| `TRUE`|

Word presence is a more appropriate measure for smaller pieces of text - such as product reviews - where words are less likely to appear multiple times. Occurences, on the other hand, are more useful for longer form text - such as speeches or news articles - where words are more likely to occur multiple times and where there is a meaningful difference between, say, using a word once or five times.

For datasets of varying length, we may want to utilize **frequency** as the independent variable, rather than occurences or presence. **Frequency** is simply a measure of the number of occurences of one word/token over the total number of words/tokens. By using **frequency** instead of **occurences** we remove algorithms' bias toward longer documents, which will naturally contain more occurences of each word.

Computing frequency is a simple task for Python - in fact, we've already done it earlier in this lesson with the **UN Speech**.

In [63]:
word_counts

{'every': 5,
 'ultimately,': 2,
 'cultures': 1,
 'expose': 1,
 'communities.': 1,
 'security': 2,
 'who,': 1,
 'suppression': 1,
 'sect;': 1,
 'intolerance.': 1,
 'far': 3,
 'seen': 1,
 'met': 1,
 'some': 8,
 'recognize': 7,
 'fourth': 1,
 'depths': 1,
 'order.': 1,
 'long.': 1,
 'surprise': 1,
 'east.': 1,
 'narrowing': 1,
 'aims': 1,
 'president;': 1,
 'enhanced': 1,
 'thrive;': 1,
 'i': 45,
 'ill-equipped,': 1,
 'dominate': 1,
 'place,': 1,
 'believe': 23,
 'luther': 1,
 'extremism': 3,
 'soviet': 1,
 'hardship': 1,
 'their': 21,
 'point': 1,
 'oligarchs': 1,
 'happen': 1,
 'new,': 1,
 'small': 2,
 'muzzling': 1,
 'asks': 1,
 'if': 18,
 'calls': 1,
 'strong,': 1,
 'systems': 1,
 'cannot': 5,
 'destructive': 1,
 'somewhat': 1,
 'catastrophe': 1,
 'terrorist': 2,
 'democracy,': 1,
 'americans': 2,
 'know': 3,
 'value': 1,
 'view': 1,
 'retreat': 1,
 'smartphone': 1,
 'ingenuity': 1,
 'truism': 1,
 'think': 6,
 'among': 3,
 'today,': 2,
 'divisions,': 1,
 'when': 8,
 'democratic,': 1,


Because we used a basic function to compile this dictionary, there a handful of extraneous words.

Let's recycle the code from the first section to complete the same exercise using our new tokenize function, and calculating word **frequency** rather than **occurence**.

In [64]:
words = tokenize(speech)
word_counts = dict()
for word in set(words):
    word_counts[word] = words.count(word) / len(words)

word_counts

{'argument': 0.0003746721618583739,
 'liberty': 0.0007493443237167478,
 'demographic': 0.0003746721618583739,
 'online': 0.0003746721618583739,
 'entrench': 0.0003746721618583739,
 'promise': 0.0011240164855751218,
 'weigh': 0.0003746721618583739,
 'position': 0.0007493443237167478,
 'israel': 0.0007493443237167478,
 'caricature': 0.0003746721618583739,
 'correction': 0.0003746721618583739,
 'spent': 0.0003746721618583739,
 'nationalism': 0.0003746721618583739,
 'city': 0.0003746721618583739,
 'poland': 0.0003746721618583739,
 'allows': 0.0007493443237167478,
 'expose': 0.0003746721618583739,
 'theory': 0.0003746721618583739,
 'security': 0.0014986886474334957,
 'trust': 0.0003746721618583739,
 'poorer': 0.0014986886474334957,
 'afternoon': 0.0003746721618583739,
 'tell': 0.0003746721618583739,
 'building': 0.0003746721618583739,
 'suppression': 0.0003746721618583739,
 'asia': 0.0003746721618583739,
 'difficult': 0.0011240164855751218,
 'back': 0.0018733608092918695,
 'venture': 0.0003

If we converted this data into a dataframe, it would have quite a few columns:

In [65]:
len(word_counts)

1170

But dataframes with hundreds or thousands of rows are not uncommon in language processing, and they make surprisingly good training sets for machine learning algorithms. 

Now that we're well-versed in data preparation, it's time to create some ML models. You'll find that, once the data is prepped, the creation of effective predictive models is suprisingly easy and follows the exact same steps as the examples in the previous lesson.

## News Classification
In this example, we'll use a sample dataset of **news articles** from SciKitLearn to create a model that predicts the category of an article based on its text. We'll use the **Multinomial Naive Bayes (MNB)** algorithm for classification - an algorithm that is heavily used for text processing. 

In [66]:
import sklearn
from sklearn import datasets
from sklearn.datasets import fetch_20newsgroups

First, let's gather the "training" subset and view the category names (stored under target_names). This way we can later take a subset of the training categories to train our model on for simplicity. 

In [67]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

from pprint import pprint
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


When data is loaded it stores as dict keys or object attributes --


target_names ---> categories 


        data ---> holds the files 
        
        
   filenames ---> holds the filenames 
   
   
   
We can call each of these as needed:

In [68]:
newsgroups_train.target_names

newsgroups_train.data

newsgroups_train.filenames

array([ 'C:\\Users\\hb711gf\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.autos\\102994',
       'C:\\Users\\hb711gf\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.mac.hardware\\51861',
       'C:\\Users\\hb711gf\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.mac.hardware\\51879',
       ...,
       'C:\\Users\\hb711gf\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60695',
       'C:\\Users\\hb711gf\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38319',
       'C:\\Users\\hb711gf\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.motorcycles\\104440'],
      dtype='<U97')

### Selecting training subset
Now, we will select the subset of categories to train our model on for simplicity of the example. 

Args-
* Shuffle - mixes up order of trainng data so that model doesn't encounter too many of same data in order; you don't want the model to account for patterns in ordering
* Random_state - makes sure the data is split the same way each time instead of randomly spliting, important for repetability; doesn't matter what number you put

In [69]:
# Using 4 of the newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
# Load in the files 
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) 

### Preprocessing Text Data with Tokenization
It is necessary to convert words into a numerical form in order to perform machine learning, which is done by vectorization. This can be done using both the 'sklearn' and 'nltk' (Natural Language Tool Kit) packages. In practice, NLTK is more commonly used for tokenization, but, for consistency, we will show you how to tokenize using sklearn tools. 

The following code uses `fit_transform`:

Scikit-Learn exposes a standard API for machine learning that has two primary interfaces: Transformer and Estimator. Both transformers and estimators expose a fit method for adapting internal parameters based on data. Transformers then expose a transform method to perform feature extraction or modify the data for machine learning, and estimators expose a predict method to generate new data from feature vectors. 

In simple terms:
* Fit- calculates the μ and σ to center, fit, the training data
* Transform - applies the training set's μ and σ to newly introduced data


In [70]:
# Tokenization - turning words into vectors 
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [71]:
# Output is a dictionary where you can get word occurences
count_vect.vocabulary_.get(u'algorithm')

4690

### Convert occurences to frequencies and use transformer

By determining the "term frequencies" (tf), or number of times a word occurs in the document over the total number of words in the document, there is a correction for the skew toward longer documents. Also, it is important to take importance away from words which appear frequently in multiple documents by calculating an "inverse document frequency" to multiple by the "tf". 

In [72]:
from sklearn.feature_extraction.text import TfidfTransformer
# Fit estimator to data and transform count matrix to use tf*idf 
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

### Training the model with Multinomial Naive Bayes classifier
Since MultinomialNB is used for text classification, we proceed with this as the classifier. 

In our example, we pass MultinomialNB through both a "fit" and "predict" phase. The fit argument fits the training data to the overall data. The predict argument performs a classification on the test vector(s).

sklearn.naive_bayes.MultinomialNB().fit(X,y)

_Parameters:_
* X : Training vectors, where n_samples is the number of samples and n_features is the number of features.
* y : Target values.

sklearn.naive_bayes.MultinomialNB().predict(X)
_Parameter:_
* X : Test vector to classify

In [73]:
from sklearn.naive_bayes import MultinomialNB
#Training
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

#Testing
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new) #Vectorize and standardize new docs
X_new_tfidf = tfidf_transformer.transform(X_new_counts) #Transform using TFTI method

predicted = clf.predict(X_new_tfidf) #Predict

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category])) #Convert object using repr() and str() 

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


********************************************************************************************************************************
_Additional Info on Naive Bayes vs. Multinomial Naive Bayes:_

> A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class.

> Multinomial Naive Bayes gives information about the distribution (multinomial) which works well for data which can be turned into counts (such as word counts in text).

> Naive Bayes classifier is a general term which refers to conditional independence of each of the features in the model, while Multinomial Naive Bayes classifier is a specific instance of a Naive Bayes classifier which uses a multinomial distribution for each of the features (instead of a Gaussian).

### Other Useful Resources
#### RegEx
- [RegexR](https://regexr.com/) - A useful tool for testing regular expressions without programming
- [DataQuest RegEx Guide](https://www.dataquest.io/blog/regular-expressions-data-scientists/)
- [Python Documentation](https://docs.python.org/2/library/re.html)
- [Google Developers Tutorial](https://developers.google.com/edu/python/regular-expressions)