<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 1. Goals of this Unit
*Natural Language Processing*

----
The goal of this unit is to introduce the field of natural language processing and provide an overview of common applications, techniques, and challenges.

<br/>After this unit, you will be able to:
- Understand what natural language processing is.
- Gain an introduction to common applications and challenges within natural language processing.
- Identify several natural language processing techniques and how they relate to each other.
- Try out a few natural language processing techniques using Python.

<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 2. Intro to NLP
*Natural Language Processing*

----
Look at the technologies around us:
- Spellcheck and autocorrect
- Auto-generated video captions
- Virtual assistants like Amazon’s Alexa
- Autocomplete
- Your news site’s suggested articles

<br/>What do they have in common?

<br/>All of these handy technologies exist because of *natural language processing!* Also known as NLP, the field is at the intersection of linguistics, artificial intelligence, and computer science. The goal? Enabling computers to interpret, analyze, and approximate the generation of human languages (like English or Spanish).

<br/>NLP got its start around 1950 with Alan Turing’s test for artificial intelligence evaluating whether a computer can use language to fool humans into believing it’s human.

<br/>But approximating human speech is only one of a wide range of applications for NLP! Applications from detecting spam emails or bias in tweets to improving accessibility for people with disabilities all rely heavily on natural language processing techniques.

<br/>NLP can be conducted in several programming languages. However, Python has some of the most extensive open-source NLP libraries, including the Natural Language Toolkit or *NLTK.* Because of this, you’ll be using Python to get your first taste of NLP.
<img src="Images/Natural_Language_Processing_Overview.webp" width="50%" height="50%">

<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 3. Text Preprocessing
*Natural Language Processing*

----
> "You never know what you have... until you clean your data."

Cleaning and preparation are crucial for many tasks, and NLP is no exception. *Text preprocessing* is usually the first step you’ll take when faced with an NLP task.

<br/>Without preprocessing, your computer interprets `"the"`, `"The"`, and `"<p>The"` as entirely different words. There is a LOT you can do here, depending on the formatting you need. Lucky for you, Regex and NLTK will do most of it for you! Common tasks include:

<br/>**Noise removal** — stripping text of formatting (e.g., HTML tags).

<br/>**Tokenization** — breaking text into individual words.

<br/>**Normalization** — cleaning text data in any other way:
- *Stemming* is a blunt axe to chop off word prefixes and suffixes. “booing” and “booed” become “boo”, but “computer” may become “comput” and “are” would remain “are.”
- *Lemmatization* is a scalpel to bring words down to their root forms. For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.”
- Other common tasks include lowercasing, stopwords removal, spelling correction, etc.

<br/>*Exercise:*
1. We used NLTK’s PorterStemmer to normalize the text — run the code to see how it does. (It may take a few seconds for the code to run.)

In [14]:
# regex for removing punctuation!
import re
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# grabbing a part of speech function:
#from part_of_speech import get_part_of_speech

text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."

cleaned = re.sub('\W+', ' ', text)
tokenized = word_tokenize(cleaned)

stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]

## -- CHANGE these -- ##
lemmatizer = None
lemmatized = []

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['so', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'i', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'i', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
[]


2. In the output terminal you’ll see our program counts `"go"` and `"went"` as different words! Also, what’s up with `"mani"` and `"hardli"`? A lemmatizer will fix this. Let’s do it. Where `lemmatizer` is defined, replace `None` with `WordNetLemmatizer()`. Where we defined `lemmatized`, replace the empty list with a list comprehension that uses `lemmatizer` to `lemmatize()` each `token` in `tokenized`.

In [10]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['so', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'i', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'i', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
['So', 'many', 'squid', 'are', 'jumping', 'out', 'of', 'suitcase', 'these', 'day', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'seeing', 'one', 'burst', 'forth', 'from', 'a', 'tightly', 'packed', 'valise', 'I', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angry', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minute', 'of', 'arriving', 'She', 'hardly', 'even', 'noticed']


3. Why are the lemmatized verbs like `"went"` still conjugated? By default `lemmatize()` treats every word as a noun. Give `lemmatize()` a second argument: `get_part_of_speech(token)`. This will tell our lemmatizer what part of speech the word is. Run your code again to see the result!

In [11]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

NameError: name 'get_part_of_speech' is not defined