<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 1. Goals of this Unit
*Natural Language Processing*

----
The goal of this unit is to introduce the field of natural language processing and provide an overview of common applications, techniques, and challenges.

<br/>After this unit, you will be able to:
- Understand what natural language processing is.
- Gain an introduction to common applications and challenges within natural language processing.
- Identify several natural language processing techniques and how they relate to each other.
- Try out a few natural language processing techniques using Python.

<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 2. Intro to NLP
*Natural Language Processing*

----
Look at the technologies around us:
- Spellcheck and autocorrect
- Auto-generated video captions
- Virtual assistants like Amazon’s Alexa
- Autocomplete
- Your news site’s suggested articles

<br/>What do they have in common?

<br/>All of these handy technologies exist because of *natural language processing!* Also known as NLP, the field is at the intersection of linguistics, artificial intelligence, and computer science. The goal? Enabling computers to interpret, analyze, and approximate the generation of human languages (like English or Spanish).

<br/>NLP got its start around 1950 with Alan Turing’s test for artificial intelligence evaluating whether a computer can use language to fool humans into believing it’s human.

<br/>But approximating human speech is only one of a wide range of applications for NLP! Applications from detecting spam emails or bias in tweets to improving accessibility for people with disabilities all rely heavily on natural language processing techniques.

<br/>NLP can be conducted in several programming languages. However, Python has some of the most extensive open-source NLP libraries, including the Natural Language Toolkit or *NLTK.* Because of this, you’ll be using Python to get your first taste of NLP.
<img src="Images/Natural_Language_Processing_Overview.webp" width="30%" height="30%">

<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 3. Text Preprocessing
*Natural Language Processing*

----
> "You never know what you have... until you clean your data."

Cleaning and preparation are crucial for many tasks, and NLP is no exception. *Text preprocessing* is usually the first step you’ll take when faced with an NLP task.

<br/>Without preprocessing, your computer interprets `"the"`, `"The"`, and `"<p>The"` as entirely different words. There is a LOT you can do here, depending on the formatting you need. Lucky for you, Regex and NLTK will do most of it for you! Common tasks include:

<br/>**Noise removal** — stripping text of formatting (e.g., HTML tags).

<br/>**Tokenization** — breaking text into individual words.

<br/>**Normalization** — cleaning text data in any other way:
- *Stemming* is a blunt axe to chop off word prefixes and suffixes. “booing” and “booed” become “boo”, but “computer” may become “comput” and “are” would remain “are.”
- *Lemmatization* is a scalpel to bring words down to their root forms. For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.”
- Other common tasks include lowercasing, stopwords removal, spelling correction, etc.

<br/>*Exercise:*
1. We used NLTK’s PorterStemmer to normalize the text — see example below.

In [4]:
# regex for removing punctuation!
import re
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# grabbing a part of speech function:
#from part_of_speech import get_part_of_speech

text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."

cleaned = re.sub('\W+', ' ', text)
tokenized = word_tokenize(cleaned)

stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]

## -- CHANGE these -- ##
lemmatizer = None
lemmatized = []

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['so', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'i', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'i', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
[]


2. In the output terminal you’ll see our program counts `"go"` and `"went"` as different words! Also, what’s up with `"mani"` and `"hardli"`? A lemmatizer will fix this. Let’s do it. Where `lemmatizer` is defined, replace `None` with `WordNetLemmatizer()`. Where we defined `lemmatized`, replace the empty list with a list comprehension that uses `lemmatizer` to `lemmatize()` each `token` in `tokenized`.

In [5]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['so', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'i', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'i', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
['So', 'many', 'squid', 'are', 'jumping', 'out', 'of', 'suitcase', 'these', 'day', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'seeing', 'one', 'burst', 'forth', 'from', 'a', 'tightly', 'packed', 'valise', 'I', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angry', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minute', 'of', 'arriving', 'She', 'hardly', 'even', 'noticed']


3. Why are the lemmatized verbs like `"went"` still conjugated? By default `lemmatize()` treats every word as a noun. Give `lemmatize()` a second argument: `get_part_of_speech(token)`. This will tell our lemmatizer what part of speech the word is. Run your code again to see the result!

In [15]:
from nltk.tag import pos_tag
from nltk.corpus import wordnet

def get_part_of_speech(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['so', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'i', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'i', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
['So', 'many', 'squid', 'be', 'jumping', 'out', 'of', 'suitcase', 'these', 'day', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightly', 'packed', 'valise', 'I', 'go', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angry', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minute', 'of', 'arrive', 'She', 'hardly', 'even', 'notice']


<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 4. Parsing Text
*Natural Language Processing*

----
You now have a preprocessed, clean list of words. Now what? It may be helpful to know how the words relate to each other and the underlying syntax (grammar). Parsing is an NLP process concerned with segmenting text based on syntax.

<br/>You probably do not want to be doing any parsing by hand and NLTK has a few tricks up its sleeve to help you out:

<br/>*Part-of-speech tagging (POS tagging)* identifies parts of speech (verbs, nouns, adjectives, etc.). NLTK can do it faster (and maybe more accurately) than your grammar teacher.

<br/>*Named entity recognition (NER)* helps identify the proper nouns (e.g., “Natalia” or “Berlin”) in a text. This can be a clue as to the topic of the text and NLTK captures many for you.

<br/>*Dependency grammar trees* help you understand the relationship between the words in a sentence. It can be a tedious task for a human, so the Python library spaCy is at your service, even if it isn’t always perfect.

<br/>In English we leave a lot of ambiguity, so syntax can be tough, even for a computer program. Take a look at the following sentence:
<img src="Images/parsing_syntactic_ambiguity.gif" width="30%" height="30%">
Do I have the binoculars? Does the cow have binoculars? Does the tree have binoculars?

<br/>*Regex parsing,* using Python’s `re` library, allows for a bit more nuance. When coupled with POS tagging, you can identify specific phrase chunks. On its own, it can find you addresses, emails, and many other common patterns within large chunks of text.

In [16]:
import spacy
from nltk import Tree
dependency_parser = spacy.load('en_core_web_sm')
parsed_squids = dependency_parser("I saw a cow under a tree with binoculars")

# Assign a new value:
my_parsed_sentence = dependency_parser("The fox jumped over the lazy dog!")

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        parsed_child_nodes = [to_nltk_tree(child) for child in node.children]
        return Tree(node.orth_, parsed_child_nodes)
    else: return node.orth_

for sent in parsed_squids.sents: to_nltk_tree(sent.root).pretty_print()
for sent in my_parsed_sentence.sents: to_nltk_tree(sent.root).pretty_print()

        saw                     
  _______|___                    
 |          cow                 
 |    _______|____               
 |   |          under           
 |   |            |              
 |   |           tree           
 |   |        ____|_______       
 |   |       |           with   
 |   |       |            |      
 I   a       a        binoculars

    jumped              
  ____|_________         
 |    |        over     
 |    |         |        
 |   fox       dog      
 |    |      ___|____    
 !   The   the      lazy



<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 5. Language Models: Bag-of-Words
*Natural Language Processing*

----
How can we help a machine make sense of a bunch of word tokens? We can help computers make predictions about language by training a language model on a corpus (a bunch of example text).

<br/>*Language models* are probabilistic computer models of language. We build and use these models to figure out the likelihood that a given sound, letter, word, or phrase will be used. Once a model has been trained, it can be tested out on new texts.

<br/>One of the most common language models is the unigram model, a statistical language model commonly known as *bag-of-words*. As its name suggests, bag-of-words does not have much order to its chaos! What it does have is a tally count of each instance for each word. Consider the following text example:
<img src="Images/bag-of-words.gif" width="30%" height="30%">
Provided some initial preprocessing, bag-of-words would result in a mapping like:

In [17]:
{"the": 2, "squid": 1, "jump": 1, "out": 1, "of": 1, "suitcase": 1}

{'the': 2, 'squid': 1, 'jump': 1, 'out': 1, 'of': 1, 'suitcase': 1}

Now look at this sentence and mapping: “Why are your suitcases full of jumping squids?”

In [18]:
{"why": 1, "be": 1, "your": 1, "suitcase": 1, "full": 1, "of": 1, "jump": 1, "squid": 1}

{'why': 1,
 'be': 1,
 'your': 1,
 'suitcase': 1,
 'full': 1,
 'of': 1,
 'jump': 1,
 'squid': 1}

You can see how even with different word order and sentence structures, “jump,” “squid,” and “suitcase” are shared topics between the two examples. Bag-of-words can be an excellent way of looking at language when you want to make predictions concerning topic or sentiment of a text. When grammar and word order are irrelevant, this is probably a good model to use.

In [20]:
# importing regex and nltk
import re, nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# importing Counter to get word counts for bag of words
from collections import Counter

# Change text to another string:
text = "You will never reach your destination if you stop and throw stones at every dog that barks."

cleaned = re.sub('\W+', ' ', text).lower()
tokenized = word_tokenize(cleaned)

stop_words = stopwords.words('english')
filtered = [word for word in tokenized if word not in stop_words]

normalizer = WordNetLemmatizer()
normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in filtered]
# Comment out the print statement below
# print(normalized)

# Define bag_of_looking_glass_words & print:
bag_of_looking_glass_words = Counter(normalized)
print(bag_of_looking_glass_words)

Counter({'never': 1, 'reach': 1, 'destination': 1, 'stop': 1, 'throw': 1, 'stone': 1, 'every': 1, 'dog': 1, 'bark': 1})


<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 6. Language Models: N-Gram and NLM
*Natural Language Processing*

----
For parsing entire phrases or conducting language prediction, you will want to use a model that pays attention to each word’s neighbors. Unlike bag-of-words, the *n-gram* model considers a sequence of some number (*n*) units and calculates the probability of each unit in a body of language given the preceding sequence of length *n*. Because of this, n-gram probabilities with larger *n* values can be impressive at language prediction.

<br/>Take a look at our revised squid example: “The squids jumped out of the suitcases. The squids were furious.”

<br/>A bigram model (where *n* is 2) might give us the following count frequencies:

In [21]:
{('', 'the'): 2, ('the', 'squids'): 2, ('squids', 'jumped'): 1, ('jumped', 'out'): 1, ('out', 'of'): 1, ('of', 'the'): 1, ('the', 'suitcases'): 1, ('suitcases', ''): 1, ('squids', 'were'): 1, ('were', 'furious'): 1, ('furious', ''): 1}

{('', 'the'): 2,
 ('the', 'squids'): 2,
 ('squids', 'jumped'): 1,
 ('jumped', 'out'): 1,
 ('out', 'of'): 1,
 ('of', 'the'): 1,
 ('the', 'suitcases'): 1,
 ('suitcases', ''): 1,
 ('squids', 'were'): 1,
 ('were', 'furious'): 1,
 ('furious', ''): 1}

There are a couple problems with the n gram model:
1. How can your language model make sense of the sentence “The cat fell asleep in the mailbox” if it’s never seen the word “mailbox” before? During training, your model will probably come across test words that it has never encountered before (this issue also pertains to bag of words). A tactic known as *language smoothing* can help adjust probabilities for unknown words, but it isn’t always ideal.
2. For a model that more accurately predicts human language patterns, you want n (your sequence length) to be as large as possible. That way, you will have more natural sounding language, right? Well, as the sequence length grows, the number of examples of each sequence within your training corpus shrinks. With too few examples, you won’t have enough data to make many predictions.

<br/>Enter *neural language models (NLMs)*! Much recent work within NLP has involved developing and training neural networks to approximate the approach our human brains take towards language. This deep learning approach allows computers a much more adaptive tack to processing human language. Common NLMs include LSTMs and transformer models.

<br/>*Exercise:*
1. If you run the code below, you’ll see the 10 most commonly used words in one of Churchill's quotes, parsed with NLTK’s `ngrams` module — if you’re thinking this looks like a bag of words, that’s because it is one!
2. At what `n` are you just getting lines from poems repeated in the text? This is where there may be too few examples of each sequence within your training corpus to make any helpful predictions.

In [23]:
import nltk, re
from nltk.tokenize import word_tokenize
# importing ngrams module from nltk
from nltk.util import ngrams
from collections import Counter

cleaned = re.sub('\W+', ' ', "You will never reach your destination if you stop and throw stones at every dog that barks.").lower()
tokenized = word_tokenize(cleaned)

# Change the n value to 2:
looking_glass_bigrams = ngrams(tokenized, 2)
looking_glass_bigrams_frequency = Counter(looking_glass_bigrams)

# Change the n value to 3:
looking_glass_trigrams = ngrams(tokenized, 3)
looking_glass_trigrams_frequency = Counter(looking_glass_trigrams)

# Change the n value to a number greater than 3:
looking_glass_ngrams = ngrams(tokenized, 4)
looking_glass_ngrams_frequency = Counter(looking_glass_ngrams)

print("Looking Glass Bigrams:")
print(looking_glass_bigrams_frequency.most_common(10))

print("\nLooking Glass Trigrams:")
print(looking_glass_trigrams_frequency.most_common(10))

print("\nLooking Glass n-grams:")
print(looking_glass_ngrams_frequency.most_common(10))

Looking Glass Bigrams:
[(('you', 'will'), 1), (('will', 'never'), 1), (('never', 'reach'), 1), (('reach', 'your'), 1), (('your', 'destination'), 1), (('destination', 'if'), 1), (('if', 'you'), 1), (('you', 'stop'), 1), (('stop', 'and'), 1), (('and', 'throw'), 1)]

Looking Glass Trigrams:
[(('you', 'will', 'never'), 1), (('will', 'never', 'reach'), 1), (('never', 'reach', 'your'), 1), (('reach', 'your', 'destination'), 1), (('your', 'destination', 'if'), 1), (('destination', 'if', 'you'), 1), (('if', 'you', 'stop'), 1), (('you', 'stop', 'and'), 1), (('stop', 'and', 'throw'), 1), (('and', 'throw', 'stones'), 1)]

Looking Glass n-grams:
[(('you', 'will', 'never', 'reach'), 1), (('will', 'never', 'reach', 'your'), 1), (('never', 'reach', 'your', 'destination'), 1), (('reach', 'your', 'destination', 'if'), 1), (('your', 'destination', 'if', 'you'), 1), (('destination', 'if', 'you', 'stop'), 1), (('if', 'you', 'stop', 'and'), 1), (('you', 'stop', 'and', 'throw'), 1), (('stop', 'and', 'throw'

<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 7. Topic Models
*Natural Language Processing*

----
We’ve touched on the idea of finding topics within a body of language. But what if the text is long and the topics aren’t obvious?

<br/>*Topic modeling* is an area of NLP dedicated to uncovering latent, or hidden, topics within a body of language.

<br/>A common technique is to deprioritize the most common words and prioritize less frequently used terms as topics in a process known as *term frequency-inverse document frequency (tf-idf)*. Say what?! This may sound counter-intuitive at first. Why would you want to give more priority to less-used words? Well, when you’re working with a lot of text, it makes a bit of sense if you don’t want your topics filled with words like “the” and “is.” The Python libraries `gensim` and `sklearn` have modules to handle tf-idf.

<br/>Whether you use your plain bag of words (which will give you term frequency) or run it through tf-idf, the next step in your topic modeling journey is often *latent Dirichlet allocation (LDA)*. LDA is a statistical model that takes your documents and determines which words keep popping up together in the same contexts (i.e., documents). We’ll use sklearn to tackle this for us.

<br/>If you have any interest in visualizing your newly minted topics, *word2vec* is a great technique to have up your sleeve. word2vec can map out your topic model results spatially as vectors so that similarly used words are closer together. In the case of a language sample consisting of “The squids jumped out of the suitcases. The squids were furious. Why are your suitcases full of jumping squids?”, we might see that “suitcase”, “jump”, and “squid” were words used within similar contexts. This word-to-vector mapping is known as a *word embedding*.

<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 8. Text Similarity
*Natural Language Processing*

----
Most of us have a good autocorrect story. Our phone’s messenger quietly swaps one letter for another as we type and suddenly the meaning of our message has changed (to our horror or pleasure). However, addressing *text similarity* — including spelling correction — is a major challenge within natural language processing.

<br/>Addressing word similarity and misspelling for spellcheck or autocorrect often involves considering the *Levenshtein distance* or minimal edit distance between two words. The distance is calculated through the minimum number of insertions, deletions, and substitutions that would need to occur for one word to become another. For example, turning “bees” into “beans” would require one substitution (“a” for “e”) and one insertion (“n”), so the Levenshtein distance would be two.

<br/>*Phonetic similarity* is also a major challenge within speech recognition. English-speaking humans can easily tell from context whether someone said “euthanasia” or “youth in Asia,” but it’s a far more challenging task for a machine! More advanced autocorrect and spelling correction technology additionally considers key distance on a keyboard and phonetic similarity (how much two words or phrases sound the same).

<br/>It’s also helpful to find out if texts are the same to guard against plagiarism, which we can identify through *lexical similarity* (the degree to which texts use the same vocabulary and phrases). Meanwhile, *semantic similarity* (the degree to which documents contain similar meaning or topics) is useful when you want to find (or recommend) an article or book similar to one you recently finished.

<br/>*Exercise:*
<br/>Assign the variable `three_away_from_code` a word with a Levenshtein distance of 3 from “code”. Assign `two_away_from_chunk` a word with a Levenshtein distance of 2 from “chunk”.

In [2]:
import nltk
# NLTK has a built-in function
# to check Levenshtein distance:
from nltk.metrics import edit_distance

def print_levenshtein(string1, string2):
    print(f"The Levenshtein distance from '{string1}' to '{string2}' is {edit_distance(string1, string2)}!")

# Check the distance between
# any two words here!
print_levenshtein("fart", "target")

# Assign passing strings here:
three_away_from_code = "babe"

two_away_from_chunk = "dunk"

print_levenshtein("code", three_away_from_code)
print_levenshtein("chunk", two_away_from_chunk)

The Levenshtein distance from 'fart' to 'target' is 3!
The Levenshtein distance from 'code' to 'babe' is 3!
The Levenshtein distance from 'chunk' to 'dunk' is 2!


<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 9. Language Prediction & Text Generation
*Natural Language Processing*

----
How does your favorite search engine complete your search queries? How does your phone’s keyboard know what you want to type next? Language prediction is an application of NLP concerned with predicting text given preceding text. Autosuggest, autocomplete, and suggested replies are common forms of language prediction.

<br/>Your first step to language prediction is picking a language model. Bag of words alone is generally not a great model for language prediction; no matter what the preceding word was, you will just get one of the most commonly used words from your training corpus.

<br/>If you go the n-gram route, you will most likely rely on *Markov chains* to predict the statistical likelihood of each following word (or character) based on the training corpus. Markov chains are memory-less and make statistical predictions based entirely on the current n-gram on hand.

<br/>For example, let’s take a sentence beginning, “I ate so many grilled cheese”. Using a trigram model (where *n* is 3), a Markov chain would predict the following word as “sandwiches” based on the number of times the sequence “grilled cheese sandwiches” has appeared in the training data out of all the times “grilled cheese” has appeared in the training data.

<br/>A more advanced approach, using a neural language model, is the *Long Short Term Memory (LSTM) model*. LSTM uses deep learning with a network of artificial “cells” that manage memory, making them better suited for text prediction than traditional neural networks.

<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 10. Advanced NLP Topics
*Natural Language Processing*

----
Believe it or not, you’ve just scratched the surface of natural language processing. There are a slew of advanced topics and applications of NLP, many of which rely on deep learning and neural networks.
- *Naive Bayes classifiers* are supervised machine learning algorithms that leverage a probabilistic theorem to make predictions and classifications. They are widely used for sentiment analysis (determining whether a given block of language expresses negative or positive feelings) and spam filtering.
- We’ve made enormous gains in *machine translation,* but even the most advanced translation software using neural networks and LSTM still has far to go in accurately translating between languages.
- Some of the most life-altering applications of NLP are focused on improving *language accessibility* for people with disabilities. Text-to-speech functionality and speech recognition have improved rapidly thanks to neural language models, making digital spaces far more accessible places.
- NLP can also be used to detect bias in writing and speech. Feel like a political candidate, book, or news source is biased but can’t put your finger on exactly how? Natural language processing can help you identify the language at issue.

<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 11. Challenges and Considerations
*Natural Language Processing*

----
As you’ve seen, there are a vast array of applications for NLP. However, as they say, “with great language processing comes great responsibility” (or something along those lines). When working with NLP, we have several important considerations to take into account:
- Different NLP tasks may be more or less difficult in different languages. Because so many NLP tools are built by and for English speakers, these tools may lag behind in processing other languages. The tools may also be programmed with cultural and linguistic biases specific to English speakers.
- What if your Amazon Alexa could only understand wealthy men from coastal areas of the United States? English itself is not a homogeneous body. English varies by person, by dialect, and by many sociolinguistic factors. When we build and train NLP tools, are we only building them for one type of English speaker?
- You can have the best intentions and still inadvertently program a bigoted tool. While NLP can limit bias, it can also propagate bias. As an NLP developer, it’s important to consider biases, both within your code and within the training corpus. A machine will learn the same biases you teach it, whether intentionally or unintentionally.
- As you become someone who builds tools with natural language processing, it’s vital to take into account your users’ privacy. There are many powerful NLP tools that come head-to-head with privacy concerns. Who is collecting your data? How much data is being collected and what do those companies plan to do with your data?

<img src="Images/atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 12. NLP Review
*Natural Language Processing*

----
Check out how much you’ve learned about natural language processing!
- Natural language processing combines computer science, linguistics, and artificial intelligence to enable computers to process human languages.
- NLTK is a Python library used for NLP.
- Text preprocessing is a stage of NLP focused on cleaning and preparing text for other NLP tasks.
- Parsing is an NLP technique concerned with breaking up text based on syntax.
- Language models are probabilistic machine models of language use for NLP comprehension tasks. Common models include bag-of-words, n-gram models, and neural language modeling.
- Topic modeling is the NLP process by which hidden topics are identified given a body of text.
- Text similarity is a facet of NLP concerned with semblance between instances of language.
- Language prediction is an application of NLP concerned with predicting language given preceding language.
- There are many social and ethical considerations to take into account when designing NLP tools.

<br/>You can build a lot of fun tools with NLP knowledge and a bit of Python. This is just the beginning.

<br/>Feel free to test out the plagiarism classifier we built below (does it work?) or use the space to play around with other NLP code you’ve encountered in this lesson!

In [4]:
import nltk
# Levenshtein distance:
from nltk.metrics import edit_distance

# an arbitrary plagiarism classifier:
def is_plagiarized(text1, text2):
    n = 7
    if edit_distance(text1.lower(), text2.lower()) > ((len(text1) + len(text2)) / n): return False
    return True

doc1 = "is this plagiarized"
doc2 = "maybe it's plagiarized"

print(is_plagiarized(doc1, doc2))

False
