&copy; 2024 by Pearson Education, Inc. All Rights Reserved. The content in this notebook is based on the book [**Python for Programmers**](https://amzn.to/2VvdnxE).

In [None]:
%%html
<!-- CSS settings for this notbook -->
<style>
    h1 {color:#BB0000}
    h2 {color:purple}
    h3 {color:#0099ff}
    hr {    
        border: 0;
        height: 3px;
        background: #333;
        background-image: linear-gradient(to right, #ccc, black, #ccc);
    }
</style>

In [None]:
# enable high-res images in notebook 
%config InlineBackend.figure_format = 'retina'

In [None]:
%matplotlib inline

# 11. Natural Language Processing (NLP)

# 11.1 Introduction
* Natural language communication examples    
    * **Conversations** between people 
    * Reading/writing **text messages**
    * Learning a **foreign language**  
    * Using a **smartphone** to read menus in other languages
* NLP is performed on **text collections** (**corpora**, plural of **corpus**)
    * **Social media posts** (Tweets, Facebook posts, etc.)
    * Documents, books, news articles, movie reviews
    * And more

<hr style="height:2px; border:none; color:#000; background-color:#000;">

#### Machine Learning and Deep Learning Natural Language Applications
* **Sentiment analysis**
* **Speech synthesis** (text-to-speech)
* **Speech recognition** (speech-to-text)
* **Inter-language text-to-text and speech-to-speech translation**
* **Automatic closed captioning**
* **Bots answering natural language questions** 
* **Text summarization**
* **Text simplification**
* **Recommender systems** (“if you liked this movie, you might also like…”)
* **Classifying articles by categories**
* **Topic modeling**—finding the **topics** discussed in documents
* **Speech to sign language and vice versa**—to enable a conversation with a hearing-impaired person
* **Lip reader technology**—for people who can’t speak, convert lip movement to text or speech to enable conversation

<hr style="height:2px; border:none; color:#000; background-color:#000;">

# 11.2 [TextBlob](https://textblob.readthedocs.io/)
### Install **TextBlob**
* `conda install -c conda-forge textblob`
* Next download **NLTK corpora** required by Textblob
> `ipython -m textblob.download_corpora`

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.1 Create a TextBlob—The fundamental Class for NLP

In [None]:
from textblob import TextBlob

In [None]:
text = 'Yesterday was a beautiful day. Tomorrow looks like bad weather.'

In [None]:
blob = TextBlob(text)

In [None]:
blob

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.2 Tokenizing Text into Sentences and Words 

In [None]:
blob.sentences  # returns list of Sentence objects

In [None]:
blob.words  # returns a WordList (subclass of list) of Words; punctuation removed

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.3 Parts-of-Speech (POS) Tagging
* `nltk` [**parts-of-speech** tags](https://www.guru99.com/pos-tagging-chunking-nltk.html)

In [None]:
blob

In [None]:
blob.tags  # list of (word, part-of-speech-tag) tuples

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.4 Extracting Noun Phrases

In [None]:
blob

In [None]:
blob.noun_phrases  # WordList of Word objects 

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.5 Sentiment Analysis on `TextBlob`s and `Sentence`s
* **`polarity`** is the **sentiment** — from **`-1.0` (negative)** to **`1.0` (positive)** &mdash; **`0.0`** is **neutral**
* **`subjectivity`** &mdash; **0.0 (objective)** to **1.0 (subjective)**

In [None]:
blob

In [None]:
blob.sentiment  # Sentiment object positive/negative and objective/subjective

In [None]:
for sentence in blob.sentences:
    print(f'{sentence}\nsentiment: {sentence.sentiment}\n')

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.7 Language Detection and Translation (1 of 3)
* **Google Translate**, **Microsoft Bing Translator** and others can translate among scores of languages

**NOTE:** TextBlob translate method is now deprecated. Instead, you can install https://github.com/DeepLcom/deepl-python

> `pip install --upgrade deepl`

* You'll need an API key
* Free one allows 500,000 characters/month
* To get a key:
> * Go to https://www.deepl.com/pro#developer
> * Click **Sign up for free**
> * Under **DeepL API Free** click **Sign up for free**
> * Specify an email/password and click **Continue**
> * Fill in the form and provide a credit card – required to “fraudulent multiple registrations”, then click **Continue**
> * Read the terms and, if you agree, click **Sign up for free**
> * Click the **Account Management** link on the thank you page
> * Click the **Account** tab and scroll to **Authentication Key for DeepL API**
> * Copy your key then in the **ch11** folder create a **`keys.py`** file containing
>> `deepL_key = 'your key here'`
> * Be sure to replace the contents of the preceding string with your DeepL key


In [None]:
import keys

In [None]:
import deepl

In [None]:
translator = deepl.Translator(keys.deepL_key)

<!--**NOTE:** TextBlob translate method is now deprecated. Instead, you can install https://deep-translator.readthedocs.io/en/latest/

> `pip install -U deep_translator`


from deep_translator import GoogleTranslator

# dictionary of supported languages
GoogleTranslator().get_supported_languages(as_dict=True) 

import keys

from deep_translator import single_detection

**Sign up for a free account and get a free API key from: https://detectlanguage.com/documentation**-->

* [ISO-639-1 language codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
* [Google Translate’s supported languages](https://cloud.google.com/translate/docs/languages)

<!--
<p style="color:darkred; font-weight:bold;">NOTE: There is currently a known issue with the translation in TextBlob<br/>Google changed the parameters to their webservice. The developer is aware of it and someone has submitted a fix, but it has not yet been merged into the repository. This should be fixed in a future version. I was able to get the basics working by modifying TextBlob's translate.py file, replacing:</p>

> url = "http://translate.google.com/translate_a/t?client=webapp&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&otf=2&ssel=0&tsel=0&kc=1"

with 

>url = "http://translate.google.com/translate_a/t?client=te&format=html&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&otf=2&ssel=0&tsel=0&kc=1"

I also had to make a couple of other minor changes in the code snippets below.


This article lists several packages that support language translation: https://dev.to/kalebu/how-to-do-language-translation-in-python-1ic6


blob

blob.detect_language()  # uses Google Translate; 'en' means English

-->

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.7 Language Detection and Translation (2 of 3)

In [None]:
blob

In [None]:
# autodetect source language and translate to Spanish
spanish = translator.translate_text(blob.string, target_lang='es')

In [None]:
spanish.detected_source_lang

In [None]:
spanish.text

In [None]:
# autodetect source language and translate to Chinese
chinese = translator.translate_text(blob.string, target_lang='zh')

In [None]:
chinese.text

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.7 Language Detection and Translation (3 of 3)
* Notice **differences** in the **text translated back to English** from Spanish and Chinese 

In [None]:
blob

In [None]:
# autodetect source language and translate to English
result = translator.translate_text(spanish.text, target_lang='en-US')

In [None]:
result.detected_source_lang

In [None]:
result.text

In [None]:
# autodetect source language and translate to English
result = translator.translate_text(chinese.text, target_lang='en-US')

In [None]:
result.detected_source_lang

In [None]:
result.text

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.9 Spell Checking and Correction (1 of 2)

In [None]:
from textblob import Word

In [None]:
word = Word('theyr')

In [None]:
%precision 2 

In [None]:
word.spellcheck()  # returns tuples of corrections and confidence values

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.9 Spell Checking and Correction (2 of 2)
* `TextBlob`s, `Sentence`s and `Word`s all have a **`correct` method** 
* **Corrects spelling** using correctly spelled word with **highest confidence value**

In [None]:
word.correct()  # chooses word with the highest confidence value

In [None]:
sentence = TextBlob('Ths sentense has missplled wrds.')

In [None]:
sentence.correct() 

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.11 Word Frequencies via `word_counts` Dictionary in a `TextBlob` (1 of 2)
* **Project Gutenberg's [60,000+ free e-books](https://www.gutenberg.org)**
    * Great source of text corpora for analysis
    * Read their [Terms of Use](https://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use)
    * **Out of copyright** in the **United States** 
* We **downloaded** the **Plain Text UTF-8** version of [Shakespeare’s *Romeo and Juliet*](https://www.gutenberg.org/ebooks/1513) 
    * Saved as **`RomeoAndJuliet.txt`** 
    * **Note**: For analysis, we **removed** the **Project Gutenberg text** before and after the play in each file

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.11 Word Frequencies via `word_counts` Dictionary in a `TextBlob` (2 of 3)

In [None]:
from pathlib import Path

In [None]:
blob = TextBlob(Path('RomeoAndJuliet.txt').read_text())  # load Romeo and Juliet

* **Which word appears more in the play&mdash;"Romeo" or "Juliet"?** 

In [None]:
blob.word_counts['juliet'] 

In [None]:
blob.word_counts['romeo']

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.13 Deleting Stop Words (1 of 2)
* Less significant words&mdash;like "a", "an", "the", pronouns, etc.&mdash;that are often removed before text analysis
* Returned by [NLTK **`stopwords` module's `words` function**](https://www.nltk.org/book/ch02.html)

In [None]:
import nltk

In [None]:
nltk.download('stopwords')  # must download before first use

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.2.13 Deleting Stop Words (2 of 2)

In [None]:
from nltk.corpus import stopwords

In [None]:
stops = stopwords.words('english')  # load the english list

In [None]:
%pprint

In [None]:
stops

In [None]:
blob = TextBlob('Today is a beautiful day.')

In [None]:
# keep anything that's not a stop word
[word for word in blob.words.lower() if word not in stops]  

[List comprehensions presentation in my **Python Fundamentals videos**](https://learning.oreilly.com/videos/python-fundamentals/9780135917411/9780135917411-PFLL_Lesson05_11)

<hr style="height:2px; border:none; color:#000; background-color:#000;">

# 11.3 Visualizing Word Frequencies with Bar Charts and Word Clouds (1 of 4)

In [None]:
blob = TextBlob(Path('RomeoAndJuliet.txt').read_text())  # load play

* Eliminate stopwords
* `item[0]` is the **word in each tuple** returned by `blob.word_counts.items()`

In [None]:
items = blob.word_counts.items()  # iterator for word-frequency tuples 

In [None]:
items = [item for item in items if item[0] not in stops]

In [None]:
items[:5]

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Sorting the Top 20 Words in Descending Order by Frequency (2 of 4)

In [None]:
from operator import itemgetter  # used to specify tuple element to sort by

In [None]:
sorted_items = sorted(items, key=itemgetter(1), reverse=True)  # descending

* **`key=itemgetter(1)`**&mdash;sort tuples by **frequency** (each tuple's element `1`)

In [None]:
top20 = sorted_items[0:20]

In [None]:
top20

In [None]:
%pprint

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Convert top20 to a `DataFrame` for Visualization (3 of 4)
* **pandas library** used frequently in later case studies 

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(top20, columns=['word', 'count'])  

In [None]:
df

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Visualizing the `DataFrame` (4 of 4)
* **`bar` method** of the `DataFrame`’s **`plot` property** creates and displays a **Matplotlib bar chart**

In [None]:
import matplotlib.pyplot as plt

In [None]:
#import matplotlib.pyplot as plt
axes = df.plot.bar(x='word', y='count')
plt.gcf().tight_layout()  # compress chart to ensure all components fit 

<hr style="height:2px; border:none; color:#000; background-color:#000;">

## 11.3.2 Visualizing the Top 200 Words in  **Romeo and Juliet** as a Word Cloud (1 of 4)
* `conda install -c conda-forge wordcloud`
* Created by **Andreas Mueller**&mdash;author of [**"Introduction to Machine Learning with Python"**](https://amzn.to/2JTBKOp) and core developer of **scikit-learn machine-learning library**
    
### Loading the Text

In [None]:
text = Path('RomeoAndJuliet.txt').read_text()

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Loading the Mask Image that Specifies the Word Cloud’s Shape (2 of 4)
* [**`wordcloud` module’s**](https://github.com/amueller/word_cloud) **`WordCloud` class** uses **matplotlib** under the hood 
* Fills non-white areas of a **mask image** with text
* Load the mask using **`imageio` module's `imread` function** 

In [None]:
import imageio  # bundled with Anaconda

In [None]:
mask_image = imageio.v3.imread('mask_heart.png')  # returns NumPy array of image's data

[NumPy discussed in Lesson 7, Array-Oriented Programming, of my **Python Fundamentals LiveLessons** videos](https://learning.oreilly.com/videos/python-fundamentals/9780135917411/9780135917411-PFLL_Lesson07_00)

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Configuring the WordCloud Object (3 of 4)

In [None]:
from wordcloud import WordCloud   

In [None]:
wordcloud = WordCloud(
    colormap='prism', mask=mask_image, background_color='white')

* `WordCloud` assigns **random colors** from a **color map**
* [Matplotlib’s named color maps](https://matplotlib.org/examples/color/colormaps_reference.html)
* [`WordCloud`’s keyword arguments and their default values](http://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html)

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Generating the Word Cloud, Saving It and Displaying It (4 of 4)

In [None]:
wordcloud = wordcloud.generate(text)  

* Removes stop words
* Calculates the word frequencies
* Uses up to **200 words** by default
    * **`max_words` keyword argument** can specify any number

In [None]:
wordcloud = wordcloud.to_file('RomeoAndJulietHeart.png')

In [None]:
from IPython.display import Image
Image(filename='RomeoAndJulietHeart.png', width=400)

<hr style="height:2px; border:none; color:#000; background-color:#000;">

# 11.5 Named Entity Recognition with [**spaCy**](https://spacy.io/) (1 of 4)
* Attempts to **locate and categorize items** that can help **determine what a text is about**
    * **dates**, **times**, **quantities**, **places**, **people**, **things**, **organizations** and more 
* [spaCy Quickstart guide](https://spacy.io/usage/models#section-quickstart)
* `conda install -c conda-forge spacy`
* Download spaCy's **English (`en_core_web_sm`) model** for processing text  
>```
>ipython -m spacy download en_core_web_sm
>```

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Loading the Language Model with the `spacy` Module’s **`load` Function** (2 of 4)

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_md')  # loads English language model

* spaCy docs recommend the **variable name `nlp`**.

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Creating a spaCy Doc (3 of 4)
* Use the **`nlp` object** to create a [**spaCy `Doc`** object](https://spacy.io/api/doc) representing the **document** to process. 

In [None]:
document = nlp('In 1994, Tim Berners-Lee founded the ' + 
    'World Wide Web Consortium which is devoted to ' +
    'developing web technologies')

<hr style="height:2px; border:none; color:#000; background-color:#000;">

### Getting the Named Entities Via a `Doc`’s **`ents` Property** (4 of 4)
* Returns tuple of spaCy **`Span`** objects representing the **named entities** 
* [**`Span`** properties](https://spacy.io/api/span)
* Display **`text`** (the **entity's text**) and **`label_`** (the **kind of entity**)

In [None]:
for entity in document.ents:
    print(f'{entity.text}: {entity.label_}')

<hr style="height:2px; border:none; color:#000; background-color:#000;">

# 11.7 Other NLP Libraries and Tools 
[See this section on O'Reilly](https://learning.oreilly.com/library/view/Python+for+Programmers,+First+Edition/9780135231364/ch11.xhtml#ch11lev1sec7)

<!--
Additional mostly free and open source NLP libraries and APIs: 
* **Gensim**—**Similarity detection** and **topic modeling**
* **Google Cloud Natural Language API**—Cloud-based API for NLP tasks such as **named entity recognition**, **sentiment analysis**, **parts-of-speech analysis and visualization**, **determining content categories** and more
* **Microsoft Linguistic Analysis API**
* **Bing sentiment analysis**—**Microsoft’s Bing search engine** now uses **sentiment** in its **search results**
* **PyTorch NLP**—**Deep learning library** for **NLP**  
* **Stanford CoreNLP**—A **Java NLP library**, which also provides a **Python wrapper**. Includes **corefererence resolution**, which finds all references to the same thing.
* **Apache OpenNLP**—Another **Java-based NLP library** for common tasks, including **coreference resolution**. **Python wrappers** are available.
* **PyNLPl** (pineapple)—**Python NLP library** 
* **SnowNLP**—**Python library** that simplifies **Chinese text processing**
* **KoNLPy**—**Korean language NLP**
* **`stop-words`**—**Python library** with **stop words for many languages**. We used NLTK’s stop words lists in this chapter. 
* **`TextRazor`**—A **paid cloud-based NLP API** that provides a **free tier**
-->

# 11.9 Natural Language Datasets 
[See this section on O'Reilly](https://learning.oreilly.com/library/view/python-for-programmers/9780135231364/ch11.xhtml#ch11lev1sec9)

<!--
* **Social media posts**&mdash;via APIs like the Twitter API we'll demonstrate next.
* **Wikipedia**—some or all of Wikipedia (`https://meta.wikimedia.org/wiki/Datasets`).
* **IMDB (Internet Movie Database)**—various **movie and TV datasets** are available.
* **UCIs text datasets**—many datasets, including the **Spambase** dataset.
* **Jeopardy! dataset**—200,000+ questions from the Jeopardy! TV show. A milestone in AI occurred in 2011 when IBM Watson famously beat two of the world’s best Jeopardy! players.
* [**Natural language processing datasets**](https://machinelearningmastery.com/datasets-natural-language-processing/)
* [**NLTK data**](https://www.nltk.org/data.html)
* **Sentiment labeled sentences data set** (from sources including **IMDB.com**, **amazon.com**, **yelp.com**) 
* [**Registry of Open Data on AWS**](https://registry.opendata.aws)—a searchable directory of **datasets hosted on Amazon Web Services**.
* [**Amazon Customer Reviews Dataset**](https://registry.opendata.aws/amazon-reviews/)—130+ million product reviews.
* and many more!-->

<hr style="height:2px; border:none; color:#000; background-color:#000;">

# More Info 
* See Lesson 11 in [**Python Fundamentals LiveLessons** here on O'Reilly Online Learning](https://learning.oreilly.com/videos/python-fundamentals/9780135917411)
* See Chapter 11 in [**Python for Programmers** on O'Reilly Online Learning](https://learning.oreilly.com/library/view/python-for-programmers/9780135231364/)
* See Chapter 12 in [**Intro Python for Computer Science and Data Science** on O'Reilly Online Learning](https://learning.oreilly.com/library/view/intro-to-python/9780135404799/)
* Interested in a print book? Check out:

| Python for Programmers<br>(640-page professional book) | Intro to Python for Computer<br>Science and Data Science<br>(880-page college textbook)
| :------ | :------
| <a href="https://amzn.to/2VvdnxE"><img alt="Python for Programmers cover" src="../images/PyFPCover.png" width="150" border="1"/></a> | <a href="https://amzn.to/2LiDCmt"><img alt="Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud" src="../images/IntroToPythonCover.png" width="159" border="1"></a>

>Please **do not** purchase both books&mdash;_Python for Programmers_ is a subset of _Intro to Python for Computer Science and Data Science_

&copy; Copyright 1992-2024 by Pearson Education, Inc. All Rights Reserved. The content in this notebook is based on the book [**Python for Programmers**](https://amzn.to/2VvdnxE).