# NATURAL LANGUAGE PROCESSING WITH TEXTACY & SPACY

__Spacy__ is a very high performance NLP library for doing several tasks of NLP with ease and speed. Let us explore another library built on top of __SpaCy__ called __TextaCy__.

## TEXTACY
+ Textacy is a Python library for performing higher-level natural language processing (NLP) tasks,
built on the high-performance Spacy library.
+ Textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text.
+ Uses
    + Text preprocessing
    + Keyword in Context
    + Topic modeling
    + Information Extraction
    + Keyterm extraction,
    + Text and Readability statistics,
    + Emotional valence analysis,
    + Quotation attribution

### INSTALLATION
You can install using `pip install textacy` or `conda install -c conda-forge textacy`.
NB: In case you are having issues with installing on windows you can use conda instead of pip.

### Downloading Dataset
You can use the following command to download the `capitol_words` dataset, whcih we will use in this tutorial.
`python -m textacy download capital_words`

<!--### FOR LANGUAGE DETECTION
You can either use `pip install textacy[lang]` or `pip install cld2-cffi` to install the required language pack for textacy.

__NOTE__: All required the package, dependencies, and add-on packs are pre-installed for this tutorial.-->

## Getting Started

In [None]:
!pip install textacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textacy
  Downloading textacy-0.13.0-py3-none-any.whl (210 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.7/210.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting cytoolz>=0.10.1 (from textacy)
  Downloading cytoolz-0.12.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting floret~=0.10.0 (from textacy)
  Downloading floret-0.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (314 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.7/314.7 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jellyfish>=0.8.0 (from textacy)
  Downloading jellyfish-0.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Loading Packages
import textacy

In [None]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, \
built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing \
— offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, \
and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, \
and more."

In [None]:
example

'Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.'

## TEXT PREPROCESSING WITH TEXTACY
Following methods can be used to preprocess your text data:

+ `textacy.preprocess_text()`
+ `textacy.preprocess.`
    + Punctuation Lowercase
    + Urls
    + Phone numbers
    + Currency
    + Emails


In [None]:
raw_text = """ The best programs, are the ones written when the programmer is supposed to be working on something else.\
Mike bought the book for $50 although in Paris it will cost $30 dollars. Don’t document the problem, \
fix it.This is from https://twitter.com/codewisdom?lang=en. """

In [None]:
# Removing urls
from textacy import preprocessing
processed_text = preprocessing.replace.urls(raw_text,repl='TWITTER')
processed_text

' The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars. Don’t document the problem, fix it.This is from TWITTER '

In [None]:
# Removing Punctuation and Uppercase

processed_text = preprocessing.remove.punctuation(processed_text)
processed_text

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for $50 although in Paris it will cost $30 dollars  Don t document the problem  fix it This is from TWITTER '

In [None]:
# Replacing Currency Symbols
processed_text = preprocessing.replace.currency_symbols(processed_text,repl='USD')
processed_text

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for USD50 although in Paris it will cost USD30 dollars  Don t document the problem  fix it This is from TWITTER '

Notice that we created a variable `processed_text` in every cell block above? That is because that usually text preprocessing is a pipeline - with multiple steps in it. Here are the steps we completed above:
+ Removing Punctuation and Uppercase
+ Removing urls
+ Replacing Currency Symbols

So we are using the variable to pass text data from each step to the next.

There are much more text preprocessing steps. [Here](https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html) is a good summary of these steps.


Refer to the [docs](https://textacy.readthedocs.io/en/0.10.1/api_reference/text_processing.html) for more details of the `textacy.preprocess` and its family methods.

### READING A TEXT OR A DOCUMENT
+ `textacy.Doc(your_text)`
+ `textacy.io.read_text(your_text)`

Textacy would not receive a lot of attractions if it only can remove URLs or punctuations; however, all additional/more advanced techniques/analyses required on `formatting` the data.

TextaCy/SpaCy uses a `Doc` as a container for any text objects. [Here](https://textacy.readthedocs.io/en/0.10.1/api_reference/lang_doc_corpus.html) is nice documentation of the `doc` object.



In [None]:
import spacy.cli
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
# With SpacyDoc
# Requires Language Pkg Model
# nlp = spacy.load("en_core_web_sm")
docx_textacy = textacy.make_spacy_doc(example, lang='en_core_web_sm')

In [None]:
docx_textacy

Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.

We can look at the `type` of the `docx_textacy` object.

In [None]:
type(docx_textacy)

spacy.tokens.doc.Doc

Following code read a `doc` from a local file:

1. use the .read() method
`file_textacy = textacy.Doc(open("example.txt").read())`

2. create a generator
`file_textacy2 = textacy.io.read_text('example.txt',lines=True)`

then:

`for text in file_textacy2:`

    `docx_file = textacy.Doc(text)`
    
    `print(docx_file)`

### Advanced Text Analytics

1. Named-Entity Recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. [Source: [Wikipedia](https://en.wikipedia.org/wiki/Named-entity_recognition)]

TextaCy has a built-in method for NER.

In [None]:
# Using Textacy Named Entity Extraction
list(textacy.extract.entities(docx_textacy))

[Textacy, NLP, Spacy]

2. n-grams

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

TextaCy has a built-in method for n-grams.

N-grams, a.k.a __Bag-of-Words__, is a very important quantifying approach for text data.

In [None]:
# NGrams with Textacy
# NB SpaCy method would be to use noun Phrases
# Tri Grams

list(textacy.extract.ngrams(docx_textacy,3))

[library for performing,
 level natural language,
 natural language processing,
 performance Spacy library,
 focuses on tasks,
 availability of tokenized,
 emotional valence analysis]

3. text statistics
This usually includes computing basic counts and various readability statistics.

In [None]:
from textacy.text_stats import TextStats

In [None]:
ts = TextStats(docx_textacy)

  utils.deprecated(


In [None]:
# Number of words
ts.n_words

60

In [None]:
# Basic counts of unuque words
ts.n_unique_words

51

In [None]:
# readability scores
ts.flesch_kincaid_grade_level

AttributeError: ignored

Some of these basic counts and readability stats seem intimidating. Feel free to Google them to better understand them.

4. Dealing with a collection of documents (corpus)

Many NLP tasks require datasets comprised of a large number of texts, which are often stored on disk in one or multiple files. textacy makes it easy to efficiently stream text and (text, metadata) pairs from disk, regardless of the format or compression of the data.

In [None]:
import textacy.datasets  # note the import
ds = textacy.datasets.CapitolWords()
ds.download()
records = ds.records(speaker_name={"Hillary Clinton", "Barack Obama"})
next(records)

100%|██████████| 11.9M/11.9M [00:00<00:00, 62.1MB/s]


Record(text='I yield myself 15 minutes of the time controlled by the Democrats.', meta={'date': '2001-02-13', 'congress': 107, 'speaker_name': 'Hillary Clinton', 'speaker_party': 'D', 'title': 'MORNING BUSINESS', 'chamber': 'Senate'})

A `textacy.Corpus` is an ordered collection of spaCy Doc s, all processed by the same language pipeline. Let’s continue with the Capitol Words dataset and make a corpus from a stream of records. (Note: This may take a few minutes.)

In [None]:
cw = textacy.datasets.CapitolWords()
records = cw.records(limit=100)
spacy_lang = textacy.load_spacy_lang("en_core_web_sm", disable=("parser",))
corpus = textacy.Corpus(spacy_lang, data=records)
print(corpus)

Corpus(100 docs, 70562 tokens)


Say we only want the speeches from Mr. Bernie Sanders:

In [None]:
for text, meta in ds.records(speaker_name="Bernie Sanders", limit=3):
    print("\n{}, {}\n{}".format(meta["title"], meta["date"], text))


JOIN THE SENATE AND PASS A CONTINUING RESOLUTION, 1996-01-04
Mr. Speaker, 480,000 Federal employees are working without pay, a form of involuntary servitude; 280,000 Federal employees are not working, and they will be paid. Virtually all of these workers have mortgages to pay, children to feed, and financial obligations to meet.
Mr. Speaker, what is happening to these workers is immoral, is wrong, and must be rectified immediately. Newt Gingrich and the Republican leadership must not continue to hold the House and the American people hostage while they push their disastrous 7-year balanced budget plan. The gentleman from Georgia, Mr. Gingrich, and the Republican leadership must join Senator Dole and the entire Senate and pass a continuing resolution now, now to reopen Government.
Mr. Speaker, that is what the American people want, that is what they need, and that is what this body must do.

DISPOSING OF SENATE AMENDMENT TO H.R. 1643, EXTENSION OF MOST-FAVORED- NATION TREATMENT FOR BUL

You can filter the corpus using certain conditions, which would cover your specific use cases:

Corpus is a list-like object that can be iterated on - each element in Corpus is a `textacy.Doc` object.

Which means we can do slicing as we slice any list in Python:

In [None]:
# any element
corpus[0]

Mr. Speaker, 480,000 Federal employees are working without pay, a form of involuntary servitude; 280,000 Federal employees are not working, and they will be paid. Virtually all of these workers have mortgages to pay, children to feed, and financial obligations to meet.
Mr. Speaker, what is happening to these workers is immoral, is wrong, and must be rectified immediately. Newt Gingrich and the Republican leadership must not continue to hold the House and the American people hostage while they push their disastrous 7-year balanced budget plan. The gentleman from Georgia, Mr. Gingrich, and the Republican leadership must join Senator Dole and the entire Senate and pass a continuing resolution now, now to reopen Government.
Mr. Speaker, that is what the American people want, that is what they need, and that is what this body must do.

In [None]:
# a sub-list
[doc for doc in corpus[:3]]

[Mr. Speaker, 480,000 Federal employees are working without pay, a form of involuntary servitude; 280,000 Federal employees are not working, and they will be paid. Virtually all of these workers have mortgages to pay, children to feed, and financial obligations to meet.
 Mr. Speaker, what is happening to these workers is immoral, is wrong, and must be rectified immediately. Newt Gingrich and the Republican leadership must not continue to hold the House and the American people hostage while they push their disastrous 7-year balanced budget plan. The gentleman from Georgia, Mr. Gingrich, and the Republican leadership must join Senator Dole and the entire Senate and pass a continuing resolution now, now to reopen Government.
 Mr. Speaker, that is what the American people want, that is what they need, and that is what this body must do.,
 Mr. Speaker, a relationship, to work and survive, has got to be honest and we have got to deal with each other in good faith. For a government to govern 

In [None]:
# You can delete elements from `corpus`
del corpus[:10]
corpus

<textacy.corpus.Corpus at 0x7f07934e2620>

We can also get basic statistics of the `corpus` object:

In [None]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

(90, 0, 67217)

In [None]:
# Word Counts - as a dictionary
counts = corpus.word_counts()
# word_freqs(weighting='count', as_strings=True)

In [None]:
# we can get the top-5 frequent words form the `counts` dictionary
sorted(counts.items(), key=lambda x: x[1], reverse=True)[:5]

[(8004577259940138793, 243),
 (15275761157247972012, 240),
 (7593739049417968140, 238),
 (14889849580704678361, 210),
 (8021635565988888124, 209)]

We also introduce a new text metric called __term frequency - inversed document frequency__ (`tf-idf`).

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.[Source: Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Term frequency (`tf`) in `tf-idf` is the word frequency we get from above dictionary (`counts`). Now we need to calculate the `idf` part.

The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

\begin{equation*}
idf(t, D) = log(\frac{N_D}{N_t})
\end{equation*}

In which, $ N_D $ is the number of documents (`doc`) in `corpus` D; and $ N_t $ is the number of `doc`s in $ D $ containing term $ t $.

Looks complicated, right? Fortunately we can use TextaCy's built-in methods to calculate `idf`.

In [None]:
idf = corpus.word_doc_counts(weighting='idf')

In [None]:
sorted(idf.items(), key=lambda x: x[1], reverse=True)[:5]

[(15689657794018624313, 4.51085950651685),
 (10480242113835854118, 4.51085950651685),
 (9211202706150281085, 4.51085950651685),
 (5971168095749524238, 4.51085950651685),
 (17101725512904321536, 4.51085950651685)]

Since now we have both __tf__ as a dict object `counts`; and idf as a dict object `idf`, we can calculate the complete __tf-idf__ metric.

In [None]:
tf_idf = {k: counts[k]/idf[k] for k in idf.keys() & counts}

### YOUR TURN HERE

Please print out the top-20 terms with the highest `tf-idf` values.

In [None]:
#### Complete your code here
store = list(tf_idf.keys())[:20]

Some of above results make sense, such as 'president' and 'bill' and 'act'. But terms like '-PRON-' (referring to pronouns such as 'you' or 'I') and ''s' do not make sense. Similar conclusion can be drawn onto words such as 'a', 'an', 'the', ...

These words are called __stop words__. In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. [Source: Wikipedia](https://en.wikipedia.org/wiki/Stop_words).

We can check if a word (a.k.a. token) is stop by using the `is_stop` attribute provided with `textacy.token` object.


In [None]:
my_doc = corpus[0]

for token in my_doc:
  print(token, token.is_stop)

### YOUR TURN HERE

Remove all stop words from `my_doc`.

__HINT__: using an `if` statement (with the `is_stop` attribute) in the `for` loop.

In [None]:
#### Complete your code here
filtered_doc = []
for word in my_doc:
    if not word.is_stop:
        filtered_doc.append(word.text)


In [None]:
filtered_doc[:10]

['Mr.',
 'Speaker',
 ',',
 'unavoidably',
 'absent',
 'votes',
 'default',
 'legislation',
 '.',
 'present']

Sometimes we only care about __word tokens__, which means we need to filter out _numbers_, _punctuations_, and so forth.

TextaCy provides a `is_alpha` attribute for that purpose.

In [None]:
for token in my_doc:
    print(token, token.is_alpha)

### YOUR TURN HERE

What if we want non-stop and word tokens?

__HINT__: combine the two above steps together.

In [None]:
#### Complete you code here
filtered_tokens = []
for token in my_doc:
    if not token.is_stop:
        filtered_tokens.append(token)

# Join the filtered tokens back into a string
filtered_doc = " ".join([token.text for token in filtered_tokens])

In [None]:
filtered_doc

'Mr. Speaker , unavoidably absent votes default legislation . present , voted " nay " motions table appeal ruling Chair regards resolutions offered Mr. Gephardt ( rollcall . 26 ) Ms. Jackson - Lee ( rollcall . 27 ) , voted " nay " ordering previous question House Resolution 355 ( rollcall . 28 ) . voted " nay " H. Con . Res . 141 ( rollcall . 29 ) . voted " yea " H.R. 2924 ( rollcall . 30 ) .'

From the results, have you noticed that different forms of the same token may appear in the text? For instance, 'run', 'ran', and 'running' are all different forms of the root word 'run'. Counting them as different words may bias any subsequent model. Thus, it would be ideal to make different forms of the same word to the root. This process is called __lemmatization__.

_Lemmatization_ usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the _lemma_. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source. [Source: Stanford NLP Group](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

TextaCy provides a `lemma_` attribute for `token` object for exactly this purpose.

In [None]:
for token in my_doc:
    print(token, token.lemma_)

Mr. Mr.
Speaker Speaker
, ,
I I
was be
unavoidably unavoidably
absent absent
during during
the the
votes vote
on on
default default
legislation legislation
. .
If if
I I
had have
been be
present present
, ,
I I
would would
have have
voted vote
" "
nay nay
" "
on on
the the
motions motion
to to
table table
the the
appeal appeal
of of
the the
ruling ruling
of of
the the
Chair Chair
with with
regards regard
to to
the the
resolutions resolution
offered offer
by by
Mr. Mr.
Gephardt Gephardt
( (
rollcall rollcall
No No
. .
26 26
) )
and and
Ms. Ms.
Jackson Jackson
- -
Lee Lee
( (
rollcall rollcall
No no
. .
27 27
) )
, ,
I I
would would
have have
voted vote
" "
nay nay
" "
on on
the the
ordering ordering
of of
the the
previous previous
question question
on on
House House
Resolution Resolution
355 355
( (
rollcall rollcall
No no
. .
28 28
) )
. .
I I
would would
have have
voted vote
" "
nay nay
" "
on on
H. H.
Con Con
. .
Res Res
. .
141 141
( (
rollcall rollcall
No no
. .
29 29
) )
. .
I I
w

# Exercise

In this exercise, you are going to complete following tasks.
1. From the `corpus` variable, generate a new list named `jb_lst` that contains all speeches from `Joseph Biden`. (__HINT__: use similar code as we filter `cw` for 'Bernie Sanders'.)
2. Select first 5 speeches (`doc`) from `jb_lst` and store them in a new list named `jb_selected`.
3. For each element in `jb_selected`, print out:
    a. Named Entities (`named_entities()`)
    b. Text Statistics (`TextStats()`)
4. For each element in `jb_selected`, print out the top 20 words based on their tf-idf score.
5. For each element in `jb_selected`, print out __lemmas__ (`lemma_`) for each token if it is not a stop word (`is_stop`) and is a word token (`is_alpha`).

In [None]:
#### Complete your code here


This is the end of part 1. We will resume on text analytics (NLP) after the break.