## Visualizing Text with spaCy and other NLP packages

[spaCy](https://spacy.io/) is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems.

spaCy is created by [Explosion AI](https://explosion.ai/), which also makes [Prodigy](https://prodi.gy/), which is a scriptable annotation tool for creating training and evaluation data for machine learning.

This tutorial won't have enough time to go through everything about spaCy -- instead, I'll show you a few examples and also provide some helpful visualizer tools that are part of spaCy's universe.

If you're interested in learning more about spaCy, I recommend spaCy's [free online course](https://course.spacy.io/en) as well as its [spaCy 101 documentation](https://spacy.io/usage/spacy-101).

To begin, I'll assume you'll have created a virtual environment and installed the needed packages provided in the [README](/README.md).

In [2]:
spacy.__version__

'3.4.3'

In [30]:
import spacy

# spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")
text = "Joe Biden is the president of the United States."
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Joe Biden PERSON
the United States GPE


In [34]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [49]:
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)

Or we can use it for its dependency parser.

In [50]:
displacy.render(doc, style="dep", jupyter=True)

For our dataset, we'll use HuggingFace's [`datasets`](https://huggingface.co/datasets) package which provide a lot of helpful datasets.

In [4]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

Found cached dataset yelp_review_full (/Users/rhymenoceros/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
100%|████████████████████████████████████████████████| 2/2 [00:00<00:00, 12.38it/s]


In [5]:
dataset['train'][0]['text']

"dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."

This data set is very large: just the training includes 650,000 reviews. For illustrative purposes, let's take only the first 1,000 reviews.

In [6]:
texts = dataset['train']['text'][0:1000]

## topic modeling with BERTopic

Analyzing new text datasets is challenging because it's hard to know what is the right question to ask. This is where exploratory analysis can help. A popular NLP technique for exploratory analysis is [topic modeling](https://en.wikipedia.org/wiki/Topic_model).

Topic modeling is (typically) is set up as unsupervised machine learning where the goal is to find hidden or latent patterns in datasets that can be interpreated as topics.

We'll use a more modern version of topic modeling using BERTopic.

In [7]:
topic_model.visualize_topics()

ValueError: This BERTopic instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(texts, show_progress_bar=False)

# Train BERTopic
topic_model = BERTopic().fit(texts, embeddings)

# Run the visualization with the original embeddings
topic_model.visualize_documents(texts, embeddings=embeddings)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(texts, reduced_embeddings=reduced_embeddings)

## whatlies

A major advance in NLP over the last 10 years has been [word embedding models](https://ruder.io/word-embeddings-1/). These provide a way to quantify word meaning by 

[Tensorboard Projector](https://projector.tensorflow.org/)

In [None]:
from whatlies import EmbeddingSet
from whatlies.language import SpacyLanguage

lang = SpacyLanguage("en_core_web_md")
words = ["cat", "dog", "fish", "kitten", "man", "woman",
         "king", "queen", "doctor", "nurse"]

emb = EmbeddingSet(*[lang[w] for w in words])
emb.plot_interactive(x_axis=emb["man"], y_axis=emb["woman"])

## `bulk`

[`bulk`](https://github.com/koaning/bulk) is a quick developer tool to apply some bulk labels. Given a prepared dataset with 2d embeddings it can generate an interface that allows you to quickly add some bulk, albeit less precice, annotations.

In [24]:
import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression
from umap import UMAP

# pip install "embetter[text]"
from embetter.text import SentenceEncoder

# Build a sentence encoder pipeline with UMAP at the end.
text_emb_pipeline = make_pipeline(
  SentenceEncoder('all-MiniLM-L6-v2'),
  UMAP()
)

# Calculate embeddings 
X_tfm = text_emb_pipeline.fit_transform(texts)

# Write to disk. Note! Text column must be named "text"
df = pd.DataFrame({"text": texts})
df['x'] = X_tfm[:, 0]
df['y'] = X_tfm[:, 1]
df.to_csv("ready.csv")

In [26]:
!python -m bulk text ready.csv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
About to serve `bulk` over at http://localhost:5006/.
^C

Aborted!


In [27]:
!python -m bulk text ready.csv --keywords "chinese,burger,meal"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
About to serve `bulk` over at http://localhost:5006/.
^C

Aborted!


If you're interested in learning more, check out Vincent's [Bulk Labeling and Prodigy video](https://www.youtube.com/embed/gDk7_f3ovIk) or the related video on [Bulk Labeling for Images](https://www.youtube.com/watch?v=DmH3JmX3w2I&feature=emb_rel_pause).