## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
%%capture
!pip install scattertext

In [2]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [5]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [6]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
6791,dickens,twist,"“It is,” replied Rose, “that you must endeavou...",companion object love world heart passion friend,old many proud other true warm most faithful,reply endeavour forget attach wound look think...
9892,dickens,copperfield,‘To degrade YOU?’ said Mr. Creakle. ‘My stars!...,star name arm cane chest knot brow eye favouri...,little visible proper,degrade say give leave ask fold make talk show...
16700,dickens,bleak,I took the liberty of saying that the room wou...,liberty room paper boy extent business way mee...,dear good overwhelmed public serious,take say want think put say know dare say obli...
1293,dickens,cities,"Young Jerry, who had only made a feint of undr...",feint undressing bed father cover darkness roo...,full ajar,make go follow follow follow follow concern ge...
19287,dickens,bleak,"He took me to the porch, which he had hitherto...",porch child name,dear,take avoid say pause go guess


In [6]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [9]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
17676,eliot,scenes,\nMost people must have agreed with Mrs. Rayno...,people day sight form face rivulet aisle kneel...,Most pretty least slight girlish fair young wh...,agree move flow soften paint think look press ...
8624,eliot,deronda,All had assembled in the drawing-room before t...,drawing room couple time child nephew niece gi...,new various little own delightful full deep ri...,assemble appear pass notice ’s allow appear en...
11766,eliot,bede,\nMr. Irwine returned from Stoniton in a post-...,night word house bed o’clock morning bed,post - chaise first dead dead awake,return say enter find desire say come beg go see
552,eliot,middlemarch,"“No,” said Mary, curtly, with a little toss of...",toss head life,little pleasant,say think
9728,eliot,deronda,"To Gwendolen, who even in the freedom of her m...",freedom time glimpse heroism sublimity medium ...,maiden faint close legal human melancholy yell...,thrust hate head pass regret take act live dis...


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [7]:
# create samples
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [8]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [12]:
# drop all columns except 'author' and 'nouns', shape your data thinking about your research question
nouns_df = df[['author', 'nouns']]

In [17]:
print(nouns_df.shape)
nouns_df.sample(10)

(20000, 2)


Unnamed: 0,author,nouns
17184,dickens,father judgment brother sister judgment
12106,dickens,dear aunt preparation night no
3755,eliot,boy man child girl trial mother year lustre ho...
17688,eliot,nayther sack waggon o mine o mind
10787,dickens,tribute evening servant manner deal noise pave...
16255,dickens,sir gallery
22138,dickens,explanation
12542,eliot,head organ edge mind way faculty blindness dam...
7178,dickens,street kennel glance character house appearanc...
817,eliot,scheme deal community descendant society year ...


## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [23]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [24]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [25]:
# display visualization in notebook
HTML(html)

In [26]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [9]:
# create samples
dickens_adj_sample = dickens_df.sample(10_000)
eliot_adj_sample = eliot_df.sample(10_000)

In [10]:
# combine DataFrames
df_adj = pd.concat([dickens_adj_sample, eliot_adj_sample])

In [11]:
print(df_adj.shape)

(20000, 6)


In [19]:
# drop all columns except 'author' and 'adjectives'
df_adj = df[['author', 'adjectives']]

In [20]:
# create a scattertext corpus
corpus_adj = st.CorpusFromPandas(df_adj, category_col='author', text_col='adjectives').build()

In [21]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus_adj,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [22]:
# display visualization in notebook
HTML(html)

In [23]:
file_name = 'example_adj.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)