<a href="https://colab.research.google.com/github/yihui-Xiong/Fa23-CLS-0161-01-Intro-Dig-Hum/blob/main/Copy_of_CLS_161_Comparing_literature_scattertext(Complete).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [None]:
%%capture
!pip install scattertext

In [None]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [None]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [None]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
14301,dickens,copperfield,"We walked, that winter evening, in the fields ...",winter evening field calm air star tranquillity,frosty early,walk bless seem partake begin shine linger loo...
24415,dickens,pickwick,"'Don't you understand me?' said Mary, looking ...",face,fat,understand say look
8259,dickens,times,‘Thou changest me from bad to good. Thou mak’...,good thee thee life ower muddle soul,chang bad fearfo alive,lose clear save
22383,dickens,pickwick,The ladies waved a choice selection of pocket-...,lady choice selection pocket handkerchief prop...,impetuous little sleek white faced perpetual g...,wave move take thrust represent wave renew bow...
199,dickens,carol,"""I would gladly think otherwise if I could,"" s...",truth day morrow yesterday dowerless girl conf...,strong irresistible free very false full,think answer know _ learn know believe choose ...


In [None]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [None]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
11185,eliot,bede,"“I know what to do, never fear,” said Bartle, ...",news sauce dinner aye aye boy eye head piece f...,good good good,know fear say move get measure ’ve
14146,eliot,romola,"He paused a moment, and his eyes sank as if he...",moment eye wave despondency _ man bosom knowle...,false beautiful gentle lonely little many poor,pause sink look say renew make love hate make ...
1996,eliot,middlemarch,"Dorothea told him that she had seen Lydgate, a...",gist conversation voice increase knowledge con...,sure restless tacit lonely,tell see recite question feel wish know pass k...
7487,eliot,deronda,“He is delightful to ride. I should like to ha...,leap mamma channel minute gallop,delightful good wide,ride like frighten pass like take
5165,eliot,mill,But not even a direct argument from that typic...,argument female law disposition thought sight ...,direct typical mere able certain private hones...,go heighten freshen speak try make stand go ca...


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [None]:
# create samples
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [None]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [None]:
# drop all columns except 'author' and 'nouns'
nouns_df = df[['author', 'nouns']]

## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [None]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [None]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [None]:
# display visualization in notebook
HTML(html)

In [None]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [None]:
# create samples
sample_size = 10000
dickens_sample_df = dickens_df.sample(n=sample_size)
eliot_sample_df = eliot_df.sample(n=sample_size)

In [None]:
# combine DataFrames
combined_df = pd.concat([dickens_sample_df, eliot_sample_df])

In [None]:
# drop all columns except 'author' and 'adjectives'
adjectives_df = combined_df[['author', 'adjectives']]

In [None]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(adjectives_df, category_col='author', text_col='adjectives').build()

In [None]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',
                                       category_name='Eliot',
                                       not_category_name='Dickens',
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [None]:
# display visualization in notebook
HTML(html)

Reflecting on this project comparing the adjectives used by Charles Dickens and George Eliot through scattertext, I realize how much I've learned about the intersection of data science and literature. This task was an insightful journey into the world of text analysis, shedding light on both my technical skills and my understanding of literary styles.

The project began with the challenge of handling and preprocessing the datasets. Working with Python and pandas, I developed a practical understanding of data manipulation. Sampling data from the `dickens_df` and `eliot_df` DataFrames and concatenating them into a single DataFrame required precision. I learned that the way data is prepared significantly impacts the results and interpretations of any analysis.

Using the Scattertext library was a new and enlightening experience. I discovered the intricate process of creating a corpus from a DataFrame and transforming this into a visual plot. This part of the project was particularly demanding, requiring a careful balance between technical know-how and creative visualization. The resulting scatter plot was not just a visual representation but also a narrative tool, revealing distinct patterns in the authors' use of adjectives.

Analyzing the scatter plot, I noted specific trends in adjective usage by both authors. For instance, Eliot's use of adjectives like 'religious' and 'spiritual' contrasted with Dickens's preference for words like 'curious' and 'extraordinary.' This observation led me to ponder the thematic and stylistic differences between the two authors, rooted in their unique backgrounds and the societal contexts they wrote in. It was fascinating to see how these linguistic nuances could be captured and compared through data visualization.

The project also honed my critical thinking skills. Interpreting data is not just about understanding what is presented; it's about probing deeper into the 'why' and 'how.' This task made me consider the broader implications of word choice in literature and how it reflects upon an author's world view.

In conclusion, this project was a comprehensive learning experience. It enhanced my skills in Python and pandas, introduced me to the practical applications of Scattertext, and deepened my appreciation for literature. The project was a vivid reminder of the power of data science in uncovering hidden patterns in text, offering a window into the minds of historical literary figures.