# Sentiment Analysis Research with the HathiTrust Digital Library

### Contents
1. Overview
2. Acquiring the Textual Data
3. Installing Dependencies
4. Fiction Example: _The Count of Monte Cristo_
   1. Full Text Analysis
   2. Extracted Features Analysis
   3. Emotional Valence Graph
5. Nonfiction Example: _The Origin of Species_
   1. Full Text Analysis
   2. Extracted Features Analysis
   3. Emotional Valence Graph
6. Exploring Large Language Models (LLMs)
7. Conclusion
8. Further Readings

### Overview
In this project, I conducted sentiment analysis using Python on two volumes from the HathiTrust Digital Library to examine how the emotional valence changes across each text. The book titles I analyzed were one fiction novel _[The Count of Monte Cristo](https://hdl.handle.net/2027/mdp.39015062136661)_ by Alexandre Dumas and one nonfiction text _[The Origin of Species](https://hdl.handle.net/2027/hvd.hw39sc)_ by Charles Darwin. My goal was to generate visualizations of the change in emotional valence over the span of these books.

I utilized two forms of textual data for my analysis: full text (TXT) files downloaded directly from HathiTrust Digital Library, as well as Extracted Features (EF) obtained from [HathiTrust Research Center (HTRC) Analytics](https://analytics.hathitrust.org/). HTRC Analytics enables non-profit research and educational uses of materials in the HathiTrust collection, including those still under copyright. Specifically, the [Extracted Features](https://analytics.hathitrust.org/datasets) contain metadata about volumes and pages alongside part-of-speech-tagged tokens and token counts extracted from full texts.

With this textual data, I performed sentiment analysis using three tools: VADER, TextBlob, and AFINN. Each tool assigns sentiment scores to input texts which can be aggregated and visualized to show how emotional valence shifts across the span of pages in each book.


### Acquiring the Textual Data
To download the full-text files for entire volumes from HathiTrust, one needs to be affiliated with a HathiTrust [member institution](https://www.hathitrust.org/member-libraries/member-list/) and logged into the HathiTrust website using institutional credentials.

Full text files can be directly downloaded from each item's page. Extracted features must be accessed through [HTRC Analytics](https://analytics.hathitrust.org/), following this [EF download tutorial](https://htrc.atlassian.net/wiki/spaces/COM/pages/43288147/Downloading+Extracted+Features#DownloadingExtractedFeatures-EF1.5download).

For this project, I downloaded the following data files:

_The Count of Monte Cristo_: `mdp-39015062136661-1693964099.txt` (full text) and `mdp.39015062136661.json.bz2` (Extracted Features)

_On the Origin of Species_: `hvd-hw39sc-1696432701.txt` (full text) and `hvd.hw39sc.json.bz2` (Extracted Features)

### Installing Dependencies
In this project, I have selected three widely-used sentiment analysis tools:

1. **[VADER](https://www.nltk.org/index.html)**: Part of the NLTK (Natural Language Toolkit), VADER is specifically designed for social media texts, adept at handling informal language, emojis, and slang.
2. **[TextBlob](https://textblob.readthedocs.io/en/dev/)**: A user-friendly library, TextBlob simplifies many common natural language processing (NLP) tasks, including sentiment analysis.
3. **[AFINN](https://pypi.org/project/afinn/)**: AFINN is a wordlist-based tool for sentiment analysis where each word in the list is rated for its sentiment strength.

These tools can be installed in a Python environment using `pip` commands:
```
pip install nltk
pip install textblob
pip install afinn
```

To analyze Extracted Features, **[htrc-feature-reader](https://pypi.org/project/htrc-feature-reader/)** is needed. It is a tool designed specifically to work with Extracted Features from HTRC.
```
pip install htrc-feature-reader
```

Other modules needed for the project include `pandas` and `plotly`. The first tool is used for data analysis, and the second is for plotting interactive graphs.
```
pip install pandas
pip install plotly==5.18.0
```


### Fiction Example: _The Count of Monte Cristo_
#### Full Text Analysis

With the texts downloaded and the required dependencies installed, I was ready to begin the sentiment analysis process. The first step was importing the necessary libraries and modules:

In [None]:
# Import libraries for data analysis and visualization
import re  # for regular expression operations
import pandas as pd
import plotly.graph_objects as go

# Import the sentiment analysis tools
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from afinn import Afinn

The full-text file contained the entire book of _The Count of Monte Cristo_ without any page breaks. To analyze the text, I needed to structure it into a more manageable format. I aimed to turn the text into a DataFrame with two columns: one for page numbers and the other for the content on each page.

While looking at the TXT file, I noticed markers that indicate page breaks, formatted as `## p. (#1) #################################################`. I decided to use regular expressions (RegEx) to identify these markers and parse the text accordingly.


In [None]:
# Open the file and split the text into lines
with open('mdp-39015062136661-1693964099.txt', 'r', encoding='utf-8') as file:
    text = file.read()
lines = text.split('\n')

# Initialize lists for page numbers and content
page_numbers = []
page_content = []

current_page_number = None
current_page_content = []

# Parse the text with RegEx
for line in lines:
    if line.startswith("## p. "):
        if current_page_number is not None:
            page_numbers.append(current_page_number)
            page_content.append(" ".join(current_page_content))

        page_pattern = r'## p\. (\d+)'
        match = re.match(page_pattern, line)
        if match:
            current_page_number = match.group(1)
            current_page_content = []
    else:
        current_page_content.append(line)

Next, I created a DataFrame named `dumas_full_text` to organize `page_numbers` and `page_content`.

In [None]:
dumas_full_text = pd.DataFrame({
    'page_number': [int(x) for x in page_numbers],
    'page_content': page_content,
})

Here is a preview of the DataFrame for pages 11 to 15:

In [None]:
dumas_full_text[10:15]

##### VADER

With the DataFrame ready, I proceeded with sentiment analysis using the VADER tool:

In [None]:
analyzer = SentimentIntensityAnalyzer()

dumas_full_text['vader_sentiment_score'] = 0.0
dumas_full_text['vader_sentiment'] = ""

for tuple in dumas_full_text.itertuples():
    sentence = tuple.page_content

    sentiment_dictionary = analyzer.polarity_scores(sentence)
    compound = sentiment_dictionary['compound']

    dumas_full_text.at[tuple.Index, 'vader_sentiment_score'] = compound

    if compound >= 0.33:
        vader_sentiment = "Positive"
    elif compound <= -0.33:
        vader_sentiment = "Negative"
    else:
        vader_sentiment = "Neutral"

    dumas_full_text.at[tuple.Index, 'vader_sentiment'] = vader_sentiment

dumas_full_text[10:15]

##### TextBlob

The process for sentiment analysis with TextBlob is similar to that with VADER:

In [None]:
dumas_full_text['textblob_sentiment_score'] = 0.0
dumas_full_text['textblob_sentiment'] = ""

for tuple in dumas_full_text.itertuples():
    sentence = tuple.page_content

    classifier = TextBlob(sentence)
    polarity = classifier.sentiment.polarity

    dumas_full_text.at[tuple.Index, 'textblob_sentiment_score'] = polarity

    if polarity >= 0.1:
        textblob_sentiment = "Positive"
    elif polarity <= -0.1:
        textblob_sentiment = "Negative"
    else:
        textblob_sentiment = "Neutral"

    dumas_full_text.at[tuple.Index, 'textblob_sentiment'] = textblob_sentiment

dumas_full_text[10:15]

##### AFINN

Lastly, I used AFINN to analyze the sentiment across the full text of _The Count of Monte Cristo_.

In [None]:
afinn = Afinn(language='en')

dumas_full_text['afinn_sentiment_score'] = 0.0

for tuple in dumas_full_text.itertuples():
    sentence = tuple.page_content

    score = afinn.score(sentence)

    dumas_full_text.at[tuple.Index, 'afinn_sentiment_score'] = score

dumas_full_text[10:15]

One thing to note is that the scale of the `afinn_sentiment_score` is different from the scales of `vader_sentiment_score` and `textblob_sentiment_score`. While VADER and TextBlob's range is between -1 and 1, AFINN's scores are sums of sentiment values of individual words, which resulted in much larger absolute values. Therefore, I needed to normalize AFINN to the range of -1 to 1.

In [None]:
# Normalize the AFINN sentiment scores
min_value = min(dumas_full_text['afinn_sentiment_score'])
max_value = max(dumas_full_text['afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in dumas_full_text['afinn_sentiment_score']]

# Adjust the normalized numbers to the -1 to 1 range
afinn_normalized = [2 * x - 1 for x in normalized_numbers]

dumas_full_text['afinn_normalized'] = afinn_normalized

dumas_full_text[10:15]

#### Extracted Features Analysis
Moving on to analyzing the Extracted Features, I first imported `FeatureReader` from the `htrc_features` library.

In [None]:
from htrc_features import FeatureReader
import warnings
# The warnings are suppressed to avoid clutter in the output and do not affect the program's functionality
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
paths = ['mdp.39015062136661.json.bz2']
fr = FeatureReader(paths)
vol = next(fr.volumes())

Then, I created a DataFrame for Extracted Features named `dumas_ef`. I also grouped the tokens by page number, so that I could analyze the content on each page in a way similar to full text.

In [None]:
dumas_ef = vol.tokenlist(pos=False, case=False)\
        .reset_index().drop(['section'], axis=1)
dumas_ef.columns = ['Page Number', 'token', 'count']

In [None]:
# Group tokens by page number
grouped_tokens = dumas_ef.groupby('Page Number')

The next section of the code combines the sentiment analysis of the EF with the three tools: VADER, TextBlob and AFINN.

In [None]:
# Initialize lists to store sentiment analysis results
ef_vader_sentiment_score = []
ef_textblob_sentiment_score = []
ef_afinn_sentiment_score = []

# Perform sentiment analysis for each page
for name, group in grouped_tokens:
    page_text = " ".join([word * count for word, count in zip(group['token'], group['count'])])

    # VADER Analysis
    sentiment_scores = analyzer.polarity_scores(page_text)
    ef_vader_sentiment_score.append(sentiment_scores['compound'])

    # TextBlob Analysis
    sentiment = TextBlob(page_text).sentiment
    ef_textblob_sentiment_score.append(sentiment.polarity)

    # AFINN Analysis
    sentiment_score = afinn.score(page_text)
    ef_afinn_sentiment_score.append(sentiment_score)

# Create a DataFrame with all sentiment analysis results
dumas_ef = pd.DataFrame({
    'page_number': [int(x) for x in grouped_tokens.groups.keys()],
    'ef_vader_sentiment_score': ef_vader_sentiment_score,
    'ef_textblob_sentiment_score': ef_textblob_sentiment_score,
    'ef_afinn_sentiment_score': ef_afinn_sentiment_score
})

dumas_ef[10:15]

Again, I needed to normalize the AFINN scores.

In [None]:
min_value = min(dumas_ef['ef_afinn_sentiment_score'])
max_value = max(dumas_ef['ef_afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in dumas_ef['ef_afinn_sentiment_score']]
ef_afinn_normalized = [2 * x - 1 for x in normalized_numbers]

dumas_ef['ef_afinn_normalized'] = ef_afinn_normalized

dumas_ef[['ef_afinn_sentiment_score', 'ef_afinn_normalized']][10:15]

Another thing that needed to be adjusted is the page numbers in the `dumas_ef` DataFrame. The original page numbers in `dumas_ef` were higher because they included pages from the book's front matter, such as the cover and copyright pages. By subtracting 16, the number of front matter of this book, I ensured that the page numbers in `dumas_ef` match those in `dumas_full_text` for meaningful comparison.

In [None]:
FRONT_MATTER_PAGES = 16
dumas_ef['page_number'] = dumas_ef['page_number'].astype(int) - FRONT_MATTER_PAGES

In [None]:
dumas_ef[10:15]

#### Emotional Valence Graph

My goal is to visualize the overall emotional trend throughout the book. For this, a detailed granularity is not necessary. Therefore, I used a rolling mean with a window size of 20 pages to smooth out the data.

In [None]:
WINDOW_SIZE = 20

# Columns in dumas_full_text for rolling window operation
columns_full_text = ['vader_sentiment_score', 'textblob_sentiment_score', 'afinn_normalized']

for col in columns_full_text:
    dumas_full_text[col] = dumas_full_text[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()

# Columns in dumas_ef for rolling window operation
columns_ef = ['ef_vader_sentiment_score', 'ef_textblob_sentiment_score', 'ef_afinn_normalized']

for col in columns_ef:
    dumas_ef[col] = dumas_ef[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()

Finally, I used `plotly` to create an interactive graph with all the processed data. The interactive graph enables the viewer to toggle which graph(s) they would like to see and allows for a more straightforward comparison.

In [None]:
fig = go.Figure()

# Plotting for dumas_full_text DataFrame
sentiment_scores_full_text = {
    'vader_sentiment_score': 'Full Text VADER',
    'textblob_sentiment_score': 'Full Text TextBlob',
    'afinn_normalized': 'Full Text AFINN Normalized'
}

for column, label in sentiment_scores_full_text.items():
    fig.add_trace(go.Scatter(x=dumas_full_text['page_number'], y=dumas_full_text[column], mode='lines', name=label))

# Plotting for dumas_ef DataFrame
sentiment_scores_ef = {
    'ef_vader_sentiment_score': 'EF VADER',
    'ef_textblob_sentiment_score': 'EF TextBlob',
    'ef_afinn_normalized': 'EF AFINN Normalized'
}

for column, label in sentiment_scores_ef.items():
    fig.add_trace(go.Scatter(x=dumas_ef['page_number'], y=dumas_ef[column], mode='lines', name=label))

# Additional plot settings
fig.update_layout(
    title=f'Emotional Valence throughout "The Count of Monte Cristo"',
    xaxis_title='Page Number',
    yaxis_title='Sentiment Score',
    showlegend=True
)

fig.show()

### Nonfiction Example: _The Origin of Species_
#### Full Text Analysis
Moving on to the nonfiction example: _The Origin of Species_. The process was largely the same, with only a few minor differences.

In this book's full-text file, the page numbering resets after page 400, marking the start of Part 2. To ensure a continuous page count throughout the book, I implemented an offset that allows the numbering to continue past 400 instead of restarting.

In [None]:
with open('hvd-hw39sc-1696432701.txt', 'r', encoding='utf-8') as file:
    text = file.read()
lines = text.split('\n')

page_numbers = []
page_content = []

current_page_number = None
current_page_content = []
offset = 0  # Initialize an offset

for line in lines:
    if line.startswith("## p. "):
        if current_page_number is not None:
            page_numbers.append(current_page_number)
            page_content.append(" ".join(current_page_content))

        page_pattern = r'## p\. (\d+)'
        match = re.match(page_pattern, line)
        if match:
            # Check for page number reset
            if int(match.group(1)) == 1 and current_page_number is not None:
                offset += current_page_number  # Update the offset with the last page number

            current_page_number = int(match.group(1)) + offset
            current_page_content = []
    else:
        current_page_content.append(line)

As with the fiction example, I created a DataFrame for the full text named `darwin_full_text`.

In [None]:
darwin_full_text = pd.DataFrame({
    'page_number': [int(x) for x in page_numbers],
    'page_content': page_content,
})
darwin_full_text[10:15]

##### VADER

In [None]:
analyzer = SentimentIntensityAnalyzer()

darwin_full_text['vader_sentiment_score'] = 0.0
darwin_full_text['vader_sentiment'] = ""

for tuple in darwin_full_text.itertuples():
    sentence = tuple.page_content

    sentiment_dictionary = analyzer.polarity_scores(sentence)
    compound = sentiment_dictionary['compound']

    darwin_full_text.at[tuple.Index, 'vader_sentiment_score'] = compound

    if compound >= 0.33:
        vader_sentiment = "Positive"
    elif compound <= -0.33:
        vader_sentiment = "Negative"
    else:
        vader_sentiment = "Neutral"

    darwin_full_text.at[tuple.Index, 'vader_sentiment'] = vader_sentiment

##### TextBlob

In [None]:
darwin_full_text['textblob_sentiment_score'] = 0.0
darwin_full_text['textblob_sentiment'] = ""

for tuple in darwin_full_text.itertuples():
    sentence = tuple.page_content

    classifier = TextBlob(sentence)
    polarity = classifier.sentiment.polarity

    darwin_full_text.at[tuple.Index, 'textblob_sentiment_score'] = polarity

    if polarity >= 0.1:
        textblob_sentiment = "Positive"
    elif polarity <= -0.1:
        textblob_sentiment = "Negative"
    else:
        textblob_sentiment = "Neutral"

    darwin_full_text.at[tuple.Index, 'textblob_sentiment'] = textblob_sentiment

##### AFINN


In [None]:
afinn = Afinn(language='en')

darwin_full_text['afinn_sentiment_score'] = 0.0

for tuple in darwin_full_text.itertuples():
    sentence = tuple.page_content

    score = afinn.score(sentence)

    darwin_full_text.at[tuple.Index, 'afinn_sentiment_score'] = score

# Normalize AFINN scores
min_value = min(darwin_full_text['afinn_sentiment_score'])
max_value = max(darwin_full_text['afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in darwin_full_text['afinn_sentiment_score']]

afinn_normalized = [2 * x - 1 for x in normalized_numbers]

darwin_full_text['afinn_normalized'] = afinn_normalized

darwin_full_text[10:15]

#### Extracted Features Analysis

In [None]:
paths = ['hvd.hw39sc.json.bz2']
fr = FeatureReader(paths)
vol = next(fr.volumes())

In [None]:
darwin_ef = vol.tokenlist(pos=False, case=False) \
    .reset_index().drop(['section'], axis=1)
darwin_ef.columns = ['Page Number', 'token', 'count']
grouped_tokens = darwin_ef.groupby('Page Number')

In [None]:
ef_vader_sentiment_score = []
ef_textblob_sentiment_score = []
ef_afinn_sentiment_score = []

for name, group in grouped_tokens:
    page_text = " ".join([word * count for word, count in zip(group['token'], group['count'])])

    # VADER Analysis
    sentiment_scores = analyzer.polarity_scores(page_text)
    ef_vader_sentiment_score.append(sentiment_scores['compound'])

    # TextBlob Analysis
    sentiment = TextBlob(page_text).sentiment
    ef_textblob_sentiment_score.append(sentiment.polarity)

    # AFINN Analysis
    sentiment_score = afinn.score(page_text)
    ef_afinn_sentiment_score.append(sentiment_score)

# Create a DataFrame with all sentiment analysis results
darwin_ef = pd.DataFrame({
    'page_number': [int(x) for x in grouped_tokens.groups.keys()],
    'ef_vader_sentiment_score': ef_vader_sentiment_score,
    'ef_textblob_sentiment_score': ef_textblob_sentiment_score,
    'ef_afinn_sentiment_score': ef_afinn_sentiment_score
})

In [None]:
# Normalize AFINN scores for EF
min_value = min(darwin_ef['ef_afinn_sentiment_score'])
max_value = max(darwin_ef['ef_afinn_sentiment_score'])

normalized_numbers = [(x - min_value) / (max_value - min_value) for x in darwin_ef['ef_afinn_sentiment_score']]

ef_afinn_normalized = [2 * x - 1 for x in normalized_numbers]

darwin_ef['ef_afinn_normalized'] = ef_afinn_normalized

darwin_ef[10:15]

In [None]:
# Subtract the number of pages in front matter from EF DataFrame
FRONT_MATTER_PAGES = 16
darwin_ef['page_number'] = darwin_ef['page_number'].astype(int) - FRONT_MATTER_PAGES

#### Emotional Valence Graph

In [None]:
# Smoothing the graph with rolling mean
WINDOW_SIZE = 20

columns_full_text = ['vader_sentiment_score', 'textblob_sentiment_score', 'afinn_normalized']

for col in columns_full_text:
    darwin_full_text[col] = darwin_full_text[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()

columns_ef = ['ef_vader_sentiment_score', 'ef_textblob_sentiment_score', 'ef_afinn_normalized']

for col in columns_ef:
    darwin_ef[col] = darwin_ef[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()

In [None]:
fig = go.Figure()

# Plotting for dumas_full_text DataFrame
sentiment_scores_full_text = {
    'vader_sentiment_score': 'Full Text VADER',
    'textblob_sentiment_score': 'Full Text TextBlob',
    'afinn_normalized': 'Full Text AFINN Normalized'
}

for column, label in sentiment_scores_full_text.items():
    fig.add_trace(go.Scatter(x=darwin_full_text['page_number'], y=darwin_full_text[column], mode='lines', name=label))

# Plotting for dumas_ef DataFrame
sentiment_scores_ef = {
    'ef_vader_sentiment_score': 'EF VADER',
    'ef_textblob_sentiment_score': 'EF TextBlob',
    'ef_afinn_normalized': 'EF AFINN Normalized'
}

for column, label in sentiment_scores_ef.items():
    fig.add_trace(go.Scatter(x=darwin_ef['page_number'], y=darwin_ef[column], mode='lines', name=label))

# Additional plot settings
fig.update_layout(
    title=f'Emotional Valence throughout "The Origin of Species"',
    xaxis_title='Page Number',
    yaxis_title='Sentiment Score',
    showlegend=True
)

fig.show()

### Exploring Large Language Models (LLMs)
During my project, I experimented with sentiment analysis using Large Language Models (LLMs) such as BERTweet and SiEBERT. However, the approach encountered several challenges. For the full text analysis, the length of content on each page often exceeded the token limit of these models. I attempted to segment the page content into individual sentences using spaCy, but occasionally, even these sentences were too lengthy. Truncating sentences to fit within the token limit compromised the accuracy of the analysis.

In the case of Extracted Features, the use of LLMs proved impractical. The token lists comprised isolated words without context, which contradicts the advantage of LLMs: analyzing more extended sentences or texts to understand the overall sentiment. Additionally, the computational demands of running LLMs exceeded the capabilities of my available hardware.

Given these limitations and the challenges, I ultimately decided against including LLMs in the final iteration of my project.


### Conclusion

In analyzing the graphs for both the fiction and nonfiction examples, several key observations emerged:
- The disparity in sentiment scores between the full text and Extracted Features using the same tool was relatively minor. In contrast, there were more significant variations when comparing different tools analyzing the same input.
- TextBlob's sentiment scores generally hovered closer to neutral (0), while VADER's scores showed a greater deviation from neutrality.
- As for the overall sentiment direction, VADER typically presented more positive scores. TextBlob's scores were moderately positive, while AFINN's scores leaned more towards the negative spectrum.

These findings highlight the distinct characteristics and tendencies of each sentiment analysis tool when applied to both fiction and nonfiction works.

In closing, I would like to express my gratitude to Glen Layne-Worthey for his invaluable guidance and encouragement throughout this project. I am also deeply thankful to Ryan Dubnicek for sharing his technical expertise, which greatly aided this work.

### Further Readings
Bowers, Katherine and Quinn Dombrowski. “Katia and the Sentiment Snobs”. The Data-Sitters Club. October 25, 2021. https://datasittersclub.github.io/site/dsc11.html.

Organisciak, Peter and Boris Capitanu. "Text Mining in Python through the HTRC Feature Reader." Programming Historian 5 (2016). https://doi.org/10.46430/phen0058.