# Notebook for Sentiment Analysis Using spaCy

Using spaCy for sentiment analysis (textblob for spacy), we want to find the overall sentiment from the articles in each year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [None]:
!pip install xlsxwriter # for writing to multiple excel sheets

In [1]:
!pip install spacytextblob

Defaulting to user installation because normal site-packages is not writeable


In [2]:
!python -m textblob.download_corpora

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already u

In [None]:
!python -m spacy download en_core_web_md

In [None]:
import spacy
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Loading articles

In [14]:
class DatasetConfig():
    input_path: str
    output_path: str
    sheet_name: str
    usecols: list[str]
    id_col: str

    def __init__(self, input_path: str, output_path: str, sheet_name: str, usecols: list[str], id_col: str):
        self.input_path = input_path
        self.output_path = output_path
        self.sheet_name = sheet_name
        self.usecols = usecols
        self.id_col = id_col

In [48]:
fakespeak_config = DatasetConfig(
    # file_path="/content/drive/My Drive/fake_news_over_time/Fakespeak_ENG_modified.xlsx",
    input_path="./data/Fakespeak-ENG/Fakespeak-ENG modified.xlsx",
    output_path="./data/Fakespeak-ENG/Analysis_output/Fakespeak_sentiment_analysis.xlsx",
    sheet_name="Working",
    usecols=['ID', 'combinedLabel', 'originalTextType', 'originalBodyText', 'originalDateYear'],
    id_col="ID"
)

misinfotext_config = DatasetConfig(
    input_path="./data/MisInfoText/PolitiFact_original_modified.xlsx",
    output_path="./data/MisInfoText/Analysis_output/MisInfoText_sentiment_analysis.xlsx",
    sheet_name="Working",
    usecols=None,
    id_col="factcheckURL"
)

In [None]:
using_dataset = misinfotext_config

In [53]:
dataset_df = pd.read_excel(
    using_dataset.input_path, 
    sheet_name=using_dataset.sheet_name, 
    usecols=using_dataset.usecols
)

In [55]:
dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,2019


## Analyzing article sentiment using spaCy textblob

[spaCy textblob](https://spacy.io/universe/project/spacy-textblob/)

[Quick References](https://github.com/SamEdwardes/spacytextblob?tab=readme-ov-file#quick-reference)

The two most relevant values returned by textblob are:
* polarity: a float in [-1.0, 1.0] where -1.0 is extremely negative and 1.0 is extremely positive
* subjectivity: a float in [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective

In [None]:
# make the model
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x200981b73e0>

In [56]:
dataset_df["doc"] = list(nlp.pipe(dataset_df["originalBodyText"]))
dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear,doc
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ..."
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",2019,"(Chuck, Schumer, :, "", why, should, American, ..."
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,2019,"(Billions, of, dollars, are, sent, to, the, St..."
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,2019,"(If, 50, Billion, $, $, were, set, aside, to, ..."
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,2019,"(Huge@#CD, 9, news, ., \n, @ncsbe, \n , sent, ..."


In [57]:
sentiment_df = pd.DataFrame(data={
    "ID": dataset_df[using_dataset.id_col],
    "Polarity": dataset_df["doc"].apply(lambda doc: doc._.blob.polarity),
    "Subjectivity": dataset_df["doc"].apply(lambda doc: doc._.blob.subjectivity),
    "Year": dataset_df["originalDateYear"]
})
sentiment_df

Unnamed: 0,ID,Polarity,Subjectivity,Year
0,Politifact_FALSE_Social media_687276,0.127320,0.451136,2019
1,Politifact_FALSE_Social media_25111,0.155556,0.387654,2019
2,Politifact_FALSE_Social media_735424,-0.270833,0.366667,2019
3,Politifact_FALSE_Social media_594307,0.000000,1.000000,2019
4,Politifact_FALSE_Social media_839325,0.000000,0.066667,2019
...,...,...,...,...
2956,Politifact_Pants on Fire_Social media_876628,-0.001179,0.352417,2023
2957,Politifact_Pants on Fire_Social media_231170,0.000000,0.000000,2023
2958,Politifact_Pants on Fire_Social media_874359,0.087966,0.391290,2020
2959,Politifact_Pants on Fire_Social media_635418,0.400000,0.425000,2021


## Filter the sentiments by year

In [58]:
grouped_by_year = sentiment_df.groupby(by="Year")
sentiment_years_dfs = [grouped_by_year.get_group(group).copy() for group in grouped_by_year.groups]

In [59]:
sentiment_years_dfs[0].head()

Unnamed: 0,ID,Polarity,Subjectivity,Year
0,Politifact_FALSE_Social media_687276,0.12732,0.451136,2019
1,Politifact_FALSE_Social media_25111,0.155556,0.387654,2019
2,Politifact_FALSE_Social media_735424,-0.270833,0.366667,2019
3,Politifact_FALSE_Social media_594307,0.0,1.0,2019
4,Politifact_FALSE_Social media_839325,0.0,0.066667,2019


## Create summary table
Next we create a summary table containing the following information for each year:
* pol_max: highest polarity
* pol_min: lowest polarity
* pol_avg: average polarity
* intensity_avg: average of sentiment scores, ignoring polarity (i.e. average taking absolute values)
* subj_max: highest subjectivity
* subj_min: lowest subjectivity
* subj_avg: average subjectivity
* pos_article_count: total number of articles with polarity > 0
* neg_article_count: total number of articles with polarity < 0
* neu_article_count: total number of articles with polairity = 0

In [60]:
summary_df = pd.DataFrame(
    data={
        "pol_max": [df["Polarity"].max() for df in sentiment_years_dfs],
        "pol_min": [df["Polarity"].min() for df in sentiment_years_dfs],
        "pol_avg": [df["Polarity"].mean() for df in sentiment_years_dfs],
        "intensity_avg": [df["Polarity"].abs().mean() for df in sentiment_years_dfs],
        "subj_max": [df["Subjectivity"].max() for df in sentiment_years_dfs],
        "subj_min": [df["Subjectivity"].min() for df in sentiment_years_dfs],
        "subj_avg": [df["Subjectivity"].mean() for df in sentiment_years_dfs],
        "pos_article_count": [df["Polarity"][df["Polarity"] > 0].count() for df in sentiment_years_dfs],
        "neg_article_count": [df["Polarity"][df["Polarity"] < 0].count() for df in sentiment_years_dfs],
        "neu_article_count": [df["Polarity"][df["Polarity"] == 0].count() for df in sentiment_years_dfs],
    },
    index=[df["Year"].iloc[0] for df in sentiment_years_dfs]
)
summary_df

Unnamed: 0,pol_max,pol_min,pol_avg,intensity_avg,subj_max,subj_min,subj_avg,pos_article_count,neg_article_count,neu_article_count
2019,0.7,-1.0,0.026148,0.145833,1.0,0.0,0.404988,138,84,61
2020,0.9375,-1.0,0.050932,0.134115,1.0,0.0,0.356599,382,201,190
2021,1.0,-0.9375,0.046474,0.132025,1.0,0.0,0.373022,366,195,145
2022,0.840625,-1.0,0.04671,0.125577,1.0,0.0,0.3593,246,123,113
2023,1.0,-1.0,0.022785,0.135252,1.0,0.0,0.363308,248,149,117
2024,0.8,-0.875,0.057826,0.141656,1.0,0.0,0.366642,101,51,51


## Write output to spreadsheet

In [61]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(using_dataset.output_path, engine="xlsxwriter")

# write dataframes to different worksheets
for df in sentiment_years_dfs:
    year = str(df["Year"].iloc[0])
    df.to_excel(writer, sheet_name=year, index=False)

summary_df.to_excel(writer, sheet_name="summary")

# close the excel writer and output file
writer.close()