# Notebook for Sentiment Analysis Using spaCy

Using spaCy for sentiment analysis (textblob for spacy), we want to find the overall sentiment from the articles in each year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
import pandas as pd
from dataset_config import BASE_FAKESPEAK_CONFIG, BASE_MISINFOTEXT_CONFIG
from helpers import get_groups, make_output_path, make_output_path_for_type

## Loading articles

In [2]:
fakespeak_config = BASE_FAKESPEAK_CONFIG | {
    "usecols": BASE_FAKESPEAK_CONFIG["usecols"] + ["originalHeadline"]
}

misinfotext_config = BASE_MISINFOTEXT_CONFIG

In [3]:
using_dataset = fakespeak_config

In [4]:
dataset_df = pd.read_excel(
    using_dataset["input_path"], 
    sheet_name=using_dataset["sheet_name"], 
    usecols=using_dataset["usecols"]
)

# Removing 2007 and 2008 years because little data in them
dataset_df = dataset_df[~(dataset_df[using_dataset["year_col"]] == 2007) & ~(dataset_df[using_dataset["year_col"]] == 2008)]

dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",,2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,,2019


## Analyzing article sentiment using spaCy textblob

[spaCy textblob](https://spacy.io/universe/project/spacy-textblob/)

[Quick References](https://github.com/SamEdwardes/spacytextblob?tab=readme-ov-file#quick-reference)

The two most relevant values returned by textblob are:
* polarity: a float in [-1.0, 1.0] where -1.0 is extremely negative and 1.0 is extremely positive
* subjectivity: a float in [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective

In [5]:
# make the model
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('spacytextblob')

  import pkg_resources


<spacytextblob.spacytextblob.SpacyTextBlob at 0x1130a237210>

In [6]:
dataset_df["doc"] = list(nlp.pipe(dataset_df[using_dataset["text_col"]]))
dataset_df["polarity"] = dataset_df["doc"].apply(lambda doc: doc._.blob.polarity)
dataset_df["subjectivity"] = dataset_df["doc"].apply(lambda doc: doc._.blob.subjectivity)

dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear,doc,polarity,subjectivity
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,,2019,"(Mexico, is, paying, for, the, Wall, through, ...",0.12732,0.451136
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",,2019,"(Chuck, Schumer, :, "", why, should, American, ...",0.155556,0.387654
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,,2019,"(Billions, of, dollars, are, sent, to, the, St...",-0.270833,0.366667
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,,2019,"(If, 50, Billion, $, $, were, set, aside, to, ...",0.0,1.0
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,,2019,"(Huge@#CD, 9, news, ., \n, @ncsbe, \n , sent, ...",0.0,0.066667


## Filter the sentiments by year

In [7]:
years, years_dfs = get_groups(dataset_df, using_dataset["year_col"])
years_dfs[0].head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear,doc,polarity,subjectivity
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,,2019,"(Mexico, is, paying, for, the, Wall, through, ...",0.12732,0.451136
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",,2019,"(Chuck, Schumer, :, "", why, should, American, ...",0.155556,0.387654
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,,2019,"(Billions, of, dollars, are, sent, to, the, St...",-0.270833,0.366667
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,,2019,"(If, 50, Billion, $, $, were, set, aside, to, ...",0.0,1.0
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,,2019,"(Huge@#CD, 9, news, ., \n, @ncsbe, \n , sent, ...",0.0,0.066667


## Create summary table
Next we create a summary table containing the following information for each year:
* polarity_max: highest polarity
* polarity_min: lowest polarity
* polarity_avg: average polarity
* intensity_avg: average of sentiment scores, ignoring polarity (i.e. average taking absolute values)
* subjectivity_max: highest subjectivity
* subjectivity_min: lowest subjectivity
* subjectivity_avg: average subjectivity
* positive_article_count: total number of articles with polarity > 0
* negative_article_count: total number of articles with polarity < 0
* neutral_article_count: total number of articles with polairity = 0

In [8]:
def get_summary_table(years: list[int], dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data={
            "polarity_max": [df["polarity"].max() for df in dfs],
            "polarity_min": [df["polarity"].min() for df in dfs],
            "polarity_avg": [df["polarity"].mean() for df in dfs],
            "intensity_avg": [df["polarity"].abs().mean() for df in dfs],
            "subjectivity_max": [df["subjectivity"].max() for df in dfs],
            "subjectivity_min": [df["subjectivity"].min() for df in dfs],
            "subjectivity_avg": [df["subjectivity"].mean() for df in dfs],
            "positive_article_count": [df["polarity"][df["polarity"] > 0].count() for df in dfs],
            "negative_article_count": [df["polarity"][df["polarity"] < 0].count() for df in dfs],
            "neutral_article_count": [df["polarity"][df["polarity"] == 0].count() for df in dfs],
        },
        index=years
    )

In [9]:
summary_df = get_summary_table(years, years_dfs)
summary_df

Unnamed: 0,polarity_max,polarity_min,polarity_avg,intensity_avg,subjectivity_max,subjectivity_min,subjectivity_avg,positive_article_count,negative_article_count,neutral_article_count
2019,0.7,-1.0,0.026148,0.145833,1.0,0.0,0.404988,138,84,61
2020,0.9375,-1.0,0.050932,0.134115,1.0,0.0,0.356599,382,201,190
2021,1.0,-0.9375,0.046474,0.132025,1.0,0.0,0.373022,366,195,145
2022,0.840625,-1.0,0.04671,0.125577,1.0,0.0,0.3593,246,123,113
2023,1.0,-1.0,0.022785,0.135252,1.0,0.0,0.363308,248,149,117
2024,0.8,-0.875,0.057826,0.141656,1.0,0.0,0.366642,101,51,51


In [10]:
types, types_dfs = get_groups(dataset_df, using_dataset["type_col"])
types_dfs[0].head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear,doc,polarity,subjectivity
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,Joe Biden Can’t Keep His Thoughts Straight,2019,"(Joe, Biden, has, a, message, for, the, public...",0.116049,0.259877
19,Politifact_FALSE_News and blog_605527,False,News and blog,Hollywood legend Tom Selleck has praised Donal...,Actor Tom Selleck: ‘I Would Say “F*ck You” To ...,2019,"(Hollywood, legend, Tom, Selleck, has, praised...",0.237263,0.614414
21,Politifact_FALSE_News and blog_868147,False,News and blog,"Hundreds of Congolese migrants, with who knows...",Border Patrol Surprise: Disease-Ridden Congole...,2019,"(Hundreds, of, Congolese, migrants, ,, with, w...",0.061036,0.310588
25,Politifact_FALSE_News and blog_944705,False,News and blog,David Steinberg released his latest report on ...,SHE’S TOAST: Latest Report by David Steinberg ...,2019,"(David, Steinberg, released, his, latest, repo...",0.054167,0.534722
40,Politifact_FALSE_News and blog_691427,False,News and blog,Nancy Pelosi is neck deep in Ukraine politics....,BREAKING EXCLUSIVE: Pelosi NECK DEEP in Ukrain...,2019,"(Nancy, Pelosi, is, neck, deep, in, Ukraine, p...",0.055,0.2425


## Write output to spreadsheet

In [11]:
def save_years(writer: pd.ExcelWriter, years: list[int], years_dfs: list[pd.DataFrame]):
    for year, df in zip(years, years_dfs):
        df.to_excel(
            writer,
            sheet_name=str(year),
            index=False,
            columns=[using_dataset["id_col"], "polarity", "subjectivity"]
        )
    
    get_summary_table(years, years_dfs).to_excel(writer, sheet_name="Summary")

In [12]:
output_path = make_output_path(using_dataset, "sentiment_analysis")

writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
save_years(writer, years, years_dfs)
writer.close()

In [13]:
for type, df in zip(types, types_dfs):
    years, years_dfs = get_groups(df, using_dataset["year_col"])

    output_path = make_output_path_for_type(using_dataset, type, "sentiment_analysis")

    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
    save_years(writer, years, years_dfs)
    writer.close()

### Now we repeat this for headlines

In [14]:
dataset_df = dataset_df[dataset_df[using_dataset["headline_col"]].notna()]
dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear,doc,polarity,subjectivity
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,Joe Biden Can’t Keep His Thoughts Straight,2019,"(Joe, Biden, has, a, message, for, the, public...",0.116049,0.259877
19,Politifact_FALSE_News and blog_605527,False,News and blog,Hollywood legend Tom Selleck has praised Donal...,Actor Tom Selleck: ‘I Would Say “F*ck You” To ...,2019,"(Hollywood, legend, Tom, Selleck, has, praised...",0.237263,0.614414
21,Politifact_FALSE_News and blog_868147,False,News and blog,"Hundreds of Congolese migrants, with who knows...",Border Patrol Surprise: Disease-Ridden Congole...,2019,"(Hundreds, of, Congolese, migrants, ,, with, w...",0.061036,0.310588
25,Politifact_FALSE_News and blog_944705,False,News and blog,David Steinberg released his latest report on ...,SHE’S TOAST: Latest Report by David Steinberg ...,2019,"(David, Steinberg, released, his, latest, repo...",0.054167,0.534722
40,Politifact_FALSE_News and blog_691427,False,News and blog,Nancy Pelosi is neck deep in Ukraine politics....,BREAKING EXCLUSIVE: Pelosi NECK DEEP in Ukrain...,2019,"(Nancy, Pelosi, is, neck, deep, in, Ukraine, p...",0.055,0.2425


In [15]:
dataset_df["doc"] = list(nlp.pipe(dataset_df[using_dataset["headline_col"]]))
dataset_df["polarity"] = dataset_df["doc"].apply(lambda doc: doc._.blob.polarity)
dataset_df["subjectivity"] = dataset_df["doc"].apply(lambda doc: doc._.blob.subjectivity)

dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear,doc,polarity,subjectivity
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,Joe Biden Can’t Keep His Thoughts Straight,2019,"(Joe, Biden, Ca, n’t, Keep, His, Thoughts, Str...",0.2,0.4
19,Politifact_FALSE_News and blog_605527,False,News and blog,Hollywood legend Tom Selleck has praised Donal...,Actor Tom Selleck: ‘I Would Say “F*ck You” To ...,2019,"(Actor, Tom, Selleck, :, ‘, I, Would, Say, “, ...",0.2,0.1
21,Politifact_FALSE_News and blog_868147,False,News and blog,"Hundreds of Congolese migrants, with who knows...",Border Patrol Surprise: Disease-Ridden Congole...,2019,"(Border, Patrol, Surprise, :, Disease, -, Ridd...",0.359722,0.444444
25,Politifact_FALSE_News and blog_944705,False,News and blog,David Steinberg released his latest report on ...,SHE’S TOAST: Latest Report by David Steinberg ...,2019,"(SHE, ’S, TOAST, :, Latest, Report, by, David,...",0.5,0.9
40,Politifact_FALSE_News and blog_691427,False,News and blog,Nancy Pelosi is neck deep in Ukraine politics....,BREAKING EXCLUSIVE: Pelosi NECK DEEP in Ukrain...,2019,"(BREAKING, EXCLUSIVE, :, Pelosi, NECK, DEEP, i...",-0.033333,0.222222


In [16]:
years, years_dfs = get_groups(dataset_df, using_dataset["year_col"])
years_dfs[0].head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear,doc,polarity,subjectivity
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,Joe Biden Can’t Keep His Thoughts Straight,2019,"(Joe, Biden, Ca, n’t, Keep, His, Thoughts, Str...",0.2,0.4
19,Politifact_FALSE_News and blog_605527,False,News and blog,Hollywood legend Tom Selleck has praised Donal...,Actor Tom Selleck: ‘I Would Say “F*ck You” To ...,2019,"(Actor, Tom, Selleck, :, ‘, I, Would, Say, “, ...",0.2,0.1
21,Politifact_FALSE_News and blog_868147,False,News and blog,"Hundreds of Congolese migrants, with who knows...",Border Patrol Surprise: Disease-Ridden Congole...,2019,"(Border, Patrol, Surprise, :, Disease, -, Ridd...",0.359722,0.444444
25,Politifact_FALSE_News and blog_944705,False,News and blog,David Steinberg released his latest report on ...,SHE’S TOAST: Latest Report by David Steinberg ...,2019,"(SHE, ’S, TOAST, :, Latest, Report, by, David,...",0.5,0.9
40,Politifact_FALSE_News and blog_691427,False,News and blog,Nancy Pelosi is neck deep in Ukraine politics....,BREAKING EXCLUSIVE: Pelosi NECK DEEP in Ukrain...,2019,"(BREAKING, EXCLUSIVE, :, Pelosi, NECK, DEEP, i...",-0.033333,0.222222


In [17]:
summary_df = get_summary_table(years, years_dfs)
summary_df

Unnamed: 0,polarity_max,polarity_min,polarity_avg,intensity_avg,subjectivity_max,subjectivity_min,subjectivity_avg,positive_article_count,negative_article_count,neutral_article_count
2019,0.5,-0.15,0.105456,0.132992,1.0,0.0,0.3361,11,4,8
2020,0.5,-0.8,-0.045323,0.109658,1.0,0.0,0.238005,17,37,66
2021,0.625,-1.0,0.000304,0.123265,1.0,0.0,0.279742,43,41,81
2022,0.5,-0.7,0.035902,0.110416,1.0,0.0,0.243935,25,15,40
2023,0.8,-1.0,0.008473,0.113216,1.0,0.0,0.271194,21,24,55
2024,0.25,-0.45,0.026395,0.095082,0.75,0.0,0.271272,10,3,8


In [18]:
types, types_dfs = get_groups(dataset_df, using_dataset["type_col"])
types_dfs[0].head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear,doc,polarity,subjectivity
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,Joe Biden Can’t Keep His Thoughts Straight,2019,"(Joe, Biden, Ca, n’t, Keep, His, Thoughts, Str...",0.2,0.4
19,Politifact_FALSE_News and blog_605527,False,News and blog,Hollywood legend Tom Selleck has praised Donal...,Actor Tom Selleck: ‘I Would Say “F*ck You” To ...,2019,"(Actor, Tom, Selleck, :, ‘, I, Would, Say, “, ...",0.2,0.1
21,Politifact_FALSE_News and blog_868147,False,News and blog,"Hundreds of Congolese migrants, with who knows...",Border Patrol Surprise: Disease-Ridden Congole...,2019,"(Border, Patrol, Surprise, :, Disease, -, Ridd...",0.359722,0.444444
25,Politifact_FALSE_News and blog_944705,False,News and blog,David Steinberg released his latest report on ...,SHE’S TOAST: Latest Report by David Steinberg ...,2019,"(SHE, ’S, TOAST, :, Latest, Report, by, David,...",0.5,0.9
40,Politifact_FALSE_News and blog_691427,False,News and blog,Nancy Pelosi is neck deep in Ukraine politics....,BREAKING EXCLUSIVE: Pelosi NECK DEEP in Ukrain...,2019,"(BREAKING, EXCLUSIVE, :, Pelosi, NECK, DEEP, i...",-0.033333,0.222222


In [19]:
output_path = make_output_path(using_dataset, "sentiment_analysis_headlines")

writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
save_years(writer, years, years_dfs)
writer.close()

In [20]:
for type, df in zip(types, types_dfs):
    years, years_dfs = get_groups(df, using_dataset["year_col"])

    output_path = make_output_path_for_type(using_dataset, type, "sentiment_analysis_headlines")

    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
    save_years(writer, years, years_dfs)
    writer.close()