# Notebook for Sentiment Analysis Using spaCy

Using spaCy for sentiment analysis (textblob for spacy), we want to find the overall sentiment from the articles in each year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [None]:
!pip install xlsxwriter # for writing to multiple excel sheets

In [None]:
!pip install spacytextblob

In [None]:
!python -m textblob.download_corpora

In [None]:
!python -m spacy download en_core_web_md

In [None]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Loading articles

In [None]:
input = '/content/drive/My Drive/fake_news_over_time/Fakespeak_ENG_modified.xlsx'

In [None]:
fakespeak_df = pd.read_excel(input, sheet_name="Working", usecols=['ID', 'combinedLabel', 'originalTextType', 'originalBodyText', 'originalDateYear'])

In [None]:
fakespeak_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,2019


## Analyzing article sentiment using spaCy textblob

[spaCy textblob](https://spacy.io/universe/project/spacy-textblob/)

[Quick References](https://github.com/SamEdwardes/spacytextblob?tab=readme-ov-file#quick-reference)

The two most relevant values returned by textblob are:
* polarity: a float in [-1.0, 1.0] where -1.0 is extremely negative and 1.0 is extremely positive
* subjectivity: a float in [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective

In [None]:
# make the model
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('spacytextblob')

ids = []
polarities = []
subjectivities = []
years = []

In [None]:
for index, row in fakespeak_df.iterrows():
  article = nlp(row['originalBodyText'])
  year = row['originalDateYear']
  pol = article._.blob.polarity
  sub = article._.blob.subjectivity
  text_id = row['ID']

  ids.append(text_id)
  polarities.append(pol)
  subjectivities.append(sub)
  years.append(year)

In [None]:
# dictionary to map the lists
tags = {
    'ID': ids,
    'Polarity': polarities,
    'Subjectivity': subjectivities,
    'Year': years
}

# create a new dataframe containing the named entities
sentiment_df = pd.DataFrame(tags)

In [None]:
sentiment_df.head()

Unnamed: 0,ID,Polarity,Subjectivity,Year
0,Politifact_FALSE_Social media_687276,0.12732,0.451136,2019
1,Politifact_FALSE_Social media_25111,0.155556,0.387654,2019
2,Politifact_FALSE_Social media_735424,-0.270833,0.366667,2019
3,Politifact_FALSE_Social media_594307,0.0,1.0,2019
4,Politifact_FALSE_Social media_839325,0.0,0.066667,2019


## Filter the sentiments by year

In [None]:
# create new dataframes holding just the articles from each year
sent_19_df = sentiment_df.loc[sentiment_df['Year'] == 2019]
sent_20_df = sentiment_df.loc[sentiment_df['Year'] == 2020]
sent_21_df = sentiment_df.loc[sentiment_df['Year'] == 2021]
sent_22_df = sentiment_df.loc[sentiment_df['Year'] == 2022]
sent_23_df = sentiment_df.loc[sentiment_df['Year'] == 2023]
sent_24_df = sentiment_df.loc[sentiment_df['Year'] == 2024]

In [None]:
sent_19_df.head()

Unnamed: 0,ID,Polarity,Subjectivity,Year
0,Politifact_FALSE_Social media_687276,0.12732,0.451136,2019
1,Politifact_FALSE_Social media_25111,0.155556,0.387654,2019
2,Politifact_FALSE_Social media_735424,-0.270833,0.366667,2019
3,Politifact_FALSE_Social media_594307,0.0,1.0,2019
4,Politifact_FALSE_Social media_839325,0.0,0.066667,2019


## Create summary table
Next we create a summary table containing the following information for each year:
* pol_max: highest polarity
* pol_min: lowest polarity
* pol_avg: average polarity
* intensity_avg: average of sentiment scores, ignoring polarity (i.e. average taking absolute values)
* subj_max: highest subjectivity
* subj_min: lowest subjectivity
* subj_avg: average subjectivity
* pos_article_count: total number of articles with polarity > 0
* neg_article_count: total number of articles with polarity < 0
* neu_article_count: total number of articles with polairity = 0

In [None]:
# helper function to summarize data
# year_df is the filtered dataframes from each year
# returns a row containing aggregate data from each year
def summarize(year_df):
  row = []
  # calculate polarity data
  row.append(year_df['Polarity'].max())
  row.append(year_df['Polarity'].min())
  row.append(year_df['Polarity'].mean())
  row.append(year_df['Polarity'].abs().mean())

  # calculate subjectivity data
  row.append(year_df['Subjectivity'].max())
  row.append(year_df['Subjectivity'].min())
  row.append(year_df['Subjectivity'].mean())

  # calculate sentiment count
  row.append(year_df['Polarity'][year_df['Polarity'] > 0].count())
  row.append(year_df['Polarity'][year_df['Polarity'] < 0].count())
  row.append(year_df['Polarity'][year_df['Polarity'] == 0].count())

  return row

In [None]:
summary = []

summary.append(summarize(sent_19_df))
summary.append(summarize(sent_20_df))
summary.append(summarize(sent_21_df))
summary.append(summarize(sent_22_df))
summary.append(summarize(sent_23_df))
summary.append(summarize(sent_24_df))

summary_df = pd.DataFrame(summary)

# row and column headers
row_labels = ['2019', '2020', '2021', '2022', '2023', '2024']
col_labels = ['pol_max', 'pol_min', 'pol_avg', 'intensity_avg', 'subj_max', 'subj_min',
              'subj_avg', 'pos_article_count', 'neg_article_count', 'neu_article_count']

summary_df.index = row_labels
summary_df.columns = col_labels

In [None]:
summary_df

Unnamed: 0,pol_max,pol_min,pol_avg,intensity_avg,subj_max,subj_min,subj_avg,pos_article_count,neg_article_count,neu_article_count
2019,0.7,-1.0,0.026148,0.145833,1.0,0.0,0.404988,138,84,61
2020,0.9375,-1.0,0.050932,0.134115,1.0,0.0,0.356599,382,201,190
2021,1.0,-0.9375,0.046474,0.132025,1.0,0.0,0.373022,366,195,145
2022,0.840625,-1.0,0.04671,0.125577,1.0,0.0,0.3593,246,123,113
2023,1.0,-1.0,0.022785,0.135252,1.0,0.0,0.363308,248,149,117
2024,0.8,-0.875,0.057826,0.141656,1.0,0.0,0.366642,101,51,51


## Write output to spreadsheet

In [None]:
output = '/content/drive/My Drive/fake_news_over_time/sentiment_analysis.xlsx'

In [None]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(output, engine="xlsxwriter")

# write dataframes to different worksheets
sent_19_df.to_excel(writer, sheet_name="2019", index=False)
sent_20_df.to_excel(writer, sheet_name="2020", index=False)
sent_21_df.to_excel(writer, sheet_name="2021", index=False)
sent_22_df.to_excel(writer, sheet_name="2022", index=False)
sent_23_df.to_excel(writer, sheet_name="2023", index=False)
sent_24_df.to_excel(writer, sheet_name="2024", index=False)
summary_df.to_excel(writer, sheet_name="summary")

# close the excel writer and output file
writer.close()