# Notebook for Sentiment Analysis Using spaCy

Using spaCy for sentiment analysis (textblob for spacy), we want to find the overall sentiment from the articles in each year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
import pandas as pd
from helpers import load_data, get_groups

## Loading articles

In [2]:
dataset_df = load_data()
dataset_df.head()

Unnamed: 0,id,text,headline,text_type,year
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016
1,http://www.politifact.com/california/statement...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017
4,http://www.politifact.com/california/statement...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017


## Analyzing article sentiment using spaCy textblob

[spaCy textblob](https://spacy.io/universe/project/spacy-textblob/)

[Quick References](https://github.com/SamEdwardes/spacytextblob?tab=readme-ov-file#quick-reference)

The two most relevant values returned by textblob are:
* polarity: a float in [-1.0, 1.0] where -1.0 is extremely negative and 1.0 is extremely positive
* subjectivity: a float in [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective

In [3]:
# make the model
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('spacytextblob')

  import pkg_resources


<spacytextblob.spacytextblob.SpacyTextBlob at 0x215b7781150>

In [4]:
dataset_df["doc"] = list(nlp.pipe(dataset_df["text"]))
dataset_df["polarity"] = dataset_df["doc"].apply(lambda doc: doc._.blob.polarity)
dataset_df["subjectivity"] = dataset_df["doc"].apply(lambda doc: doc._.blob.subjectivity)

dataset_df.head()

Unnamed: 0,id,text,headline,text_type,year,doc,polarity,subjectivity
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",0.026219,0.244984
1,http://www.politifact.com/california/statement...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016,"(Sacramento, ,, CA, -, United, States, Senator...",0.021154,0.41859
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017,"(We, should, anticipate, black, and, gray, mar...",-0.020094,0.394264
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017,"(As, a, ballot, initiative, calling, for, repe...",0.028671,0.420328
4,http://www.politifact.com/california/statement...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017,"(WASHINGTON, ,, DC, , The, House, of, Represe...",-0.04651,0.418955


## Filter the sentiments by year

In [5]:
years, years_dfs = get_groups(dataset_df, "year")
years_dfs[0].head()

Unnamed: 0,id,text,headline,text_type,year,doc,polarity,subjectivity
433,http://www.politifact.com/truth-o-meter/statem...,"Washington, D.C., Mar 25 - In response to sugg...",Bachmann Demands Truth: Will Obama Administrat...,Press release,2009,"(Washington, ,, D.C., ,, Mar, 25, -, In, respo...",-0.002273,0.178409
434,http://www.politifact.com/truth-o-meter/statem...,When most Americans talk about the need for he...,Taxpayer-Funded Abortion Is Not Health-Care Re...,News and blog,2009,"(When, most, Americans, talk, about, the, need...",0.090897,0.390104
435,http://www.politifact.com/truth-o-meter/statem...,A number of people in the news analysis busine...,One of these things is not like the other,News and blog,2009,"(A, number, of, people, in, the, news, analysi...",0.017545,0.364272
436,http://www.politifact.com/truth-o-meter/statem...,Yesterday President Obama responded to my stat...,,Social media,2009,"(Yesterday, President, Obama, responded, to, m...",0.124426,0.503137
437,http://www.politifact.com/truth-o-meter/statem...,Secretary of Defense Robert Gates is extremely...,"Military to Pledge Oath To Obama, Not Constitu...",News and blog,2009,"(Secretary, of, Defense, Robert, Gates, is, ex...",0.013445,0.387886


## Create summary table
Next we create a summary table containing the following information for each year:
* polarity_max: highest polarity
* polarity_min: lowest polarity
* polarity_avg: average polarity
* intensity_avg: average of sentiment scores, ignoring polarity (i.e. average taking absolute values)
* subjectivity_max: highest subjectivity
* subjectivity_min: lowest subjectivity
* subjectivity_avg: average subjectivity
* positive_article_count: total number of articles with polarity > 0
* negative_article_count: total number of articles with polarity < 0
* neutral_article_count: total number of articles with polairity = 0

In [6]:
def get_summary_table(years: list[int], dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data={
            "polarity_max": [df["polarity"].max() for df in dfs],
            "polarity_min": [df["polarity"].min() for df in dfs],
            "polarity_avg": [df["polarity"].mean() for df in dfs],
            "intensity_avg": [df["polarity"].abs().mean() for df in dfs],
            "subjectivity_max": [df["subjectivity"].max() for df in dfs],
            "subjectivity_min": [df["subjectivity"].min() for df in dfs],
            "subjectivity_avg": [df["subjectivity"].mean() for df in dfs],
            "positive_article_count": [df["polarity"][df["polarity"] > 0].count() for df in dfs],
            "negative_article_count": [df["polarity"][df["polarity"] < 0].count() for df in dfs],
            "neutral_article_count": [df["polarity"][df["polarity"] == 0].count() for df in dfs],
        },
        index=years
    )

In [7]:
summary_df = get_summary_table(years, years_dfs)
summary_df

Unnamed: 0,polarity_max,polarity_min,polarity_avg,intensity_avg,subjectivity_max,subjectivity_min,subjectivity_avg,positive_article_count,negative_article_count,neutral_article_count
2009,0.237596,-0.044249,0.070177,0.07565,0.607064,0.178409,0.416238,15,2,0
2010,0.375714,-0.09498,0.095071,0.110616,0.608172,0.303333,0.442185,20,3,0
2011,0.255,-0.33,0.075608,0.103913,0.62807,0.28112,0.436895,37,7,0
2012,0.206056,-0.054745,0.079749,0.088951,0.594855,0.312187,0.463587,23,5,0
2013,0.25,-0.2625,0.061929,0.08999,0.6375,0.0,0.382729,47,10,4
2014,0.5,-0.0875,0.082586,0.093524,0.7,0.0,0.415251,24,8,2
2015,0.246599,-0.475,0.017919,0.084202,1.0,0.0,0.399886,28,9,5
2016,0.8,-0.7,0.082629,0.119154,1.0,0.0,0.425452,66,19,6
2017,0.7375,-1.0,0.045831,0.117952,1.0,0.0,0.415415,145,55,23
2018,0.54,-0.5,0.055117,0.118204,0.9,0.0,0.449506,84,35,3


In [8]:
types, types_dfs = get_groups(dataset_df, "text_type")
types_dfs[0].head()

Unnamed: 0,id,text,headline,text_type,year,doc,polarity,subjectivity
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",0.026219,0.244984
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017,"(We, should, anticipate, black, and, gray, mar...",-0.020094,0.394264
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017,"(As, a, ballot, initiative, calling, for, repe...",0.028671,0.420328
6,http://www.politifact.com/california/statement...,"Recently, a group of special interests threate...","Repeal Californias gas tax increase, says GOP ...",News and blog,2017,"(Recently, ,, a, group, of, special, interests...",-0.076841,0.534678
7,http://www.politifact.com/california/statement...,"COSTA MESA, Orange County It was a surreal vi...","The pro-Russia, pro-weed, pro-Assange GOP cong...",News and blog,2017,"(COSTA, MESA, ,, Orange, County, , It, was, a...",0.146143,0.455828


## Write output to spreadsheet

In [9]:
def save_years(writer: pd.ExcelWriter, years: list[int], years_dfs: list[pd.DataFrame]):
    for year, df in zip(years, years_dfs):
        df.to_excel(
            writer,
            sheet_name=str(year),
            index=False,
            columns=["id", "polarity", "subjectivity", "year"]
        )
    
    get_summary_table(years, years_dfs).to_excel(writer, sheet_name="Summary")

In [10]:
writer = pd.ExcelWriter("./output/sentiment.xlsx", engine="xlsxwriter")
save_years(writer, years, years_dfs)
writer.close()

In [11]:
for type, df in zip(types, types_dfs):
    years, years_dfs = get_groups(df, "year")

    type_str = str(type).lower().replace(" ", "_")

    writer = pd.ExcelWriter(f"./output/{type_str}/sentiment_{type_str}.xlsx", engine="xlsxwriter")
    save_years(writer, years, years_dfs)
    writer.close()

### Now we repeat this for headlines

In [12]:
# Get rows with existing headlines
dataset_headlines_df = dataset_df[dataset_df["headline"].notna()].copy()
dataset_headlines_df.head()

Unnamed: 0,id,text,headline,text_type,year,doc,polarity,subjectivity
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",0.026219,0.244984
1,http://www.politifact.com/california/statement...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016,"(Sacramento, ,, CA, -, United, States, Senator...",0.021154,0.41859
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017,"(We, should, anticipate, black, and, gray, mar...",-0.020094,0.394264
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017,"(As, a, ballot, initiative, calling, for, repe...",0.028671,0.420328
4,http://www.politifact.com/california/statement...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017,"(WASHINGTON, ,, DC, , The, House, of, Represe...",-0.04651,0.418955


In [13]:
dataset_headlines_df["doc"] = list(nlp.pipe(dataset_headlines_df["headline"]))
dataset_headlines_df["polarity"] = dataset_headlines_df["doc"].apply(lambda doc: doc._.blob.polarity)
dataset_headlines_df["subjectivity"] = dataset_headlines_df["doc"].apply(lambda doc: doc._.blob.subjectivity)

dataset_headlines_df.head()

Unnamed: 0,id,text,headline,text_type,year,doc,polarity,subjectivity
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Multiple, States, Have, Agreed, To, Implement...",0.0,0.0
1,http://www.politifact.com/california/statement...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016,"(U.S., Senator, Dianne, Feinstein, Opposes, Pr...",0.0,0.0
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017,"(Why, you, should, buy, a, locking, gasoline, ...",0.0,0.0
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017,"(California, Gas, -, Tax, -, Hike, Repeal, Cam...",0.0,0.0
4,http://www.politifact.com/california/statement...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017,"(Rep., Chu, Decries, "", Heartless, "", ACA, Rep...",0.0,0.0


In [14]:
years, years_dfs = get_groups(dataset_headlines_df, "year")
years_dfs[0].head()

Unnamed: 0,id,text,headline,text_type,year,doc,polarity,subjectivity
433,http://www.politifact.com/truth-o-meter/statem...,"Washington, D.C., Mar 25 - In response to sugg...",Bachmann Demands Truth: Will Obama Administrat...,Press release,2009,"(Bachmann, Demands, Truth, :, Will, Obama, Adm...",0.0,0.0
434,http://www.politifact.com/truth-o-meter/statem...,When most Americans talk about the need for he...,Taxpayer-Funded Abortion Is Not Health-Care Re...,News and blog,2009,"(Taxpayer, -, Funded, Abortion, Is, Not, Healt...",0.0,0.0
435,http://www.politifact.com/truth-o-meter/statem...,A number of people in the news analysis busine...,One of these things is not like the other,News and blog,2009,"(One, of, these, things, is, not, like, the, o...",-0.125,0.375
437,http://www.politifact.com/truth-o-meter/statem...,Secretary of Defense Robert Gates is extremely...,"Military to Pledge Oath To Obama, Not Constitu...",News and blog,2009,"(Military, to, Pledge, Oath, To, Obama, ,, Not...",-0.1,0.1
438,http://www.politifact.com/truth-o-meter/statem...,"Now that the so-called stimulus plan is law, w...",Two GOP Governors on the Stimulus,News and blog,2009,"(Two, GOP, Governors, on, the, Stimulus)",0.0,0.0


In [15]:
summary_df = get_summary_table(years, years_dfs)
summary_df

Unnamed: 0,polarity_max,polarity_min,polarity_avg,intensity_avg,subjectivity_max,subjectivity_min,subjectivity_avg,positive_article_count,negative_article_count,neutral_article_count
2009,0.5,-0.5,0.03631,0.13244,1.0,0.0,0.240216,3,4,9
2010,0.5,-0.5,0.014296,0.114296,1.0,0.0,0.201187,5,3,14
2011,1.0,-0.8,0.051359,0.138291,1.0,0.0,0.257254,12,6,26
2012,0.7,-0.5,0.010652,0.110761,0.9,0.0,0.238033,7,5,16
2013,0.75,-0.65,0.037728,0.131846,1.0,0.0,0.211604,14,6,31
2014,0.2,-0.7,-0.01677,0.05751,1.0,0.0,0.17572,4,5,18
2015,0.7,-0.5,0.046667,0.121111,0.9,0.0,0.197963,8,6,16
2016,0.8,-0.6,0.018342,0.127081,1.0,0.0,0.293928,17,13,31
2017,0.6,-0.875,0.001976,0.093257,1.0,0.0,0.211646,36,29,94
2018,1.0,-1.0,0.006384,0.159683,1.0,0.0,0.268206,24,26,42


In [16]:
types, types_dfs = get_groups(dataset_headlines_df, "text_type")
types_dfs[0].head()

Unnamed: 0,id,text,headline,text_type,year,doc,polarity,subjectivity
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Multiple, States, Have, Agreed, To, Implement...",0.0,0.0
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017,"(Why, you, should, buy, a, locking, gasoline, ...",0.0,0.0
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017,"(California, Gas, -, Tax, -, Hike, Repeal, Cam...",0.0,0.0
6,http://www.politifact.com/california/statement...,"Recently, a group of special interests threate...","Repeal Californias gas tax increase, says GOP ...",News and blog,2017,"(Repeal, Californias, gas, tax, increase, ,, s...",0.0,0.0
7,http://www.politifact.com/california/statement...,"COSTA MESA, Orange County It was a surreal vi...","The pro-Russia, pro-weed, pro-Assange GOP cong...",News and blog,2017,"(The, pro, -, Russia, ,, pro, -, weed, ,, pro,...",-0.388889,0.833333


In [17]:
writer = pd.ExcelWriter("./output/sentiment_headlines.xlsx", engine="xlsxwriter")
save_years(writer, years, years_dfs)
writer.close()

In [18]:
for type, df in zip(types, types_dfs):
    years, years_dfs = get_groups(df, "year")

    type_str = str(type).lower().replace(" ", "_")

    writer = pd.ExcelWriter(f"./output/{type_str}/sentiment_{type_str}_headlines.xlsx", engine="xlsxwriter")
    save_years(writer, years, years_dfs)
    writer.close()