# Notebook for Parts of Speech Analysis

Using spaCy for parts of speech analysis, we want to create relative frequency tables for the parts of speech by year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [51]:
import spacy
from spacy.tokens.doc import Doc
from spacy.tokens.token import Token
import pandas as pd
from dataset_config import BASE_FAKESPEAK_CONFIG, BASE_MISINFOTEXT_CONFIG
from helpers import get_groups, make_output_path, make_output_path_for_type

## Loading articles into dataframes, separated by year

In [None]:
fakespeak_config = BASE_FAKESPEAK_CONFIG
misinfotext_config = BASE_MISINFOTEXT_CONFIG

In [69]:
using_dataset = fakespeak_config

In [70]:
dataset_df = pd.read_excel(
    using_dataset["input_path"], 
    sheet_name=using_dataset["sheet_name"], 
    usecols=using_dataset["usecols"]
)
dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,2019


## Tagging parts of speech using spaCy

Using the small English web model, we tag the parts of speech in the body text by making article's body text a string, analyzing the string using spaCy, and then appending each token to a list manually.

We end up with a dataframe of many rows since each tag/tagged token takes up one row - this is fine since we are looking at overall counts in a year and we don't need to preserve the delineation between articles.

In [71]:
nlp = spacy.load("en_core_web_sm")

In [72]:
def get_tokens(doc: Doc):
    return [token for token in doc]

def get_pos(token: Token):
    return token.pos_

In [73]:
dataset_df["doc"] = list(nlp.pipe(dataset_df[using_dataset["text_col"]]))

In [74]:
dataset_df["token"] = dataset_df["doc"].apply(get_tokens)
pos_df = dataset_df.explode("token")
pos_df["POS"] = pos_df["token"].apply(get_pos)
pos_df

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear,doc,token,POS
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",Mexico,PROPN
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",is,AUX
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",paying,VERB
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",for,ADP
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",the,DET
...,...,...,...,...,...,...,...,...
2960,Politifact_Pants on Fire_Social media_621529,Pants on Fire,Social media,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",ON,PROPN
2960,Politifact_Pants on Fire_Social media_621529,Pants on Fire,Social media,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",A,PRON
2960,Politifact_Pants on Fire_Social media_621529,Pants on Fire,Social media,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",MONTHLY,PROPN
2960,Politifact_Pants on Fire_Social media_621529,Pants on Fire,Social media,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",BASIS,PROPN


## Create relative frequency tables of parts of speech by year

### Frequency tables per year for saving

In [75]:
years, years_dfs = get_groups(pos_df, using_dataset["year_col"])
years_dfs[0].head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear,doc,token,POS
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",Mexico,PROPN
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",is,AUX
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",paying,VERB
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",for,ADP
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019,"(Mexico, is, paying, for, the, Wall, through, ...",the,DET


### Summary tables for easy glancing

In [76]:
def get_summary_counts_df(years: list[int], years_dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data=[df["POS"].value_counts() for df in years_dfs], 
        index=pd.Index(years, name="year")
    )

def get_summary_proportions_df(years: list[int], years_dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data=[df["POS"].value_counts(normalize=True) for df in years_dfs], 
        index=pd.Index(years, name="year")
    )

In [77]:
summary_counts_df = get_summary_counts_df(years, years_dfs)
summary_counts_df

POS,NOUN,VERB,PUNCT,PROPN,ADP,DET,PRON,ADJ,AUX,ADV,CCONJ,PART,SPACE,NUM,SCONJ,SYM,X,INTJ
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2019,6641,4168,4099,3557,3551,2708,2502,2175,1915,1217,1085,1055,1020,744,670,133,60,52
2020,22970,14936,15684,15448,12701,10212,8740,7905,7367,4676,3505,3555,4237,3065,2669,658,259,148
2021,26692,16357,17555,14316,14242,11522,8797,9021,7557,5087,4082,3982,4261,3497,2766,453,218,126
2022,16320,9725,10639,8624,8722,6841,5234,5619,4374,3281,2392,2256,2585,2473,1683,518,212,75
2023,21830,13075,13904,12022,11285,9163,6552,7231,5565,3690,3139,2978,3906,2109,2090,576,225,78
2024,4542,3088,3272,3023,2480,2065,1774,1525,1329,920,660,699,771,456,466,200,91,37


In [78]:
summary_proportions_df = get_summary_proportions_df(years, years_dfs)
summary_proportions_df

POS,NOUN,VERB,PUNCT,PROPN,ADP,DET,PRON,ADJ,AUX,ADV,CCONJ,PART,SPACE,NUM,SCONJ,SYM,X,INTJ
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2019,0.177795,0.111587,0.10974,0.095229,0.095069,0.072499,0.066984,0.05823,0.051269,0.032582,0.029048,0.028245,0.027308,0.019919,0.017937,0.003561,0.001606,0.001392
2020,0.165567,0.107658,0.11305,0.111349,0.091549,0.073608,0.062998,0.056979,0.053101,0.033705,0.025264,0.025624,0.03054,0.022092,0.019238,0.004743,0.001867,0.001067
2021,0.177319,0.108662,0.11662,0.095103,0.094612,0.076542,0.05844,0.059928,0.050202,0.033794,0.027117,0.026453,0.028306,0.023231,0.018375,0.003009,0.001448,0.000837
2022,0.178218,0.106199,0.116181,0.094176,0.095246,0.074705,0.057157,0.061361,0.047765,0.035829,0.026121,0.024636,0.028229,0.027006,0.018379,0.005657,0.002315,0.000819
2023,0.182803,0.109489,0.116431,0.100672,0.0945,0.07673,0.054866,0.060552,0.046601,0.0309,0.026286,0.024938,0.032709,0.017661,0.017502,0.004823,0.001884,0.000653
2024,0.165779,0.112709,0.119425,0.110337,0.090518,0.07537,0.064749,0.055661,0.048507,0.033579,0.024089,0.025513,0.028141,0.016644,0.017009,0.0073,0.003321,0.00135


In [79]:
types, types_dfs = get_groups(pos_df, using_dataset["type_col"])
types_dfs[0].head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear,doc,token,POS
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,2019,"(Joe, Biden, has, a, message, for, the, public...",Joe,PROPN
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,2019,"(Joe, Biden, has, a, message, for, the, public...",Biden,PROPN
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,2019,"(Joe, Biden, has, a, message, for, the, public...",has,VERB
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,2019,"(Joe, Biden, has, a, message, for, the, public...",a,DET
16,Politifact_FALSE_News and blog_73653,False,News and blog,Joe Biden has a message for the public on his ...,2019,"(Joe, Biden, has, a, message, for, the, public...",message,NOUN


In [80]:
def get_pos_table_for_year(df: pd.DataFrame):
    counts = df["POS"].value_counts()

    pos_table = counts.to_frame()
    pos_table["proportion"] = counts / counts.sum()

    return pos_table

In [81]:
def save_years(writer: pd.ExcelWriter, years: list[int], years_dfs: list[pd.DataFrame]):
    for year, df in zip(years, years_dfs):
        pos_table_df = get_pos_table_for_year(df)
        pos_table_df.to_excel(
            writer,
            sheet_name=str(year)
        )
    
    get_summary_counts_df(years, years_dfs).to_excel(writer, sheet_name="counts")
    get_summary_proportions_df(years, years_dfs).to_excel(writer, sheet_name="proportions")

## Writing dataframes to excel spreadsheet

In [82]:
output_path = make_output_path(using_dataset, "POS_frequency")

writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
save_years(writer, years, years_dfs)
writer.close()

In [83]:
for type, df in zip(types, types_dfs):
    years, years_dfs = get_groups(df, using_dataset["year_col"])

    output_path = make_output_path_for_type(using_dataset, type, "POS_frequency")

    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
    save_years(writer, years, years_dfs)
    writer.close()