# Notebook for Parts of Speech Analysis

Using spaCy for parts of speech analysis, we want to create relative frequency tables for the parts of speech by year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
import spacy
from spacy.tokens.doc import Doc
from spacy.tokens.token import Token
import pandas as pd
from dataset_config import BASE_FAKESPEAK_CONFIG, BASE_MISINFOTEXT_CONFIG
from helpers import get_groups, make_output_path, make_output_path_for_type

## Loading articles into dataframes, separated by year

In [2]:
fakespeak_config = BASE_FAKESPEAK_CONFIG
misinfotext_config = BASE_MISINFOTEXT_CONFIG

In [3]:
using_dataset = fakespeak_config

In [4]:
dataset_df = pd.read_excel(
    using_dataset["input_path"], 
    sheet_name=using_dataset["sheet_name"], 
    usecols=using_dataset["usecols"]
)

# Removing 2007 and 2008 years because little data in them
dataset_df = dataset_df[~(dataset_df[using_dataset["year_col"]] == 2007) & ~(dataset_df[using_dataset["year_col"]] == 2008)]

dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,2019


## Tagging parts of speech using spaCy

Using the small English web model, we tag the parts of speech in the body text by making article's body text a string, analyzing the string using spaCy, and then appending each token to a list manually.

We end up with a dataframe of many rows since each tag/tagged token takes up one row - this is fine since we are looking at overall counts in a year and we don't need to preserve the delineation between articles.

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
def get_tokens(doc: Doc):
    return [token for token in doc]

def get_pos(token: Token):
    return token.pos_

In [None]:
dataset_df["doc"] = list(nlp.pipe(dataset_df[using_dataset["text_col"]]))

In [None]:
dataset_df["token"] = dataset_df["doc"].apply(get_tokens)
pos_df = dataset_df.explode("token")
pos_df["POS"] = pos_df["token"].apply(get_pos)
pos_df

## Create relative frequency tables of parts of speech by year

### Frequency tables per year for saving

In [None]:
years, years_dfs = get_groups(pos_df, using_dataset["year_col"])
years_dfs[0].head()

### Summary tables for easy glancing

In [None]:
def get_summary_counts_df(years: list[int], years_dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data=[df["POS"].value_counts() for df in years_dfs], 
        index=pd.Index(years, name="year")
    )

def get_summary_proportions_df(years: list[int], years_dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data=[df["POS"].value_counts(normalize=True) for df in years_dfs], 
        index=pd.Index(years, name="year")
    )

In [None]:
summary_counts_df = get_summary_counts_df(years, years_dfs)
summary_counts_df

In [None]:
summary_proportions_df = get_summary_proportions_df(years, years_dfs)
summary_proportions_df

In [None]:
types, types_dfs = get_groups(pos_df, using_dataset["type_col"])
types_dfs[0].head()

In [None]:
def get_pos_table_for_year(df: pd.DataFrame):
    counts = df["POS"].value_counts()

    pos_table = counts.to_frame()
    pos_table["proportion"] = counts / counts.sum()

    return pos_table

In [None]:
def save_years(writer: pd.ExcelWriter, years: list[int], years_dfs: list[pd.DataFrame]):
    for year, df in zip(years, years_dfs):
        pos_table_df = get_pos_table_for_year(df)
        pos_table_df.to_excel(
            writer,
            sheet_name=str(year)
        )
    
    get_summary_counts_df(years, years_dfs).to_excel(writer, sheet_name="counts")
    get_summary_proportions_df(years, years_dfs).to_excel(writer, sheet_name="proportions")

## Writing dataframes to excel spreadsheet

In [None]:
output_path = make_output_path(using_dataset, "POS_frequency")

writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
save_years(writer, years, years_dfs)
writer.close()

In [None]:
for type, df in zip(types, types_dfs):
    years, years_dfs = get_groups(df, using_dataset["year_col"])

    output_path = make_output_path_for_type(using_dataset, type, "POS_frequency")

    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
    save_years(writer, years, years_dfs)
    writer.close()