# Notebook for Parts of Speech Analysis

Using spaCy for parts of speech analysis, we want to create relative frequency tables for the parts of speech by year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
import spacy
from spacy.tokens.doc import Doc
from spacy.tokens.token import Token
import pandas as pd
from helpers import load_data, get_groups

## Loading articles into dataframes, separated by year

In [2]:
dataset_df = load_data()
dataset_df.head()

Unnamed: 0,id,text,headline,text_type,year
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016
1,http://www.politifact.com/california/statement...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017
4,http://www.politifact.com/california/statement...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017


## Tagging parts of speech using spaCy

Using the small English web model, we tag the parts of speech in the body text by making article's body text a string, analyzing the string using spaCy, and then appending each token to a list manually.

We end up with a dataframe of many rows since each tag/tagged token takes up one row - this is fine since we are looking at overall counts in a year and we don't need to preserve the delineation between articles.

In [3]:
nlp = spacy.load("en_core_web_sm")

  import pkg_resources


In [4]:
def get_tokens(doc: Doc):
    return [token for token in doc]

def get_pos(token: Token):
    return token.pos_

In [5]:
dataset_df["doc"] = list(nlp.pipe(dataset_df["text"]))

In [6]:
dataset_df["token"] = dataset_df["doc"].apply(get_tokens)
pos_df = dataset_df.explode("token")
pos_df["POS"] = pos_df["token"].apply(get_pos)
pos_df

Unnamed: 0,id,text,headline,text_type,year,doc,token,POS
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",Residents,NOUN
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",of,ADP
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",multiple,ADJ
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",states,NOUN
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",will,AUX
...,...,...,...,...,...,...,...,...
2960,Politifact_Pants on Fire_Social media_621529,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,,Social media,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",ON,PROPN
2960,Politifact_Pants on Fire_Social media_621529,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,,Social media,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",A,PRON
2960,Politifact_Pants on Fire_Social media_621529,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,,Social media,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",MONTHLY,PROPN
2960,Politifact_Pants on Fire_Social media_621529,ANYBODY ELSE FIND IT FUNNY THAT ISRAEL WAS ATT...,,Social media,2023,"(ANYBODY, ELSE, FIND, IT, FUNNY, THAT, ISRAEL,...",BASIS,PROPN


## Create relative frequency tables of parts of speech by year

### Frequency tables per year for saving

In [7]:
years, years_dfs = get_groups(pos_df, "year")
years_dfs[0].head()

Unnamed: 0,id,text,headline,text_type,year,doc,token,POS
433,http://www.politifact.com/truth-o-meter/statem...,"Washington, D.C., Mar 25 - In response to sugg...",Bachmann Demands Truth: Will Obama Administrat...,Press release,2009,"(Washington, ,, D.C., ,, Mar, 25, -, In, respo...",Washington,PROPN
433,http://www.politifact.com/truth-o-meter/statem...,"Washington, D.C., Mar 25 - In response to sugg...",Bachmann Demands Truth: Will Obama Administrat...,Press release,2009,"(Washington, ,, D.C., ,, Mar, 25, -, In, respo...",",",PUNCT
433,http://www.politifact.com/truth-o-meter/statem...,"Washington, D.C., Mar 25 - In response to sugg...",Bachmann Demands Truth: Will Obama Administrat...,Press release,2009,"(Washington, ,, D.C., ,, Mar, 25, -, In, respo...",D.C.,PROPN
433,http://www.politifact.com/truth-o-meter/statem...,"Washington, D.C., Mar 25 - In response to sugg...",Bachmann Demands Truth: Will Obama Administrat...,Press release,2009,"(Washington, ,, D.C., ,, Mar, 25, -, In, respo...",",",PUNCT
433,http://www.politifact.com/truth-o-meter/statem...,"Washington, D.C., Mar 25 - In response to sugg...",Bachmann Demands Truth: Will Obama Administrat...,Press release,2009,"(Washington, ,, D.C., ,, Mar, 25, -, In, respo...",Mar,PROPN


### Summary tables for easy glancing

In [8]:
def get_summary_counts_df(years: list[int], years_dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data=[df["POS"].value_counts() for df in years_dfs], 
        index=pd.Index(years, name="year")
    )

def get_summary_proportions_df(years: list[int], years_dfs: list[pd.DataFrame]):
    return pd.DataFrame(
        data=[df["POS"].value_counts(normalize=True) for df in years_dfs], 
        index=pd.Index(years, name="year")
    )

In [9]:
summary_counts_df = get_summary_counts_df(years, years_dfs)
summary_counts_df

POS,NOUN,PUNCT,VERB,ADP,DET,PROPN,ADJ,PRON,AUX,ADV,PART,CCONJ,SCONJ,NUM,SPACE,X,SYM,INTJ
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2009,2475,1620,1425,1267,1097,1005,942,738,688,472,394,321,299,219,188,57,30,12
2010,2191,1279,1280,1156,957,1192,805,666,636,410,356,308,234,245,96,4,43,5
2011,4981,2668,2882,2580,2095,2466,1709,1541,1313,819,724,715,456,413,217,21,67,18
2012,3438,2192,2009,1808,1419,2108,1201,957,891,533,511,481,333,457,121,19,127,21
2013,7034,4079,4261,3630,3144,3065,2576,2620,2168,1384,1226,1161,768,606,346,30,90,24
2014,2884,2073,1833,1676,1288,2090,930,807,763,450,508,434,269,534,279,65,63,11
2015,4559,2874,2767,2535,2140,2210,1773,1768,1424,827,826,825,488,377,455,21,52,21
2016,6872,4365,4402,3922,3105,3555,2550,2534,1907,1387,1019,1079,810,722,421,44,129,28
2017,14282,8727,9359,8057,6721,8609,4819,5451,4348,2919,2271,2246,1658,1608,908,59,258,51
2018,8486,5267,5531,5151,3938,5482,2991,3288,2523,1881,1296,1268,968,844,480,27,88,46


In [10]:
summary_proportions_df = get_summary_proportions_df(years, years_dfs)
summary_proportions_df

POS,NOUN,PUNCT,VERB,ADP,DET,PROPN,ADJ,PRON,AUX,ADV,PART,CCONJ,SCONJ,NUM,SPACE,X,SYM,INTJ
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2009,0.186807,0.122273,0.107555,0.09563,0.082799,0.075855,0.0711,0.055702,0.051928,0.035625,0.029738,0.024228,0.022568,0.01653,0.01419,0.004302,0.002264,0.000906
2010,0.184692,0.107814,0.107899,0.097446,0.080671,0.10048,0.067858,0.056141,0.053612,0.034561,0.030009,0.025963,0.019725,0.020652,0.008092,0.000337,0.003625,0.000421
2011,0.193926,0.103874,0.112206,0.100448,0.081565,0.096009,0.066537,0.059996,0.051119,0.031886,0.028188,0.027837,0.017754,0.016079,0.008449,0.000818,0.002609,0.000701
2012,0.184581,0.117685,0.10786,0.097069,0.076184,0.113175,0.06448,0.05138,0.047836,0.028616,0.027435,0.025824,0.017878,0.024536,0.006496,0.00102,0.006818,0.001127
2013,0.184078,0.106747,0.111509,0.094996,0.082278,0.08021,0.067413,0.068565,0.056736,0.036219,0.032084,0.030383,0.020098,0.015859,0.009055,0.000785,0.002355,0.000628
2014,0.170077,0.12225,0.108097,0.098838,0.075957,0.123253,0.054845,0.047591,0.044996,0.026538,0.029958,0.025594,0.015864,0.031491,0.016453,0.003833,0.003715,0.000649
2015,0.175738,0.110786,0.106661,0.097718,0.082492,0.08519,0.068345,0.068152,0.054892,0.031879,0.03184,0.031802,0.018811,0.014532,0.017539,0.000809,0.002004,0.000809
2016,0.176881,0.112352,0.113305,0.10095,0.079921,0.091503,0.065635,0.065224,0.049085,0.0357,0.026228,0.027773,0.020849,0.018584,0.010836,0.001133,0.00332,0.000721
2017,0.173428,0.105973,0.113648,0.097837,0.081614,0.10454,0.058518,0.066192,0.052798,0.035446,0.027577,0.027274,0.020133,0.019526,0.011026,0.000716,0.003133,0.000619
2018,0.171244,0.106286,0.111613,0.103945,0.079467,0.110625,0.060357,0.066351,0.050913,0.037958,0.026153,0.025588,0.019534,0.017032,0.009686,0.000545,0.001776,0.000928


In [11]:
types, types_dfs = get_groups(pos_df, "text_type")
types_dfs[0].head()

Unnamed: 0,id,text,headline,text_type,year,doc,token,POS
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",Residents,NOUN
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",of,ADP
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",multiple,ADJ
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",states,NOUN
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...",will,AUX


In [12]:
def get_pos_table_for_year(df: pd.DataFrame):
    counts = df["POS"].value_counts()

    pos_table = counts.to_frame()
    pos_table["proportion"] = counts / counts.sum()

    return pos_table

In [13]:
def save_years(writer: pd.ExcelWriter, years: list[int], years_dfs: list[pd.DataFrame]):
    for year, df in zip(years, years_dfs):
        pos_table_df = get_pos_table_for_year(df)
        pos_table_df.to_excel(
            writer,
            sheet_name=str(year)
        )
    
    get_summary_counts_df(years, years_dfs).to_excel(writer, sheet_name="counts")
    get_summary_proportions_df(years, years_dfs).to_excel(writer, sheet_name="proportions")

## Writing dataframes to excel spreadsheet

In [14]:
writer = pd.ExcelWriter("./output/pos.xlsx", engine="xlsxwriter")
save_years(writer, years, years_dfs)
writer.close()

In [15]:
for type, df in zip(types, types_dfs):
    years, years_dfs = get_groups(df, "year")

    type_str = str(type).lower().replace(" ", "_")

    writer = pd.ExcelWriter(f"./output/{type_str}/pos_{type_str}.xlsx", engine="xlsxwriter")
    save_years(writer, years, years_dfs)
    writer.close()