# Notebook for Finding Ngrams

Using scikit-learn, we want to find ngrams (most commonly occuring sets of words) in the articles across each year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load in fakespeak dataset

In [2]:
class DatasetConfig():
    input_path: str
    output_path: str
    output_path_headlines: str
    sheet_name: str
    usecols: list[str]

    def __init__(self, input_path: str, output_path: str, output_path_headlines: str, sheet_name: str, usecols: list[str]):
        self.input_path = input_path
        self.output_path = output_path
        self.output_path_headlines = output_path_headlines
        self.sheet_name = sheet_name
        self.usecols = usecols

In [3]:
fakespeak_config = DatasetConfig(
    # file_path="/content/drive/My Drive/fake_news_over_time/Fakespeak_ENG_modified.xlsx",
    input_path="./data/Fakespeak-ENG/Fakespeak-ENG modified.xlsx",
    output_path="./data/Fakespeak-ENG/Analysis_output/Fakespeak_ngrams.xlsx",
    output_path_headlines="./data/Fakespeak-ENG/Analysis_output/Fakespeak_ngrams_headlines.xlsx",
    sheet_name="Working",
    usecols=['ID', 'combinedLabel', 'originalTextType', 'originalBodyText', 'originalDateYear', 'originalHeadline']
)

misinfotext_config = DatasetConfig(
    input_path="./data/MisInfoText/PolitiFact_original_modified.xlsx",
    output_path="./data/MisInfoText/Analysis_output/MisInfoText_ngrams.xlsx",
    output_path_headlines="./data/MisInfoText/Analysis_output/MisInfoText_ngrams_headlines.xlsx",
    sheet_name="Working",
    usecols=None
)

In [16]:
using_dataset = fakespeak_config

In [18]:
dataset_df = pd.read_excel(
    using_dataset.input_path, 
    sheet_name=using_dataset.sheet_name, 
    usecols=using_dataset.usecols
)
dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",,2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,,2019


## Extracting ngrams

Here we use sklearn's CountVectorizer() function to produce ngrams where n=1-5. First, we separate the fakespeak_df into its respective years, then find these ngrams and create new dataframes to hold them.

In [19]:
# helper function to find ngrams for articles from each year
def get_ngram_counts(df: pd.DataFrame, col: str):
  year = df["originalDateYear"].iloc[0]
  
  # initialize vector
  c_vec = CountVectorizer(ngram_range=(1, 5))

  # input to fit_transform must be an iterable of strings
  ngrams = c_vec.fit_transform(df[col].to_list())

  # initialize vocabulary after calling fit_transform
  vocab = c_vec.vocabulary_

  count_values = ngrams.toarray().sum(axis=0)

  # list to hold ngram rows that will be turned into a dataframe
  ngram_list = []

  for count, text in sorted([(count_values[i], k) for k, i in vocab.items()], reverse=True):
    n = len(text.split())
    ngram_list.append([n, text, count, year])

  headers = ['n', 'ngram_text', 'ngram_count', 'year']
  ngram_df = pd.DataFrame(ngram_list, columns=headers)

  # sort the dataframe by n
  ngram_df = ngram_df.sort_values(by=['n', 'ngram_count'], ascending=[True, False])

  return ngram_df


In [20]:
# get ngrams for each year
grouped_by_year = dataset_df.groupby(by="originalDateYear")
ngram_years_dfs = [get_ngram_counts(grouped_by_year.get_group(group), "originalBodyText") 
                   for group in grouped_by_year.groups]
ngram_years_dfs[0]

Unnamed: 0,n,ngram_text,ngram_count,year
0,1,the,1580,2019
1,1,to,1009,2019
2,1,and,852,2019
3,1,of,752,2019
4,1,in,580,2019
...,...,...,...,...
115070,5,000 according to regional data,1,2019
115074,5,000 000 that have been,1,2019
115078,5,00 to the trump campaign,1,2019
115082,5,00 ticket and you will,1,2019


In [21]:
grouped_by_year_headlines = dataset_df.groupby(by="originalDateYear")
ngram_years_headlines_dfs = [get_ngram_counts(grouped_by_year.get_group(group), "originalBodyText") 
                             for group in grouped_by_year_headlines.groups]
ngram_years_headlines_dfs[0]

Unnamed: 0,n,ngram_text,ngram_count,year
0,1,the,1580,2019
1,1,to,1009,2019
2,1,and,852,2019
3,1,of,752,2019
4,1,in,580,2019
...,...,...,...,...
115070,5,000 according to regional data,1,2019
115074,5,000 000 that have been,1,2019
115078,5,00 to the trump campaign,1,2019
115082,5,00 ticket and you will,1,2019


## Prepare dataframes to output to spreadsheet
Currently, the dataframes hold all found ngrams with n=1-5, including ones that only appear once (which isn't very helpful - for reference, the unfiltered 2019 dataframe contains 115,090 entries). To address this issue, we only take the first 20 entries for each n=2-5 (i.e. we take the first 20 bigrams, then the first 20 trigrams, etc. for each year).

The exception is we take the first 50 monogram entries, since a lot of them tend to be common words and the results are more interesting when we broaden the search. To circumvent this, we also drop the first 10 rows from each dataframe to go further down the monogram list (this can also be adjusted).

In [22]:
# helper function that cleans up the dataframes as outlined above
# where df is the ngram dataframe
# num_mono is the number of entries to include for monograms
# num_other is the number of entries to include for the other ngrams
# drop index indicates the number of rows we want to drop from the top of the dataframe
def clean_ngram(df, num_mono, num_other, drop_index):
  # drop 20 most common ngrams
  df = df.iloc[drop_index:]

  # filter dataframe by ngram frequency
  df1 = df[df['n'] == 1].head(num_mono)
  df2 = df[df['n'] == 2].head(num_other)
  df3 = df[df['n'] == 3].head(num_other)
  df4 = df[df['n'] == 4].head(num_other)
  df5 = df[df['n'] == 5].head(num_other)

  # concatenate the dataframes along the rows
  output_df = pd.concat([df1, df2, df3, df4, df5], axis=0)

  return output_df

In [23]:
num_entries_mono = 50
num_entries_other = 20
drop_index = 10

In [24]:
clean_ngram_years_dfs = [clean_ngram(df, num_entries_mono, num_entries_other, drop_index) 
                         for df in ngram_years_dfs]
clean_ngram_years_dfs[0]

Unnamed: 0,n,ngram_text,ngram_count,year
10,1,on,221,2019
11,1,we,217,2019
12,1,are,208,2019
13,1,you,196,2019
14,1,with,189,2019
...,...,...,...,...
2602,5,sen cruz said to the,3,2019
2749,5,or technique capable of causing,3,2019
2755,5,or incendiary device or technique,3,2019
2801,5,of any firearm explosive or,3,2019


In [25]:
clean_ngram_years_headlines_dfs = [clean_ngram(df, num_entries_mono, num_entries_other, drop_index) 
                                   for df in ngram_years_headlines_dfs]
clean_ngram_years_headlines_dfs[0]

Unnamed: 0,n,ngram_text,ngram_count,year
10,1,on,221,2019
11,1,we,217,2019
12,1,are,208,2019
13,1,you,196,2019
14,1,with,189,2019
...,...,...,...,...
2602,5,sen cruz said to the,3,2019
2749,5,or technique capable of causing,3,2019
2755,5,or incendiary device or technique,3,2019
2801,5,of any firearm explosive or,3,2019


We could show the resulting dataframes for the other years as well, but here I've chosen not to in order to save space and improve readability for the notebook.

## Write dataframes to excel spreadsheet

In [None]:
!pip install xlsxwriter

In [26]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(using_dataset.output_path, engine="xlsxwriter")

for df in clean_ngram_years_dfs:
    year = str(df["year"].iloc[0])
    columns_to_save = [col for col in df.columns if col != "year"]
    df.to_excel(writer, sheet_name=year, columns=columns_to_save, index=False)

# close the excel writer and output file
writer.close()

In [27]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(using_dataset.output_path_headlines, engine="xlsxwriter")

for df in clean_ngram_years_headlines_dfs:
    year = str(df["year"].iloc[0])
    columns_to_save = [col for col in df.columns if col != "year"]
    df.to_excel(writer, sheet_name=year, columns=columns_to_save, index=False)

# close the excel writer and output file
writer.close()