# Notebook for Finding Ngrams

Using scikit-learn, we want to find ngrams (most commonly occuring sets of words) in the articles across each year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
import pandas as pd

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load in fakespeak dataset

In [4]:
# file path for fakespeak excel sheet
input = '/content/drive/My Drive/fake_news_over_time/Fakespeak_ENG_modified.xlsx'

In [5]:
fakespeak_df = pd.read_excel(input, sheet_name="Working", usecols=['ID', 'combinedLabel', 'originalTextType', 'originalBodyText', 'originalDateYear'])

In [6]:
fakespeak_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,2019


## Extractng ngrams

Here we use sklearn's CountVectorizer() function to produce ngrams where n=1-5. First, we separate the fakespeak_df into its respective years, then find these ngrams and create new dataframes to hold them.

In [20]:
# helper function to find ngrams for articles from each year
def vectorize(df, year):
  # filter the dataframe
  year_df = df[df['originalDateYear'] == year]

  # initialize vector
  c_vec = CountVectorizer(ngram_range=(1, 5))

  # input to fit_transform must be an iterable of strings
  ngrams = c_vec.fit_transform(year_df['originalBodyText'].to_list())

  # initialize vocabulary after calling fit_transform
  vocab = c_vec.vocabulary_

  count_values = ngrams.toarray().sum(axis=0)

  # list to hold ngram rows that will be turned into a dataframe
  ngram_list = []

  for count, text in sorted([(count_values[i], k) for k, i in vocab.items()], reverse=True):
    n = len(text.split())
    ngram_list.append([n, text, count])

  headers = ['n', 'ngram_text', 'ngram_count']
  ngram_df = pd.DataFrame(ngram_list, columns=headers)

  # sort the dataframe by n
  ngram_df = ngram_df.sort_values(by=['n', 'ngram_count'], ascending=[True, False])

  return ngram_df


In [42]:
# get ngrams for each year
ngram_19_df = vectorize(fakespeak_df, 2019)
ngram_20_df = vectorize(fakespeak_df, 2020)
ngram_21_df = vectorize(fakespeak_df, 2021)
ngram_22_df = vectorize(fakespeak_df, 2022)
ngram_23_df = vectorize(fakespeak_df, 2023)
ngram_24_df = vectorize(fakespeak_df, 2024)

## Prepare dataframes to output to spreadsheet
Currently, the dataframes hold all found ngrams with n=1-5, including ones that only appear once (which isn't very helpful - for reference, the unfiltered 2019 dataframe contains 115,090 entries). To address this issue, we only take the first 20 entries for each n=2-5 (i.e. we take the first 20 bigrams, then the first 20 trigrams, etc. for each year).

The exception is we take the first 50 monogram entries, since a lot of them tend to be common words and the results are more interesting when we broaden the search. To circumvent this, we also drop the first 10 rows from each dataframe to go further down the monogram list (this can also be adjusted).

In [76]:
# helper function that cleans up the dataframes as outlined above
# where df is the ngram dataframe
# num_mono is the number of entries to include for monograms
# num_other is the number of entries to include for the other ngrams
# drop index indicates the number of rows we want to drop from the top of the dataframe
def clean_ngram(df, num_mono, num_other, drop_index):
  # drop 20 most common ngrams
  df = df.iloc[drop_index:]

  # filter dataframe by ngram frequency
  df1 = df[df['n'] == 1].head(num_mono)
  df2 = df[df['n'] == 2].head(num_other)
  df3 = df[df['n'] == 3].head(num_other)
  df4 = df[df['n'] == 4].head(num_other)
  df5 = df[df['n'] == 5].head(num_other)

  # concatenate the dataframes along the rows
  output_df = pd.concat([df1, df2, df3, df4, df5], axis=0)

  return output_df

In [77]:
num_entries_mono = 50
num_entries_other = 20
drop_index = 10

clean_ngram_19_df = clean_ngram(ngram_19_df, num_entries_mono, num_entries_other, drop_index)
clean_ngram_20_df = clean_ngram(ngram_20_df, num_entries_mono, num_entries_other, drop_index)
clean_ngram_21_df = clean_ngram(ngram_21_df, num_entries_mono, num_entries_other, drop_index)
clean_ngram_22_df = clean_ngram(ngram_22_df, num_entries_mono, num_entries_other, drop_index)
clean_ngram_23_df = clean_ngram(ngram_23_df, num_entries_mono, num_entries_other, drop_index)
clean_ngram_24_df = clean_ngram(ngram_24_df, num_entries_mono, num_entries_other, drop_index)

In [79]:
clean_ngram_19_df.head()

Unnamed: 0,n,ngram_text,ngram_count
10,1,on,221
11,1,we,217
12,1,are,208
13,1,you,196
14,1,with,189


In [78]:
clean_ngram_20_df

Unnamed: 0,n,ngram_text,ngram_count
10,1,on,850
11,1,are,730
12,1,you,695
13,1,be,674
14,1,with,668
...,...,...,...
2435,5,live in nation that was,10
2458,5,is run by idiots if,10
2469,5,in nation that was founded,10
2677,5,related biological products advisory committee,9


We could show the resulting dataframes for the other years as well, but here I've chosen not to in order to save space and improve readability for the notebook.

## Write dataframes to excel spreadsheet

In [None]:
!pip install xlsxwriter

In [80]:
# file path for output excel spreadsheet
output = '/content/drive/My Drive/fake_news_over_time/ngrams.xlsx'

In [81]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(output, engine="xlsxwriter")

# write dataframes to different worksheets
clean_ngram_19_df.to_excel(writer, sheet_name="2019", index=False)
clean_ngram_20_df.to_excel(writer, sheet_name="2020", index=False)
clean_ngram_21_df.to_excel(writer, sheet_name="2021", index=False)
clean_ngram_22_df.to_excel(writer, sheet_name="2022", index=False)
clean_ngram_23_df.to_excel(writer, sheet_name="2023", index=False)
clean_ngram_24_df.to_excel(writer, sheet_name="2024", index=False)

# close the excel writer and output file
writer.close()