# Notebook for Finding Ngrams

Using scikit-learn, we want to find ngrams (most commonly occuring sets of words) in the articles across each year.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
from dataset_config import BASE_FAKESPEAK_CONFIG, BASE_MISINFOTEXT_CONFIG
from helpers import get_groups, make_output_path, make_output_path_for_type

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Load in dataset

In [3]:
fakespeak_config = BASE_FAKESPEAK_CONFIG | {
    "headline_col": "originalHeadline",
    "usecols": BASE_FAKESPEAK_CONFIG["usecols"] + ["originalHeadline"]
}

misinfotext_config = BASE_MISINFOTEXT_CONFIG | {
    "headline_col": "originalHeadline",
}

In [4]:
using_dataset = misinfotext_config

In [5]:
dataset_df = pd.read_excel(
    using_dataset["input_path"], 
    sheet_name=using_dataset["sheet_name"], 
    usecols=using_dataset["usecols"]
)
dataset_df.head()

Unnamed: 0,factcheckURL,originalURL,originalBodyText,originalHeadline,originalTextType,originalDate,originalDateYear
0,http://www.politifact.com/arizona/statements/2...,https://associatedmediacoverage.com/three-stat...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016-05-06,2016
1,http://www.politifact.com/california/statement...,https://users.focalbeam.com/fs/distribution:wl...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016-07-12,2016
2,http://www.politifact.com/california/statement...,http://www.sacbee.com/opinion/op-ed/soapbox/ar...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017-08-04,2017
3,http://www.politifact.com/california/statement...,https://nocagastax.com/california-gas-tax-hike...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017-06-15,2017
4,http://www.politifact.com/california/statement...,https://chu.house.gov/media-center/press-relea...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017-05-04,2017


## Extracting ngrams

Here we use sklearn's CountVectorizer() function to produce ngrams where n=1-5. First, we separate the fakespeak_df into its respective years, then find these ngrams and create new dataframes to hold them.

In [6]:
# helper function to find ngrams for articles from each year
def get_ngram_counts(df: pd.DataFrame, col: str):
  try:
    # initialize vector
    c_vec = CountVectorizer(ngram_range=(1, 5))

    # input to fit_transform must be an iterable of strings
    ngrams = c_vec.fit_transform(df[col].to_list())

    # initialize vocabulary after calling fit_transform
    vocab = c_vec.vocabulary_

    count_values = ngrams.toarray().sum(axis=0)

    # list to hold ngram rows that will be turned into a dataframe
    ngram_list = []

    for count, text in sorted([(count_values[i], k) for k, i in vocab.items()], reverse=True):
      n = len(text.split())
      ngram_list.append([n, text, count])

    headers = ['n', 'ngram_text', 'ngram_count']
    ngram_df = pd.DataFrame(ngram_list, columns=headers)

    # sort the dataframe by n
    ngram_df = ngram_df.sort_values(by=['n', 'ngram_count'], ascending=[True, False])

    return ngram_df
  except Exception as e:
    print("Error getting ngram counts")
    print(e)

    # Return empty dataframe
    return pd.DataFrame({
      "n": [],
      "ngram_text": [],
      "ngram_count": [],
    })

## Prepare dataframes to output to spreadsheet
Currently, the dataframes hold all found n-grams, including ones that are fully comprised of stop words. We are not interested in those, so we only keep n-grams that have meaningful content, and are not just stop words.

Furthermore, we only take the first 20 entries for each n=2-5 (i.e. we take the first 20 bigrams, then the first 20 trigrams, etc. for each year). The exception is we take the first 50 monogram entries, since a lot of them tend to be common words and the results are more interesting when we broaden the search. These are the most common n-grams (since they are sorted descending by count), which are the ones most interesting to us.

In [7]:
stop_words = stopwords.words('english')

In [8]:
def is_all_stop_words(text: str):
  tokens = word_tokenize(text)
  return all(token in stop_words for token in tokens)

# helper function that cleans up the dataframes as outlined above
# where df is the ngram dataframe
# num_mono is the number of entries to include for monograms
# num_other is the number of entries to include for the other ngrams
# drop index indicates the number of rows we want to drop from the top of the dataframe
def clean_ngram(df: pd.DataFrame, num_mono=50, num_other=20, drop_index=10):
  # Get rid of n-grams that are fully comprised of stop words
  # df["doc"] = list(nlp.pipe(df["ngram_text"]))

  # Pass through the empty df
  if df.shape[0] == 0:
    return df

  df = df[~df["ngram_text"].apply(is_all_stop_words)]

  # filter dataframe by ngram frequency
  df1 = df[df['n'] == 1].head(num_mono)
  df2 = df[df['n'] == 2].head(num_other)
  df3 = df[df['n'] == 3].head(num_other)
  df4 = df[df['n'] == 4].head(num_other)
  df5 = df[df['n'] == 5].head(num_other)

  # concatenate the dataframes along the rows
  output_df = pd.concat([df1, df2, df3, df4, df5], axis=0)

  return output_df

In [9]:
def get_ngram_years_dfs(df: pd.DataFrame):
    years, years_dfs = get_groups(df, using_dataset["year_col"])
    headline_years_df = [df[~df[using_dataset["headline_col"]].isna()] for df in years_dfs]
    
    ngrams_text_years_dfs = [clean_ngram(get_ngram_counts(df, using_dataset["text_col"])) for df in years_dfs]
    ngrams_headline_years_dfs = [clean_ngram(get_ngram_counts(df, using_dataset["headline_col"])) for df in headline_years_df]
    
    return years, ngrams_text_years_dfs, ngrams_headline_years_dfs

In [10]:
years, ngrams_text_years_dfs, ngrams_headline_years_dfs = get_ngram_years_dfs(dataset_df)

In [11]:
ngrams_text_years_dfs[0].head()

Unnamed: 0,n,ngram_text,ngram_count
4,1,troops,3
6,1,senator,3
11,1,withdrawal,2
17,1,surge,2
20,1,new,2


In [12]:
ngrams_headline_years_dfs[0].head()

Unnamed: 0,n,ngram_text,ngram_count
4,1,statement,1
11,1,mccain,1
16,1,john,1
18,1,hillary,1
19,1,clinton,1


We could show the resulting dataframes for the other years as well, but here I've chosen not to in order to save space and improve readability for the notebook.

## Write dataframes to excel spreadsheet

In [13]:
output_path = make_output_path(using_dataset, "ngrams")

writer = pd.ExcelWriter(output_path, engine="xlsxwriter")

for year, df in zip(years, ngrams_text_years_dfs):
    df.to_excel(writer, sheet_name=str(year), index=False)

writer.close()

In [14]:
output_path = make_output_path(using_dataset, "ngrams_headlines")

writer = pd.ExcelWriter(output_path, engine="xlsxwriter")

for year, df in zip(years, ngrams_headline_years_dfs):
    df.to_excel(writer, sheet_name=str(year), index=False)

writer.close()

Now run the same analysis for each separate text type

In [15]:
types, types_dfs = get_groups(dataset_df, using_dataset["type_col"])

In [16]:
for type, df in zip(types, types_dfs):
    years, ngrams_text_years_dfs, ngrams_headline_years_dfs = get_ngram_years_dfs(df)

    output_path = make_output_path_for_type(using_dataset, type, "ngrams")

    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")

    for year, df in zip(years, ngrams_text_years_dfs):
        df.to_excel(writer, sheet_name=str(year), index=False)

    writer.close()

    output_path = make_output_path_for_type(using_dataset, type, "ngrams_headlines")

    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")

    for year, df in zip(years, ngrams_headline_years_dfs):
        df.to_excel(writer, sheet_name=str(year), index=False)

    writer.close()

Error getting ngram counts
empty vocabulary; perhaps the documents only contain stop words
Error getting ngram counts
empty vocabulary; perhaps the documents only contain stop words
Error getting ngram counts
empty vocabulary; perhaps the documents only contain stop words
Error getting ngram counts
empty vocabulary; perhaps the documents only contain stop words
Error getting ngram counts
empty vocabulary; perhaps the documents only contain stop words
Error getting ngram counts
empty vocabulary; perhaps the documents only contain stop words
