# Project 1: Language Modeling and Fake Review Classification

Johann Lee and Altria Wang

Outline: 
  - **Data Processing**
  - **Unsmoothed & Smoothed N-gram Models**
  - **Perplexity Calculation** 
  - **Naive Bayes Classifier** 

# Part 1: Preprocessing the Dataset
In this part, you are going to do a few things:
* Connect to the google drive where the data set is stored
* Load and read files
* Preprocess the text

------
**Please upload the dataset to each partner's individual Google Drive now.** We suggest using the same folder structure within Google Drive because the notebook is shared among you, so the code to load the data would have to be changed every time if folder structures are different. One folder structure might be: Google Drive/CS 4740/Project 1/Dataset/ or whatever works for you. See our code below for an example of how we load the data from Google Drive.

## 1.1 Connect to google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## 1.2 Load and read files
First, let's install [NLTK](https://www.nltk.org/), a very widely package for NLP preprocessing (and other tasks) for Python.

In [None]:
!pip install -U nltk tqdm
!pip install transformers

Then we read and load data.

In [None]:
import os
import csv
import io
from nltk import word_tokenize, sent_tokenize
import nltk
from tqdm.notebook import tqdm
nltk.download('punkt')

real_review_train = []
real_review_validation = []
fake_review_train = []
fake_review_validation = []

def load_real_fake_dataset(filename):
    real = []
    fake = []
    with open(filename) as fp:
        csvreader = csv.reader(fp, delimiter="|")
        for txt, label in csvreader:
            label = int(label)
            if label:
                fake.append(txt)
            else:
                real.append(txt)
    
    return real, fake

real_review_train, fake_review_train = load_real_fake_dataset("P1_real_fake_review_train.txt")
real_review_validation, fake_review_validation = load_real_fake_dataset("P1_real_fake_review_val.txt")

test = open("P1_real_fake_review_test.txt")
csv_reader = csv.reader(test)


def tokenize_reviews(reviews):
    return [
        [
            word.lower() for sent in sent_tokenize(review)
            for word in word_tokenize(sent)
        ]
        for review in tqdm(reviews, leave=False)
    ]

textdata = (tokenize_reviews(test))
tokenized_real_review_training = tokenize_reviews(real_review_train)
tokenized_fake_review_training = tokenize_reviews(fake_review_train)
tokenized_real_review_validation = tokenize_reviews(real_review_validation)
tokenized_fake_review_validation = tokenize_reviews(fake_review_validation)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


0it [00:00, ?it/s]

  0%|          | 0/642 [00:00<?, ?it/s]

  0%|          | 0/638 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

Sanity checks for our real and fake training sets

In [None]:
tokenized_real_review_training[0]

In [None]:
tokenized_fake_review_training[0]

## 1.3 Data Preprocessing & Preparation

There's a well-known parable in machine learning that 80% of the work is all about data preparation, 10% is supporting infrastructure and 10% is actual modeling. If your "raw" dataset is not preprocessed and prepared in a way to maximize its value, then your model will be more like this: https://xkcd.com/1838/. For this project, modeling is the star of the show for learning purposes, but we still want you to pay attention to the preprocessing stage.

*We've already tokenized and lowercased* the raw data for you. We have not added a start of sentence token but feel free to do so (it is not neccessary). Here are a few extra things you might want to do:

- Think about edge cases. For example, you don't want to accidentally append a period to the last word of a sentence. 
- Watch out for apostrophes and other tricky things like quotations, they cause lots of edge cases. For example, "they're" can be all one token, or two tokens ("they", "'re") or even three tokens ("they", " ' ", "re"). 

Why did we lowercase all tokens? Because the computer will otherwise consider "The" and "the" as two separate words and this will cause problems.

Note that you may use existing
tools just for the purpose of preprocessing. 

Advice: don't get bugged down in the dozens of preprocessing packages and suggestions that you can find on Towards Data Science or Stack Overflow. Start with this [NLTK tutorial](https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/introduction/nochunks.html#:~:text=The%20Natural%20Language%20Toolkit%20(NLTK,tokenization%2C%20tagging%2C%20and%20parsing.) and that should be plenty.

In [None]:
from nltk import *
from nltk import pos_tag
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


def tokenize_reviews_update(reviews):
    return [
        [
            word.lower() for sent in sent_tokenize(review)
            for word in word_tokenize(sent)
        ]
         for review in tqdm(reviews, leave=False)
    ]


tokenized_real_review_training = tokenize_reviews_update(real_review_train)
tokenized_fake_review_training = tokenize_reviews_update(fake_review_train)
tokenized_real_review_validation = tokenize_reviews_update(real_review_validation)
tokenized_fake_review_validation = tokenize_reviews_update(fake_review_validation)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


  0%|          | 0/642 [00:00<?, ?it/s]

  0%|          | 0/638 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:
# Lemmatization 
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_personalized(reviews):
  returning_list = []
  for review in tqdm(reviews, leave=False):
    posed = pos_tag(review)
    temp = []
    for (word, pos) in posed:
      lem = lemmatizer.lemmatize(word)
      temp.append(lem)
    returning_list.append(temp)
  return returning_list


lemmatized_real_review_training = lemmatize_personalized(tokenized_real_review_training)
lemmatized_fake_review_training = lemmatize_personalized(tokenized_fake_review_training)
lemmatized_real_review_validation = lemmatize_personalized(tokenized_real_review_validation)
lemmatized_fake_review_validation = lemmatize_personalized(tokenized_fake_review_validation)

# print(lemmatized_fake_review_training)

In [None]:
# Stop words
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

stopword = set(stopwords.words('english'))

def remove_stopwords(reviews):
  data = []
  for review in tqdm(reviews, leave=False):
    cur_review = []
    for word in review: 
      if word not in stopword:
        cur_review.append(word)
    data.append(cur_review)
  return data

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
removed_fake_train = remove_stopwords(lemmatized_fake_review_training)
removed_real_train = remove_stopwords(lemmatized_real_review_training)
removed_fake_valid = remove_stopwords(lemmatized_fake_review_validation)
removed_real_valid = remove_stopwords(lemmatized_real_review_validation)

  0%|          | 0/638 [00:00<?, ?it/s]

  0%|          | 0/642 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

# Part 2: Compute Unsmoothed Language Models.




## 2.1 Unsmoothed Uni-gram Model.

In [None]:
"""
Function [unsmoothed_unigram] computes the probabilities for a unigram model
lst: a list of words in a sentence
Return: [data structure of your choice] that stores the result
"""
def unsmoothed_unigram(lst):
  unigram_probs = {}
  word_count = 0
  unigram_probs['<s>'] = 1
  for document in lst:
    unigram_probs['<s>'] += 1
    for word in document:
      word_count += 1
      if word in unigram_probs:
        unigram_probs[word] = unigram_probs[word] + 1
      else:
        unigram_probs[word] = 1
  for word in unigram_probs:
    unigram_probs[word] = unigram_probs[word] / word_count
  return unigram_probs

In [None]:
temp = unsmoothed_unigram(lemmatized_fake_review_training)
# print(temp)

## 2.2 Unsmoothed Bi-gram Model.
$p(w_n\mid w_{n-1})=\frac{C(w_{n-1}w_n)}{C(w_{n-1})}$ means we might want to store two things (count of $w_{n-1}$ and count of $w_{n-1}w_n$).

In [None]:
def uni_bi_counts(lst):
  word_counts = {}
  biword_counts = {}
  
  for document in lst:
    prev_word = "<s>"
    if prev_word in word_counts:
      word_counts[prev_word] = word_counts[prev_word] + 1
    else: word_counts[prev_word] = 1
    for word in document:
      #wn
      if word in word_counts:
        word_counts[word] = word_counts[word] + 1
      else:
        word_counts[word] = 1

      #wn-1 wn count
      if (prev_word, word) in biword_counts:
        biword_counts[(prev_word, word)] = biword_counts[(prev_word, word)] + 1
      else:
        biword_counts[(prev_word, word)] = 1

      prev_word = word
  return word_counts, biword_counts

def unsmoothed_bigram(word_counts, biword_counts):
  for (prev_word, word) in biword_counts:
    #turning counts into prob
    biword_counts[(prev_word, word)] = biword_counts[(prev_word, word)] / word_counts[prev_word]

  return biword_counts

In [None]:
uni, bi = uni_bi_counts(tokenized_fake_review_training)

In [None]:
uni, bi = uni_bi_counts([["the","best", "latte","in","ithaca"]])
print(uni)
print(bi)

{'<s>': 1, 'the': 1, 'best': 1, 'latte': 1, 'in': 1, 'ithaca': 1}
{('<s>', 'the'): 1, ('the', 'best'): 1, ('best', 'latte'): 1, ('latte', 'in'): 1, ('in', 'ithaca'): 1}


# Part 3: Smoothed Language Model
We handled unknown words by associating the token <unk> with a probability. This probability was determined by replacing the first instances of unique word types with the <unk> label, then calculating k-gram models as if <unk> was any other word. During the test phase, any unknown words would be calculated with this <unk> probability instead. 

We chose add-k smoothing with k = 1 as a starting point. Since for any integer of k(1=< k =<length of the review), k = 1 will minimize the perplexity.

## 3.1 Unknown Words Handling



In [None]:
def unknownHandled_unigram(lst):
  unigram_probs = {}
  word_count = 0
  unigram_probs['<s>'] = 0
  for document in lst:
    unigram_probs['<s>'] += 1
    for word in document:
      word_count += 1
      if word in unigram_probs:
        unigram_probs[word] = unigram_probs[word] + 1
      else:
        unigram_probs[word] = 1
        if '<unk>'in unigram_probs.keys():
          unigram_probs['<unk>'] += 1
        else:          
          unigram_probs['<unk>'] = 1
  for word in unigram_probs:
    unigram_probs[word] = unigram_probs[word] / word_count
  return unigram_probs

In [None]:
handledFakeUni = unknownHandled_unigram(removed_fake_train)
handledRealUni = unknownHandled_unigram(removed_real_train)

In [None]:
def unknownHandled_bigram(lst):
  word_counts = {}
  biword_counts = {}
  
  for document in lst:
    prev_word = "<s>"
    if prev_word in word_counts:
      word_counts[prev_word] = word_counts[prev_word] + 1
    else: word_counts[prev_word] = 1
    for word in document:
      gate = 0
      #wn
      if word in word_counts:
        word_counts[word] = word_counts[word] + 1
      else:
        word_counts[word] = 1
        if '<unk>'in word_counts.keys():
          word_counts['<unk>'] += 1
        else:          
          word_counts['<unk>'] = 1
          gate = 1


      #wn-1 wn count
      if (prev_word, word) in biword_counts:
        biword_counts[(prev_word, word)] = biword_counts[(prev_word, word)] + 1
      else:
        biword_counts[(prev_word, word)] = 1

        
        if ('<unk>', word) in biword_counts.keys():
          biword_counts[('<unk>', word)] += 1
        else:
          biword_counts[('<unk>', word)] = 1
        if (prev_word, '<unk>') in biword_counts.keys():
          biword_counts[(prev_word, '<unk>')] += 1
        else:  
          biword_counts[(prev_word, '<unk>')] = 1          
        if ('<unk>', '<unk>') in biword_counts.keys():
          biword_counts[('<unk>', '<unk>')] += 1
        else:  
          biword_counts[('<unk>', '<unk>')] = 1

      if gate == 0: 
        prev_word = word
      else: prev_word = '<unk>'

  return word_counts, biword_counts

def handled_bigram(word_counts, biword_counts):
  for (prev_word, word) in biword_counts:
    #turning counts into prob
    biword_counts[(prev_word, word)] = biword_counts[(prev_word, word)] / word_counts[prev_word]

  return biword_counts

In [None]:
uni_f, bi_f = unknownHandled_bigram(removed_fake_train)
uni_r, bi_r = unknownHandled_bigram(removed_real_train)

handledFakeBi = handled_bigram(uni_f, bi_f)
handledRealBi = handled_bigram(uni_r, bi_r)


In [None]:
# handledRealBi

## 3.2 Smoothing: Add-k

Try later if have time:
* Kneser-Ney
* Good-Turing

In [None]:
"""
dic: a dictionary of your unigrams. key: words, val: occurence
k: parameter k for smoothing
Return: a dictionary of results after smoothing
"""
def add_k_unigram(dic, k):
  v = len(dic)
  for word in dic:
    dic[word] = (dic[word] + k) / (dic[word] + k * v)
  return dic

In [None]:
uni_dic = unsmoothed_unigram(tokenized_fake_review_training)
# print(add_k_unigram(uni_dic, 2))

In [None]:
"""
uni_dic: a dictionary of your unigrams.
bi_dic: a dictionary of your bigrams.
k: parameter k for smoothing
Return: a dictionary of results after smoothing
"""
def add_k_bigram(uni_dic, bi_dic, k):
  v = len(uni_dic)
  for (prev_word, word) in bi_dic:
    bi_dic[(prev_word, word)] = (bi_dic[(prev_word, word)]+k) / (uni_dic[prev_word] + k * v)

  return bi_dic

In [None]:
uni, bi = uni_bi_counts(tokenized_fake_review_training)
# print(add_k_bigram(uni, bi, 2))

# Part 4: Perplexity
Perplexity defined as follows:
\begin{align*}
PP &= \left(\prod_i^N\frac{1}{P\left(W_i\mid W_{i-1}, ...W_{i-n+1}\right)}\right)^{\frac{1}{N}}\\
&=\exp \frac{1}{N}\sum_{i}^N-\log P\left(W_i\mid W_{i-1}, ...W_{i-n+1}\right)
\end{align*}
where $N$ is the total number of tokens in the test corpus and $P\left(W_i\mid W_{i-1}, ...W_{i-n+1}\right)$
is the n-gram probability of the model. Note that under the second definition above, perplexity is a function of the average (per-word) log probability and that lower perplexity means a better model.

## Task 1: Compute perplexity for smoothed unigram and smoothed bigram. 

In [None]:
# Compute perplexity for one smoothing method on unigram,
import numpy as np
def perplexity_uni(unigram):
  perplexity_score = 0
  for prob in unigram.values():
    perplexity_score += np.log(prob)
  perplexity_score = perplexity_score/len(unigram.values())
  return np.power(2, -1 * perplexity_score)

In [None]:
def choose_smallest_uni(review, n):
  smallest_perplexity_n = 0
  smallest_perplexity = 1000000
  while (n > 0):
    if perplexity_uni(add_k_unigram(review, n)) < smallest_perplexity:
      smallest_perplexity = perplexity_uni(add_k_unigram(review, n))
      smallest_perplexity_n = n
    n = n - 1
  return smallest_perplexity, smallest_perplexity_n

In [None]:
#Perplexity for lemmatized fake review
print(choose_smallest_uni((unsmoothed_unigram(lemmatized_fake_review_training)), len(unsmoothed_unigram(lemmatized_fake_review_training))))
print(choose_smallest_uni((unsmoothed_unigram(tokenized_fake_review_training)), len(unsmoothed_unigram(tokenized_fake_review_training))))

(387.69362772439143, 1)
(411.0630838173839, 1)


In [None]:
new2 = unsmoothed_unigram(removed_fake_train)
perplexity_uni((add_k_unigram(new2,1)))

381.72886284755924

In [None]:
# Smoothing method on bigram.
import numpy as np 
def perplexity_bi(bigram):
  perplexity_score = 0
  for prob in bigram.values():
    perplexity_score += np.log(prob)
  perplexity_score = perplexity_score/len(bigram.values())
  return np.power(2, -1 * perplexity_score)

In [None]:
uni = unknownHandled_unigram(tokenized_fake_review_training)
uni_inp, bi_inp = unknownHandled_bigram(tokenized_fake_review_training)
bi = handled_bigram(uni_inp, bi_inp)
perplexity_bi(add_k_bigram(uni, bi, 1))

366.3653018348878

In [None]:
uni = unknownHandled_unigram(lemmatized_fake_review_training)
uni_inp, bi_inp = unknownHandled_bigram(lemmatized_fake_review_training)
bi = handled_bigram(uni_inp, bi_inp)
print(perplexity_bi(add_k_bigram(uni, bi, 1)))

uni = unknownHandled_unigram(removed_fake_train)
uni_inp, bi_inp = unknownHandled_bigram(removed_fake_train)
bi = handled_bigram(uni_inp, bi_inp)
print(perplexity_bi(add_k_bigram(uni, bi, 1)))

347.269607141579
338.3709404074856


In [None]:
# removed_fake_train[0]

In [None]:
import pandas as pd
best_Fakebigram_model = add_k_bigram(handledFakeUni, handledFakeBi, 1)
best_Realbigram_model = add_k_bigram(handledRealUni, handledRealBi, 1)

df = pd.read_csv("P1_real_fake_review_test.txt", delimiter="\n")
df['Id Text']=df['Id Text'].str[3:-1]
df.head()
tt = df['Id Text'].to_list()
test_input = remove_stopwords(lemmatize_personalized(tokenize_reviews_update(tt)))

def use_bigram(bigramReal,bigramFake, lst):
  predictions = []
  for document in lst:
    prev_word = "<s>"
    real_prob = 1
    fake_prob = 1
    for word in document:
      if (prev_word, word) in bigramReal:
        real_prob *= bigramReal[prev_word, word]
      else:
        if ('<unk>', word) in bigramReal.keys():
          real_prob *= bigramReal['<unk>', word]
        elif (prev_word, '<unk>') in bigramReal.keys():
          real_prob *= bigramReal[(prev_word, '<unk>')]
        else:  
          real_prob *= bigramReal[('<unk>', '<unk>')]



      if (prev_word, word) in bigramFake:
        fake_prob *= bigramFake[prev_word, word]
        # print(bigram[prev_word, word])
      else:
        if ('<unk>', word) in bigramFake.keys():
          fake_prob *= bigramFake['<unk>', word]
        elif (prev_word, '<unk>') in bigramFake.keys():
          fake_prob *= bigramFake[(prev_word, '<unk>')]
        else:  
          fake_prob *= bigramFake[('<unk>', '<unk>')]
      prev_word = word
    pred = 0 if real_prob >= fake_prob else 1
    # P(real | w1n) > P(fake | w1n) iff P(w1n| real)* P(real)/P(w1n) > P(w1n | fake) * P(fake)/P(w1n)
    predictions.append(pred)
  return predictions
bi_pred = use_bigram(best_Realbigram_model, best_Fakebigram_model, test_input)

  0%|          | 0/160 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

In [None]:
output2 = pd.DataFrame()
output2['Id'] = pd.Series(range(len(bi_pred)))
output2['Prediction'] = pd.Series(bi_pred)
output2.to_csv('zw238_jcl354_bi1.csv', index=False)

# Part 6: Naive Bayes

For review *d* and its label *c* (either 0 or 1).
\begin{align*}
P(c|d)=\frac{P(d|c)P(c)}{P(d)}
\end{align*}
Likelihood: $P(d|c)$. In real/deception corpus, how likely *d* would appear.

Prior: $P(c)$. The probability of real/deceptive reviews in general.

Posterior: $P(c|d)$. Given *d*, how likely is it that it is real/deceptive.

Goal: $\underset{c\in \{0,1\}}{\operatorname{argmax}} P(c|d)$, which is equivalent to $\underset{c\in \{0,1\}}{\operatorname{argmax}} P(d|c)P(c)$.

The equivalence holds because $P(d)$ is the same for any $c$. Thus the denominator can be dropped.

Denote $d=\{x_1, x_2, ..., x_n\}$ where $x_i$'s are words in the reviews *d*. Unlike n-gram language modelling, we make the multinomial Naive Bayes independence assumption here, where we assume positions of words do not matter. Formally, 
\begin{align*}
&\underset{c\in \{0,1\}}{\operatorname{argmax}} P(d|c)P(c)\\
=&\underset{c\in \{0,1\}}{\operatorname{argmax}} P(x_1, ..., x_n|c)P(c)\\
=&\underset{c\in \{0,1\}}{\operatorname{argmax}} P(x_1|c)P(x_2|c)...P(x_n|c)
\end{align*}

We will collect the occurences of each word for the classification (Bag Of Words).

## 6.1 Implementation

In [None]:
import csv

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

counts = CountVectorizer()

def load_dataset(filename):
    reviews = []
    labels = []
    with open(filename) as fp:
        csvreader = csv.reader(fp, delimiter="|")
        for txt, label in csvreader:
            labels.append(int(label))
            reviews.append(txt)
    return reviews, labels


reviews_train = []
labels_train = []
reviews_train, labels_train = load_dataset("P1_real_fake_review_train.txt")
counts_train = counts.fit_transform(reviews_train)

reviews_test = []
reviews_test, labels_test = load_dataset("P1_real_fake_review_val.txt")
counts_test = counts.transform(reviews_test)


print(counts_test.shape[1] == counts_train.shape[1])

True


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def nothing(doc):
  return doc
counts2 = CountVectorizer(tokenizer = nothing, preprocessor = nothing)
        
def load_dataset_al(real_reviews, fake_reviews):
    reviews = []
    labels = []

    n = len(real_reviews)
    for review in real_reviews:
      reviews.append(review)
    while(n > 0):
      labels.append(1);
      n = n - 1;
   
    n2 = len(fake_reviews)
    for review2 in fake_reviews:
      reviews.append(review2)
    while(n2 > 0):
      labels.append(0);
      n2 = n2 - 1;
    return reviews, labels

In [None]:
review_al, labels_al = load_dataset_al(lemmatized_real_review_training, lemmatized_fake_review_training)
reviews_al2 = counts2.fit_transform(review_al)
test_al, labelss_al = load_dataset_al(lemmatized_real_review_validation, lemmatized_fake_review_validation)
test_al1 = counts2.transform(test_al)

In [None]:
nb2 = MultinomialNB(alpha=1) 
nb2.fit(reviews_al2, labels_al)
nb2

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

In [None]:
# Accuracy of lemmatization 
def get_accuracy(preds, labels):
  return ((preds == labels).sum()) / len(labels)

preds_test = nb2.predict(test_al1)
accuracy = get_accuracy(preds_test, labelss_al)

print(accuracy)

0.8875


In [None]:
train_al, labels_train_al = load_dataset_al(removed_real_train, removed_fake_train)
train_stop = counts2.fit_transform(train_al)
test_stop, labels_stop_al = load_dataset_al(removed_real_valid, removed_fake_valid)
test_stop_final = counts2.transform(test_stop)

In [None]:
nb_stop = MultinomialNB(alpha = 1)
nb_stop.fit(train_stop, labels_train_al)
nb_stop

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

In [None]:
def get_accuracy(preds, labels):
  return ((preds == labels).sum()) / len(labels)

stop_pred = nb_stop.predict(test_stop_final)
accuracy = get_accuracy(stop_pred, labels_stop_al)

print(accuracy)

0.875


In [None]:
nb = MultinomialNB(alpha=1) # 1 smoothing
nb.fit(counts_train, labels_train)
nb

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

In [None]:
def get_accuracy(preds, labels):
  return ((preds == labels).sum()) / len(labels)

preds_test = nb.predict(counts_test)
accuracy = get_accuracy(preds_test, labels_test)

print(accuracy)

0.90625


In [None]:
!pip install swifter

In [None]:
import numpy as np
import pandas as pd

import torch
import transformers
import swifter

# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

#https://soumilshah1995.blogspot.com/2021/04/using-bert-with-scikit-learn-to-do-text.html
class BertTokenizer(object):

    def __init__(self, text=[]):
        self.text = text

        # For DistilBERT:
        self.model_class, self.tokenizer_class, self.pretrained_weights = (transformers.DistilBertModel, transformers.DistilBertTokenizer, 'distilbert-base-uncased')

        # Load pretrained model/tokenizer
        self.tokenizer = self.tokenizer_class.from_pretrained(self.pretrained_weights)
        self.model = self.model_class.from_pretrained(self.pretrained_weights)

    def get(self):
        # df = pd.DataFrame(data={"text":self.text})
        tokenized = df["text"].swifter.apply((lambda x: self.tokenizer.encode(x, add_special_tokens=True)))

        max_len = 0
        for i in tokenized.values:
            if len(i) > max_len:
                max_len = len(i)
        # max_len = 1000
        print(max_len)
        padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

        attention_mask = np.where(padded != 0, 1, 0)
        input_ids = torch.tensor(padded)
        attention_mask = torch.tensor(attention_mask)

        print('got here')
        with torch.no_grad(): last_hidden_states = self.model(input_ids, attention_mask=attention_mask)
        print("get here 2")
        features = last_hidden_states[0][:, 0, :].numpy()

        return features

In [None]:
import pandas as pd
df = pd.read_csv("P1_real_fake_review_train.txt", delimiter = "|", names=["text", "label"])
df["text"] = df['text'].str.slice(0,500)
df.head()

In [None]:
df['text'][0]

In [None]:
init = BertTokenizer(text=df)

In [None]:
features = init.get()

In [None]:
nbBERT = MultinomialNB(alpha=1) # 1 smoothing
nbBERT.fit(features, labels_train)
preds_test = nbBERT.predict(counts_test)
accuracy = get_accuracy(preds_test, labels_test)

print(accuracy)

## Submitting to Kaggle

In [None]:
import pandas as pd
df = pd.read_csv("P1_real_fake_review_test.txt", delimiter="\n")
df['Id Text']=df['Id Text'].str[3:-1]
df.head()
tt = df['Id Text'].to_list()
final = counts.transform(tt)
final

In [None]:
output = pd.DataFrame()
pred = nb.predict(final)
output['Id'] = pd.Series(range(len(pred)))
output['Prediction'] = pd.Series(pred)
output.to_csv('zw238_jcl354.csv', index=False)

In [None]:
%%capture
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc

In [None]:
%%capture
# the red text is a placeholder! Change it to your directory structure!
!cp 'My Drive/nlp_proj1/CS4740_FA21_p1_zw238_jcl354.ipynb' ./ 

In [None]:
# the red text is a placeholder! Change it to the name of this notebook!
!jupyter nbconvert  "CS4740_FA21_p1_zw238_jcl354.ipynb" --to PDF