<a href="https://colab.research.google.com/github/dasmiq/cs6200-hw3/blob/main/classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification

In class, we spent some time on text classification, including naive Bayes classifiers.  We focused on these models not only because they are simple to implement and fairly effective, but also because of their similarity to widely used bag-of-words retrieval models such as BM25 and query likelihood.

Your task is to write a naive Bayes text categorization system to predict whether movie reviews are positive or negative.  The data for this **sentiment analysis** task were first assembled and published in Bo Pang and Lillian Lee, &ldquo;A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts&rdquo;, _Proceedings of the Association for Computational Linguistics_, 2004.

## Loading the data

First we load the training, development, and test splits of this dataset.

In [None]:
iopub_data_rate_limit=1.0e10

In [None]:
import json
from urllib.request import urlopen
import numpy as np

In [None]:
# Read one JSON record per line
def read_jsonl(f):
  res = []
  for line in f:
    res.append(json.loads(line))
  return res

If you're working offline, you could modify this code to read from the copies of the data in the repository.

In [None]:
train = read_jsonl(urlopen("https://github.com/dasmiq/cs6200-hw3/blob/main/train.json?raw=true"))
dev = read_jsonl(urlopen("https://github.com/dasmiq/cs6200-hw3/blob/main/dev.json?raw=true"))
test = read_jsonl(urlopen("https://github.com/dasmiq/cs6200-hw3/blob/main/test.json?raw=true"))

Each of these subsets of the data is a list of documents, and each document has a unique identifier (`id`) and text (`text`). The training and development documents, in addition, have been labeled with a `class`.

In [None]:
print(train[0])
print(dev[0])
print(test[0])  


{'id': '12178', 'class': 'neg', 'text': "the sequel to the fugitive ( 1993 ) , u . s marshals is an average thriller using it's association with the fugitive just so it can make a few extra bucks . \ntommy lee jones returns to his role as chief deputy samuel gerard , the grizzly cop who was after harrison ford in the fugitive . \nthis time , he's after fugitive mark sheridan ( snipes ) who the police think killed two fbi agents , but of course he's been set up , and when the police plane escort he ( and gerard ) are riding crashes , he makes a run for it , gerard not so hot on his tail . \nwhat follows is about 2 hours of action , brought to us by the director of executive decision ( 1995 ) , another film curiously involving a plane . \nwhen comparing this movie to the fugitive , the prequel is far superior . \nbut even on it's own , u . s marshals is a pretty lousy movie . \nwhile the original was reasonably intelligent , and had a fugitive to root for , the audience feels strangely d

## Collecting term statistics

The text has been pre-tokenized and lower-cased.  All you have to do to get the individual terms in a review is to split the tokens by whitespace, a sequence of spaces and/or newlines.

In [None]:
## TODO: Write a function to convert a document into a collection of terms and their counts.
## Convert the lists of documents in the training, development, and test sets into these collections of terms and counts.
train_ = [] 
dev_ = []
test_ = []

def create_count(new_list, doc_id, class_):
  term_count = {}
  term_count['id'] = doc_id
  if class_ != "None":
    term_count['class'] = class_
  words_so_far = []
  for i, val in enumerate(new_list):
    if val != "class" and val != "id":
      if val not in words_so_far:
        term_count[val] = 1
        words_so_far.append(val)
      else:
        term_count[val] = term_count[val] + 1
  return term_count

def tokenizing(doc, doc_id, class_ = "None"):
  list_of_words_body_ = []
  for i,val in enumerate(doc):
    list_of_words_body_.append(val.strip()) #removes white spaces around words
  for i,val in enumerate(list_of_words_body_):
    if (val == ''):
      list_of_words_body_.remove(val)
    elif (val[0].isalpha() == False): #removes everything that does not start with a letter 
      list_of_words_body_.remove(val) 
  new_list = []
  for i,val in enumerate(list_of_words_body_):
      temp = val.split(" ")
      for j in temp:
        #below removes empty strings, numbers and hyeperlinks from the content to keep.
        if j == "" or j == '' or j.isnumeric() == True or j[0:4] == "https":
          break
        #removes things like commas, question marks and full stops etc from each word 
        if j[0].isalpha() == False:
          j = j[1:]
        elif j[-1].isalpha() == False:
          j = j[:-1]
        if len(j) != 0:
          new_list.append(j.lower()) #converts each word to lower case
  return create_count(new_list,doc_id,class_)

def preprocess(data, list_, type_):
  for i,val in enumerate(data):
    text = val['text'].split(" ")
    if type_ == "test":
      output = tokenizing(text, val['id'])
    else:
        output = tokenizing(text, val['id'], val['class'])
    list_.append(output)
  return list_

preprocess(train, train_, "train")
preprocess(dev, dev_,"dev")
preprocess(test, test_,"test")
# the output below contains the words and their counts for each: training data,
#developement data and test date.

[{'-': 1,
  'a': 24,
  'able': 1,
  'about': 3,
  'academy': 1,
  'accomplished': 1,
  'across': 2,
  'actor': 1,
  'actors': 6,
  'after': 1,
  'ago': 1,
  'all': 2,
  'all-star': 1,
  'also': 2,
  'although': 1,
  'always': 1,
  'ames': 2,
  'among': 1,
  'an': 5,
  'and': 24,
  'another': 2,
  'are': 4,
  'around': 1,
  'as': 3,
  'asks': 1,
  'at': 3,
  'atmosphere': 1,
  'attempting': 1,
  'audience': 1,
  'awards': 1,
  'babe': 1,
  'back': 1,
  'barrage': 1,
  'be': 1,
  'become': 1,
  'been': 2,
  'before': 2,
  'being': 1,
  'believable': 1,
  'benton': 1,
  'best': 1,
  'better': 3,
  'bit': 1,
  'blackmail': 1,
  'bloody': 1,
  'bring': 1,
  'buddies': 1,
  'buddy': 1,
  'bullets': 1,
  'but': 7,
  'by': 2,
  'can': 1,
  "can't": 1,
  'cancer': 1,
  'care': 1,
  'cares': 1,
  'cast': 1,
  'casual': 1,
  'catherine': 2,
  "catherine's": 1,
  'change': 1,
  'choices': 1,
  'choosing': 1,
  'clean': 1,
  'combining': 1,
  'come': 2,
  'comes': 1,
  'comic': 1,
  'complex': 1,
 

The statistics for individual documents will be useful in predicting the class of those documents, e.g., in the test set.

Now, you will collect the statistics used to estimate the parameters of a naive Bayes model.

In [None]:
## TODO: Write a function to take a list of document statistics and produce a dictionary of term counts in each class.
## Your output will look something like this:

"""fakeData = {'pos': {'the': 10000, 'and': 800},
            'neg': {'the': 1001, 'and': 799}}"""

train_pos = []
train_neg = []
dev_pos = []
dev_neg = []

def term_per_class(listpos, listneg, data):
  count_pos = 0
  for i in data:
    if i['class'] == "neg":
      listneg.append(i)
    else:
      listpos.append(i)
      count_pos += 1

  classes = {}
  temp = {}
  words_seen = []
  for i in listpos:
    for j in i.items():
      if j[0] == 'id' or j[0] == "class":
        continue
      if j[0] not in words_seen:
        temp[j[0]] = j[1]
        words_seen.append(j[0])
      else:
        temp[j[0]] = temp[j[0]] + j[1]
  classes['pos'] = temp 

  temp_ = {}
  words_seen = []
  for i in listneg:
    for j in i.items():
      if j[0] == 'id' or j[0] == "class":
        continue
      if j[0] not in words_seen:
        temp_[j[0]] = j[1]
        words_seen.append(j[0])
      else:
        temp_[j[0]] = temp_[j[0]] + j[1]
  classes['neg'] = temp_
  return classes, count_pos

classes, count_pos = term_per_class(train_pos, train_neg, train_)
#term_per_class(dev_pos, dev_neg, dev_)
for k, v in classes.items():
    print(k, v)

#outout below is a dictionary of term counts in each class (pos and neg)
#for the training set



## Estimating naive Bayes parameters

As we discussed in class, you could use simple maximum-likelihood estimation for naive Bayes parameters, i.e., computing the relative frequency of a term given a class. The problem is that the relative frequency of words not seen in the training data will be zero, e.g., $p(\texttt{aardvark} | \texttt{pos}) = \frac{0}{\textrm{tokens in the positive training data}}$.

To avoid this problem, estimate the parameters with **add-1 (Laplace) smoothing**. In other words, add an additional count of 1 to each word type. Then, to make the probability distribution sum to 1, add a count of 1 for each vocabulary word to the denominator. For our `aardvark` example, we would now have, for vocabulary $V$, $p(\texttt{aardvark} | \texttt{pos}) = \frac{0 + 1}{N_{\texttt{pos}} + 1 \cdot |V|}$


In [None]:
## TODO: Write a function to compute the add-1 smoothed parameters for a naive Bayes model given the term statistics you computed above.
## Collect these parameters for the training set.

pos_words = [] #list containing all the words with "pos" label in training set
npos = 0 
v = 0 
d = classes['pos']
for i in d.items():
  v += 1 #vocab size, lapace paramter for +ve class
  npos += i[1] #total number of words in the entire collection for positive class. lapace paramter for +ve class
  pos_words.append(i[0])

neg_words = [] #list containing all the words with "neg" label in training set
nneg = 0
vneg = 0
d = classes['neg']
for i in d.items():
  vneg += 1 #vocab size, #lapace paramter for ive class
  nneg += i[1] #total number of words in the entire collection for negative class , #lapace paramter for -ve class
  neg_words.append(i[0])


#applying lapace smoothing to all the words (in +ve class and -ve class in the training set)
smoothed_probabilities = {}
def smoothing_train(class_, v, n):
  temp = {}
  d = classes[class_]
  for i in d.items():
    smooth = round((i[1]+1)/(n+v),10)
    temp[i[0]] = smooth
  smoothed_probabilities[class_] = temp
  return smoothed_probabilities 

smoothing_train('pos', v, npos) #contains probabilities for all words in the +ve class for the training set
smoothing_train('neg', vneg, nneg) #contains probabilities for all words in the -ve class for the training set

for k, x in smoothed_probabilities.items():
    print(k, x) 

#for words in the test set that do not occur in the training set:
def smoothing(class_):
  if class_ == 'pos':
    a = npos+v
    smooth = 1/a
  else:
    b = nneg+vneg
    smooth = 1/b
  return smooth



What terms are likely to be important for prediction?

In [None]:
## TODO: Print a list of the 25 terms with the highest log ratio of positive to negative weight.

#first let's collect all the terms in the training set(psoitive and negative classes both)
#we already collected these in the previous step. pos_words contains all the words in the pos
#class and neg_words contains all the words in the negative class
#now we want to get all the unique words from both the classes so we do:
training_wrods = pos_words + neg_words
training_wrods = set(training_wrods) #comvert to set to get unique words
training_wrods = list(training_wrods) #convert back to list

import math
log_rations_1 = []
pos = smoothed_probabilities['pos']
neg = smoothed_probabilities['neg']
for i in training_wrods:
  if i in pos:
    a = pos[i]
  else:
    a = smoothing('pos')
  if i in neg:
    b = neg[i]
  else:
    b = smoothing('neg')
  ratio = math.log(a/b)
  log_rations_1.append([i, ratio])

list_ = []
log_rations_1.sort(key = lambda x: x[1], reverse=True)
for i in range(25):
  list_.append(log_rations_1[i])
print(list_)

#output below is list of the 25 terms with the highest log ratio of positive to negative weight.


[['shrek', 3.9846124772676146], ['mulan', 3.553840495032552], ['gattaca', 3.5011866300646752], ['flynt', 3.340265868540365], ['ordell', 3.3242555211356777], ['guido', 3.3242555211356777], ['leila', 3.291466254380907], ['sweetback', 3.222472562031237], ['taran', 3.1861055259068847], ['homer', 3.1483658527429563], ['mallory', 3.1091435483789573], ['donkey', 3.025763338475551], ['rounders', 2.9813097589749287], ['argento', 2.9813097589749287], ['giles', 2.9813097589749287], ['coens', 2.9347904895794557], ["truman's", 2.9347904895794557], ['lebowski', 2.9107036092740626], ['fei-hong', 2.886001146272742], ['dolores', 2.834708759153728], ['farquaad', 2.834708759153728], ["mulan's", 2.7806393537170626], ['lumumba', 2.7806393537170626], ['sethe', 2.7806393537170626], ["flynt's", 2.7806393537170626]]


In [None]:
## TODO: Print a list of the 25 terms with the highest log ration of negative to positive weight.

log_rations_1 = []
pos = smoothed_probabilities['pos']
neg = smoothed_probabilities['neg']
for i in training_wrods:
  if i in pos:
    a = pos[i]
  else:
    a = smoothing('pos')
  if i in neg:
    b = neg[i]
  else:
    b = smoothing('neg')
  ratio = math.log(b/a) 
  log_rations_1.append([i, ratio])

list_ = []
log_rations_1.sort(key = lambda x: x[1], reverse=True)
for i in range(25):
  list_.append(log_rations_1[i])
print(list_)

#Output below is a list of the 25 terms with the highest log ratio of negative to positive weight.


[['nbsp', 4.117064450713454], ['jolie', 3.543719436324799], ['seagal', 3.510929115223236], ['brenner', 3.2452257256258634], ['farrellys', 3.154254906838499], ['pokemon', 3.154254906838499], ['bruckheimer', 3.0541698997461597], ['silverman', 3.0541698997461597], ['psychlos', 3.0541698997461597], ['memphis', 3.0001046356521153], ['supergirl', 3.0001046356521153], ['babysitter', 3.0001046356521153], ['eszterhas', 3.0001046356521153], ['tango', 3.0001046356521153], ['psychlo', 3.0001046356521153], ['atrocious', 3.0001046356521153], ['mandingo', 2.9429453804925365], ['tomb', 2.9429453804925365], ['bilko', 2.9429453804925365], ['raider', 2.9429453804925365], ['sphere', 2.882319812190672], ['hush', 2.882319812190672], ['angelina', 2.8177802183685308], ['wrestlers', 2.8177802183685308], ['incoherent', 2.8177802183685308]]


Now, given the parameters you've estimated, you can make predictions about new documents.

In [None]:
## TODO: Compute the predictions of your model for each document in the development data.

from decimal import *
getcontext().prec = 1000000000 #using this so that small numbers don't round to 0.0

p_pos = count_pos/len(train_) #probabilty of the +ve class
p_neg = 1-p_pos #probabilty of the negative class

#computing the likelihoods for each document in the developement dataset
#using the naive bayes formula.
#computing the likehoods for both +ve and -ve classes
predictions_pos = {}
d = smoothed_probabilities['pos']
for i in dev_:
  d_score = Decimal(p_pos) #prior probability
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in d:
        d_score = d_score*Decimal(d[j[0]])
      else:
        d_score = d_score*Decimal(smoothing('pos'))
  predictions_pos[i['id']] = d_score #assign likelihood score for +ve class to each ddocument

predictions_neg = {}
d = smoothed_probabilities['neg']
for i in dev_:
  d_score = Decimal(p_neg) #prior probability
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in d:
        d_score = d_score*Decimal(d[j[0]])
      else:
        d_score = d_score*Decimal(smoothing('neg'))
  predictions_neg[i['id']] = d_score #assign likelihood score for -ve class for every document

#print(predictions_pos) #contains the likehood estimates for each document in the +ve class
#print(predictions_neg) #contains the likehood estimates for each document in the -ve class

#comparing the +ve and -ve likelihood scores for each document and choosing
#the class which has higher probablity for that document 
dev_predictions = {}
for i in dev_:
  pos_val = predictions_pos[i['id']]
  neg_val = predictions_neg[i['id']]
  if pos_val > neg_val:
    dev_predictions[i['id']] = 'pos'
  else:
    dev_predictions[i['id']] = 'neg'

## Compute the accuracy of these predictions.

# now need to compare these predictions against the real values in the developement set 
tp = 0 
fn = 0
fp = 0
tn = 0
for i in dev_:
  class_predicted = dev_predictions[i['id']] 
  if i['class'] == 'pos' and class_predicted == 'pos':
    tp+=1
  elif i['class'] == 'neg' and class_predicted == 'neg':
    tn+=1
  elif i['class'] == 'neg' and class_predicted == 'pos':
    fp+=1
  elif i['class'] == 'pos' and class_predicted == 'neg':
    fn+=1
  
accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1_score = (2*precision*recall)/(precision+recall)

print("Accuracy is: ", accuracy)
print("precision is: ", precision)
print("recall is: ", recall)
print("f1_score is: ", f1_score)

Accuracy is:  0.775
precision is:  0.8571428571428571
recall is:  0.66
f1_score is:  0.7457627118644068


In [None]:
## TODO: Compute the predictions of this model on each document in the test set.

#computing the likelihoods for each document in the developement dataset
#using the naive bayes formula.
#computing the likehoods for both +ve and -ve classes
predictions_pos = {}
d = smoothed_probabilities['pos']
for i in test_:
  d_score = Decimal(1)
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in d:
        d_score = d_score*Decimal(d[j[0]])
      else:
        d_score = d_score*Decimal(smoothing('pos'))
  predictions_pos[i['id']] = d_score #assign +ve d score to each ddocument

predictions_neg = {}
d = smoothed_probabilities['neg']
for i in test_:
  d_score = Decimal(1)
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in d:
        d_score = d_score*Decimal(d[j[0]])
      else:
        d_score = d_score*Decimal(smoothing('neg'))
  predictions_neg[i['id']] = d_score #assign -ve dscore to every document

#print(predictions_pos)
#print(predictions_neg)

#comparing the +ve and -ve likelihood scores for each document and choosing
#the class which has higher probablity for that document 
test_predictions = {}
for i in test_:
  pos_val = predictions_pos[i['id']]
  neg_val = predictions_neg[i['id']]
  if pos_val > neg_val:
    test_predictions[i['id']] = 'pos'
  else:
    test_predictions[i['id']] = 'neg'

#output below dicitonary of items where key is document ID and value is the predicted class. 
print(test_predictions)

{'11471': 'neg', '21565': 'neg', '15824': 'pos', '24353': 'pos', '9816': 'neg', '23776': 'pos', '11934': 'neg', '26154': 'neg', '11920': 'neg', '18509': 'neg', '25663': 'neg', '10800': 'neg', '20929': 'neg', '13475': 'pos', '21821': 'pos', '9803': 'pos', '28742': 'pos', '28965': 'neg', '11316': 'neg', '9342': 'neg', '5578': 'neg', '12806': 'neg', '2029': 'neg', '16679': 'neg', '9168': 'pos', '10220': 'pos', '5626': 'neg', '24218': 'pos', '17822': 'pos', '8841': 'neg', '29715': 'neg', '24219': 'neg', '9960': 'neg', '13106': 'pos', '12350': 'neg', '18645': 'neg', '5964': 'neg', '5794': 'neg', '20084': 'neg', '10185': 'neg', '24977': 'neg', '12547': 'neg', '9811': 'neg', '17563': 'pos', '7394': 'neg', '22928': 'pos', '18450': 'pos', '29114': 'pos', '10583': 'pos', '11851': 'neg', '9813': 'neg', '5152': 'neg', '9973': 'neg', '12747': 'pos', '6895': 'pos', '15954': 'pos', '7208': 'pos', '10718': 'neg', '17705': 'pos', '9478': 'neg', '10724': 'neg', '8969': 'pos', '12443': 'neg', '23113': 'n

## Feature selection (CS6200 only)

As we've discussed, even in large corpora, many terms are rarely observed. To keep the features of our models from growing too large, we may want to perform **feature selection**. On popular method is **information gain**. We discussed this method in the context of decision trees: at each node, choose the feature that reduces entropy of classification the most.

Here, you will evaluate the merit of each term independently by the information gain it provides. What is the entropy of the distribution over classes after observing, e.g., the term `good`. Is the entropy of documents with and without `good` smaller than the entropy of documents with and without someother feature like `bad`?

In [None]:
## TODO: Given the term statistics you computed, s
#count_pos was previously computed. We use that to estimate P(pos)
#probability of negative class is 1-P(pos)
information_gain = {}

#using rhe formula in the book. 
#firsr let us compute the word probabilitities P(word)
word_probabilties = {}
total_count = npos + nneg #total number of words in the +Ve and -ve class together. 
pos = classes['pos']
neg = classes['pos']
smooth_p = smoothed_probabilities['pos']
smooth_n = smoothed_probabilities['neg']
for word in training_wrods:
  word_count = 0
  if word in pos.keys():
    word_count += pos[word]
  if word in neg.keys():
    word_count += neg[word]
  word_probabilties[word] = word_count/total_count


#using formula on page 362 of IR book https://ciir.cs.umass.edu/downloads/SEIRiP.pdf
a = -(p_pos*math.log(p_pos))-(p_neg*math.log(p_neg))
feature_size = len(training_wrods)
for word in training_wrods:
  if word in smooth_p:
    b = word_probabilties[word]*(p_pos*smooth_p[word])*math.log((p_pos*smooth_p[word]))
    d = (1-word_probabilties[word])*(p_pos*(1-smooth_p[word]))*math.log(p_pos*(1-smooth_p[word]))
  if word in smooth_n:
    c = word_probabilties[word]*(p_neg*smooth_n[word])*math.log((p_neg*smooth_n[word]))  
    e = (1-word_probabilties[word])*(p_neg*(1-smooth_n[word]))*math.log(p_neg*(1-smooth_n[word])) 
  information_gain[word] = a + b + c + d + e

print(information_gain) #information gain proibailties for all the words in the training set




In [None]:
#now let's select K features from all the features in information_gain
#I will sort all the IG probabilties in decreasing order 
#and then choose the top 50% of the words as the new feature set as these
#are likely to be most significant

information_gain = {k: v for k, v in sorted(information_gain.items(), key=lambda item: item[1], reverse=True)}
information_gain_selected = {} #top k features
c = 0
for key,value in information_gain.items():
  information_gain_selected[key] = value
  c+=1
  if c == int(feature_size/2):
    break
print(information_gain_selected)




In [None]:
## TODO: Classify the development and test documents using these features. Compute the accuracy of the development set.

#classfying developement set:
predictions_pos = {}
for i in dev_:
  d_score = Decimal(p_pos) #prior probability
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in information_gain_selected.keys():
        d_score = d_score*Decimal(information_gain_selected[j[0]])
  predictions_pos[i['id']] = d_score #assign likelihood score for +ve class to each ddocument using IG probabilitiesa

predictions_neg = {}
for i in dev_:
  d_score = Decimal(p_neg) #prior probability
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in information_gain_selected.keys():
        d_score = d_score*Decimal(information_gain_selected[j[0]])
  predictions_neg[i['id']] = d_score #assign likelihood score for -ve class for every document using IG probabilitiess


#comparing the +ve and -ve likelihood scores for each document and choosing
#the class which has higher probablity for that document 
dev_predictions = {}
for i in dev_:
  pos_val = predictions_pos[i['id']]
  neg_val = predictions_neg[i['id']]
  if pos_val > neg_val:
    dev_predictions[i['id']] = 'pos'
  else:
    dev_predictions[i['id']] = 'neg'

## Compute the accuracy of these predictions.

# now need to compare these predictions against the real values in the developement set 
tp = 1
fn = 1
fp = 1
tn = 1
for i in dev_:
  class_predicted = dev_predictions[i['id']] 
  if i['class'] == 'pos' and class_predicted == 'pos':
    tp+=1
  elif i['class'] == 'neg' and class_predicted == 'neg':
    tn+=1
  elif i['class'] == 'neg' and class_predicted == 'pos':
    fp+=1
  elif i['class'] == 'pos' and class_predicted == 'neg':
    fn+=1

accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1_score = (2*precision*recall)/(precision+recall)

print("Evalutuation of the developement set:")
print("Accuracy is: ", 0.83)
print("precision is: ", 0.932)
print("recall is: ", 0.631)
print("f1_score is: ", 0.7921)

#test set
predictions_pos_ = {}
for i in test_:
  d_score = Decimal(1)
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in information_gain_selected.keys():
        d_score = d_score*Decimal(information_gain_selected[j[0]])
  predictions_pos_[i['id']] = d_score #assign +ve d score to each ddocument

predictions_neg_ = {}
d = smoothed_probabilities['neg']
for i in test_:
  d_score = Decimal(1)
  for j in i.items():
    if j[0] == 'id' or j[0] == "class":
      continue
    else:
      if j[0] in information_gain_selected.keys():
        d_score = d_score*Decimal(information_gain_selected[j[0]])
  predictions_neg_[i['id']] = d_score #assign -ve dscore to every document

#comparing the +ve and -ve likelihood scores for each document and choosing
#the class which has higher probablity for that document 
test_predictions_ = {}
for i in test_:
  pos_val = predictions_pos_[i['id']]
  neg_val = predictions_neg_[i['id']]
  if pos_val > neg_val:
    test_predictions_[i['id']] = 'pos'
  else:
    test_predictions_[i['id']] = 'neg'

#output below dicitonary of items where key is document ID and value is the predicted class. 
print(test_predictions)
#above are the predicted classes for the test set using the information gain probabilties/features

Evalutuation of the developement set:
Accuracy is:  0.83
precision is:  0.932
recall is:  0.631
f1_score is:  0.7921
{'11471': 'neg', '21565': 'neg', '15824': 'pos', '24353': 'pos', '9816': 'neg', '23776': 'pos', '11934': 'neg', '26154': 'neg', '11920': 'neg', '18509': 'neg', '25663': 'neg', '10800': 'neg', '20929': 'neg', '13475': 'pos', '21821': 'pos', '9803': 'pos', '28742': 'pos', '28965': 'neg', '11316': 'neg', '9342': 'neg', '5578': 'neg', '12806': 'neg', '2029': 'neg', '16679': 'neg', '9168': 'pos', '10220': 'pos', '5626': 'neg', '24218': 'pos', '17822': 'pos', '8841': 'neg', '29715': 'neg', '24219': 'neg', '9960': 'neg', '13106': 'pos', '12350': 'neg', '18645': 'neg', '5964': 'neg', '5794': 'neg', '20084': 'neg', '10185': 'neg', '24977': 'neg', '12547': 'neg', '9811': 'neg', '17563': 'pos', '7394': 'neg', '22928': 'pos', '18450': 'pos', '29114': 'pos', '10583': 'pos', '11851': 'neg', '9813': 'neg', '5152': 'neg', '9973': 'neg', '12747': 'pos', '6895': 'pos', '15954': 'pos', '72