**Informaiton Retrieval Programming Assignment #4**
<br>**Binary Text Classification**
<br>Build a binary classifier using the systematic review dataset
<br>- create feature vector: bag-of-word, tf-idf, 2-gram
<br>- run classification algorithms: multinominal naive bayes
<br>- report precision, recall and F1 scores
<br>- conduct experiments of using title only vs using title, abstract and keywords


<br><br>**Author:** Helen Ting He; **Date:** Oct 31, 2021

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk import word_tokenize
import string #remove punctuation
import io
import re
import time
import langdetect #detect which language it is 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import feature_extraction

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# upload files
from google.colab import files
uploaded = files.upload()

Saving phase1.dev.shuf.tsv to phase1.dev.shuf (4).tsv
Saving phase1.test.shuf.tsv to phase1.test.shuf (4).tsv
Saving phase1.train.shuf.tsv to phase1.train.shuf (4).tsv


In [None]:
# read files
train_data = pd.read_csv(io.BytesIO(uploaded['phase1.train.shuf.tsv']),sep='\t',header=None)
dev_data = pd.read_csv(io.BytesIO(uploaded['phase1.dev.shuf.tsv']),sep='\t',header=None)
test_data = pd.read_csv(io.BytesIO(uploaded['phase1.test.shuf.tsv']),sep='\t',header=None)

In [None]:
##################
# pre-processing
##################
'''print('Num of paper in train data: ' + str(train_data.shape[0]))

# only select the english paper
train_data['lang'] = train_data[2].apply(langdetect.detect)
train_en = train_data[train_data['lang'] == 'en']
print('Num of English paper in train data: ' + str(train_en.shape[0]))

print('Num of paper in dev data: ' + str(dev_data.shape[0]))
# only select the english paper
dev_data['lang'] = dev_data[2].apply(langdetect.detect)
dev_en = dev_data[dev_data['lang'] == 'en']
print('Num of English paper in dev data: ' + str(dev_en.shape[0]))'''

print('Num of paper in test data: ' + str(test_data.shape[0]))
# only select the english paper
test_data['lang'] = test_data[2].apply(langdetect.detect)
test_en = test_data[test_data['lang'] == 'en']
print('Num of English paper in test data: ' + str(test_en.shape[0]))

def pre_process(data):
  # To pre process the data
  # @input: pandas series 
  # @output: clean pandas series
  result = []
  for line in data:
    # clean (convert to lowercase, remove punctuations and numbers and then strip)
    tokens = line.lower().strip().split()
    clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
    # remove stop words
    stopwords = nltk.corpus.stopwords.words("english")
    non_stop = [clean_non_stop for clean_non_stop in clean_tokens if clean_non_stop not in stopwords]
    # stemming 
    ps = nltk.stem.porter.PorterStemmer()
    non_stop_ps = [ps.stem(word) for word in non_stop]
    # from list to string
    text = " ".join(non_stop_ps)
    result.append(text)
  return result

train_clean = pre_process(train_en[2])
dev_clean = pre_process(dev_en[2])
test_clean = pre_process(test_en[2])

Num of paper in test data: 4814
Num of English paper in test data: 4750


In [None]:
#########################
# feature representation
########################
# bag of word
vectorize = CountVectorizer()
bag = vectorize.fit_transform(train_clean + dev_clean + test_clean)

train_bow = bag[0:len(train_clean)].toarray()
dev_bow = bag[len(train_clean):len(train_clean)+len(dev_clean)]
test_bow = bag[len(train_clean)+len(dev_clean):len(train_clean)+len(dev_clean)+len(test_clean)]

## Baseline 
Use features only from title. make prediction against the Dev partition and report precision, recall and F1 (show the computation).

In [None]:
y_train = train_en[0]
X_train = train_bow
X_test = dev_bow

In [None]:
#########################
# Multinomial NB
########################
t0 = time.time()
clf = MultinomialNB()
clf.fit(X_train, y_train)
train_time = time.time() - t0
print("train time: %0.3fs" % train_time)

t0 = time.time()
pred = clf.predict(X_test)
test_time = time.time() - t0
print("test time:  %0.3fs" % test_time)

train time: 15.104s
test time:  0.001s


In [None]:
y_test = dev_en[0]
#########################
# Evaluation
########################
# contingency table
print(pd.crosstab(index = y_test, columns = pred))
precision = 6/(6+144)
recall = 6/(6+52)
F1 = 2*precision*recall/(precision + recall)
print("precision of Multinomial Naive Bayes using bag of word representation is: " + str(precision))
print("recall of Multinomial Naive Bayes using bag of word representation is: " + str(recall))
print("F1 of Multinomial Naive Bayes using bag of word representation is: " + str(F1))

col_0    -1   1
0              
-1     4565  52
 1      144   6
precision of Multinomial Naive Bayes using bag of word representation is: 0.04
recall of Multinomial Naive Bayes using bag of word representation is: 0.10344827586206896
F1 of Multinomial Naive Bayes using bag of word representation is: 0.057692307692307696


## Experiment #1: Is longer better?
not only using features from title, but also from the abstract and keywords fields

In [None]:
# concatenate title, abstract and keyword
# replace NaN with empty string
train_en[9] = train_en[9].fillna('')
dev_en[9] = dev_en[9].fillna('')
test_en[9] = test_en[9].fillna('')
train_en[8] = train_en[8].fillna('')
dev_en[8] = dev_en[8].fillna('')
test_en[8] = test_en[8].fillna('')
train_en[2] = train_en[2].fillna('')
dev_en[2] = dev_en[2].fillna('')
test_en[2] = test_en[2].fillna('')

train_en = train_en.assign(concat = lambda train_en: train_en[2] + " " + train_en[8] + " " + train_en[9])
dev_en = dev_en.assign(concat = lambda dev_en: dev_en[2] + " " + dev_en[8] + " " + dev_en[9])
test_en = test_en.assign(concat = lambda test_en: test_en[2] + " " + test_en[8] + " " + test_en[9])

# pre-process
train_clean_long = pre_process(train_en['concat'])
dev_clean_long = pre_process(dev_en['concat'])
test_clean_long = pre_process(test_en['concat'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [None]:
# bag of word
vectorize = CountVectorizer()
bag_long = vectorize.fit_transform(train_clean_long + dev_clean_long + test_clean_long)

train_bow_long = bag_long[0:len(train_clean_long)].toarray()
dev_bow_long = bag_long[len(train_clean_long):len(train_clean_long)+len(dev_clean_long)]
test_bow_long = bag_long[len(train_clean_long)+len(dev_clean_long):len(train_clean_long)+len(dev_clean_long)+len(test_clean_long)]

In [None]:
print("train_bow.shape", train_bow.shape)
print("train_bow_long.shape", train_bow_long.shape)

train_bow.shape (21350, 11199)
train_bow_long.shape (21350, 34119)


In [None]:
y_train_long = train_en[0]
X_train_long = train_bow_long
X_test_long = dev_bow_long

# Multinomial NB
t0 = time.time()
clf_long = MultinomialNB()
clf_long.fit(X_train_long, y_train_long)
train_time = time.time() - t0
print("train time: %0.3fs" % train_time)

t0 = time.time()
pred_long = clf_long.predict(X_test_long)
test_time = time.time() - t0
print("test time:  %0.3fs" % test_time)

y_test_long = dev_en[0]

train time: 35.601s
test time:  0.004s


In [None]:
# Evaluation
# contingency table
print(pd.crosstab(index = y_test_long, columns = pred_long))
precision = 65/(65+85)
recall = 65/(65+101)
F1 = 2*precision*recall/(precision + recall)
print("precision of Multinomial Naive Bayes using bag of word representation is: " + str(precision))
print("recall of Multinomial Naive Bayes using bag of word representation is: " + str(recall))
print("F1 of Multinomial Naive Bayes using bag of word representation is: " + str(F1))

col_0    -1    1
0               
-1     4516  101
 1       85   65
precision of Multinomial Naive Bayes using bag of word representation is: 0.43333333333333335
recall of Multinomial Naive Bayes using bag of word representation is: 0.39156626506024095
F1 of Multinomial Naive Bayes using bag of word representation is: 0.41139240506329117


## Experiment 2 TF-IDF
I used bag-of-word to represent the data in the experiment 1. Here I used TF-IDF  to represent data 

In [None]:

# TF-IDF
vectorizer = feature_extraction.text.TfidfVectorizer(max_features = 10000)
bag_new = vectorizer.fit_transform(train_clean_long + dev_clean_long + test_clean_long).toarray()
train_new = bag_new[0:len(train_clean_long)]
dev_new= bag_new[len(train_clean_long):len(train_clean_long)+len(dev_clean_long)]
test_new = bag_new[len(train_clean_long)+len(dev_clean_long):len(train_clean_long)+len(dev_clean_long)+len(test_clean_long)]


In [None]:
y_train_new = train_en[0]
X_train_new = train_new
X_test_new = dev_new
y_test_new = dev_en[0]

clf_new = MultinomialNB()
clf_new.fit(X_train_new, y_train_new)

pred_new = clf_new.predict(X_test_new)

In [None]:
# Evaluation
# contingency table
print(pd.crosstab(index = y_test_new, columns = pred_new))
precision = 0/(150)
recall = 0/(4521)
#F1 = 2*precision*recall/(precision + recall)
print("precision of Multinomial Naive Bayes using bag of word representation is: " + str(precision))
print("recall of Multinomial Naive Bayes using bag of word representation is: " + str(recall))
#print("F1 of Multinomial Naive Bayes using bag of word representation is: " + str(F1))

col_0    -1
0          
-1     4621
 1      150
precision of Multinomial Naive Bayes using bag of word representation is: 0.0
recall of Multinomial Naive Bayes using bag of word representation is: 0.0


## Experiment 3 N-gram
Experiment 2 using TF-IDF to represent data, but it can't learn the postive case at all. Here try another freature represetation method to fit title, abstract and keywords: n-gram

In [None]:
# 2-gram
vectorizer = feature_extraction.text.CountVectorizer(max_features = 20000,ngram_range=(2,2))
bag_2 = vectorizer.fit_transform(train_clean_long + dev_clean_long + test_clean_long).toarray()
train_2= bag_2[0:len(train_clean_long)]
dev_2= bag_2[len(train_clean_long):len(train_clean_long)+len(dev_clean_long)]
test_2 = bag_2[len(train_clean_long)+len(dev_clean_long):len(train_clean_long)+len(dev_clean_long)+len(test_clean_long)]

In [None]:
y_train_2 = train_en[0]
X_train_2 = train_2
X_test_2 = dev_2
y_test_2 = dev_en[0]

clf_2 = MultinomialNB()
clf_2.fit(X_train_2, y_train_2)

pred_2 = clf_2.predict(X_test_2)

In [None]:
# Evaluation
# contingency table
print(pd.crosstab(index = y_test_2, columns = pred_2))
precision = 102/(102+362)
recall = 102/(102+48)
F1 = 2*precision*recall/(precision + recall)
print("precision of Multinomial Naive Bayes using bag of word representation is: " + str(precision))
print("recall of Multinomial Naive Bayes using bag of word representation is: " + str(recall))
print("F1 of Multinomial Naive Bayes using bag of word representation is: " + str(F1))

col_0    -1    1
0               
-1     4259  362
 1       48  102
precision of Multinomial Naive Bayes using bag of word representation is: 0.21982758620689655
recall of Multinomial Naive Bayes using bag of word representation is: 0.68
F1 of Multinomial Naive Bayes using bag of word representation is: 0.3322475570032573


## Final Prediction
From previous methods, we conclude that using bag-of-word as feature representation method and title, keywords and abstract as input, it can achieve our best F1 score at 0.413. We used this combination to run our test data

In [None]:
pred_long_final = pd.Series(clf_long.predict(test_bow_long))
final_df = pd.concat([test_en[1], pred_long_final],axis=1)

In [None]:
import csv
with open('output.tsv','wt') as out_file:
  writer = csv.writer(out_file, delimiter='\t', lineterminator='\n')
  for i in range(len(final_df)):
    writer.writerow(final_df.iloc[i])