# Information Retrieval: Programming Assignment \#4

### Sheetal Parikh
EN.605.744.81<br>
November 1, 2021
***
***

In [1]:
#imports for notebook
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn import datasets, svm, metrics
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

import os 
import csv

# change the current directory 
# to specified directory 
os.chdir(r"C:\Users\Sheetal\Documents\Sheetal\datasets") 

#checking current directory
#print(os.getcwd() + "\n")

#direct path to files
filepath_test = '/Users/Sheetal/Documents/Sheetal/datasets/phase1.test.shuf.tsv'
filepath_dev = '/Users/Sheetal/Documents/Sheetal/datasets/phase1.dev.shuf.tsv'
filepath_train = '/Users/Sheetal/Documents/Sheetal/datasets/phase1.train.shuf.tsv'

### EDA and Reading in Files

In [2]:
#reading in files

train = pd.read_csv(filepath_train, sep='\t', header=None,
       names=["Assessment", "Docid", "Title", "Authors", "Journal", "Issn", "Year", "Language", "Abstract", "Keywords"])

dev = pd.read_csv(filepath_dev, sep='\t', header=None,
       names=["Assessment", "Docid", "Title", "Authors", "Journal", "Issn", "Year", "Language", "Abstract", "Keywords"])

test = pd.read_csv(filepath_test, sep='\t', header=None,
       names=["Assessment", "Docid", "Title", "Authors", "Journal", "Issn", "Year", "Language", "Abstract", "Keywords"])

In [3]:
# Sanity check for train file
print(f'N rows={len(train)}, M columns={len(train.columns)}')

#print first few rows to visualize the training dataset
train.head()

N rows=21662, M columns=10


Unnamed: 0,Assessment,Docid,Title,Authors,Journal,Issn,Year,Language,Abstract,Keywords
0,-1,hash:3f1ebe70-a242-3b43-843c-eef89284607a,Misoprostol for treating postpartum haemorrhag...,"Hofmeyr, G. J.;Ferreira, S.;Nikodem, V. C.;Man...",BMC Pregnancy and Childbirth,1471-2393,2004,eng,Background: Postpartum haemorrhage remains an ...,South Africa;adult;article;blood transfusion;c...
1,-1,hash:aa35378f-0460-37f1-b001-ac735e027333,Vitamin A supplements and diarrheal and respir...,"Fawzi, W. W.;Mbise, R.;Spiegelman, D.;Fataki, ...",J Pediatr,0022-3476,2000,eng,OBJECTIVE: To determine the effect of vitamin ...,"Child, Preschool;Diarrhea/ epidemiology;Dietar..."
2,-1,hash:3ddd7e14-a607-3313-a74f-613c988206f3,The efficacy and safety of a controlled releas...,"Gathua, S. N.;Aluoch, J. A.",East Afr Med J,,1990,eng,The treatment of asthma in Africa is influence...,Adult;Albuterol/administration & dosage/advers...
3,-1,hash:41e91fb1-6cfe-3347-aa26-4424e0afe11e,The state of the art of education for child su...,"Mrisho, F. H.",BERC Bull,,1987,eng,PIP: Tanzania has both a high infant mortality...,"Africa;Africa South of the Sahara;Africa, East..."
4,-1,hash:92d8601c-ef4f-39b1-b65c-ce1c7325ba6f,[The practicability of preceptorship in the cu...,"Lin, C. C.;Lo, K. M.;Leu, C. S.",Kaohsiung J Med Sci,,1996,eng,Physicians who have graduated from traditional...,"Adult;Aged;Curriculum;Education, Medical;Engli..."


In [4]:
#checking for missing values in training set
train.isnull().sum()

Assessment       0
Docid            0
Title            0
Authors        412
Journal        825
Issn          8240
Year           681
Language         1
Abstract         1
Keywords       750
dtype: int64

In [5]:
#checking for missing values in dev set
dev.isnull().sum()

Assessment       0
Docid            0
Title            0
Authors          0
Journal        168
Issn          1845
Year           139
Language         0
Abstract         0
Keywords       164
dtype: int64

In [6]:
#checking for missing values in test set
test.isnull().sum()

Assessment       0
Docid            0
Title            0
Authors          0
Journal        182
Issn          1825
Year           164
Language         0
Abstract         0
Keywords       142
dtype: int64

In [7]:
#summary of attributes
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21662 entries, 0 to 21661
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Assessment  21662 non-null  int64 
 1   Docid       21662 non-null  object
 2   Title       21662 non-null  object
 3   Authors     21250 non-null  object
 4   Journal     20837 non-null  object
 5   Issn        13422 non-null  object
 6   Year        20981 non-null  object
 7   Language    21661 non-null  object
 8   Abstract    21661 non-null  object
 9   Keywords    20912 non-null  object
dtypes: int64(1), object(9)
memory usage: 1.7+ MB


In [8]:
#checking for the number of languages that are present in the training dataset
pd.value_counts(train.Language)

eng    21625
fre       13
spa        9
por        8
ger        2
chi        2
afr        1
dut        1
Name: Language, dtype: int64

As can be seen above, all files contain null values. This won't be an issue for the first test since we are only using the title features.  However, we will have to clean the files when using features from the title, keyword, and abstract.  Primarily the keyword columns contains many null values.  Also, we can see that most of the documents are in english so I think it would be sufficient to remove "english" stop words.  I don't believe that language would be a good feature to use since it probably won't help distinguish anything since most documents are in english. The Issn column would probably also not be a good feature to use since it contains so many null values.

### Formulas for Recall, Precision and F1 - Score

The CountVectorizer from the scikit-learn library will be used to convert the text to numerical data by transforming the 
text into a vector depending on the count of each word in the text.

In [9]:
#calculating precision of the predicted values
#Recall is the percentage of +1s in the Dev file that were correctly predicted to belong to the class; precision
#is the percentage of +1s in the output file (or that you predict are positive) which are indeed correct according to the 
#Dev file labels.

#formula for recall - percentage of +1s that are correctly predicted to belong
def calc_recall(predictions, actual, wording = True):
    total_pred = 0
    correct_pred = 0
    
    #tallying number of total predictions and number of correct predictions
    for p, a in zip(predictions, actual):
        if a == 1:
            total_pred += 1
            
            if p == 1:
                correct_pred += 1
    #wording can be turned off if needed
    if wording:
        print(f'Recall: {correct_pred}/{total_pred} = {correct_pred/total_pred}')
    
    return correct_pred/total_pred

In [10]:
#formula for precision - percentage of +1s that are correctly predicted according to Dev file labels
def calc_precision(predictions, actual, wording = True):
    total_pred = 0
    correct_pred = 0
    
    #tallying number of total predictions and number of correct predictions
    for p, a in zip(predictions, actual):
        if p == 1:
            total_pred += 1
            
            if a == 1:
                correct_pred += 1
    #wording can be turned off if needed
    if wording:
        print(f'Precision: {correct_pred}/{total_pred} = {correct_pred/total_pred}')
    
    return correct_pred/total_pred

In [11]:
#formula for F1 Score:
def f1_score(predictions, actual):
    p = calc_precision(predictions, actual, wording = False)
    r = calc_recall(predictions, actual, wording = False)
    
    return 2*p*r/(p+r)

In [12]:
#function for printing all the results
def printResults(pred, data):
    calc_precision(pred, data)
    calc_recall(pred, data)
    f1 = f1_score(pred, data)
    print(f'F1-Score: {f1}')

### Naive Bayes Model - BinomialNB

#### Baseline - Training Set

Below we can see the results if we try to predict the training set assessment values if only using the title features

In [13]:
#only extracting features from title from dev file
train_title = train['Title']
train_title.head()

0    Misoprostol for treating postpartum haemorrhag...
1    Vitamin A supplements and diarrheal and respir...
2    The efficacy and safety of a controlled releas...
3    The state of the art of education for child su...
4    [The practicability of preceptorship in the cu...
Name: Title, dtype: object

In [14]:
#creating feature vectors of only the title column - removing stopwords
vectorizer = CountVectorizer(stop_words='english')
train_title_vectors = vectorizer.fit_transform(train_title)
print(train_title_vectors[0])

  (0, 10171)	1
  (0, 16052)	1
  (0, 12357)	1
  (0, 7072)	1
  (0, 13143)	1
  (0, 3919)	1
  (0, 16070)	1
  (0, 8591)	1


In [15]:
# Building the naive bayes the model - using training
train_assessment = train['Assessment']
model = BernoulliNB(alpha=.01).fit(train_title_vectors, train_assessment)

In [16]:
#baseline training set
predictions = model.predict(train_title_vectors)
printResults(predictions, train_assessment)

Precision: 611/1099 = 0.5559599636032757
Recall: 611/695 = 0.879136690647482
F1-Score: 0.6811594202898551


#### Baseline - Dev File

Below we can see the results of using the training set to train the model to predict the dev assessment values.  We will only be using the title features

In [17]:
#only extracting features from title from dev file
dev_title = dev['Title']
dev_title.head()

0    Educational needs in patient care practices in...
1    Methods, equipment and techniques for rural he...
2    Limitations in verbal fluency following heavy ...
3    Attitude towards rape: a comparative study amo...
4    An evaluation of a training workshop for pharm...
Name: Title, dtype: object

In [18]:
#creating feature vectors
vectorizer = CountVectorizer(stop_words='english')
train_title_vectors = vectorizer.fit_transform(train_title)

#building model
model_nb = BernoulliNB(alpha=.01).fit(train_title_vectors, train_assessment)

In [19]:
#predicting assessment results
dev_title_vectors = vectorizer.transform(dev_title)
pred_nb =model_nb.predict(dev_title_vectors)
dev_assessment = dev['Assessment']

In [20]:
#viewing the feature vectors
print(dev_title_vectors[0])

  (0, 2888)	1
  (0, 5255)	1
  (0, 7488)	1
  (0, 8949)	1
  (0, 10661)	1
  (0, 11750)	1
  (0, 12411)	1


In [21]:
#printing resutls
printResults(pred_nb, dev_assessment)

Precision: 60/178 = 0.33707865168539325
Recall: 60/150 = 0.4
F1-Score: 0.3658536585365853


### Using Features from Title, Abstract and Keywords Fields

Due to the high number of NA's in the keyword fields, I chose to drop those rows. A better way to handle the NA's may have been to compare the Abstracts of documents and copy the keywords depending on how similar the Abstracts are.

In [22]:
#drop all NA's in training set
train_2 = train.dropna()

#checking if NAs have dropped
train_2.isnull().sum()

Assessment    0
Docid         0
Title         0
Authors       0
Journal       0
Issn          0
Year          0
Language      0
Abstract      0
Keywords      0
dtype: int64

In [23]:
#creating list of all title, abstract and keyword features using train set without NAs
train_list = train_2['Title'] + train_2['Abstract'] + train_2['Keywords']
train_list

0        Misoprostol for treating postpartum haemorrhag...
1        Vitamin A supplements and diarrheal and respir...
5        Yad L'hakhlama (Reach to Recovery) in IsraelYa...
10       [Knowledge and treatment of malaria in rural S...
12       Emergency aortic stent grafting for traumatic ...
                               ...                        
21654    Prevalence of Chlamydia trachomatis infection ...
21657    155 vascular injuries: a retrospective study i...
21658    Clinicians' perceptions of the problem of anti...
21659    Histopathology services in a rural African hos...
21660    An audit of structure, process and outcome of ...
Length: 12887, dtype: object

In [24]:
#drop all NA's in dev set
dev2 = dev.dropna()
dev_data = dev2['Title'] + dev2['Abstract'] + dev2['Keywords']
dev_data

0       Educational needs in patient care practices in...
1       Methods, equipment and techniques for rural he...
2       Limitations in verbal fluency following heavy ...
5       A pilot study of gender inequalities related t...
6       Optimal staffing for Acute Care of the Elderly...
                              ...                        
4843    Knowledge, attitude, the perceived risks of in...
4845    Effects on age on spinal cord lesion patients'...
4846    Malaria: knowledge and behaviour in an endemic...
4847    Coronary artery disease is associated with the...
4849    Dermatophytomycoses in children in rural Kenya...
Length: 2922, dtype: object

In [25]:
#creating feature vectors 
vectorizer = CountVectorizer(stop_words='english')
train_title_vectors = vectorizer.fit_transform(train_list)

#creating model
model_nb2 = BernoulliNB(alpha=.001).fit(train_title_vectors, train_2['Assessment'])

In [26]:
dev_title_vectors = vectorizer.transform(dev_data)
pred_nb2 = model_nb2.predict(dev_title_vectors)

printResults(pred_nb2, dev2['Assessment'])

Precision: 42/90 = 0.4666666666666667
Recall: 42/101 = 0.4158415841584158
F1-Score: 0.43979057591623033


As expected, the naive bayes model using the title, abstract and keyword features resulted in a better recall, precision and F1-score than when only using the title.  For the SVM and Logistic Regression models below, I'll be using the title, abstract and keyword features.

### SVM - Experiment 

In [27]:
#training feature vectors
vectorizer = CountVectorizer(stop_words='english')
train_vectors = vectorizer.fit_transform(train_list)

In [45]:
#building model
model_svc = svm.LinearSVC(max_iter = 4000).fit(train_vectors, train_2['Assessment'])

In [29]:
#predicting assessment results of dev file 
dev_vectors = vectorizer.transform(dev_data)
preds_svm = model_svc.predict(dev_vectors)

#view results
printResults(preds_svm, dev2['Assessment'])

Precision: 46/103 = 0.44660194174757284
Recall: 46/101 = 0.45544554455445546
F1-Score: 0.4509803921568628


The SVM model performed better than the Naive Bayes model using the title, abstract, and keyword features and after having removed all the NA's

### Logistic Regression - Experiment

In [30]:
#training set feature vectors
vectorizer = CountVectorizer(stop_words='english')
train_vectors = vectorizer.fit_transform(train_list)

In [31]:
#building model
model_lr = LogisticRegression(max_iter = 1000).fit(train_vectors, train_2['Assessment'])

In [32]:
#predicting assessment results of dev file 
dev_vectors = vectorizer.transform(dev_data)
preds_lr = model_lr.predict(dev_vectors)

#view results
printResults(preds_lr,dev2['Assessment'])

Precision: 44/87 = 0.5057471264367817
Recall: 44/101 = 0.43564356435643564
F1-Score: 0.4680851063829788


The logistic regression model produced the best results with a precision of around 50%, recall around 44% and 
an F1-Score of around 47%.  Similar to the SVM model, I used the title, abstract and keyword features using the training and dev dataset after removing all rows with NAs

### Logistic Regression and Adding Journal Features - Experiment

In [33]:
#creating feature vectors of the dev data
dev_data2 = dev2['Title'] + dev2['Abstract'] + dev2['Keywords'] + dev2['Journal']
dev_data2

0       Educational needs in patient care practices in...
1       Methods, equipment and techniques for rural he...
2       Limitations in verbal fluency following heavy ...
5       A pilot study of gender inequalities related t...
6       Optimal staffing for Acute Care of the Elderly...
                              ...                        
4843    Knowledge, attitude, the perceived risks of in...
4845    Effects on age on spinal cord lesion patients'...
4846    Malaria: knowledge and behaviour in an endemic...
4847    Coronary artery disease is associated with the...
4849    Dermatophytomycoses in children in rural Kenya...
Length: 2922, dtype: object

In [34]:
#creating feature vectors of the training data
vectorizer = CountVectorizer(stop_words='english')
train_title_vectors = vectorizer.fit_transform(train_list)

In [35]:
#building logistic regression model
model_lr2 = LogisticRegression(max_iter = 1000).fit(train_title_vectors, train_2['Assessment'])

In [36]:
#prediction results
dev_title_vectors = vectorizer.transform(dev_data2)
preds_lr2 = model_lr2.predict(dev_title_vectors)

#view results
printResults(preds_lr2,dev2['Assessment'])

Precision: 42/85 = 0.49411764705882355
Recall: 42/101 = 0.4158415841584158
F1-Score: 0.4516129032258065


For this experiment, I chose to use the same logistic regression model but added the journal feature.  This produced 
slightly lower results to the previous experiment of using the logistic fregression model and only the title, abstract,
and keyword features

### Creating Test File of Best Model

Based on the results above, I will use the logistic regression model to create the test set predictions as this model had the highest precision, recall and F1-Score. Also, I used the title, abstract, and keyword features as this also produced the best results. Similar to the other experiments, I had to remove the NA's from the test set as it was causing errors. 

In [37]:
#removing any NA's from the set
test2 = test.dropna()

#creating list of features for the test set
test_data = test2['Title'] + test2['Abstract'] + test2['Keywords']

In [38]:
#creating feature vectors for training set
vectorizer = CountVectorizer(stop_words='english')
train_title_vectors = vectorizer.fit_transform(train_list)

#creating model
model_lr3 = LogisticRegression(max_iter = 1000).fit(train_title_vectors, train_2['Assessment'])

In [39]:
#creating feature vectors of test set 
test_vectors = vectorizer.transform(test_data)

#test set predictions
preds_lr3 = model_lr3.predict(test_vectors)

In [40]:
#function to save prediction by docid in txt file 

def save_preds(_fn, _y_pred, _df):
    import csv
    with open(_fn, 'w') as fout:
        writer = csv.writer(fout, delimiter=',', lineterminator='\n')
        writer.writerow(['Docid', 'Assessment'])
        for y, docID in zip(_y_pred, _df['Docid']):
            writer.writerow([docID, y])

In [41]:
#printing the prediction results to a txt file
save_preds('SPARIKH6.txt', preds_lr3, test2)

In [42]:
#reading in txt file
pred_results = pd.read_csv('SPARIKH6.txt', sep=',')
pred_results.head()

Unnamed: 0,Docid,Assessment
0,hash:518a02c1-75f6-3e00-8575-de6ee4ca0e32,-1
1,hash:54c7ea79-9c8a-38b3-95c7-a0def1df161d,-1
2,hash:0ffaac73-5a77-3a7d-bd07-33e495cf577d,1
3,hash:6fb4a864-953e-3e60-9b86-778872647be5,-1
4,hash:7aece21e-e933-3702-984c-a3efb6cc8595,-1


In [43]:
#print the assessment results
pd.value_counts(pred_results.Assessment)

-1    2858
 1      67
Name: Assessment, dtype: int64

As can be seen, the predicted values of the test file is also highly skewed toward -1, similar to the dev and training sets.  I believe this skew in data is one of the reasons I wasn't able to produce an F1-Score of above 47%. The model would perform better if we had a more even spread in the data.  Also, there was a lot of missing data.  Rather than simply removing the rows with NAs, it may have been beneficial to come up with a method to replace the NAs with an estimated text value.

### References


https://www.pythonpool.com/read-tsv-file-python/

https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-python/#.YX7uqp5Kg2x

https://towardsdatascience.com/basics-of-countvectorizer-e26677900f9c

https://towardsdatascience.com/text-classification-using-naive-bayes-theory-a-working-example-2ef4b7eb7d5a

https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/