# Assignment 2 - CT5120

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **November 25, 2022**.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $50$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

| Task | Marks for write-up | Marks for code | Total Marks |
| :--- | :----------------- | :------------- | :---------- |
| 1    |                  5 |              5 |          10 |
| 2    |                  - |             10 |          10 |
| 3    |                  5 |              5 |          10 |
| 4    |                  5 |              5 |          10 |
| 5    |                  5 |              5 |          10 |



---

This assignment involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. You will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download train.csv, test_seen.csv and test_unseen.csv from the [Github](https://github.com/sharduls007/Assignment_2_CT5120) or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

We will be using test_seen.csv for benchmarking our model, hence it has label. On the other hand, test_unseen is used for [Kaggle](https://www.kaggle.com/competitions/nlp2022ct5120suggestionmining/overview) competition.


In [1]:
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/train.csv" > train.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_seen.csv" > test.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_unseen.csv" > test_unseen.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 92  670k   92  616k    0     0   568k      0  0:00:01  0:00:01 --:--:--  569k
100  670k  100  670k    0     0   561k      0  0:00:01  0:00:01 --:--:--  562k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  168k  100  168k    0     0   463k      0 --:--:-- --:--:-- --:--:--  464k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  208k  100  208k    0     0   363k      0 --:

In [2]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test.csv', 
                 names=['id', 'text', 'label'], header=0)
train_df = shuffle(train_df)
test_df = shuffle(test_df)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440

---

## Task 1: Data Pre-processing (10 Marks)

Explain at least 3 steps that you will perform to preprocess the texts before training a classifier.



Edit this cell to write your answer below the line in no more than 300 words.

---

> Delete this line and your write answer here

---

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [3]:
from nltk.corpus import stopwords
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer
import string
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stopwords = set(stopwords.words("english"))
import gensim

punctuation = set(string.punctuation)
tokenizer = TreebankWordTokenizer()
detokenizer = TreebankWordDetokenizer()
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

remove_stopword = lambda tokens: [token for token in tokens if token not in stopwords]
remove_punctuation = lambda tokens: [token.lower() for token in tokens if token not in punctuation]
stem_word = lambda words : [stemmer.stem(word) for word in words]
lem_word = lambda words : [lemma.lemmatize(word) for word in words]


def preprocess(text):
    return gensim.utils.simple_preprocess(text)

for idx, data in enumerate(train_texts):
    train_texts[idx] = detokenizer.detokenize(preprocess(data))
    if train_texts[idx] == "":
        train_texts.pop(idx)
        train_labels.pop(idx)
    
for idx, data in enumerate(test_texts):
    test_texts[idx] = detokenizer.detokenize(preprocess(data))
    if test_texts[idx] == "":
        test_texts.pop(idx)
        test_labels.pop(idx)

In [4]:
train_texts =  [text.replace("_","") for text in train_texts]
test_texts =  [text.replace("_","") for text in test_texts]

In [5]:
words = []
for sentence in train_texts:
    for word in sentence.split(" "):
        words.append(word)

---

## Task 2: Feature Engineering (I) - TF-IDF as features (10 Marks)

In the lectures we have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. Complete the following code cell to create a suggestion detector which uses `tf-idf` scores as features for a Naïve Bayes classifier.

After applying your preprocessing steps, use the training data to train the classifier and make predictions on the test set. You **must not** use the test set for training.

If everything is implemented correctly, then you should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import GaussianNB, MultinomialNB

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here

count_vec = CountVectorizer(ngram_range=(2,2))
X_train_counts = count_vec.fit_transform(train_texts)

tf = TfidfTransformer()
X_train_tfidf = tf.fit_transform(X_train_counts)

# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here

nb = MultinomialNB()
nb.fit(X_train_tfidf.toarray(), train_labels)

# Predict on the test set.

X_test_counts = count_vec.transform(test_texts)
X_test_tfidf = tf.transform(X_test_counts)
predictions = nb.predict(X_test_tfidf.toarray())    # save your predictions on the test set into this list
prediction_prob_old =  nb.predict_log_proba(X_test_tfidf.toarray())

# ... your code goes here

#################### DO NOT EDIT BELOW THIS LINE #################


#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
  '''
  Calculate the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

0.7784342688330872

---

## Task 3: Evaluation Metrics (10 marks)

Why is accuracy not the best measure for evaluating a classifier? Describe an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.

Edit this cell to write your answer below the line in no more than 150 words.

---

> Delete this line and your write answer here

---

In the code cell below, write an implementation of the evaluation metric you defined above. Please write your own implementation from scratch.

In [7]:
def evaluate(labels, predictions):
  '''
  Calculate an evaluation score other than accuracy for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  # check that labels and predictions are of same length
  assert len(labels) == len(predictions)

  score = 0.0
  
  #################### EDIT BELOW THIS LINE #########################

  # your code goes here
  true_positive = 0
  true_negative = 0
  false_positive = 0
  false_negative = 0

  for label, prediction in zip(labels, predictions):
        if label == 1 and prediction == 1:
            true_positive += 1
        elif label == 0 and prediction == 0:
            true_negative += 1
        elif label == 0 and prediction == 1:
            false_positive += 1
        else:
            false_negative += 1

  accuracy_score = (true_positive + true_negative) / (true_positive + true_negative + false_positive  + false_negative)
  precision_pos_class = true_positive / (true_positive + false_positive)
  recall_pos_class = true_positive / (true_positive + false_negative)
  precision_neg_class = true_negative / (true_negative + false_negative)
  recall_neg_class = true_negative / (true_negative + true_positive)
  f1 = true_positive/(true_positive + 0.5*(false_positive + false_negative))

  score = """ True Postive: %s \t False Poitive: %s \n False Negative: %s \t True Negative: %s 
  \n\n Precision class 0: %s \t Precision class 1: %s
  \n Recall: %s \t Recall class 0: %s
  \n\n Accurarcy: %s
  \n F1 Score: %s"""% (true_positive, false_negative, false_positive, true_negative, precision_pos_class, precision_neg_class,
                       recall_pos_class, recall_neg_class, accuracy_score, f1)
  #################### EDIT ABOVE THIS LINE #########################

  return print(score)

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)

 True Postive: 39 	 False Poitive: 294 
 False Negative: 6 	 True Negative: 1015 
  

 Precision class 0: 0.8666666666666667 	 Precision class 1: 0.7754010695187166
  
 Recall: 0.11711711711711711 	 Recall class 0: 0.9629981024667932
  

 Accurarcy: 0.7784342688330872
  
 F1 Score: 0.20634920634920634


---

## Task 4: Feature Engineering (II) - Other features (10 Marks)

Describe features other than those defined in Task 2 which might improve the performance of your suggestion detector. If these features require any additional pre-processing steps, then define those steps as well.


Edit this cell to write your answer below the line in no more than 500 words.

---

> Delete this line and your write answer here

---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Naïve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 3. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

In [8]:
train_data = pd.DataFrame(list(zip(train_texts, train_labels)), columns= ["data", "label"])
test_data = pd.DataFrame(list(zip(test_texts, test_labels)), columns= ["data", "label"])

neg_train_data = train_data['label'] == 0
pos_train_data = train_data['label'] == 1
pos_test_data = test_data['label'] == 1
neg_test_data = test_data['label'] == 0

from sklearn.utils import shuffle

pos_data = train_data.loc[pos_train_data]
neg_data = train_data.loc[neg_train_data].sample(n= 1334, random_state=100)

train_data = pd.concat([pos_data, neg_data])
train_data = shuffle(train_data)

pos_data = test_data.loc[pos_train_data]
neg_data = test_data.loc[neg_train_data]

test_data = pd.concat([pos_data, neg_data])
test_data = shuffle(train_data)

train_texts, train_labels = train_data["data"].to_list(), train_data["label"].to_list() 
test_texts, test_labels = train_data["data"].to_list(), train_data["label"].to_list() 

In [9]:
# Create your features.
# ... your code goes here
tf = CountVectorizer(ngram_range=(2,2))
Pos_tag_X_train = [detokenizer.detokenize([t for w,t in nltk.pos_tag(tokenizer.tokenize(text))])  for text in train_texts]
Pos_tag_X_test = [detokenizer.detokenize([t for w,t in nltk.pos_tag(tokenizer.tokenize(text))])  for text in test_texts]

New_X_train = tf.fit_transform(Pos_tag_X_train)
New_X_test = tf.transform(Pos_tag_X_test)

# Train a Naïve Bayes classifier using the features you defined.
# ... your code goes her
nb = MultinomialNB()
nb.fit(New_X_train.toarray(), train_labels)

# Evaluate on the test set.
predictions = nb.predict(New_X_test.toarray())
prediction_prob_new = nb.predict_log_proba(New_X_test.toarray())
# ... your code goes here

In [10]:
evaluate(test_labels, predictions)

 True Postive: 1073 	 False Poitive: 261 
 False Negative: 339 	 True Negative: 995 
  

 Precision class 0: 0.759915014164306 	 Precision class 1: 0.7921974522292994
  
 Recall: 0.8043478260869565 	 Recall class 0: 0.4811411992263056
  

 Accurarcy: 0.775112443778111
  
 F1 Score: 0.7815003641660597


In [11]:
len(predictions)

2668

In [12]:
predictions = []
for new, old in zip(prediction_prob_new, prediction_prob_old):
    if (new[0]+old[0])/2 > (new[1]+old[1])/2:
        predictions.append(0)
    else:
        predictions.append(1)

In [13]:
evaluate(predictions, test_labels)

AssertionError: 

In [None]:
len(prediction_prob_old)

---

## Task 5: Kaggle Competition (10 marks)

Head over to https://www.kaggle.com/t/1f90b74da0b7484da9647638e22d1068  
Use above classifier to predict the label for test_unseen.csv from competition page and upload the results to the leaderboard. The current baseline score is 0.36823. Make an improvement above the baseline. Please note that the evaluation metric for the competition is the f-score.

Read competition page for more details.



from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Preparing submission for Kaggle


StudentID = "22223696_Patill" # Please add your student id and lastname
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)

test_texts = test_unseen['text']

for idx, data in enumerate(test_texts):
    test_texts[idx] = detokenizer.detokenize(preprocess(data))

test_texts =  [text.replace("_","") for text in test_texts]

tf_text_unseen = tf.transform(test_texts)
                          
predictions = nb.predict(tf_text_unseen.toarray())
        
# Here Id is unique identifier assigned to each test sample ranging from test_0 till test_1699
# Expected is a list of prediction made by your classifier
sub = {"Id": [f"test_{i}" for i in range(len(test_unseen))],
       "Expected": predictions}

sub_df = pd.DataFrame(sub)
# The code below will generate a StudentID.csv on your drive on the left hand side in the explorer
# Please upload the file as a submission on the competition page
# You can index your submission StudentID_Lastname_index.csv, where index is your number of submission
sub_df.to_csv(f"{StudentID}.csv", sep=",", header=1, index=None)

Mention the approach that you have chosen briefly, and what is the mean average f-score that you have achieved? Did it improve above the chosen baseline model (0.36823)? Why or why not?

Edit this cell to write your answer below the line in no more than 500 words.

---

> Delete this line and your write answer here

---