---

This involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. We will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 

Download train.csv, test_seen.csv and test_unseen.csv from the [Github](https://github.com/sharduls007/Assignment_2_CT5120) or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

We will be using test_seen.csv for benchmarking our model, hence it has label. On the other hand, test_unseen is used for [Kaggle](https://www.kaggle.com/competitions/nlp2022ct5120suggestionmining/overview) competition.


In [1]:
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/train.csv" > train.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_seen.csv" > test.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_unseen.csv" > test_unseen.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) schannel: next InitializeSecurityContext failed: Unknown error (0x80092012) - The revocation function was unable to check revocation for the certificate.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) schannel: next InitializeSecurityContext failed: Unknown error (0x80092012) - The revocation function was unable to check revocation for the certificate.
  % Total    % Received % Xferd  Ave

In [2]:
import numpy as np
import pandas as pd

# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test_seen.csv', 
                 names=['id', 'text', 'label'], header=0)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440

---

## 1: Data Pre-processing 




---
>We can preprocess our text by eliminating punctuation, transforming it to lower case, stripping any urls or links, and tokenizing each word in a given sentence. To maintain consistency in the word count, we must convert all text to lower case; this is one of the simplest and most successful forms of text preparation. It is applicable to the majority of text mining and NLP problems and can be useful when the dataset is small. It also considerably improves the consistency of predicted output. Then we remove the punctuation from the text; punctuation removal is the process of deleting characters and text fragments that may interfere with text analysis. It is one of the most crucial stage in text preprocessing.
On our text, then we use tokenization. This separates the texts into smaller sub-texts (tokens), allowing for better generalization of the relationship between the texts and the labels. This determines the dataset's "vocabulary" (set of unique tokens present in the data). Finally, we can get rid of the stop words. We remove the low-level information from our text by deleting these terms, allowing us to focus on the crucial information.
---

In [3]:
#importing libraries
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import nltk
import re
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
train_text = train_df.text
test_text = test_df.text

stopwords = nltk.corpus.stopwords.words("english")

def preprocess(text):  
    
    # removal of extra spaces
    regex_pat = re.compile(r'\s+')
    text_space = text.str.replace(regex_pat, ' ')

    # removal of links
    url =  re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|' '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    texts = text.str.replace(url, '')
    
    # removal of punctuations
    newtext = texts.str.replace("[^a-zA-Z1-10]", " ")
    
    # Converting text to lower case
    text_lower = newtext.str.lower()
    
    # tokenizing
    tokenized_text = text_lower.apply(lambda x: x.split())
    # removal of stopwords
    tokenized_text=  tokenized_text.apply(lambda x: [item for item in x if item not in stopwords])
    
    for i in range(len(tokenized_text)):
        tokenized_text[i] = ' '.join(tokenized_text[i])
        texts_p= tokenized_text
    
    return texts_p

processed_texts = preprocess(train_text)   
train_df['processed_text'] = processed_texts

processed_texts = preprocess(test_text)  
test_df['processed_text'] = processed_texts

In [2]:
from platform import python_version
python_version()

'3.9.16'

---

## 2: Feature Engineering (I) - TF-IDF as features

The raw counts of words and `tf-idf` scores can be useful features for a classification task.


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.
tfIdfTransformer = TfidfTransformer(use_idf=True)
countVectorizer = CountVectorizer()
wordCount_train = countVectorizer.fit_transform(train_df.processed_text)
newTfIdf_train = tfIdfTransformer.fit_transform(wordCount_train)

wordCount_test = countVectorizer.transform(test_df.processed_text)
newTfIdf_test = tfIdfTransformer.transform(wordCount_test)

# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
from sklearn import metrics

clf = GaussianNB()
clf.fit(newTfIdf_train.toarray(), train_labels)


# Predict on the test set.
predictions = []    # save your predictions on the test set into this list

predictions = clf.predict(newTfIdf_test.toarray())

def accuracy(labels, predictions):

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

0.5544117647058824

---

## 3: Evaluation Metrics 




---

> Standard accuracy is defined as the ratio of correct classifications to the number of classifications done. Accuracy can be a useful measure if we have the same number of samples per class but if we have an imbalanced set of samples accuracy isn't useful at all. Even more so, a test can have a high accuracy but actually perform worse than a test with a lower accuracy.
We can make use of other evaluation metrics such as precision which measures that among the cases predicted to be positive, how much percentage of them are really positive. Also recall which measures how much percentage of real positive cases are correctly identified. And finally F1 score that is created to have a balanced metric between recall and precision. It is the Harmonic mean of recall and precision.

---

In the code cell below, write an implementation of the evaluation metric you defined above. Please write your own implementation from scratch.

In [6]:
def evaluate(labels, predictions):


  # check that labels and predictions are of same length
  assert len(labels) == len(predictions)

  score = 0.0

#Defining a confusion matrix to calculate the F1-score
  def cf(labels, predictions):
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    for x, y in zip(labels, predictions):
        if x == 1 and y == 1:
            tp +=1
    for x, y in zip(labels, predictions):
        if x == 0 and y == 0:
            tn +=1
    for x, y in zip(labels, predictions):
        if x == 0 and y == 1:
            fp +=1
    for x, y in zip(labels, predictions):
        if x == 1 and y == 0:
            fn +=1
    
    prec = tp/ (tp + fp)  
    rec = tp/ (tp + fn)  

    p = prec
    r = rec
    score = 2 * p * r/ (p + r) 
    return score
  
  score=cf(labels, predictions)

  return score

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)

0.3964143426294821

---

## 4: Feature Engineering (II) - Other features



---

> We can implement simple characteristics that are less CPU-intensive and comparably easier to compute. We can use features such as counting the amount of words and characters in each sentence. Character length and word length are frequently found to be relevant in text datasets. Count the amount of distinct words; this feature allows us to see if there are any word repetitions in the data points. We can even calculate the puncutation marks from the original text.

---

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Create your features.

# count number of characters 
def count_chars(text):
    return len(text)

# count number of words 
def count_words(text):
    return len(text.split())

# count number of punctuations
def count_punctuations(text):
    punctuations='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    d=dict()
    for i in punctuations:
        d[str(i)+' count']=text.count(i)
    return d

# count number of unique words 
def count_unique_words(text):
    return len(set(text.split()))
            
#applying the features to the train and test dataset
train_df['char_count'] = train_df["processed_text"].apply(lambda x:count_chars(x))
train_df['word_count'] = train_df["processed_text"].apply(lambda x:count_words(x))
train_df['punct_count'] = train_df["text"].apply(lambda x:count_punctuations(x))
train_df['unique_count'] = train_df["processed_text"].apply(lambda x:count_unique_words(x))

test_df['char_count'] = test_df["processed_text"].apply(lambda x:count_chars(x))
test_df['word_count'] = test_df["processed_text"].apply(lambda x:count_words(x))
test_df['punct_count'] = test_df["text"].apply(lambda x:count_punctuations(x))
test_df['unique_count'] = test_df["processed_text"].apply(lambda x:count_unique_words(x))

#creating sepearte columns for each puncuation for train and test dataset
df_punct= pd.DataFrame(list(train_df.punct_count))
test_punct= pd.DataFrame(list(test_df.punct_count))
train_df=pd.merge(train_df,df_punct,left_index=True, right_index=True)
test_df=pd.merge(test_df,test_punct,left_index=True, right_index=True)
train_df.drop(columns=['punct_count'],inplace=True)
test_df.drop(columns=['punct_count'],inplace=True)

vectorizer =  TfidfVectorizer()
train_tf_idf_features =  vectorizer.fit_transform(train_df['processed_text']).toarray()
test_tf_idf_features  =  vectorizer.transform(test_df['processed_text']).toarray()

train_tf_idf = pd.DataFrame(train_tf_idf_features) 
test_tf_idf = pd.DataFrame(test_tf_idf_features)

train_Y = train_df['label']
test_Y = test_df['label']

# Listing all features
features = ['char_count', 'word_count','unique_count','! count', '" count', '# count', '$ count',
       '% count', '& count', '\' count', '( count', ') count', '* count',
       '+ count', ', count', '- count', '. count', '/ count', ': count',
       '; count', '< count', '= count', '> count', '? count', '@ count',
       '[ count', '\ count', '] count', '^ count', '_ count', '` count',
       '{ count', '| count', '} count', '~ count']
# Finally merging all features with above TF-IDF. 
train = pd.merge(train_tf_idf,train_df[features],left_index=True, right_index=True)
test  = pd.merge(test_tf_idf,test_df[features],left_index=True, right_index=True)


# Train a Naïve Bayes classifier using the features you defined.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(train, train_Y)
predictions = clf.predict(test)

# Evaluate on the test set.
evaluate(test_Y, predictions)

0.471161657189277

---

## 5: Kaggle Competition 

Head over to https://www.kaggle.com/t/1f90b74da0b7484da9647638e22d106
Use above classifier to predict the label for test_unseen.csv from competition page and upload the results to the leaderboard. The current baseline score is 0.36823. Make an improvement above the baseline. Please note that the evaluation metric for the competition is the f-score.

In [6]:
# Preparing submission for Kaggle
StudentID = "22224253_Borole" # Please add your student id and lastname
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)

unseen = test_unseen['text']
processed_texts = preprocess(unseen)  
test_unseen['processed_text'] = processed_texts
test_unseen['char_count'] = test_unseen["processed_text"].apply(lambda x:count_chars(x))
test_unseen['word_count'] = test_unseen["processed_text"].apply(lambda x:count_words(x))
test_unseen['punct_count'] = test_unseen["text"].apply(lambda x:count_punctuations(x))
test_unseen['unique_count'] = test_unseen["processed_text"].apply(lambda x:count_unique_words(x))

testunseen_punct= pd.DataFrame(list(test_unseen.punct_count))
test_unseen=pd.merge(test_unseen,testunseen_punct,left_index=True, right_index=True)
test_unseen.drop(columns=['punct_count'],inplace=True)

test_unseen_tfidf_features  =  vectorizer.transform(test_unseen['processed_text']).toarray()
test_unseen_tf_idf = pd.DataFrame(test_unseen_tfidf_features)

t_unseen = pd.merge(test_unseen_tf_idf,test_unseen[features],left_index=True, right_index=True)

pred = clf.predict(t_unseen)


---

> For the test unseen dataset, I preprocessed the original text in part 2 of the assignment using my user defined function and stored it as a new column. Then, from the processed text, features such as word count, character count, punctuation count from the original text, and unique words were extracted. After obtaining our features, I used the tfidf vectoririser to weight them and then used it to predict labels using SkLearn's multinomialNB classifier. On the test unseen dataset, my mean avg f-score is 0.56529.
We can see that my score is higher than the specified baseline model, which is '0.36823'. This improvement could be attributed to the usage of user-defined features as well as preparation of textual data.

---