# NLP M3 - Tweet Sentiment Extraction

* Muskan Gupta - J016
* Harsh Gupta - J017

Approach:
We are using transformers to extract the part of text which relates to the sentiment. The data has been converted into a numpy array so that later it is easy to convert the whole dataset into a json file to input it in the QuestionAnswerModel.

In [None]:
import numpy as np 
import pandas as pd 
import json


train_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')
sub_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/sample_submission.csv')

train = np.array(train_df)
test = np.array(test_df)

!mkdir -p data
use_cuda = True # whether to use GPU or not

In [None]:
train_df.head()

Preprocessing

I formulate this task as an extractive question answering problem, such as SQuAD.
Given a question and context, the model is trained to find the answer spans in the context.

Therefore, I use sentiment as question, text as context, selected_text as answer.

* Question: sentiment
* Context: text
* Answer: selected_text

The QuestionAnswerModel requires a certain format for the data to be input in. To create the json file, we append the selected_text(line[2]), sentiment(line[-1]) and the text of the tweet(line[1]) in a json file format. Same is being done in the conversion for test files, but the selected_text or answer field is left empty.

In [None]:
def find_all(input_str, search_str):
    l1 = []
    length = len(input_str)
    index = 0
    while index < length:
        i = input_str.find(search_str, index)
        if i == -1:
            return l1
        l1.append(i)
        index = i + 1
    return l1

def do_qa_train(train):

    output = []
    for line in train:
        context = line[1]

        qas = []
        question = line[-1]
        qid = line[0]
        answers = []
        answer = line[2]
        if type(answer) != str or type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answer_starts = find_all(context, answer)
        for answer_start in answer_starts:
            answers.append({'answer_start': answer_start, 'text': answer.lower()})
            break
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})

        output.append({'context': context.lower(), 'qas': qas})
        
    return output

qa_train = do_qa_train(train)

with open('data/train.json', 'w') as outfile:
    json.dump(qa_train, outfile)

In [None]:
def do_qa_test(test):
    output = []
    for line in test:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_test = do_qa_test(test)

with open('data/test.json', 'w') as outfile:
    json.dump(qa_test, outfile)

Using a Question Answer Model from Simple Transformers. Approach based on SQuAD (Stanford Question and Answer Dataset)

In [None]:
!pip install '/kaggle/input/simple-transformers-pypi/seqeval-0.0.12-py3-none-any.whl' -q
!pip install '/kaggle/input/simple-transformers-pypi/simpletransformers-0.22.1-py3-none-any.whl' -q

### Train model

Training the question and answer model using pretrained model of 'distilbert-base-uncased-distilled-squad' since it was the only one available as an offline dataset and could be used without internet!

Initially running only for 3 epochs to test for an initial commit accuracy. 

In [None]:
from simpletransformers.question_answering import QuestionAnsweringModel

MODEL_PATH = '/kaggle/input/transformers-pretrained-distilbert/distilbert-base-uncased-distilled-squad/'

model = QuestionAnsweringModel('distilbert', 
                               MODEL_PATH, 
                               args={'reprocess_input_data': True,
                                     'overwrite_output_dir': True,
                                     'learning_rate': 5e-5,
                                     'num_train_epochs': 3,
                                     'max_seq_length': 196,
                                     'doc_stride': 64,
                                     'fp16': False,
                                    },
                              use_cuda=use_cuda)

model.train_model('data/train.json')

### Submission

Submissions are being saved in a .csv file (this part is being done the way kaggle wants submissions, also very weird concept Sir!)

In [None]:
predictions = model.predict(qa_test)
predictions_df = pd.DataFrame.from_dict(predictions)

sub_df['selected_text'] = predictions_df['answer']

sub_df.to_csv('submission.csv', index=False)

print("File submitted successfully.")