### Natural Language Processing - M3
### Maaz Ansari (J002), Riddhi Mehta (J030), Husain Ghadiali (J056)

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np

### Reading train, test and sample submission data

In [None]:
train = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')
sample = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/sample_submission.csv')

### Viewing the data

In [None]:
train.head()

In [None]:
test.head()

In [None]:
sample.head()

In [None]:
train.shape, test.shape, sample.shape

###### In this notebook, we focus on creating a model and training it for prediction on test data. We have planned to use the Question Answer model and for that we need to convert our data in json format.

It is important to understand that this kernel does not allow internet access.
So in order to use the QuestionAnswer Model froms simpletransformers library, we need to first add the following URL links into our input folder. 

* https://www.kaggle.com/jonathanbesomi/simple-transformers-pypi
* https://www.kaggle.com/jonathanbesomi/transformers-pretrained-distilbert

### Pre-processing of data

In [None]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

In [None]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

In [None]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub(' ', text)
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

These are a few definitions that we created in order to clean the data. But, then while understanding the needs and requirements of the type of input that the Question Answering Model has, we dont use these functions for pre-processing. Although, these functions are highly robust and can clean dirtiest data! It can be used to explore other models for this problem statement.

### Converting the data into array format

In [None]:
train = np.array(train)
test = np.array(test)

In [None]:
train[:3]

Here we observe that after converting the data into array format, we have shown the first three rows. So technically it is a list of lists. Let's focus on the first list. The values in it are of the column headings:
* textID
* text
* selected_text
* sentiment

In [None]:
test[:3]

Similarly for test data, we only have the columns:
* textID
* text
* sentiment

##### So this means that we have to select those phrases from the text in the test data that are expressing the sentiment which is asked. And to bring it in a format that is required, we make the following three very important functions.

### Search function

In [None]:
def search(input_string, search_string):
    length = len(input_string)
    start_index = []
    length = len(input_string)
    index = 0
    while index < length:
        i = input_string.find(search_string, index)
        if i == -1:
            return start_index
        start_index.append(i)
        index = i + 1
    return start_index

In [None]:
## For example, we have:

search("hello I am having a good day today, what about you?", "good day today")

# This will return the value of the character index where the first letter of the search_string starts.
# In this case, it is the 20th position.

### Converting train into json format

In [None]:
def convert_train_to_json(train_set): 
    
    outer_list = []
    
    for row in train_set:
        qid = row[0]        # As explained previously, this is the textID column value
        context = row[1]    # As explained previously, this is the text column value
        answer = row[2]     # As explained previously, this is the selected_text column value
        question = row[-1]  # As explained previously, this is the sentiment column value

                             # Here, we consider the sentiment value as "question" because given
                             # the sentiment we, then predict what should be the selected_text.
        inner_list = []              
        answers = []
        
        # We need to run the following IF command because if there are non string values then the code 
        # will throw an error and this is what we have to prevent. 
        # Hence, as soon as the error comes, we ask the code to CONTINUE.
        
        if type(context) != str or type(answer) != str or type(question) != str: 
            continue
        answer_starts = search(context, answer)
        for answer_start in answer_starts:
            answers.append({'answer_start': answer_start, 'text': answer.lower()})
            break
        inner_list.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})

        outer_list.append({'context': context.lower(), 'qas': inner_list})
        
    return outer_list

In [None]:
train = convert_train_to_json(train)

In [None]:
len(train)

In [None]:
train[:3]

So, this is how the json format should be.

Outer list has:
* context
* inner_list (key = qas)

Inner list has:
* question
* id
* is_impossible - which is always False
* answers

Answers has:
* answer_start 
* selected_text

### Converting test into json format

In [None]:
def convert_test_to_json(test_set):
    
    outer_list = []
    
    for row in test_set:
        
        qid = row[0]
        context = row[1]
        question = row[-1]
        inner_list = []
                
        if type(context) != str or type(question) != str:
            continue
            
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'}) # Random initialisation of values
        inner_list.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        outer_list.append({'context': context.lower(), 'qas': inner_list})
    return outer_list

In [None]:
test = convert_test_to_json(test)

In [None]:
len(test)

In [None]:
test[:3]

##### I believe that the format of the json file has already been explained previously for the train set. The explanation is same for test json.

### Dumping the json structure of train and test into .json files

In [None]:
import os
import json

os.makedirs('data', exist_ok = True)

with open('data/train.json', 'w') as f:
    json.dump(train, f)
    f.close()
    
with open('data/test.json', 'w') as f:
    json.dump(test, f)
    f.close()

### Importing the required pip install files which we added using the "add data" option in kaggle kernel using the URLs mentioned in the start of this kernel. 

In [None]:
!pip install '/kaggle/input/simple-transformers-pypi/seqeval-0.0.12-py3-none-any.whl' -q
!pip install '/kaggle/input/simple-transformers-pypi/simpletransformers-0.22.1-py3-none-any.whl' -q

In [None]:
from simpletransformers.question_answering import QuestionAnsweringModel

In [None]:
MODEL = '/kaggle/input/transformers-pretrained-distilbert/distilbert-base-uncased-distilled-squad/'

model = QuestionAnsweringModel('distilbert',  
                               MODEL,
                               args={'reprocess_input_data': True,
                                     'overwrite_output_dir': True,
                                     'learning_rate': 5e-5,
                                     'num_train_epochs': 2,
                                     'max_seq_length': 192,
                                     'doc_stride': 64,
                                     'fp16': False
                                    }, 
                               use_cuda=True
                              )

In [None]:
model.train_model('data/train.json')

### Predicting the selected_text of the test set using the weights of the above model

In [None]:
pred_df = model.predict(test)
pred_df = pd.DataFrame.from_dict(pred_df)

In [None]:
pred_df.head()

We observe that the columns are named according to the dictionary key values. Hence, we change the the column names according to the exact ones needed in sample_submission.csv

In [None]:
sample["selected_text"] = pred_df["answer"]

In [None]:
sample.to_csv('submission.csv', index=False)

#### We finally save our submission file in the format required. And now we are ready to obtain a score!

In [None]:
print("Everything is successful! Good Luck for the score!")

# -----------THE END-----------