<a href="https://colab.research.google.com/github/salevizo/nbg/blob/master/NBG_Race_Passage_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NBG Race Passage Retrieval Baseline Model
This Colab Notebook is an easy way to run our baseline model in order to get more confident on how and where to start from. Our model serves as a reference on handling the inputs and outputs, a quick way to get you started without losing too much time on data manipulation issues. A big part of this piece of code can be used unchanged in the solution that you will submit.

## Philosophy behind the Baseline Model
We have created a Supervised approach on this Passage Retrieval task which achieves decent results by applying a heuristic technique. The core idea behind our approach is that legal texts/court decisions have a strict writing format and one can expect many similarities among the structure of different documents which share the same topic. This way if someone had the answers to some questions asked against some documents, he/she could use them to learn a pattern, find a general structure and some keywords that acompany this answer.

Our model uses some known answers (Answers200.json) to questions (questions.csv) and documents (Documents200.json) to create its own augmented question that will replace the original one. This augmented question consists of all the unique words found in answers from all the documents we fed to train with.

The model will iterate over all the documents in the test dataframe and tokenize each document to sentences. Then for every question in the questions.csv file it will calculate its vector and it will start iterating through the trained questions in order to find the best match/the most similar one to the original (by calculating the cosine similarity between them). 

> *In the baseline model that we provide you with, this step isn't really necessary because all questions take part in the training. We have this mechanism of finding the most similar question from the training ones because in the future we will test the models you will submit against paraphrased questions. Keep that in mind when you develop your solution.*

When the model finally has created a question vector it will calculate the cosine similarity between that and the vector for every sentence in the document. Those results will be saved in a list of dictionaries which we will use to get the top ranked/most similar sentences. Finally, we will take the top 4 most similar sentences, to create a passage that will serve as the estimate passage of our model. Those passages are being saved along with the document id and the question id in a dataframe that at the end of the execution will be our estimated_answers.csv output file. This file along with the Answers200.json will be feeded to the **calculate_answers_score** function of the **answers_evaluation** python script that you can find in the github project. The console will print the Average F1 Score for all the question, the F1 Score per question, the Critical Success Index and the Overall Score.


### Some of the model's limitations
*   This model won't consider if a question has an answer or not. It will answer all the questions, resulting to a worse CSI score.
*   This model doesn't have any 'smart' mechanism on which sentences form a passage, it just takes the n most similar sentences found in the document to the question given.

## Conclusion
Feel free to play arround and tweak parameters to see immediate performance changes, train your own vectors to use.
We do not want in any way to restrict you regarding the approach you will take, by giving this model. Think out of the box and show us different ways to achieve better results with respect to the desired output.

Find the code in the Github repo: https://github.com/myNBGcode/NBG-Race-passage-retrieval.git

In [0]:
#Clone NBG-Race-passage-retrieval Github project
from getpass import getpass
import urllib

user = input('User name: ')
password = getpass('Password: ')
password = urllib.parse.quote(password) # your password is converted into url format
AUTHENTICATION = '{}:{}'.format(user, password)

!git clone https://$AUTHENTICATION@github.com/myNBGcode/NBG-Race-passage-retrieval.git
AUTHENTICATION = '' # removing authentication variable

#Get into project's directory and view its content
%cd NBG-Race-passage-retrieval
%ls

User name: ZisisFl
Password: ··········
Cloning into 'NBG-Race-passage-retrieval'...
remote: Enumerating objects: 19, done.[K
remote: Total 19 (delta 0), reused 0 (delta 0), pack-reused 19[K
Unpacking objects: 100% (19/19), done.
Checking out files: 100% (14/14), done.
/content/NBG-Race-passage-retrieval
answers_evaluation.py  passage_retrieval_supervised.py  ROUGE_N_algorithm.py
[0m[01;34mdata[0m/                  README.md


In [0]:
# import libraries
import gensim
import pandas as pd
import numpy as np
import os
import nltk
import re
import json
from pandas.io.json import json_normalize
from answers_evaluation import calculate_answers_score

### text_processing
This function is used to process text and get it ready for word vectors calculation/the same function was used as a preprocessing step for the creation of the words 
vectors we have provided you with

In [0]:

def text_processing(text):
    lower_case_text = text.lower()
    # remove non words
    clean = re.sub("[^Α-ΩΆΈΌΊΏΉΎα-ωάέόίώήύϊϋ]", " ", lower_case_text)
    split_to_tokens = clean.split()
    split_to_tokens = [w for w in split_to_tokens if w not in stop_words]
    return split_to_tokens

In [0]:
def calc_mean_vector(word2vec_model, words):
    # remove out-of-vocabulary words
    words = [word for word in words if word in word2vec_model.vocab]
    if len(words) >= 1:
        return np.mean(word2vec_model[words], axis=0)

In [0]:
def cosine_similarity(vector1, vector2):
    return np.dot(vector1, vector2)/(np.linalg.norm(vector1)*np.linalg.norm(vector2))

In [0]:
def get_unique_words(text):
    unique_words = []
    processed_text = text_processing(text)
    for token in processed_text:
        if not token in unique_words:
            unique_words.append(token)

    return unique_words

### create_unique_keywords_list
This funtion is used for the augmentation of the questions based on answers given. This function can serve as a reference on how to manipulate the complex json file formats.

In [0]:
def create_unique_keywords_list(input_df):
    # init dataframe that will hold the concatenated corpus for each question
    concatenated_answers = pd.DataFrame(columns=['concatenated_answer'])

    for index, row in input_df.iterrows():
        # create column with ekoiphsh class as a simple text
        df.loc[index, 'annotation.classificationResult'] = row['annotation.classificationResult'][0]['classes'][0]

        # generate a dataframe of concatenated answers on every different question found in the input_df
        normalized_df = json_normalize(row['annotation.annotationResult'])
        for index, row in normalized_df.iterrows():
            for i in range(len(row['label'])):
                # if this question category exists in the dataframe overwrite text with old + new
                if row['label'][i] in concatenated_answers.index:
                    concatenated_answers.loc[row['label'][i]] = row['points'][0]['text'] + ' ' + \
                                                                concatenated_answers.loc[row['label'][i]]
                # if this question category doesn't exist in the dataframe insert it
                else:
                    concatenated_answers.loc[row['label'][i]] = row['points'][0]['text']
    concatenated_answers['question_id'] = concatenated_answers.index

    # generate list of unique keywords from concatenated answers in order to be used as trained replacement questions
    list_of_question_keywords = []
    for index, row in concatenated_answers.iterrows():
        list_of_question_keywords.append(get_unique_words(row['concatenated_answer']))
    concatenated_answers['keywords_concat_answers'] = list_of_question_keywords

    return concatenated_answers


In [0]:
# load nltk greek tokenizer
nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/greek.pickle')


# load trained vectors
model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join('data', 'vectors', 'FEK_ABDG_100.bin'), binary=True)


# load greek stop words
stop_words = []
with open(os.path.join('data', 'stop_words.txt'), 'r', encoding='utf-8') as stop_words_file:
    for line in stop_words_file.readlines():
        stop_words.append(line[:-1])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# create a dataframe of  questions
processed_questions = []
questions_df = pd.read_csv(os.path.join('data', 'Questions.csv'))
for index, row in questions_df.iterrows():
    # process the query in order to be ready for vector calculation
    query = text_processing(row['original_question'])
    processed_questions.append(query)
# append a new column to the questions dataframe that contains the queries processed
questions_df['processed_question'] = processed_questions

In [0]:
# init answers dataframe, in this file the model will write the estimated answers
answers_df = pd.DataFrame(columns=['document_id', 'question_id', 'estimated_passage'])

### Import Answers200.json and Documents200.json files
#### Answers200.json
It's time to import the file that contains the annotated answers that we are going to use in order to augment our queries. These answers will be carried by the train variable.

#### Documents200.json
We also import the documents that we will use to estimate the answer passages from. We exclude the documents that were used for training, so the rest of the documents will consist the test.

####Important
Answer200.json and Document200.json concern the same documents. This is why we drop the fraction of dataframe used for training from the test dataframe. When  you will receive the Document1000.json file you won't have to split in train and test because Document1000.json in its entirity will be the test.

In [0]:
# read the json file with the annotated answers
with open(os.path.join('data', 'Answers200.json'), 'r', encoding='utf-8') as input_json:
    json_file = json.loads(input_json.read())

# normalize json file to dataframe
df = json_normalize(json_file['data'])

# use a random sample of the answers to create a train dataframe
train = df.sample(frac=0.25)

# read the json file that contains the documents
with open(os.path.join('data', 'Documents200.json'), 'r', encoding='utf-8') as input_json:
    json_file = json.loads(input_json.read())

test = json_normalize(json_file['data'])
# remove the documents that where used for training (this won't be necessary when you receive the Documents1000.json)
test = test.drop(train.index)

''' 
    CODE TO REPLACE FOR THE Documents1000.json file
    with open(os.path.join('data', 'Documents1000.json'), 'r', encoding='utf-8') as input_json:
        json_file = json.loads(input_json.read())
        
    test = json_normalize(json_file['data'])
    # do not drop from test
'''

  """
  


" \n    CODE TO REPLACE FOR THE Documents1000.json file\n    with open(os.path.join('data', 'Documents1000.json'), 'r', encoding='utf-8') as input_json:\n        json_file = json.loads(input_json.read())\n        \n    test = json_normalize(json_file['data'])\n    # do not drop from test\n"

### Question augmentation/training happens here

In [0]:
# augment questions from the known given answers
# we create a dataframe to replace every question included in the train dataframe with lists of unique words
# found in the answers of those questions
trained_questions_df = create_unique_keywords_list(train)

# merge the two dataframes to bring original question and processed question data to trained_questions_df
trained_questions_df = trained_questions_df.merge(questions_df, left_on='question_id', right_on='question_id')

  # Remove the CWD from sys.path while we load stuff.


### Here starts the actual model
This is the code block that you will have to apply the most changes to get different results.

In [0]:
# init a counter
document_count = 0

# loop through documents
for index, document in test.iterrows():
    # tokenize to sentences the document
    raw_sentences = tokenizer.tokenize(document['content'])

    document_count = document_count + 1
    print('Processing document {} out of {}'.format(str(document_count), str(len(test.index))))

    # for every question
    for q_index, question in questions_df.iterrows():
        # if there is a trained question version of an original question, use that instead this applies
        # this method will apply for questions that are rephrased so we can still levarage the trained ones

        # init question vector with the original question
        question_vector = calc_mean_vector(model, question['processed_question'])
        similar_question_vector = ''

        best_sim = 0
        best_candidate = None
        # for every question found in the trained ones
        for index2, proc_question in trained_questions_df.iterrows():
            # calculate the candidate similar question vector
            similar_question_vector = calc_mean_vector(model, proc_question['processed_question'])

            # calculate the cosine similarity of the two vectors
            similarity = cosine_similarity(question_vector, similar_question_vector)

            if similarity > best_sim:
                best_sim = similarity
                best_candidate = proc_question['keywords_concat_answers']

        # calculate the question vector using the best candidate / we assume here that a big enough training
        # set is used and all the question take part in the training
        question_vector = calc_mean_vector(model, best_candidate)

        # init list of results for the similarity of each doc's sentence with the question
        list_of_results = []
        # for every question in the document
        for i in range(len(raw_sentences)):
            # tokenize and process sentence
            tokenized_sentence = text_processing(raw_sentences[i])
            # calculate the mean vector of the sentence's tokens
            sentence_vector = calc_mean_vector(model, tokenized_sentence)

            # if a sentence's tokens exist in the vocabulary calculate cosine similarity
            if sentence_vector is not None:
                similarity = cosine_similarity(question_vector, sentence_vector)
                list_of_results.append(
                    {'sentence_index': i, 'original_sentence': raw_sentences[i], 'similarity': similarity})

        # sort by similarity the list of the results
        list_of_results = sorted(list_of_results, key=lambda k: k['similarity'], reverse=True)

        # create a passage from the top n ranked by similarity sentences
        estimated_passage = ''
        top_n = 4
        for i in range(top_n):
            estimated_passage = estimated_passage + raw_sentences[list_of_results[i]['sentence_index']] + ' '

        answers_df = answers_df.append(
            {'document_id': document['id'], 'question_id': question['question_id'], 'estimated_passage': estimated_passage},
            ignore_index=True)

Processing document 1 out of 150
Processing document 2 out of 150
Processing document 3 out of 150
Processing document 4 out of 150
Processing document 5 out of 150
Processing document 6 out of 150
Processing document 7 out of 150
Processing document 8 out of 150
Processing document 9 out of 150
Processing document 10 out of 150
Processing document 11 out of 150
Processing document 12 out of 150
Processing document 13 out of 150
Processing document 14 out of 150
Processing document 15 out of 150
Processing document 16 out of 150
Processing document 17 out of 150
Processing document 18 out of 150
Processing document 19 out of 150
Processing document 20 out of 150
Processing document 21 out of 150
Processing document 22 out of 150
Processing document 23 out of 150
Processing document 24 out of 150
Processing document 25 out of 150
Processing document 26 out of 150
Processing document 27 out of 150
Processing document 28 out of 150
Processing document 29 out of 150
Processing document 30 

In [0]:
# save answers in a CSV file
answers_df.to_csv(os.path.join('data', 'estimated_answers.csv'), index=False, encoding='utf-8')
print('\nCSV file with answers saved in data directory')
print('---------------------------------------------\n')


CSV file with answers saved in data directory
---------------------------------------------



In [0]:
# calculate scores
calculate_answers_score('estimated_answers.csv', 'Answers200.json')

  df = json_normalize(json_file['data'])
  normalized_df = json_normalize(row['annotation.annotationResult'])


Average F1 Score (for questions with an answer): 0.27
Critical Success Index: 0.89
Overall Score: 0.40

Average F1 Score per question category
PISTWTRIA_TRAPEZA: 0.40
PERIOYSIAKH_KATASTASH: 0.37
OIKOGENEIAKH_KATATASTASH: 0.17
RYTHMISI_OFEILWN: 0.18
XARAKTHRISTIKA_EKPOIHSHS: 0.19


(0.27162506986324536, 0.8906666666666667, 0.3954333892239297)