# Coleridge - Huggingface Question Answering

This is not exactly what the competition metric is asking for, but is an interesting experiment nonetheless.

I've taken the Huggingface Question Answering pre-trained model, and asked it to predict which dataset is referenced (as opposed to the text mentioning it).

In [None]:
import os
import glob
import numpy as np
import pandas as pd
import re
import simplejson
import torch
from joblib import Parallel, delayed
from typing import *
from transformers import pipeline

Thanks to [@Nobu](https://www.kaggle.com/enukuro) for figuring out how to get question answering working in offline mode with this dataset
- https://www.kaggle.com/enukuro/huggingface-distilbertbasecaseddistilledsquad

In [None]:
### Online Mode
# from transformers import pipeline
# question_answering = pipeline("question-answering", device=device)  # cache_dir="/kaggle/working/transformers")


### Offline Mode

from transformers import QuestionAnsweringPipeline, DistilBertTokenizerFast, TFDistilBertForQuestionAnswering, DistilBertConfig

tokenizer = DistilBertTokenizerFast(
    vocab_file='../input/huggingface-distilbertbasecaseddistilledsquad/distilbert-base-cased-distilled-squad_vocab.txt', 
    tokenizer_file='../input/huggingface-distilbertbasecaseddistilledsquad/distilbert-base-cased-distilled-squad_tokenizer.json', 
    do_lower_case=False
)
config = DistilBertConfig.from_pretrained(
    '../input/huggingface-distilbertbasecaseddistilledsquad/distilbert-base-cased-distilled-squad_config.json'
)
model = TFDistilBertForQuestionAnswering.from_pretrained(
    '../input/huggingface-distilbertbasecaseddistilledsquad/distilbert-base-cased-distilled-squad.h5', 
    config=config
)
question_answering = QuestionAnsweringPipeline(
    model=model, 
    tokenizer=tokenizer,
    device=(0 if torch.cuda.is_available() else -1)
) 

# Example Usage

Example taken from: https://towardsdatascience.com/question-answering-with-pretrained-transformers-using-pytorch-c3e7a44b4012

In [None]:
context = """
Machine learning (ML) is the study of computer algorithms that improve automatically through experience. 
It is seen as a part of artificial intelligence.
Machine learning algorithms build a model based on sample data, known as "training data", 
in order to make predictions or decisions without being explicitly programmed to do so. 
Machine learning algorithms are used in a wide variety of applications, 
such as email filtering and computer vision, 
where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
"""
question = "What are machine learning models based on?"

result = question_answering(question=question, context=context)
print("Answer:", result['answer'])
print("Score: ", result['score'])

# Prepare Dataset

Code reused from: https://www.kaggle.com/jamesmcguigan/coleridge-string-literals/

In [None]:
train_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
train_df

In [None]:
%%time
def clean_text(text: str) -> str:               return re.sub('[^A-Za-z0-9]+', ' ', str(text).lower()).strip()
def clean_texts(texts: List[str]) -> List[str]: return [ clean_text(text) for text in texts ] 

def read_json(index: str, test_train) -> Dict:
    filename = f"../input/coleridgeinitiative-show-us-the-data/{test_train}/{index}.json"
    with open(filename) as f:
        json = simplejson.load(f)
    return json
        
def json2text(index: str, test_train) -> str:
    json  = read_json(index, test_train)
    texts = [
        row["section_title"] + " " + row["text"] 
        for row in json
    ]
    # texts = clean_texts(texts)
    text  = " ".join(texts)
    return text

def filename_to_index(filename):
    return re.sub("^.*/|\.[^.]+$", '', filename)

def glob_to_indices(globpath):
    return list(map(filename_to_index, glob.glob(globpath)))
       
# Inspired by: https://www.kaggle.com/hamditarek/merge-multiple-json-files-to-a-dataframe
def dataset_df(test_train="test"):
    indices = glob_to_indices(f"../input/coleridgeinitiative-show-us-the-data/{test_train}/*.json")    
    texts   = Parallel(-1)( 
        delayed(json2text)(index, test_train)
        for index in indices  
    )
    df = pd.DataFrame([
        { "id": index, "text": text }
        for index, text in zip(indices, texts)
    ])
    df.to_csv(f"{test_train}.json.csv", index=False)
    return df

train_data = dataset_df("train")
test_data  = dataset_df("test")

In [None]:
train_data

In [None]:
test_data

# Question Answering

Lets try out a variety of different question formats

In [None]:
train_df = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/train.csv")

def answer_questions(question, df, count=0):
    submission_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/sample_submission.csv', index_col=0)
    for n, (_, row) in enumerate(df.iterrows()):
        context  = row["text"]
        result   = question_answering(question=question, context=context)
        expected = train_df[ train_df["Id"] == row["id"] ]
        datasets = "; ".join(expected['dataset_label']) if len(expected) else ""  # Predict the internal text 
        
        submission_df['PredictionString'] = result['answer']
        submission_df.to_csv("submission.csv")
        
        print(f"{row['id']} | {result['score']:.3f}")
        if len(datasets): 
            print('answer:       ', result['answer'])
            print('dataset_label:', set(expected['dataset_label']))
            print('dataset_title:', set(expected['dataset_title']))
            print('pub_title:    ', set(expected['pub_title']) )
            print()
        if count and count <= n: break

In [None]:
answer_questions("Which was said about the study dataset?", train_data, 10)

In [None]:
answer_questions("Which dataset is referenced?", train_data, 10)

In [None]:
answer_questions("Which study dataset is referenced?", train_data, 10)

In [None]:
answer_questions("What is referenced?", train_data, 10)

In [None]:
answer_questions("What papers are referenced?", train_data, 10)

In [None]:
answer_questions("Which study is referenced?", train_data, 10)

In [None]:
answer_questions("Which study, program, data or database?", train_data, 10)

In [None]:
answer_questions("What did you say about the study?", train_data, 10)

In [None]:
answer_questions("Identify the mention of datasets?", train_data, 10)

In [None]:
answer_questions("What was said about the study, program, data or database?", train_data, 10)

In [None]:
answer_questions("What was said about the study program data?", train_data, 10)

# Submission

In [None]:
answer_questions("What was said about the study program data?", test_data)

# Unsolved Problems

- SOLVED: How do I get `os.environ["TRANSFORMERS_CACHE"]` to work in offline mode? (Thanks: [@Nobu](https://www.kaggle.com/enukuro))
- Does anybody have any advise for how these pretrained models could be fine-tuned on the competition dataset?

# Conclusion

This might not be exactly what the competition is asking for, but the results are intresting none the less.

If you learnt something from this notebook, or want to fork it, then please leave an upvote. Thank you.

# Further Reading


The original `String Literals` notebook 
- https://www.kaggle.com/jamesmcguigan/coleridge-string-literals/