<a href="https://colab.research.google.com/github/tubagokhan/DeepLearningNLPFoundations/blob/main/Huggingface_Pipeline_Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#Please execute this notebook on the course python environment
#This installs all the pre-requisities for the course

!pip install tensorflow
!pip install torch
!pip install keras
!pip install transformers
!pip install datasets
!pip install sentencepiece
!pip install evaluate
!pip install nltk
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (

In [2]:
import transformers

#Set to avoid warning messages.
transformers.logging.set_verbosity_error()

## Using a Qu-An Pipeline (Open Domain Qu-An -> general purpose, Closed Domain Qu-An -> for specific domain )

In [3]:
from transformers import pipeline

context="""
Earth is the third planet from the Sun and the only astronomical object 
known to harbor life. While large volumes of water can be found 
throughout the Solar System, only Earth sustains liquid surface water. 
About 71% of Earth's surface is made up of the ocean, dwarfing 
Earth's polar ice, lakes, and rivers. The remaining 29% of Earth's 
surface is land, consisting of continents and islands. 
Earth's surface layer is formed of several slowly moving tectonic plates, 
interacting to produce mountain ranges, volcanoes, and earthquakes. 
Earth's liquid outer core generates the magnetic field that shapes Earth's 
magnetosphere, deflecting destructive solar winds.
"""

quan_pipeline = pipeline("question-answering", 
                         model="deepset/minilm-uncased-squad2")

answer=quan_pipeline(question="How much of earth is land?",
             context=context)
print(answer)



Downloading (…)lve/main/config.json:   0%|          | 0.00/477 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/133M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

{'score': 0.9553403258323669, 'start': 327, 'end': 330, 'answer': '29%'}


In [4]:
print("\nAnother question :")
print(quan_pipeline( question="How are mountain ranges created?",
             context=context))


Another question :
{'score': 0.2615410387516022, 'start': 399, 'end': 471, 'answer': "Earth's surface layer is formed of several slowly moving tectonic plates"}


## Evaluating Qu-An Performance ( SQuAD metric)

In [5]:
from evaluate import load
squad_metric = load("squad_v2")

#Ignoring Context & Question as they are not needed for evaluation
#This example is to showcase how the evaluation works based on match between the prediction
#and the correct answer

correct_answer="Paris"

predicted_answers=["Paris",
                 "London",
                 "Paris is one of the best cities in the world"]

cum_predictions=[]
cum_references=[]
for i in range(len(predicted_answers)):
    
    #Use the input format for predictions
    predictions = [{'prediction_text':predicted_answers[i], 
                    'id': str(i),
                    'no_answer_probability': 0.}]
    cum_predictions.append(predictions[0])
    
    #Use the input format for naswers
    references = [{'answers': {'answer_start': [1], 
                               'text': [correct_answer]}, 
                   'id': str(i)}]
    cum_references.append(references[0])

    results = squad_metric.compute(predictions=predictions,
                                   references=references)
    print("F1 is", results.get('f1'), 
          " for answer :", predicted_answers[i])
    
#Compute for cumulative Results
cum_results=squad_metric.compute(predictions=cum_predictions,
                                 references=cum_references)
print("\n Cumulative Results : \n",cum_results)

Downloading builder script:   0%|          | 0.00/6.47k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

F1 is 100.0  for answer : Paris
F1 is 0.0  for answer : London
F1 is 22.22222222222222  for answer : Paris is one of the best cities in the world

 Cumulative Results : 
 {'exact': 33.333333333333336, 'f1': 40.74074074074074, 'total': 3, 'HasAns_exact': 33.333333333333336, 'HasAns_f1': 40.74074074074074, 'HasAns_total': 3, 'best_exact': 33.333333333333336, 'best_exact_thresh': 0.0, 'best_f1': 40.74074074074074, 'best_f1_thresh': 0.0}
