<a href="https://colab.research.google.com/github/acastellanos-ie/natural_language_processing/blob/master/qa_practice_dl/arabic_question_answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Colab Configuration

**Execute this steps to configure the Google Colab environment in order to execute this notebook. It is not required if you are executing it locally and you have properly configured your local environment according to what explained in the Github Repository.**

The first step is to clone the repository to have access to all the data and files.

In [3]:
! git clone https://github.com/acastellanos-ie/natural_language_processing.git

Cloning into 'natural_language_processing'...
remote: Enumerating objects: 4733, done.[K
remote: Counting objects: 100% (4733/4733), done.[K
remote: Compressing objects: 100% (4540/4540), done.[K
remote: Total 4733 (delta 336), reused 4475 (delta 161), pack-reused 0[K
Receiving objects: 100% (4733/4733), 13.07 MiB | 18.30 MiB/s, done.
Resolving deltas: 100% (336/336), done.


Install the requirements

In [4]:
! pip install -Uqqr natural_language_processing/arabic_requirements.txt --use-deprecated=legacy-resolver

[K     |████████████████████████████████| 2.5 MB 12.4 MB/s 
[K     |████████████████████████████████| 1.5 MB 42.8 MB/s 
[K     |████████████████████████████████| 56 kB 4.6 MB/s 
[K     |████████████████████████████████| 337 kB 39.5 MB/s 
[K     |████████████████████████████████| 126 kB 43.2 MB/s 
[K     |████████████████████████████████| 10.4 MB 15.8 MB/s 
[K     |████████████████████████████████| 12.0 MB 232 kB/s 
[K     |████████████████████████████████| 10.8 MB 42.3 MB/s 
[K     |████████████████████████████████| 356 kB 42.6 MB/s 
[K     |████████████████████████████████| 188 kB 46.2 MB/s 
[K     |████████████████████████████████| 720 kB 44.1 MB/s 
[K     |████████████████████████████████| 454.3 MB 19 kB/s 
[K     |████████████████████████████████| 25.3 MB 92 kB/s 
[K     |████████████████████████████████| 76 kB 4.5 MB/s 
[K     |████████████████████████████████| 1.7 MB 37.1 MB/s 
[K     |████████████████████████████████| 895 kB 41.2 MB/s 
[K     |██████████████████

Ensure that you have the GPU runtime activated:

![](https://miro.medium.com/max/3006/1*vOkqNhJNl1204kOhqq59zA.png)

Now you have everything you need to execute the code in Colab

# Zero-shot Question Answering in Arabic

In this exercise, we will focus on Question Answer in Arabic. Moreover, we are going to move a step beyond by performing Zero-Shot Question answering.

In the other exercise, we relied on a BERT model that had been fine-tuned to the task of question answering; i.e., the base BERT model is trained with a huge dataset for Masked Language Modeling and then is re-trained with an annotated QA dataset (a dataset containing pairs of questions and answers) like SQuAD.

In contrast, in this exercise, we will rely on the zero-shot capacities of these language models. This is something that we already discussed when presenting GPT3. Instead of using task-dependant datasets to fine-tuning a language model to specific tasks, we use the raw language model (i.e., the model trained only on the language modeling task). GPT3 has shown that, with enough input data, the model can solve NLP tasks without being programmed explicitly for it.

# Pre-trained Language Model

We will not use GPT3 specifically for this exercise: it's not publicly available and too large to be used in the Colab environment. We will use the older and smaller version of it, GPT2, available in the AraBERT repository.

Therefore, the first thing to do is to download the repository.

In [1]:

!git clone https://github.com/aub-mind/arabert

Cloning into 'arabert'...
remote: Enumerating objects: 530, done.[K
remote: Counting objects: 100% (316/316), done.[K
remote: Compressing objects: 100% (228/228), done.[K
remote: Total 530 (delta 167), reused 226 (delta 82), pack-reused 214[K
Receiving objects: 100% (530/530), 4.86 MiB | 20.23 MiB/s, done.
Resolving deltas: 100% (290/290), done.


Now, as usual, we need to set up the pre-processor and the pipeline to define the model and format the input data 


In [5]:
#textwrap enables formating of long text
import textwrap

from transformers import pipeline, GPT2TokenizerFast
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
from arabert.preprocess import ArabertPreprocessor

import torch
device = 0 if torch.cuda.is_available() else -1

#you can choose any aragpt2 model since they all have the same preprocessing
arabert_processor = ArabertPreprocessor(model_name="aragpt2-mega")

In [8]:
model_name = "aubmindlab/aragpt2-mega" #the mega model needs a High-Ram colab

grover_gpt2_model = GPT2LMHeadModel.from_pretrained(model_name)
grover_gpt2_model.half()
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)

aragpt2_pipeline = pipeline("text-generation",model=grover_gpt2_model, tokenizer=tokenizer,device=device)

Downloading:   0%|          | 0.00/843 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/553M [00:00<?, ?B/s]

Some weights of the model checkpoint at aubmindlab/aragpt2-base were not used when initializing GPT2LMHeadModel: ['ln_f.bias', 'ln_f.weight']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at aubmindlab/aragpt2-base and are newly initialized: ['emb_norm.weight', 'emb_norm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

# Answering Questions


Everything is now ready to start making questions and see how the model answers them.

As explained before, we will not fine-tune the model to the question answering task; neither will we rely on an annotated dataset with questions and answers. We expect the model to have learned how to answer questions automatically!

This is something rather ambitious. Let's see how the model behaves.

Ok, so let's start. As explained before, we have not fine-tuned the model for the QA task; therefore, we will use the pre-trained language model (GPT2).

This model has been trained for the language modeling task: given an input text (prompt), generate the most likely output text. Typically, the input text is a sentence without some words, and the model has to predict the missing words. Once properly trained, these language models can generate output text that is coherent with the input.

For this specific situation, we are going to do the same. We will input some text (the query), and we expect the model to generate new text that happens to be the answer to the query.

In more detail, we are going to input as prompt the question that we want to answer. However, if we just input the question, the model could simply generate input text somehow related to the query, but not the actual answer to the query itself. To "help" the model, we are going to explicitly tell the model that we want the answer to the question.

**Note:** This approach was presented in the original [GPT3 paper](https://arxiv.org/abs/2005.14165) and it is rather impressive IMHO. You are using a model that has not been trained for any specific task, and it performs well by just "telling" the model in natural language what you want to do. This very same model (with no further fine-tuning) could be used for any other task (e.g., classification) by just telling the model: classify the following text into positive or negative. 






In [None]:
# Tell the model that you want to answer the following question.
prompt = """
أجب عن السؤال التالي:
"""

In [None]:
# The question itself
text = """
\"من كان رئيس ألمانيا النازية في الحرب العالمية الثانية ؟\"
"""
text_tok = tokenizer.tokenize(arabert_processor.preprocess(text))
text_len = len(text_tok)
print(text_tok)
print(text_len)

['"', 'ĠÙħÙĨ', 'ĠÙĥØ§ÙĨ', 'ĠØ±Ø¦ÙĬØ³', 'ĠØ£ÙĦÙħØ§ÙĨÙĬØ§', 'ĠØ§ÙĦÙĨØ§Ø²ÙĬØ©', 'ĠÙģÙĬ', 'ĠØ§ÙĦØŃØ±Ø¨', 'ĠØ§ÙĦØ¹Ø§ÙĦÙħÙĬØ©', 'ĠØ§ÙĦØ«Ø§ÙĨÙĬØ©', 'ĠØŁ', 'Ġ"']
12


The following joins the prompt and the question and it pre-processes and formats it by means of the defined pre-processor.

In [None]:
text_prep = arabert_processor.preprocess(prompt + " " + text)
input_len = len(tokenizer.tokenize(text_prep))
print(text_prep)
print(input_len)

أجب عن السؤال التالي : " من كان رئيس ألمانيا النازية في الحرب العالمية الثانية ؟ "
18


The input query and prompt are ready and formated. Now, let's use the model to answer the question.

In [None]:
print("Input Length: ", input_len + text_len //2)
gen_text = aragpt2_pipeline(text_prep,
            pad_token_id=0, # 0 for AraGPT2
            do_sample = False,
            num_beams=5,
            max_length=input_len + 10,
            top_k=10,
            top_p=0.95,
            repetition_penalty = 3.0,
            no_repeat_ngram_size = 3,
            num_return_sequences = 5)
gen_text

Input Length:  24


[{'generated_text': 'أجب عن السؤال التالي : " من كان رئيس ألمانيا النازية في الحرب العالمية الثانية ؟ " الجواب هو أدولف هتلر ، الذي حكم ألمانيا بين'},
 {'generated_text': 'أجب عن السؤال التالي : " من كان رئيس ألمانيا النازية في الحرب العالمية الثانية ؟ " الجواب هو أدولف هتلر ، الذي حكم ألمانيا لمدة'},
 {'generated_text': 'أجب عن السؤال التالي : " من كان رئيس ألمانيا النازية في الحرب العالمية الثانية ؟ " الجواب هو أدولف هتلر ، الذي ولد في النمسا'},
 {'generated_text': 'أجب عن السؤال التالي : " من كان رئيس ألمانيا النازية في الحرب العالمية الثانية ؟ " الجواب هو أدولف هتلر ، الذي حكم ألمانيا منذ'},
 {'generated_text': 'أجب عن السؤال التالي : " من كان رئيس ألمانيا النازية في الحرب العالمية الثانية ؟ " إن الإجابة على هذا السؤال صعبة جدا ، لأن هتلر'}]

Amazing! 
The candidate answers properly solve the input query!

Let's try a new one. In this case, we are going to make it a little more difficult for the model. We are going to ask for a specific date. Let's see how it goes.

In [None]:
text_prep = arabert_processor.preprocess("أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة")
input_len = len(tokenizer.tokenize(text_prep))
print(text_prep)
print(input_len)

أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة
17


In [None]:
print("Input Length: ", input_len + text_len //2)
gen_text = aragpt2_pipeline("أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة",
            pad_token_id=0, # 0 for AraGPT2
            do_sample = False,
            num_beams=5,
            max_length=input_len + 4,
            top_k=10,
            top_p=0.95,
            repetition_penalty = 3.0,
            no_repeat_ngram_size = 3,
            num_return_sequences = 5)
gen_text

Input Length:  23


[{'generated_text': 'أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة 1909 �'},
 {'generated_text': 'أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة 1906 �'},
 {'generated_text': 'أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة 1902 �'},
 {'generated_text': 'أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة 1904 �'},
 {'generated_text': 'أجب عن السؤال التالي : في أي سنة تأسست مدينة العيون ؟ الجواب هو في سنة 1903 �'}]

In this case, we have different answers (none of them seem to be correct). 

I recommend you try new questions or even different prompts to see if you can get more interesting answers.

Enjoy!
