<a href="https://colab.research.google.com/github/shubhangkhare/Learnings/blob/main/Question_Answering_using_Spacy_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2
Read blog: https://explosion.ai/blog/spacy-transformers

The choice between BERT, GPT-2, and XLNet depends on the specific task and requirements you have. Each model has its strengths and weaknesses, and the "best" model will vary depending on the context.

1. **BERT (Bidirectional Encoder Representations from Transformers)**:
BERT is a pre-trained model developed by Google. It excels in tasks that require a deep understanding of language, such as question answering, text classification, and named entity recognition. BERT's architecture is based on a bidirectional Transformer, which allows it to capture context from both left and right directions. BERT is known for its versatility and has achieved state-of-the-art results in various NLP benchmarks.

2. **GPT-2 (Generative Pre-trained Transformer 2):**
GPT-2 is a generative language model developed by OpenAI. It is trained to predict the next word in a sentence, making it ideal for tasks like text generation, storytelling, and language translation. GPT-2 has a large number of parameters (1.5 billion), which enables it to generate coherent and contextually relevant text. It has been praised for its creative abilities but may require more data and computational resources compared to BERT for fine-tuning.

3. **XLNet (eXtreme Language Understanding Network):**
XLNet is another pre-trained model that addresses some limitations of previous models like BERT. It introduces the concept of permutation-based training, allowing the model to consider all possible permutations of the input text, leading to improved context understanding. XLNet performs well on a wide range of tasks, including text classification, natural language inference, and sentiment analysis. It is known for its ability to handle long-range dependencies and has achieved competitive results in various benchmarks.

Ultimately, the choice between BERT, GPT-2, and XLNet depends on the specific task, available data, and computational resources. BERT is great for understanding language, GPT-2 is ideal for text generation, and XLNet addresses some limitations of previous models. It is recommended to experiment and evaluate these models on your specific use case to determine which one performs the best for your needs.

# Import Libraries

In [4]:
!pip install spacy-transformers
!python -m spacy download en_trf_bertbaseuncased_lg # bert-base-uncased
#!python -m spacy download de_trf_bertbasecased_lg # bert-base-german-cased
#!python -m spacy download en_trf_xlnetbasecased_lg # xlnet-base-cased

Collecting spacy-transformers
  Downloading spacy_transformers-1.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/190.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m184.3/190.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.8/190.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.31.0,>=3.4.0 (from spacy-transformers)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading spacy_alignments-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2

In [None]:
import pandas as pd
import spacy

# Load Dataset

### Real Data

In [2]:
data = pd.read_csv('your_file.csv')
questions = data['Question'].tolist()
answers = data['Answer'].tolist()

NameError: ignored

### Sample Data

In [None]:
questions = [
    "What is the capital of India?",
    "What is the official language of India?",
    "Who is the Prime Minister of India?",
    "What is the currency of India?",
    "What is the population of India?",
    "What are the major religions in India?",
    # Add more questions...
]

answers = [
    "New Delhi",
    "Hindi",
    "Narendra Modi",
    "Indian Rupee",
    "1.3 billion",
    "Hinduism, Islam, Christianity, Sikhism, Buddhism, Jainism",
    # Add more answers...
]

data = pd.DataFrame({"Question": questions, "Answer": answers})
# data.to_csv("india_qa.csv", index=False)
questions = data['Question'].tolist()
answers = data['Answer'].tolist()

In [None]:
data.head()

# Process the data with SpaCy
Next, you'll need to process the questions and answers using SpaCy. This involves tokenization, part-of-speech tagging, and dependency parsing. You can use the pre-trained English model en_core_web_sm for this.

In [None]:
nlp = spacy.load('en_core_web_md')

processed_questions = [nlp(q) for q in questions]
processed_answers = [nlp(a) for a in answers]

# Implement the question answering logic
To answer a given question, you'll need to find the most relevant answer from your processed answers. One way to do this is by calculating the similarity between the question and each answer using SpaCy's similarity method. You can then choose the answer with the highest similarity score.

In [None]:
def get_most_similar_answer(question):
    question_tokens = nlp(question)
    max_similarity = -1
    most_similar_answer = None

    for answer, processed_answer in zip(answers, processed_answers):
        similarity = question_tokens.similarity(processed_answer)
        if similarity > max_similarity:
            max_similarity = similarity
            most_similar_answer = answer

    return most_similar_answer

In [None]:
def get_most_similar_answer(user_ques):
    user_ques_emb = nlp(user_ques)
    most_similar_answer = None
    count = 0

    for q, a in zip(questions, answers):
        data_ques_emb = nlp(q)
        similarity = user_ques_emb.similarity(data_ques_emb)

        if count == 0:
            max_similarity = similarity
            most_similar_answer = a

        if similarity > max_similarity:
            max_similarity = similarity
            most_similar_answer = a

        count += 1
    return most_similar_answer

# Test the question answering system

In [None]:
question = "Can you tell me India's Capital"
answer = get_most_similar_answer(question)
print(answer)