<a href="https://colab.research.google.com/github/usha-1-bandi/nlp/blob/main/GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Create a manual dataset with some example text passages
dataset = [
    "The Indian Cricket Team represents India in international cricket and is governed by the Board of Control for Cricket in India (BCCI).",
    "Virat Kohli, one of India's most successful batsmen, has led the Indian Cricket Team to numerous victories across formats.",
    "The Indian Premier League (IPL) is a professional Twenty20 cricket league in India, featuring top Indian and international players.",
    "Sachin Tendulkar, known as the 'God of Cricket,' is one of the greatest cricketers to have played for the Indian team.",
    "The Indian Cricket Team won its first Cricket World Cup in 1983 under the captaincy of Kapil Dev, defeating the West Indies.",
    "Mahendra Singh Dhoni led India to victory in the 2007 T20 World Cup, the 2011 ODI World Cup, and the 2013 Champions Trophy.",
    "Rohit Sharma holds the record for the highest individual score in One Day Internationals (ODIs) with 264 runs.",
    "The Indian Cricket Team won the 2011 Cricket World Cup, ending a 28-year drought, with MS Dhoni hitting the winning six.",
    "The Border-Gavaskar Trophy is a Test cricket series played between India and Australia, named after Sunil Gavaskar and Allan Border.",
    "Anil Kumble, one of India's greatest bowlers, took all 10 wickets in a Test innings against Pakistan in 1999.",
    "India became the No.1 ranked Test team in the world for the first time in 2009 under the captaincy of MS Dhoni.",
    "India has produced some of the world's greatest all-rounders, including Kapil Dev, Ravindra Jadeja, and Hardik Pandya.",
    "The Indian Cricket Team is known for its fierce rivalry with Pakistan, especially in ICC tournaments.",
    "Rahul Dravid, also known as 'The Wall,' was one of India's most dependable Test batsmen and is now the team's head coach.",
    "India won the inaugural ICC T20 World Cup in 2007, defeating Pakistan in the final.",
    "Yuvraj Singh made history by hitting six sixes in an over against England’s Stuart Broad in the 2007 T20 World Cup.",
    "India has played Test cricket since 1932, making its debut against England at Lord’s.",
    "The Indian Women’s Cricket Team has made significant strides, reaching the finals of the 2017 ODI World Cup and 2020 T20 World Cup.",
    "Jasprit Bumrah is one of India's premier fast bowlers, known for his deadly yorkers and unorthodox bowling action.",
    "India won the 2023 ICC World Test Championship final, cementing its dominance in red-ball cricket.",
]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(dataset)
tfidf_vectors = tfidf_matrix.toarray()

# Load a pre-trained question-answering model from Hugging Face
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

def retrieve_and_answer(query, top_k=1):
    """
    Function to retrieve the most similar passage to the query from the dataset
    and answer the question using a pre-trained QA model from Hugging Face.

    Args:
    - query: The input question or query from the user.
    - top_k: Number of top passages to retrieve.

    Returns:
    - answer: Generated text answer from the model.
    """
    # Convert the query to a TF-IDF vector
    query_vec = vectorizer.transform([query]).toarray()

    # Compute cosine similarity between the query vector and all document vectors
    similarities = cosine_similarity(query_vec, tfidf_vectors)
    print(similarities)

    # Get the index of the most similar passage
    most_similar_indices = np.argsort(similarities[0])[::-1][:top_k]
    print(most_similar_indices)

    # Retrieve the most similar passage(s)
    similar_passages = [dataset[i] for i in most_similar_indices]

    # Combine similar passages to form the context
    context = " ".join(similar_passages)
    print(context)

    # Answer the question using the QA model
    result = qa_pipeline(question=query, context=context)

    return result['answer']

# Example Query
query = "Who is greatest all-rounders of all time?"

# Retrieve and answer the question
answer = retrieve_and_answer(query, top_k=2)
print(answer)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


[[0.06620167 0.02753185 0.03641752 0.16639475 0.02739053 0.
  0.         0.         0.03152493 0.27212382 0.13594548 0.38347115
  0.04614341 0.0613326  0.         0.         0.         0.02511487
  0.06483323 0.        ]]
[11  9]
India has produced some of the world's greatest all-rounders, including Kapil Dev, Ravindra Jadeja, and Hardik Pandya. Anil Kumble, one of India's greatest bowlers, took all 10 wickets in a Test innings against Pakistan in 1999.
Anil Kumble
