# Custom Chatbot Project
In this project we are choosing data set of current ongoing Indial Premier League 2024 match results extracted from wikipedia page. The custom chatbot goal is to build prompt question to answer highlevel questions like who won the match in the current ongoing IPL2024.

OpenAI do not have information on 2024 IPL so choosing current ongoing as of Apirl-May 2024 IPL.

The technique used is RAG Retrieval Augmented Generation on the broader level. RAG, which stands for Retrieval-Augmented Generation, is a powerful technique in the field of natural language processing (NLP) that combines the strengths of both retrieval-based and generative models. It aims to enhance the quality and relevance of generated responses by incorporating information from a large external knowledge source. By leveraging this contextual information, RAG models can provide more informed responses. For example, if a question is asked about a specific topic, RAG can retrieve relevant information from a knowledge base and use that information to generate a response that is grounded in factual accuracy and context.

In [13]:
pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m71.7/76.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.28.1
    Uninstalling openai-1.28.1:
      Successfully uninstalled openai-1.28.1
Successfully installed openai-0.28.0


In [4]:
SOURCE_URL = 'https://en.wikipedia.org/wiki/2024_Indian_Premier_League'
HTML_PAGE_FILEPATH = './html_page.html'
CSV_FILEPATH_WITH_EMBEDDINGS = './wikipedia_with_embeddings.csv'


In [2]:
import requests
import pandas as pd
import openai
from bs4 import BeautifulSoup
from typing import List, Union, Dict
from scipy.spatial.distance import cosine
openai.api_key="YOUR_API_KEY"

**Data Wrangling**

Extract data from Wikipedia and structure it in a way that can be used to build embeddings, you can follow a few steps. Here first we identify the specific Wikipedia page or pages from which you want to extract data. Then, use web scraping techniques to fetch the HTML content of the page. Next, parse the HTML content to extract the relevant information, such as paragraphs, tables, or lists. Once you have extracted the data, you can organize it into a structured format, such as 'text' column. This structured data can then be used as input to build embeddings using techniques like word embeddings or document embeddings.

In [5]:
# Helper function to fetch HTML page from a URL
def fetch_html_page(url: str) -> bytes:
    """
    Fetches HTML content from a given URL.

    Args:
        url (str): The URL of the webpage to fetch.

    Returns:
        bytes: The HTML content of the webpage.

    Raises:
        Exception: If there is a connection error.
    """
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    #print(response.text)


    if response.status_code == 200:
        return response.content
    else:
        raise Exception('Connection error')

with open(HTML_PAGE_FILEPATH, mode='wb') as html_file:
    html_page = fetch_html_page(SOURCE_URL)
    html_file.write(html_page)

In [6]:
# Function to extract data from HTML\
import re
def extract_data_from_html(html_file_path: str) -> pd.DataFrame:
    """
    Extracts data from an HTML file.

    Args:
        html_file_path (str): The file path to the HTML file.

    Returns:
        pd.DataFrame: A DataFrame containing the extracted data.
    """
    # Parsing the HTML file
    with open(html_file_path) as fp:
        soup = BeautifulSoup(fp, 'html.parser')

    # Finding the root DOM node
    #this is html extract check the website for change in the id value in case of error.
    root_dom_node = soup.find('span', {'id': 'Fixtures_and_results'})
    print(root_dom_node)
    # Extracting month headers
    index=0;
    match_headers = [match_header.find_next('span') for index, match_header in enumerate(soup.find_all('span', {'id': f"match{index+1}"}))]
    current_match = None
    data = []

    # Loop through DOM nodes to extract data
    for node in root_dom_node.find_all_next():
        if node.name == 'td':
            data.append(f"{node.find_next('b').text.strip()}")
    # Creating DataFrame from extracted data
    chunks = [" ".join((data[i:i+8])) for i in range(0, len(data), 8)]
    df = pd.DataFrame(chunks, columns=['text'])
    return df

In [7]:
# Extracting data from HTML and displaying DataFrame
df = extract_data_from_html(HTML_PAGE_FILEPATH)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

<span class="mw-headline" id="Fixtures_and_results">Fixtures and results</span>


In [8]:
# Displaying DataFrame and its shape
print(df.head(20))
print(df.shape)

                                                                                                                                                                                                                                          text
0                            Match 1 Royal Challengers Bengaluru v Chennai Super Kings (H) Chennai Super Kings won by 6 wickets Chennai Super Kings won by 6 wickets Chennai Super Kings won by 6 wickets Chennai Super Kings won by 6 wickets
1                                                                            Match 2 Delhi Capitals v Punjab Kings (H) Punjab Kings won by 4 wickets Punjab Kings won by 4 wickets Punjab Kings won by 4 wickets Punjab Kings won by 4 wickets
2                                      Match 3 (H) Kolkata Knight Riders v Sunrisers Hyderabad Kolkata Knight Riders won by 4 runs Kolkata Knight Riders won by 4 runs Kolkata Knight Riders won by 4 runs Kolkata Knight Riders won by 4 runs
3                                           

In [9]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL = 'gpt-3.5-turbo'
#Batch size need to revisited since it is taking more time on creating embeddings.
BATCH_SIZE = 1

In [10]:
# Reset display options for pandas DataFrame
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

In [24]:
def get_embeddings(prompt: Union[str, List[str]], embedding_model: str) -> List[List[float]]:
    """
    Retrieves embeddings from OpenAI API for the given prompt using the specified embedding model.

    Args:
        prompt (Union[str, List[str]]): Input prompt or list of prompts.
        embedding_model (str): Name of the embedding model to use.

    Returns:
        List[List[float]]: List of embeddings for the input prompt(s).
    """

    response = openai.Embedding.create(
        input=prompt if isinstance(prompt, list) else [prompt],
        engine=EMBEDDING_MODEL_NAME,
        max_tokens=100,
        temperature=0,
        n=1,
        stop=None,
        log_level="info"
    )
    return [row.embedding for row in response.data]

# Function to create embeddings for DataFrame
def create_embeddings(df: pd.DataFrame, embedding_model_name: str = EMBEDDING_MODEL_NAME, batch_size: int = BATCH_SIZE) -> List[List[float]]:
    """
    Creates embeddings for the text data in the DataFrame using the specified embedding model.

    Args:
        df (pd.DataFrame): DataFrame containing text data.
        embedding_model_name (str): Name of the embedding model to use.
        batch_size (int): Size of batches for processing.

    Returns:
        List[List[float]]: List of embeddings corresponding to the text data.
    """
    embeddings_output = []
    for idx in range(0, len(df), batch_size):
        batch = df.iloc[idx:idx+batch_size]['text'].tolist()
        embeddings = get_embeddings(batch, embedding_model_name)
        embeddings_output.extend(embeddings)
    return embeddings_output

In [45]:
# Add embeddings to DataFrame and save to CSV
df['embedding'] = create_embeddings(df)
df.to_csv(CSV_FILEPATH_WITH_EMBEDDINGS, sep=',', index=False)

# Display DataFrame head
print(df.head(10))

                                                text  \
0  Match 1 Royal Challengers Bengaluru v Chennai ...   
1  Match 2 Delhi Capitals v Punjab Kings (H) Punj...   
2  Match 3 (H) Kolkata Knight Riders v Sunrisers ...   
3  Match 4 (H) Rajasthan Royals v Lucknow Super G...   
4  Match 5 (H) Gujarat Titans v Mumbai Indians Gu...   
5  Match 6 Punjab Kings v Royal Challengers Benga...   
6  Match 7 (H) Chennai Super Kings v Gujarat Tita...   
7  Match 8 (H) Sunrisers Hyderabad v Mumbai India...   
8  Match 9 (H) Rajasthan Royals v Delhi Capitals ...   
9  Match 10 (H) Royal Challengers Bengaluru v Kol...   

                                           embedding  
0  [0.01276527252048254, -0.0024205476511269808, ...  
1  [-0.00025063130306079984, 0.002982246922329068...  
2  [0.0038601511623710394, 0.0009172391728498042,...  
3  [-0.0006995435687713325, -0.009809255599975586...  
4  [-0.012661637738347054, -0.004087824374437332,...  
5  [0.0035090281162410975, -0.009090369567275047,... 

In [17]:
EMBEDDING_MODEL = 'text-embedding-3-small'

**Custom Query Completion**

In the following cell defining buil_simple_promt function just to construct the simple prompt.

build_custom_prompt - function to build custom context and create prompt as per the data set embeddings which was prepared.

handle_question is the function which takes the prompt and returns the response.

In [36]:
def build_simple_prompt(question: str) -> List[Dict[str, str]]:
    """
    Builds a simple prompt for asking a question.

    Args:
        question (str): The question to include in the prompt.

    Returns:
        List[Dict[str, str]]: A list containing a single message with the user role and the provided question.
    """
    return [
        {
            'role': 'user',
            'content': question
        }
    ]

#Provide a summary of the results for IPL 2024 matches, including which teams played each match and the outcomes.
#Specifically, focus on details such as the match number, competing teams, the winning team, and the margin of victory. \
#For instance, for Match 1, mention the teams involved, who hosted the game, and how the game was won

def build_custom_prompt(question: str, database_df: pd.DataFrame) -> List[Dict[str, str]]:
    """
    Builds a custom prompt including context for asking a question based on a database DataFrame.

    Args:
        question (str): The question to include in the prompt.
        database_df (pd.DataFrame): The DataFrame containing the database of facts.

    Returns:
        List[Dict[str, str]]: A list containing two messages: system message with context and user message with the question.
    """
    return [
        {
            'role': 'system',
            'content': """
            Answer the question based on provided context below. If the question cannot be answered based on provided context, say "I don't know the answer".
            We have 2024 IPL - IPL is Indial premier League. Context contains facts from ongoing 2024 IPL Match statistics.
            Facts starts with, or instance, for Match 1, mention the teams involved, who hosted the game, and how the game was won.
            Context:
                {}
            """.format('\n\n'.join(build_custom_context(question, database_df)))
        },
        {
            'role': 'user',
            'content': question
        }
    ]

def build_custom_context(question: str, database_df: pd.DataFrame, n: int = 5) -> List[str]:
    """
    Builds a custom context for a given question based on the closest facts from a database DataFrame.

    Args:
        question (str): The question for which the context is being built.
        database_df (pd.DataFrame): The DataFrame containing the database of facts.
        n (int): The number of closest facts to include in the context.

    Returns:
        List[str]: A list of closest facts to the question.
    """
    question_embedding = get_embeddings(question, EMBEDDING_MODEL)[0]
    df = database_df.copy()
    df["distances"] = df['embedding'].apply(lambda embedding: cosine(embedding, question_embedding))
    df.sort_values("distances", ascending=True, inplace=True)
    return df.iloc[:n]['text'].tolist()

def handle_question(prompt: List[Dict[str, str]],model_name: str = COMPLETION_MODEL) -> str:
    """
    Handles a question prompt by generating a response using the specified model.

    Args:
        prompt (List[Dict[str, str]]): The prompt messages to send to the model.
        model_name (str): The name of the completion model to use.

    Returns:
        str: The response generated by the model.
    """
    response = openai.ChatCompletion.create(model="gpt-4-turbo-preview", messages=prompt)
    return response.choices[0].message.content

**Custom Performance Demonstration**

The following cells demonstrates the performance of the custom query using 3 questions. For each question, 2 answers one with simple prompt which is without context and the other is custom promp with context.[link text](https://)

In [38]:
# Read the DataFrame from CSV file
df = pd.read_csv(CSV_FILEPATH_WITH_EMBEDDINGS)
# Convert embedding values from string to list of floats
df['embedding'] = df['embedding'].apply(lambda value: [float(dim) for dim in value.replace('[', '').replace(']', '').split(',')])

In [40]:
question_1 = 'Who won the Match 1 of Indial Premeir League 2024?'
print('Answer without context : \n' , handle_question(build_simple_prompt(question_1)))
print('Answer with context : \n ', handle_question(build_custom_prompt(question_1, df)))

Answer without context : 
 I'm sorry, but I can't provide future events such as the winner of the first match of the Indian Premier League in 2024. My knowledge is current up to September 2023, and future IPL matches including outcomes are beyond my current information. Please check the latest sports news for the most current outcomes of sports events like the IPL.
Answer with context : 
  Chennai Super Kings won Match 1 of the Indian Premier League 2024.


In [42]:
question_2 = 'Who won the Match 40 of IPL?'
# Print answer without context
print('Answer without Context: \n', handle_question(build_simple_prompt(question_2)))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_2, df)))

Answer without Context: 
 I cannot provide real-time information, including the outcome of specific IPL matches such as Match 40, because my last update was in September 2023. For the most accurate and current information on IPL match results, please check the latest sports news on a reliable website or the official IPL platform.

Answer with Context: 
 Delhi Capitals won Match 40 of the IPL by 4 runs.


In [46]:
question_3 = 'Who won the Match 6 Punjab Kings v Royal Challengers Bengaluru of IPL?'
# Print answer without context
print('Answer without Context: \n', handle_question(build_simple_prompt(question_3)))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_3, df)))

Answer without Context: 
 As of my last update in September 2023, the specific outcome of Match 6 between Punjab Kings and Royal Challengers Bangalore in the Indian Premier League (IPL) for a given year was not provided. The IPL schedules and match results are updated annually, and outcomes can vary each year. For the most current and accurate information regarding match results, I'd recommend checking the official IPL website or reputable sports news websites.

Answer with Context: 
 Royal Challengers Bengaluru won by 4 wickets.


# **Conclusion** :

Context Dependence: The model's ability to provide accurate responses improves
significantly when it has access to the appropriate context or database, highlighting the importance of tailored data environments in achieving precise answers.

Error Correction Capability: When initially provided with inadequate context, the model's responses can be incorrect. However, it has the capacity to adjust and correct these errors once the necessary contextual information is made available.

Data Retrieval Efficiency: The successful retrieval of information from a custom database indicates that the model is equipped with effective data processing mechanisms, allowing it to extract relevant facts efficiently when provided with structured data.

Significance of Detailed Databases: The need for a detailed and specific database to ensure the accuracy of responses underlines the model’s reliance on comprehensive data inputs to function optimally.