In [None]:
# Copyright 2024 Rittman Analytics ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Profit & Loss Report QA Chatbot using LLMs, LangChain and BigQuery Vector Store

This notebook implements a question-answering system leveraging Large Language Models (LLMs), Langchain, and a vector store in BigQuery to analyze profit and loss (P&L) data. The system allows users to ask natural language questions about the P&L data and receive relevant answers.

### System Architecture

The system consists of several key components:

1. **Data Source (BigQuery):** The P&L data resides in BigQuery, organized into tables like `pl_reports`, `profit_and_loss_report_account_group`, etc. Pre-computed analysis is stored in `pl_reports_vector_storage`.

2. **Vector Store (BigQuery):** A BigQueryVectorStore is used to store embeddings of pre-created P&L analysis texts. This allows for efficient similarity search to quickly find relevant context for user questions. The table `pl_reports_vector_storage` holds these embeddings. A separate vector store, `successful_qa_pairs`, is used to store successful question-answer pairs for learning and improving the system's performance.

3. **LLM (OpenAI's GPT-4):** The core intelligence is provided by OpenAI's GPT-4, acting as the question answering engine and providing natural language processing capabilities.

4. **Langchain:** Langchain orchestrates the interaction between the different components. It manages the agent, memory, chains, and toolkits to provide a cohesive and efficient workflow.

5. **SQL Agent (Langchain):** An SQL agent is used to directly query BigQuery when the vector store does not contain sufficient information to answer a question.

6. **Embedding Model (Vertex AI):** `textembedding-gecko@latest` from Google Vertex AI creates vector embeddings of text data, enabling semantic search within the vector store.

### Question Answering Workflow

The process of answering a question follows these steps:

1. **Question Analysis:** The user enters a natural language question.

2. **Vector Store Query (First Attempt):** The system determines whether the question can be answered using the pre-created analysis stored in the `pl_reports_vector_storage` vector store. This is determined by a prompt sent to the LLM, evaluating if the question aligns with the types of analyses performed and the available time range. If deemed suitable, a similarity search is performed to retrieve the most relevant document(s).

3. **Pre-created Answer Summarization:** If relevant documents are found, their content is extracted, stripped of HTML tags, and summarized using the LLM to focus only on information relevant to the question. This avoids providing irrelevant information from the original analysis.

4. **SQL Query (Fallback):** If the vector store query does not yield satisfactory results or is deemed unsuitable, the system uses the Langchain SQL agent. This component uses the LLM to:
- Analyze the question to identify relevant financial terms and entities.
- Translate these terms into a suitable SQL `WHERE` clause to filter the data.
- Construct a SQL query to the appropriate BigQuery view (`profit_and_loss_report_account_group_xa`, `profit_and_loss_report_sub_categories_xa`, or `profit_and_loss_report_categories_xa` depending on the question context). The view selection is determined by analysis of the question.
- Execute the query in BigQuery and format the results into a readable answer.

5. **Answer Relevance Evaluation:** The generated answer (whether from the vector store or SQL query) is evaluated by the LLM to assess its relevance to the original question and provides a relevance score and explanation.

6. **Feedback and Learning:** The user provides feedback on the answer. This feedback is used to improve the question for subsequent iterations, aiming to refine the answer. Successful question-answer pairs are stored in the `successful_qa_pairs` vector store, improving future responses.

7. **Iterative Refinement (Optional):** The system allows for iterative refinement based on user feedback. The LLM is used to reformulate the question in response to feedback, repeating the process for `max_iterations` (default 3).

### Key Functions

* **`ask_question(question, context="")`:** The main function for answering questions. It orchestrates the vector store search and SQL query processes.
* **`should_query_vector_store(question)`:** Determines if a vector store search is likely to provide a relevant answer.
* **`summarize_content(question, content)`:** Summarizes the retrieved vector store content to focus only on relevant information.
* **`find_matching_values(question, lookups)`:** Extracts relevant financial entities from the question to construct the SQL `WHERE` clause.
* **`construct_filter_clause(matches)`:** Creates the SQL `WHERE` clause using extracted entities and time periods.
* **`determine_view(matches)`:** Selects the appropriate BigQuery view based on the question's context.
* **`extract_time_periods(question)`:** Extracts time periods from the question for filtering.
* **`get_similar_qa(question)`:** Retrieves similar Q&A pairs from the `successful_qa_pairs` vector store for context.
* **`ask_question_with_feedback_and_learning(question)`:** Handles user feedback and iterative refinement.
* **`store_successful_qa(question, answer)`:** Stores successful Q&A pairs in the `successful_qa_pairs` vector store.


### Setup and Dependencies

The notebook requires several Python packages, listed at the beginning of the file. Ensure you have installed them before running the code. You will also need to configure:

* A service account key file (`service_account_file`) with access to your BigQuery project.
* An OpenAI API key (`OPENAI_API_KEY`).
* The correct BigQuery project ID, dataset name, and table names.

In [None]:
pip install bs4 langchain langchain-community langchain-google-community langchain_google_vertexai langchain-openai openai chromadb tiktoken tabulate sqlalchemy sqlalchemy-bigquery google-cloud-bigquery

In [20]:
import os
from google.cloud import bigquery
from google.api_core import exceptions
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain, LLMChain
from langchain.prompts import PromptTemplate
import pandas as pd
from collections import defaultdict
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_community import BigQueryVectorStore
import re
import uuid
from datetime import datetime
from bs4 import BeautifulSoup
from dateutil.relativedelta import relativedelta
from langchain.docstore.document import Document
from dateutil import parser
import calendar



In [None]:
service_account_file = "/content/ra-development-d027d9a2dd60.json"
project = "ra-development"
dataset = "analytics_finance_demo"
location = "europe-west2"
sqlalchemy_url = f'bigquery://{project}/{dataset}?credentials_path={service_account_file}'
os.environ["OPENAI_API_KEY"] = ""

vector_store_content_description = """
The vector store contains pre-created analysis with the following information:
1. Key Metrics
   - Revenue, Overheads, Cost of Delivery, Gross & Net Profit, Retained Earnings net amounts
   - Account Sub-Category and Account Group net amounts
2. Key Metrics Calculation:
   - Month-over-month, year-to-date, and budget variance calculations
   - Gross margin percentages
3. Significant Transaction Identification:
   - Transactions exceeding a certain percentage threshold of the total account group amount
   - Cancelling transactions are identified and excluded
   - Context for each significant transaction (new or changed from previous month)
4. Overhead Trend Analysis:
   - Monthly growth rates for overhead categories over the last 6 months
   - Account groups with significant average monthly growth (> 10%)
5. Identification of New Repeating (Recurring) Transactions:
   - Transactions with the same description appearing consistently over the last 3 months

The analysis covers the last three months and the current year-to-date at summary level for category, subcategory, and account group levels.
"""

# Create a BigQuery client
client = bigquery.Client.from_service_account_json(service_account_file)

# Initialize embedding model
embedding_model = VertexAIEmbeddings(
    model_name="textembedding-gecko@latest",
    project=project
)

# Initialize BigQueryVectorStore containing P&L report analysis texts
vector_store = BigQueryVectorStore(
    project_id=project,
    dataset_name=dataset,
    table_name="pl_reports_vector_storage",
    location=location,
    embedding=embedding_model,
)

def create_successful_qa_table():
    table_id = f"{project}.{dataset}.successful_qa_pairs"  # Correct table ID

    schema = [
        bigquery.SchemaField("content", "STRING", mode="REQUIRED"),
        bigquery.SchemaField("embedding", "FLOAT64", mode="REPEATED"),
        bigquery.SchemaField("id", "STRING", mode="REQUIRED"),
    ]

    table = bigquery.Table(table_id, schema=schema)
    table.clustering_fields = ["id"]

    try:
        table = client.create_table(table)  # Use the client from outside
        print(f"Created table {table.project}.{table.dataset_id}.{table.table_id}")
    except exceptions.Conflict:
        print(f"Table  already exists.")
    except Exception as e:
        print(f"An error occurred while creating the table: {str(e)}")

# Initialize a new vector store for storing successful Q&A pairs
def initialize_qa_vector_store():
    create_successful_qa_table()  # Ensure the table exists with the correct schema
    return BigQueryVectorStore(
        project_id=project,
        dataset_name=dataset,
        table_name="successful_qa_pairs",
        location=location,
        embedding=embedding_model,
        text_column="content",
        embedding_column="embedding",
        id_column="id"
    )

qa_vector_store = initialize_qa_vector_store()

def load_vector_storage():
    query = f"""
    SELECT date_month as month, COALESCE(report_analysis, '') || COALESCE(invoice_analysis, '') || COALESCE(recurring_payments_analysis, '') as content
    FROM `{project}.{dataset}.pl_reports`
    """
    df = client.query(query).to_dataframe()

    for _, row in df.iterrows():
        month = row['month']
        text = row['content']
        metadata = {'month': month}  # Removed analysis_type as it's no longer needed
        vector_store.add_texts([text], metadatas=[metadata])

    print("Vector storage loaded successfully.")

def get_available_months():
    query = f"""
    SELECT DISTINCT month as month
    FROM `{project}.{dataset}.pl_reports_vector_storage`
    ORDER BY month DESC
    """
    df = client.query(query).to_dataframe()
    return df['month'].tolist()

# Add this function to determine the valid time range
def get_valid_time_range(available_months):
    if not available_months:
        return None, None

    latest_month = max(available_months)
    earliest_month = min(available_months)

    # Calculate the start of the year for the latest month
    year_start = latest_month.replace(month=1, day=1)

    # The valid range includes two months before the earliest available month
    valid_start = (earliest_month - relativedelta(months=2)).replace(day=1)

    return valid_start, latest_month

def create_lookup_tables():
    query = f"""
    SELECT DISTINCT
        account_report_group,
        account_report_sub_category,
        account_category
    FROM `{project}.{dataset}.profit_and_loss_report_account_group`
    """
    df = client.query(query).to_dataframe()

    lookups = {
        'group': defaultdict(list),
        'sub_category': defaultdict(list),
        'category': defaultdict(list)
    }

    for _, row in df.iterrows():
        group = row['account_report_group']
        sub_category = row['account_report_sub_category']
        category = row['account_category']

        lookups['group'][group.lower()].append(group)
        lookups['sub_category'][sub_category.lower()].append(sub_category)
        lookups['category'][category.lower()].append(category)

    return lookups

lookups = create_lookup_tables()

memory = ConversationBufferMemory()
db = SQLDatabase.from_uri(sqlalchemy_url)
llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)
conversation_chain = ConversationChain(
    llm=llm,
    memory=memory
)
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent_executor = create_sql_agent(
    llm=llm,
    toolkit=toolkit,
    verbose=True,
    top_k=1000,
)

def find_matching_values(question, lookups):
    words = question.lower().split()
    matches = {
        'group': set(),
        'sub_category': set(),
        'category': set()
    }

    for word in words:
        for key in lookups:
            if word in lookups[key]:
                matches[key].update(lookups[key][word])

    return matches

def construct_filter_clause(matches):
    clauses = []
    for key, values in matches.items():
        if values:
            column = "account_report_group" if key == "group" else f"account_{key}"
            quoted_values = ["'{0}'".format(v) for v in values]
            clause = "{0} IN ({1})".format(column, ", ".join(quoted_values))
            clauses.append(clause)

    return " AND ".join(clauses) if clauses else ""

def determine_view(matches):
    if matches['group']:
        return "profit_and_loss_report_account_group_xa"
    elif matches['sub_category']:
        return "profit_and_loss_report_sub_categories_xa"
    else:
        return "profit_and_loss_report_categories_xa"

def extract_time_periods(question):
    """Extract time periods mentioned in the question."""
    # Patterns for various time formats
    year_pattern = r'\b(20\d{2})\b'
    month_year_pattern = r'\b(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+(\d{4})\b'
    quarter_pattern = r'\b(Q[1-4])\s+(\d{4})\b'

    time_periods = []

    # Extract years
    years = re.findall(year_pattern, question)
    for year in years:
        time_periods.append(('year', int(year)))

    # Extract month-year combinations
    month_years = re.findall(month_year_pattern, question, re.IGNORECASE)
    for month, year in month_years:
        date = parser.parse(f"{month} {year}")
        time_periods.append(('month', date))

    # Extract quarters
    quarters = re.findall(quarter_pattern, question)
    for quarter, year in quarters:
        quarter_num = int(quarter[1])
        start_month = (quarter_num - 1) * 3 + 1
        time_periods.append(('quarter', parser.parse(f"{year}-{start_month:02d}-01")))

    return time_periods

def strip_html(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    return soup.get_text()

def summarize_content(question, content):
    prompt = PromptTemplate(
        input_variables=["question", "content"],
        template="""Given the following question and content, provide a concise summary of the content that is directly relevant to answering the question.
        Ignore any information that doesn't pertain to the question. All amounts should be stated in GBP (£).

        Question: {question}

        Content: {content}

        Relevant Summary:"""
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    summary = chain.run(question=question, content=content)
    return summary.strip()

def should_query_vector_store(question):
    prompt = PromptTemplate(
        input_variables=["question", "content_description", "valid_time_range"],
        template="""Given the following question and description of the content in a vector store,
        determine if the vector store is likely to contain information that can answer the question.
        Consider the following:
        1. Does the question ask about any of the key metrics or analyses mentioned in the content description?
        2. Does the question fall within the time frame covered by the vector store? Valid time range: {valid_time_range}
        3. Is the level of detail requested (category, subcategory, account group) available in the vector store?

        Respond with 'Yes' if the vector store is likely to contain relevant information, or 'No' if it's unlikely or unclear.

        Question: {question}

        Vector Store Content Description:
        {content_description}

        Decision (Yes/No):
        Explanation:"""
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    response = chain.run(question=question,
                         content_description=vector_store_content_description,
                         valid_time_range=f"{valid_time_range[0].strftime('%B %Y')} to {valid_time_range[1].strftime('%B %Y')}")


    # Extract the decision from the response
    decision_match = re.search(r'Decision \(Yes/No\):\s*(Yes|No)', response, re.IGNORECASE)
    if decision_match:
        decision = decision_match.group(1).lower()
        return decision == 'yes'
    else:
        # If no clear decision is found in the expected format, look for a 'Yes' at the beginning of the response
        if response.strip().lower().startswith('yes'):
            return True
        else:
            return False

def evaluate_answer_relevance(question, answer):
    prompt = PromptTemplate(
        input_variables=["question", "answer"],
        template="""Given the following question and answer, evaluate how well the answer addresses the question.
        Provide a relevance score as a percentage and a brief explanation.

        Question: {question}

        Answer: {answer}

        Relevance Score (0-100%):
        Explanation:"""
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    response = chain.run(question=question, answer=answer)
    return response

def ask_question(question, context=""):
    time_periods = extract_time_periods(question)

    if should_query_vector_store(question):
        month = time_periods[0][1] if time_periods and time_periods[0][0] in ['month', 'quarter'] else None
        filter_dict = {"month": month.strftime('%Y-%m-%d') if month else None}

        if filter_dict["month"] is None: # Check for None before passing
            relevant_docs = vector_store.similarity_search(question, k=1) # Remove the filter
        else:
            relevant_docs = vector_store.similarity_search(question, k=1, filter=filter_dict)

        if relevant_docs:
            pre_created_answer = relevant_docs[0].page_content
            stripped_answer = strip_html(pre_created_answer)
            summarized_answer = summarize_content(question, stripped_answer)
            answer = f"Based on the pre-created analysis: {summarized_answer}"
        else:
            answer = "No relevant pre-created analysis found. Falling back to SQL query."
    else:
        answer = "The question cannot be answered by pre-created analysis. Using SQL query."

    if "Using SQL query" in answer:
        matches = find_matching_values(question, lookups)
        filter_clause = construct_filter_clause(matches)
        view_name = determine_view(matches)

        date_filter = construct_date_filter(time_periods)

        if date_filter:
            if filter_clause: # Add AND only if there's an existing filter clause
                filter_clause = f"{filter_clause} AND {date_filter}"
            else:
                filter_clause = date_filter

        instruction = f"""You are a knowledgeable finance data analyst working for Rittman Analytics.
        Use the `{project}.{dataset}.{view_name}` view to answer this question.
        Use the following SQL filter clause in your query: {filter_clause}
        When calculating revenue or any other financial metrics, make sure to aggregate the values for the entire time period(s) mentioned in the question (which may be months, quarters, or years).
        If no specific time period is mentioned, provide an overview of all available data.
        If multiple time periods are mentioned, provide a comparison between them.
        Please construct and execute a SQL query to answer the question, making sure to include the filter clause.
        Do not include markdown-style triple backticks in the SQL you generate and try to use or validate.

        {context}

        Question is: {question}
        """

        answer = agent_executor.run(instruction)

    relevance_evaluation = evaluate_answer_relevance(question, answer)

    return f"{answer}\n\nRelevance Evaluation:\n{relevance_evaluation}"

def construct_date_filter(time_periods):
    """Construct a SQL date filter based on extracted time periods using AND for multiple periods."""
    filters = []
    for period_type, date in time_periods:
        if period_type == 'year':
            filters.append(f"EXTRACT(YEAR FROM date_month) = {date}")
        elif period_type == 'month':
            filters.append(f"EXTRACT(MONTH FROM date_month) = {date.month} AND EXTRACT(YEAR FROM date_month) = {date.year}")
        elif period_type == 'quarter':
            quarter_start = date
            quarter_end = date + relativedelta(months=2, day=calendar.monthrange(date.year, date.month + 2)[1]) #Correctly calculate end of quarter
            filters.append(f"date_month BETWEEN DATE('{quarter_start.strftime('%Y-%m-%d')}') AND DATE('{quarter_end.strftime('%Y-%m-%d')}')")

    return " AND ".join(filters) if filters else ""

def extract_dates_from_question(question):
    """Extract all years mentioned in the question."""
    year_pattern = r'\b(20\d{2})\b'
    return list(set(re.findall(year_pattern, question)))

def get_similar_qa(question: str, k: int = 3):
    """Retrieve similar Q&A pairs from the vector store."""
    print(f"Searching for similar Q&A pairs to: {question}")
    similar_qa = qa_vector_store.similarity_search(question, k=k)
    print(f"Retrieved {len(similar_qa)} similar Q&A pairs")

    converted_results = []
    for i, item in enumerate(similar_qa):
        if isinstance(item, Document):
            converted_results.append(item)
            print(f"Document {i+1} (already Document):")
            print(f"  Page content: {item.page_content[:100]}...")
            print(f"  Metadata: {item.metadata}")
        else:
            # Handle cases where the item might be a dict or have a different structure
            text = item.get('text', item.get('page_content', ''))
            metadata = {key: value for key, value in item.items() if key not in ['text', 'page_content']}
            doc = Document(page_content=text, metadata=metadata)
            converted_results.append(doc)
            print(f"Document {i+1} (converted to Document):")
            print(f"  Page content: {text[:100]}...")
            print(f"  Metadata: {metadata}")

    return converted_results

def format_similar_qa(similar_qa):
    """Format similar Q&A pairs for inclusion in the prompt."""
    formatted = "Similar Q&A pairs from past interactions:\n\n"
    for i, doc in enumerate(similar_qa, 1):
        formatted += f"{i}. {doc.page_content}\n\n"
    return formatted

def ask_question_with_feedback_and_learning(question: str, max_iterations: int = 3) -> str:
    iteration = 0
    while iteration < max_iterations:
        # Retrieve similar Q&A pairs
        similar_qa = get_similar_qa(question)

        # Prepare the context with similar Q&A pairs
        context = format_similar_qa(similar_qa)

        # Use the modified ask_question function with context
        answer = ask_question(question, context)
        print(f"\nAnswer (Iteration {iteration + 1}):\n{answer}")

        user_satisfied = input("\nDid this answer your question sufficiently? (yes/no): ").lower().strip()

        if user_satisfied == 'yes':
            # Store the successful Q&A pair
            store_successful_qa(question, answer)
            return answer

        feedback = input("Please provide feedback on how the answer could be improved: ")

        # Use LLM to analyze feedback and improve the question
        improve_prompt = PromptTemplate(
            input_variables=["original_question", "answer", "feedback", "context"],
            template="""Given the original question, the provided answer, user feedback, and similar Q&A pairs from past interactions,
            please suggest an improved version of the question that addresses the user's concerns.

            Original Question: {original_question}

            Provided Answer: {answer}

            User Feedback: {feedback}

            {context}

            Improved Question:"""
        )

        improve_chain = LLMChain(llm=llm, prompt=improve_prompt)
        improved_question = improve_chain.run(
            original_question=question,
            answer=answer,
            feedback=feedback,
            context=context
        )

        print(f"\nImproved question based on feedback: {improved_question}")

        question = improved_question  # Update the question for the next iteration
        iteration += 1

    return "I apologize, but I couldn't provide a satisfactory answer within the maximum number of iterations. Please try rephrasing your question or contact support for further assistance."


def store_successful_qa(question: str, answer: str):
    """Store a successful question-answer pair in the vector store."""
    qa_pair = f"Q: {question}\nA: {answer}"

    # Generate embedding for the QA pair
    embedded_vector = embedding_model.embed_query(qa_pair)

    # Prepare the row to be inserted
    row = {
        'content': qa_pair,
        'embedding': embedded_vector,
        'id': str(uuid.uuid4())  # Generate a unique ID
    }

    # Insert the row into BigQuery
    client = bigquery.Client()
    table_id = f"{project}.{dataset}.successful_qa_pairs"

    # Insert the row into BigQuery
    errors = client.insert_rows_json(table_id, [row]) # Use correctly defined table_id

    if errors == []:
        print("Successful Q&A pair stored in vector store.")
    else:
        print(f"Errors occurred while storing Q&A pair: {errors}")



def main(reload_vector_storage=False):
    global valid_time_range, vector_store_content_description

    if reload_vector_storage:
        load_vector_storage()

    available_months = get_available_months()
    valid_time_range = get_valid_time_range(available_months)

    # Update the vector_store_content_description with the actual time range
    vector_store_content_description += f"""
    The analysis covers the period from {valid_time_range[0].strftime('%B %Y')} to {valid_time_range[1].strftime('%B %Y')}.
    For each month in this range, the analysis includes data for that month, comparisons to the two previous months, and year-to-date figures.
    """

    print("Hi! Ask me a question about our company's profit and loss data")
    while True:
        question = input("\nYour question (or type 'QUIT' to exit): ")
        if question.lower() == 'quit':
            break

        final_answer = ask_question_with_feedback_and_learning(question)
        print(f"\nFinal Answer: {final_answer}")
        print("\n---")

if __name__ == "__main__":
    main(reload_vector_storage=false)  # Set to True to reload vector storage