# Cricket Data Semantic Search & Conversational Pipeline with Haystack

This notebook demonstrates a full Haystack pipeline for cricket player data, including:
- Data loading and preprocessing
- Vectorization with GPT-4.0 embeddings
- Semantic storage in a vector database (FAISS)
- Conversational UI for querying the data
- Efficient conversation persistence

---

## 1. Install Required Packages and Databases

The following commands will install all necessary dependencies and set up the FAISS vector database in the `Databases` folder. To delete the database, simply remove the corresponding files from the folder.

```bash
# Install core dependencies
pip install farm-haystack[faiss] openai streamlit

# (Optional) If you want to use Weaviate or Qdrant instead of FAISS:
# pip install farm-haystack[weaviate]
# pip install farm-haystack[qdrant]

# FAISS is file-based and will store its index in the Databases folder
mkdir -p ../../../../Databases/faiss
```

To delete the FAISS database, run:
```bash
rm -rf ../../../../Databases/faiss
```


In [16]:
# 2. Import Libraries
import pandas as pd
import os
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever, PromptNode, PromptTemplate
from haystack.pipelines import Pipeline
import openai
import streamlit as st
import sqlite3
import json

---

## 3. Load and Preprocess Cricket Data

We will load both the stats and personal information CSVs, merge them, and prepare the data for semantic vectorization.

In [17]:
# Load cricket player stats and personal info
data_dir = '../../../../Hobby/Cricket/data'
stats_path = os.path.join(data_dir, 'cricket_player_stats.csv')
personal_path = os.path.join(data_dir, 'cricket_player_personal.csv')

stats_df = pd.read_csv(stats_path)
personal_df = pd.read_csv(personal_path)

# Merge on player_id and fill missing values
df = pd.merge(stats_df, personal_df, on=['player_id', 'name'], how='outer')
df = df.fillna('N/A')

# Combine all info into a single text field for semantic search
def row_to_text(row):
    return f"Name: {row['name']}. Role: {row['role']}. Team: {row['team']}. Batting Avg: {row['batting_average']}. Bowling Avg: {row['bowling_average']}. DOB: {row['date_of_birth']}. Country: {row['country']}. City: {row['city']}."

df['text'] = df.apply(row_to_text, axis=1)
df.head()

Unnamed: 0,player_id,name,role,batting_average,bowling_average,team,date_of_birth,country,city,text
0,1,Virat Kohli,Batsman,57.7,,India,1988-11-05,India,Delhi,Name: Virat Kohli. Role: Batsman. Team: India....
1,2,Steve Smith,Batsman,59.8,,Australia,1989-06-02,Australia,Sydney,Name: Steve Smith. Role: Batsman. Team: Austra...
2,3,Kane Williamson,Batsman,54.3,,New Zealand,1990-08-08,New Zealand,Tauranga,Name: Kane Williamson. Role: Batsman. Team: Ne...
3,4,Joe Root,Batsman,50.2,,England,1990-12-30,England,Sheffield,Name: Joe Root. Role: Batsman. Team: England. ...
4,5,Babar Azam,Batsman,48.6,,Pakistan,1994-10-15,Pakistan,Lahore,Name: Babar Azam. Role: Batsman. Team: Pakista...


---

## 4. Vectorize Data with GPT-4.0 Embeddings

We use the OpenAI GPT-4.0 embedding model to convert each player's information into a semantic vector.

In [18]:
# Set up Ollama local embedding model (e.g., nomic-embed-text, llama2, etc.)
# You must have Ollama running locally and the model pulled (e.g., `ollama pull nomic-embed-text`)
ollama_embedding_model = "nomic-embed-text"  # Or another supported embedding model in Ollama
retriever = EmbeddingRetriever(
    embedding_model=ollama_embedding_model,
    model_format="sentence_transformers",  # Use sentence-transformers for local models
    use_gpu=True
 )

# Prepare Haystack documents
documents = [
    {"content": row['text'], "meta": {"player_id": row['player_id'], "name": row['name']}} for _, row in df.iterrows()
 ]

ImportError: Failed to import 'transformers.modeling_utils'. Run 'pip install farm-haystack[inference]'. Original error: cannot import name 'SequenceSummary' from 'transformers.modeling_utils' (/Users/mathewthomas/Documents/hobby_projects/AI_ML_Work/567COG_DS_AI_ML/.venv/lib/python3.13/site-packages/transformers/modeling_utils.py)

---

## 5. Set Up Backend Database for Vector Storage

We will use FAISS as the vector database, storing the index in the `Databases/faiss` folder.

In [None]:
# Set up FAISS document store
faiss_dir = '../../../../../../Databases/faiss'
os.makedirs(faiss_dir, exist_ok=True)
faiss_index_path = os.path.join(faiss_dir, 'cricket_faiss_index')

# If index exists, load; else, create new
if os.path.exists(faiss_index_path + '.faiss'):
    document_store = FAISSDocumentStore.load(faiss_index_path)
else:
    document_store = FAISSDocumentStore(embedding_dim=1536, faiss_index_factory_str="Flat")

---

## 6. Store Vectorized Data in Database

We will write the vectorized documents to the FAISS document store for efficient semantic retrieval.

In [None]:
# Write documents and embeddings to FAISS
document_store.write_documents(documents)
document_store.update_embeddings(retriever)
document_store.save(faiss_index_path)

---

## 7. Build Haystack Pipeline for Semantic Search

We will create a ConversationalRetrievalPipeline using the FAISS retriever and a GPT-4.0 prompt node for answering questions.

In [None]:
# Set up Haystack pipeline with Ollama LLM (e.g., llama2, mistral, etc.)
ollama_llm = "llama2"  # Or another supported LLM in Ollama
ollama_api_url = "http://localhost:11434/v1/completions"  # Default Ollama API endpoint

prompt_node = PromptNode(
    model_name_or_path=ollama_llm,
    api_url=ollama_api_url,
    max_length=512,
    default_prompt_template=PromptTemplate(
        name="cricket-qa",
        prompt="Answer the user's question about cricket players using the provided context. Context: {join(documents)}. Question: {query}"
    ),
    model_kwargs={"temperature": 0.2},
    stop_words=["\n"]
)

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

---

## 8. Create Conversational UI for Data Interaction

We will use Streamlit to create a simple chat interface for asking questions about the cricket data. Conversations will be persisted efficiently.

In [None]:
# Streamlit UI for conversation
st.title("Cricket Data Conversational QA")

# SQLite for conversation persistence
conv_db_path = '../../../../../../Databases/cricket_conversations.db'
conn = sqlite3.connect(conv_db_path)
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS conversations (id INTEGER PRIMARY KEY, user_input TEXT, response TEXT)''')
conn.commit()

# Chat interface
if 'history' not in st.session_state:
    st.session_state['history'] = []

user_input = st.text_input("Ask a question about cricket players:")
if st.button("Ask") and user_input:
    result = pipeline.run({"Query": user_input, "params": {"Retriever": {"top_k": 5}}})
    answers = result.get('PromptNode', [])
    answer = answers[0].answer if answers else "No answer found."
    st.session_state['history'].append((user_input, answer))
    # Persist to DB
    c.execute("INSERT INTO conversations (user_input, response) VALUES (?, ?)", (user_input, answer))
    conn.commit()

# Display conversation history
st.subheader("Conversation History")
for q, a in st.session_state['history']:
    st.markdown(f"**You:** {q}")
    st.markdown(f"**Bot:** {a}")

---

## 9. Persist Conversations Efficiently

Conversations are stored in a lightweight SQLite database (`Databases/cricket_conversations.db`). Each user question and bot response is saved for later retrieval and analysis.

---

## 10. Notes: Database Installation, Deletion, and Conversation Storage Details

- **Database Installation:**
    - FAISS is installed via pip and stores its index in the `Databases/faiss` folder.
    - SQLite is used for conversation persistence and stores data in `Databases/cricket_conversations.db`.
- **Database Deletion:**
    - To delete the FAISS index: `rm -rf ../../../../../../Databases/faiss`
    - To delete conversation history: `rm ../../../../../../Databases/cricket_conversations.db`
- **Conversation Storage:**
    - Each user question and bot response is stored as a row in the SQLite database for efficient retrieval and analysis.

---

You now have a full Haystack pipeline for semantic search and conversational QA over cricket data, with persistent and manageable storage.