**Building Data Democratization ChatBot to answer Data Science Questions within 100 words**

Install fuzzywuzzy, huggingface_hub

In [None]:
!pip install huggingface_hub fuzzywuzzy python-Levenshtein

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-Levenshtein)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Downloading python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)


In [None]:
import json
from fuzzywuzzy import fuzz
from huggingface_hub import InferenceClient
from google.colab import userdata
import sys

# Initialize Hugging Face client
client = InferenceClient(api_key=userdata.get('HF_TOKEN'))  # Replace with your API key

# Knowledge base as a JSON-like dictionary
knowledge_base = {
    "What is credit risk?": "Credit risk is when someone might not pay back a loan, so the bank could lose money. It’s like lending your toy and not getting it back. Banks use math to guess if someone is risky.",
    "What is data democratization?": "Data democratization means everyone can use data to make choices; it's like turning a complicated chemistry lab into a fun, safe kitchen where everyone can cook. It helps people understand numbers without being experts.",
    "What is an algorithm?": "An algorithm is like a recipe or step-by-step instructions for solving a problem! Just like following steps to bake cookies, computers follow algorithms to complete tasks."
}

# Topic keywords
relevant_keywords = [
    "data", "science", "analytics", "risk", "credit", "model", "modeling",
    "geospatial", "data democratization", "predictive", "machine learning", "statistics",
    "lake", "llm", "large language model", "fraud", "forecasting", "financial"
]


def is_on_topic(user_input):
    """Check if the input contains relevant keywords."""
    user_input_lower = user_input.lower()
    return any(keyword in user_input_lower for keyword in relevant_keywords)


def find_closest_question(user_input):
    """Find the closest matching question in the knowledge base."""
    best_match = None
    highest_score = 0
    for question in knowledge_base:
        score = fuzz.partial_ratio(user_input.lower(), question.lower())
        if score > highest_score and score > 85:  # Threshold for match
            highest_score = score
            best_match = question
    return best_match


def get_llm_answer(user_input):
    """Get answer from Hugging Face API with child-friendly prompt."""
    prompt = f"Explain {user_input} in 100 words or less, like you're talking to a 10-year-old, in the context of data science or analytics. Keep it simple and clear."
    try:
        response = client.chat_completion(
            messages=[{"role": "user", "content": prompt}],
            model="mistralai/Mixtral-8x7B-Instruct-v0.1",
            max_tokens=150,  # ~100 words
            temperature=0.7
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Sorry, I couldn't get an answer right now: {str(e)}"




In [None]:
# Test with single input
user_input = "What is data science?"  # Try different questions
if not is_on_topic(user_input):
    print("Answer: This is out of my domain.")
else:
    closest_question = find_closest_question(user_input)
    if closest_question:
        print("Answer:", knowledge_base[closest_question])
    else:
        print("Answer:", get_llm_answer(user_input))

Answer: Data science is like being a detective with data! You use special tools and skills to find hidden treasures in information. By cleaning, organizing, and exploring data, you can help answer big questions and even make predictions about the future. It's like solving puzzles to tell stories and make the world a better place.


In [None]:
# Test with single input
user_input = "What is data lake?"  # Try different questions
if not is_on_topic(user_input):
    print("Answer: This is out of my domain.")
else:
    closest_question = find_closest_question(user_input)
    if closest_question:
        print("Answer:", knowledge_base[closest_question])
    else:
        print("Answer:", get_llm_answer(user_input))

Answer: A data lake is like a big, super-organized swimming pool for information. It stores all kinds of data, like numbers, words, and pictures, in one place. Just like how a lake is filled with water from many different sources, a data lake collects data from various places, like websites, apps, and machines. This way, it's easier for people to find, use, and learn from the data to make better decisions, just like how we can use water from a lake for many things, like drinking, swimming, and irrigating fields.


In [None]:
# Test with single input
user_input = "What is LLM?"  # Try different questions
if not is_on_topic(user_input):
    print("Answer: This is out of my domain.")
else:
    closest_question = find_closest_question(user_input)
    if closest_question:
        print("Answer:", knowledge_base[closest_question])
    else:
        print("Answer:", get_llm_answer(user_input))

Answer: An LLM is like a master's degree in a specific area of law. In the context of data science or analytics, an LLM in Law and Technology or Intellectual Property could help you understand the legal rules around using data and creating new technology. This can be important for protecting your ideas and making sure you're following the law when working with data. It's like having a special toolbox for handling legal issues in data science!


In [None]:
# Test with single input
user_input = "What is Feline Cognitive Dysfunction? I'm suspecting my cat has it."  # Try different questions
if not is_on_topic(user_input):
    print("Answer: This is out of my domain.")
else:
    closest_question = find_closest_question(user_input)
    if closest_question:
        print("Answer:", knowledge_base[closest_question])
    else:
        print("Answer:", get_llm_answer(user_input))

Answer: This is out of my domain.


In [None]:
def chatbot():
    """Run the chatbot in the console."""
    print("Welcome to the DataBite Chatbot! Ask anything about data science or risk analytics (type 'exit' to quit).")
    while True:
        user_input = input("Your question: ")
        if user_input.lower() == 'exit':
            print("Goodbye!")
            break
        closest_question = find_closest_question(user_input)
        if closest_question:
            print("Answer:", knowledge_base[closest_question])
        else:
            print("Answer:", get_llm_answer(user_input))

if __name__ == "__main__":
    chatbot()

Welcome to the Data Democratization Chatbot! Ask anything about data science or risk analytics (type 'exit' to quit).
Your question: Can you tell me what data lake is?
Answer: A data lake is like a big swimming pool for information. Just like how a lake collects water from many sources like rivers and rain, a data lake stores data from various places like websites, apps, and machines. This data can be of many types, such as numbers, text, pictures, and more. Once the data is in the lake, it can be cleaned, organized, and used to help adults make better decisions or find answers to important questions.
Your question: and what is structured and unstructured data?
Answer: Structured data is like data that fits neatly into a spreadsheet with rows and columns, like your name, age, and address. Unstructured data is like data that doesn't fit into a spreadsheet, like a picture, a video, or a bunch of words in a document. Both kinds of data are important and can tell us different things, but u