## Neural Search with txtai

With txtai we can vectorize our data and perform semantic queries on it. 

Note: txtai can also work with audio, images and videos. It is a very deep tech.

## Demo 

First I'll setup all the tools needed, including using txtai to "embed" (i.e. vectorize) this collection of course sample data I have retrieved from kaggle: https://www.kaggle.com/datasets/siddharthm1698/coursera-course-dataset?resource=download. Note that this data is shallow in information per course and lacks description -- but it is still useful to demo with, nonetheless.

In [12]:
import pandas as pd
from txtai import Embeddings
from IPython.core.display import HTML, display
import matplotlib.pyplot as plt


  from IPython.core.display import HTML, display


In [4]:
# Read the CSV file into a DataFrame
df = pd.read_csv("coursea_data.csv")

# Generate a combined text for each course
df['combined'] = df.apply(lambda x: f"{x['course_title']} {x['course_organization']} {x['course_Certificate_type']} {x['course_rating']} {x['course_difficulty']} {x['course_students_enrolled']}", axis=1)

# Initialize an embeddings object with hybrid search enabled
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "hybrid": True})

# Convert the DataFrame to a list of (id, text) for indexing, using the index as a unique ID
data = [(idx, row['combined']) for idx, row in df.iterrows()]

# Index the data
embeddings.index(data)

print("Indexing complete.")


Indexing complete.


In [5]:
def retrieve_course_by_id(uid):
    course = df.loc[uid]
    return course.to_dict()

# Define the table function for displaying results
def table(category, query, rows):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 100%;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 10px;
        font: 15px Oswald;
    }
    </style>
    """

    html += f"<h3>[{category}] {query}</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>"
    for score, text in rows:
        html += f"<tr><td>{score:.4f}</td><td>{text}</td></tr>"
    html += "</table>"

    display(HTML(html))
    
def table_with_details(category, query, results, df):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 100%;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 10px;
        font: 15px Oswald;
    }
    </style>
    """

    # Define table headers based on DataFrame's columns
    columns = ['Score', 'Course Title', 'Organization', 'Certificate Type', 'Rating', 'Difficulty', 'Students Enrolled']
    html += f"<h3>[{category}] {query}</h3><table><thead><tr>" + ''.join([f"<th>{col}</th>" for col in columns]) + "</tr></thead><tbody>"

    for uid, score in results:
        # Fetch the course by its unique ID
        course = df.loc[uid]
        # Create a row with course details
        row = f"""
        <tr>
            <td>{score:.4f}</td>
            <td>{course['course_title']}</td>
            <td>{course['course_organization']}</td>
            <td>{course['course_Certificate_type']}</td>
            <td>{course['course_rating']}</td>
            <td>{course['course_difficulty']}</td>
            <td>{course['course_students_enrolled']}</td>
        </tr>
        """
        html += row

    html += "</tbody></table>"
    display(HTML(html))

## Direct Queries Demonstration

This section showcases the search system's ability to handle direct, topic-specific queries. These queries represent straightforward search intentions for educational content within the dataset.


In [8]:
# Direct, topic-specific queries
direct_queries = [
    "machine learning",
    "Google beginner",
    "professional certificate 4.5",
    "advanced 100,000",
    "Python easy"
]

# Perform search and display results for direct queries
for query in direct_queries:
    results = embeddings.search(query, 5)  # Retrieve top 5 results for each query
    detailed_rows = [(uid, score) for uid, score in results]
    table_with_details("Direct Query Results", query, detailed_rows, df)


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.5593,Machine Learning,Stanford University,COURSE,4.9,Mixed,3.2m
0.5547,Structuring Machine Learning Projects,deeplearning.ai,COURSE,4.8,Beginner,220k
0.3498,Deep Learning,deeplearning.ai,SPECIALIZATION,4.8,Intermediate,690k
0.347,"Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization",deeplearning.ai,COURSE,4.9,Beginner,270k
0.3372,Applied Machine Learning in Python,University of Michigan,COURSE,4.6,Intermediate,150k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.3197,Technical Support Fundamentals,Google,COURSE,4.8,Beginner,280k
0.2966,Operating Systems and You: Becoming a Power User,Google,COURSE,4.6,Beginner,76k
0.2941,The Bits and Bytes of Computer Networking,Google,COURSE,4.7,Beginner,130k
0.2858,System Administration and IT Infrastructure Services,Google,COURSE,4.7,Beginner,62k
0.2773,Introduction to Web Development,"University of California, Davis",COURSE,4.7,Beginner,76k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.5243,Google IT Support,Google,PROFESSIONAL CERTIFICATE,4.8,Beginner,350k
0.3155,Autodesk Certified Professional: AutoCAD for Design and Drafting Exam Prep,Autodesk,COURSE,4.7,Advanced,22k
0.31,Cloud Engineering with Google Cloud,Google Cloud,PROFESSIONAL CERTIFICATE,4.7,Intermediate,310k
0.306,Arizona State University TESOL,Arizona State University,PROFESSIONAL CERTIFICATE,4.9,Beginner,150k
0.306,Autodesk Certified Professional: Revit for Architectural Design Exam Prep,Autodesk,COURSE,4.7,Advanced,9.7k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.3476,Advanced Data Science with IBM,IBM,SPECIALIZATION,4.4,Advanced,320k
0.3307,Advanced Machine Learning,National Research University Higher School of Economics,SPECIALIZATION,4.5,Advanced,190k
0.1927,Achieving Personal and Professional Success,University of Pennsylvania,SPECIALIZATION,4.7,Beginner,110k
0.1918,IBM AI Engineering,IBM,PROFESSIONAL CERTIFICATE,4.6,Intermediate,140k
0.1877,Data Analysis and Interpretation,Wesleyan University,SPECIALIZATION,4.4,Beginner,100k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2034,Introduction to Scripting in Python,Rice University,SPECIALIZATION,4.7,Beginner,49k
0.1921,Aprende a programar con Python,Universidad Austral,SPECIALIZATION,4.2,Beginner,6.6k
0.1852,Using Python to Interact with the Operating System,Google,COURSE,4.7,Beginner,19k
0.1825,Data Visualization with Python,IBM,COURSE,4.6,Intermediate,66k
0.1801,Python for Everybody,University of Michigan,SPECIALIZATION,4.8,Beginner,1.5m


## Informal Queries Demonstration

This section highlights the system's versatility by handling informal, contextually rich queries. These demonstrate the semantic understanding and nuanced search capabilities.


In [9]:
# Informal, context-rich queries
informal_queries = [
    "I wanna be a security expert in cloud",
    "I don't know what to do but from what my friend has told me, nursing is pretty good",
    "how does one become a social worker from scratch?",
    "something to warm my tea",
    "I want to learn cloud security from scratch",
    "I want to help people in need",
    "i graduated in maths and need something more advanced"
]

# Perform search and display results for informal queries
for query in informal_queries:
    results = embeddings.search(query, 5)  # Retrieve top 5 results for each query
    detailed_rows = [(uid, score) for uid, score in results]
    table_with_details("Informal Query Results", query, detailed_rows, df)


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.5918,Security in Google Cloud Platform,Google Cloud,SPECIALIZATION,4.7,Intermediate,300k
0.4338,Networking in Google Cloud,Google Cloud,SPECIALIZATION,4.7,Intermediate,290k
0.2738,Cloud Engineering with Google Cloud,Google Cloud,PROFESSIONAL CERTIFICATE,4.7,Intermediate,310k
0.2699,Palo Alto Networks Cybersecurity,Palo Alto Networks,SPECIALIZATION,4.6,Beginner,9.1k
0.2686,From Data to Insights with Google Cloud Platform,Google Cloud,SPECIALIZATION,4.6,Beginner,26k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2973,COVID-19: What You Need to Know (CME Eligible),Osmosis,COURSE,4.8,Beginner,32k
0.2949,What is Social?,Northwestern University,COURSE,4.6,Mixed,94k
0.2949,What is Data Science?,IBM,COURSE,4.7,Beginner,260k
0.2842,What is Compliance?,University of Pennsylvania,COURSE,4.8,Mixed,6.2k
0.27,The Science of Success: What Researchers Know that You Should Know,University of Michigan,COURSE,4.8,Beginner,59k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2282,"The Modern World, Part One: Global History from 1760 to 1910",University of Virginia,COURSE,4.8,Beginner,130k
0.2236,Becoming a changemaker: Introduction to Social Innovation,University of Cape Town,COURSE,4.8,Beginner,61k
0.2231,Social Policy for Social Services & Health Practitioners,Columbia University,SPECIALIZATION,4.8,Beginner,4.2k
0.2183,Become a Journalist: Report the News!,Michigan State University,SPECIALIZATION,4.7,Beginner,28k
0.2141,How to Start Your Own Business,Michigan State University,SPECIALIZATION,4.1,Beginner,34k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.1753,Introduction to Thermodynamics: Transferring Energy from Here to There,University of Michigan,COURSE,4.8,Beginner,4.1k
0.1019,Emerging Technologies: From Smartphones to IoT to Big Data,Yonsei University,SPECIALIZATION,4.6,Beginner,17k
0.0997,Probability and Statistics: To p or not to p?,University of London,COURSE,4.6,Beginner,36k
0.0898,Food & Beverage Management,Università Bocconi,COURSE,4.8,Mixed,57k
0.0884,Introduction to Psychology,Yale University,COURSE,4.9,Beginner,270k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.3601,Introduction to Cyber Security,New York University,SPECIALIZATION,4.7,Beginner,32k
0.2439,From Data to Insights with Google Cloud Platform,Google Cloud,SPECIALIZATION,4.6,Beginner,26k
0.2154,Security in Google Cloud Platform,Google Cloud,SPECIALIZATION,4.7,Intermediate,300k
0.2144,Build a Modern Computer from First Principles: From Nand to Tetris (Project-Centered Course),Hebrew University of Jerusalem,COURSE,4.9,Mixed,95k
0.2111,How to Start Your Own Business,Michigan State University,SPECIALIZATION,4.1,Beginner,34k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2149,COVID-19: What You Need to Know (CME Eligible),Osmosis,COURSE,4.8,Beginner,32k
0.1737,Think Again I: How to Understand Arguments,Duke University,COURSE,4.6,Beginner,200k
0.1572,The Manager's Toolkit: A Practical Guide to Managing People at Work,"Birkbeck, University of London",COURSE,4.6,Mixed,47k
0.1529,People Analytics,University of Pennsylvania,COURSE,4.5,Mixed,84k
0.1422,Leading People and Teams,University of Michigan,SPECIALIZATION,4.7,Beginner,220k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2809,Machine Learning and Reinforcement Learning in Finance,New York University,SPECIALIZATION,3.7,Intermediate,29k
0.2792,Introduction to Discrete Mathematics for Computer Science,National Research University Higher School of Economics,SPECIALIZATION,4.4,Beginner,75k
0.2745,Mathematics for Data Science,National Research University Higher School of Economics,SPECIALIZATION,4.5,Beginner,12k
0.2703,Accelerated Computer Science Fundamentals,University of Illinois at Urbana-Champaign,SPECIALIZATION,4.7,Intermediate,22k
0.2695,Mathematics for Machine Learning: PCA,Imperial College London,COURSE,4.0,Intermediate,33k


## Let's try some 'informal' queries now

Recall that the data is shallow and lacks description, so there is only so much txtai can do with the embedding of the data.

In [10]:
informal_queries = [
    "I want to become a journalist and I live in Michigan",
    "I want to specialize in journalism and I live in USA",
    "I want to see the least common course in USA",
    "I want to see the least common course in USA at most 20k students"
]

for query in informal_queries:
    results = embeddings.search(query, 5)  # Retrieve top 5 results for each query
    detailed_rows = [(uid, score) for uid, score in results]
    table_with_details("Informal Query Results", query, detailed_rows, df)

Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.655,Become a Journalist: Report the News!,Michigan State University,SPECIALIZATION,4.7,Beginner,28k
0.2573,Good with Words: Writing and Editing,University of Michigan,SPECIALIZATION,4.6,Beginner,3.8k
0.2158,How to Start Your Own Business,Michigan State University,SPECIALIZATION,4.1,Beginner,34k
0.2125,Write Your First Novel,Michigan State University,COURSE,4.4,Beginner,18k
0.211,Anatomy,University of Michigan,SPECIALIZATION,4.8,Beginner,30k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2862,Become a Journalist: Report the News!,Michigan State University,SPECIALIZATION,4.7,Beginner,28k
0.2531,Think Again I: How to Understand Arguments,Duke University,COURSE,4.6,Beginner,200k
0.2371,Financial Engineering and Risk Management Part I,Columbia University,COURSE,4.6,Mixed,100k
0.2363,English Composition I,Duke University,COURSE,4.6,Beginner,200k
0.2278,American Contract Law I,Yale University,COURSE,4.9,Beginner,18k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2078,Leading Healthcare Quality and Safety,The George Washington University,COURSE,4.8,Beginner,16k
0.2008,International Law in Action: A Guide to the International Courts and Tribunals in The Hague,Universiteit Leiden,COURSE,4.8,Mixed,36k
0.194,Think Again I: How to Understand Arguments,Duke University,COURSE,4.6,Beginner,200k
0.1885,Cost and Economics in Pricing Strategy,University of Virginia,COURSE,4.8,Beginner,15k
0.1868,Stanford's Short Course on Breastfeeding,Stanford University,COURSE,4.7,Beginner,17k


Score,Course Title,Organization,Certificate Type,Rating,Difficulty,Students Enrolled
0.2495,Aspectos básicos de la planificación y la gestión de proyectos,University of Virginia,COURSE,4.9,Beginner,10k
0.2486,Stanford's Short Course on Breastfeeding,Stanford University,COURSE,4.7,Beginner,17k
0.2475,Write Your First Novel,Michigan State University,COURSE,4.4,Beginner,18k
0.2364,Cost and Economics in Pricing Strategy,University of Virginia,COURSE,4.8,Beginner,15k
0.2362,Construction Scheduling,Columbia University,COURSE,4.8,Beginner,15k
