<a href="https://www.kaggle.com/code/ufownl/ai-assistant-for-data-science-beginners?scriptVersionId=186343184" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook is powered by **[lua-cgemma](https://github.com/ufownl/lua-cgemma)**, a library that provides Lua bindings for **[gemma.cpp](https://github.com/google/gemma.cpp)**, see [README.md](https://github.com/ufownl/lua-cgemma/blob/main/README.md) for more details. Thanks to **[gemma.cpp](https://github.com/google/gemma.cpp)** using **[Google Highway Library](https://github.com/google/highway)** to take advantage of portable SIMD for CPU inference, it can run smoothly on the kernel without any accelerator. The purpose of this notebook is to explore a way to leverage the power of **Gemma** using only the CPUs.

*Note: the context prompts of this AI assistant are constructed based on the content of [this post](https://www.springboard.com/blog/data-science/learn-data-science-on-your-own/).*

# Requirements

Before starting, you should have installed:

- [CMake](https://cmake.org/)
- C++ compiler, supporting at least C++17
- [LuaJIT](https://luajit.org/), recommended to install [OpenResty](https://openresty.org/) directly
- [websockets](https://websockets.readthedocs.io/), for interacting with AI assistant in Kaggle Notebook

In [1]:
# Install OpenResty

!wget -O - https://openresty.org/package/pubkey.gpg | apt-key add -
!echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/openresty.list
!apt update && apt install -y openresty

--2024-07-01 08:39:22--  https://openresty.org/package/pubkey.gpg
Resolving openresty.org (openresty.org)... 18.138.237.72, 52.220.196.30
Connecting to openresty.org (openresty.org)|18.138.237.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1688 (1.6K) [text/plain]
Saving to: 'STDOUT'


2024-07-01 08:39:22 (124 MB/s) - written to stdout [1688/1688]

OK
deb http://openresty.org/package/ubuntu focal main
Get:1 http://openresty.org/package/ubuntu focal InRelease [4720 B]
Get:2 https://packages.cloud.google.com/apt gcsfuse-focal InRelease [1225 B]
Get:3 http://openresty.org/package/ubuntu focal/main amd64 Packages [29.9 kB]
Get:4 https://packages.cloud.google.com/apt cloud-sdk InRelease [1616 B]
Get:5 http://security.ubuntu.com/ubuntu focal-security InRelease [128 kB]
Hit:6 http://archive.ubuntu.com/ubuntu focal InRelease
Get:7 https://packages.cloud.google.com/apt gcsfuse-focal/main amd64 Packages [24.1 kB]
Get:8 http://archive.ubuntu.com/ub

In [2]:
# Install websockets

!pip install -q -U websockets

# Install lua-cgemma

In [3]:
import os
os.chdir("/opt")

**1st step:** Clone the source code from [GitHub](https://github.com/ufownl/lua-cgemma):

In [4]:
!git clone --branch main https://github.com/ufownl/lua-cgemma.git

Cloning into 'lua-cgemma'...
remote: Enumerating objects: 554, done.[K
remote: Counting objects: 100% (106/106), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 554 (delta 71), reused 79 (delta 56), pack-reused 448[K
Receiving objects: 100% (554/554), 124.55 KiB | 1.32 MiB/s, done.
Resolving deltas: 100% (351/351), done.


**2nd step:** Build and install:

In [5]:
!mkdir -p lua-cgemma/build
os.chdir("/opt/lua-cgemma/build")
!cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_CXX_FLAGS="-DGEMMA_TOPK=5" .. && make -j4 && make install

cmake: /opt/conda/lib/libcurl.so.4: no version information available (required by cmake)
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for 

**3rd step:** Test whether the module is installed correctly:

In [6]:
!resty -e 'require("cgemma").info()'


    __
   / /_  ______ _      _________ ____  ____ ___  ____ ___  ____ _
  / / / / / __ `/_____/ ___/ __ `/ _ \/ __ `__ \/ __ `__ \/ __ `/
 / / /_/ / /_/ /_____/ /__/ /_/ /  __/ / / / / / / / / / / /_/ /
/_/\__,_/\__,_/      \___/\__, /\___/_/ /_/ /_/_/ /_/ /_/\__,_/
                         /____/

Date & Time              : 2024-07-01 08:48:24
Max Sequence Length      : 4096
Top-K                    : 5
Prefill Token Batch Size : 16
CPU                      :            Intel(R) Xeon(R) CPU @ 2.20GHz
Instruction Set          : AVX2 (256 bits)
Hardware Concurrency     : 4
Compiled Config          : opt



# Launch AI Assistant Service

In [7]:
import os
os.chdir("/opt")

**1st step:** Clone configuration and data files from [GitHub](https://github.com/ufownl/ai-assistant-for-data-science-beginners): 

In [8]:
!git clone https://github.com/ufownl/ai-assistant-for-data-science-beginners.git

Cloning into 'ai-assistant-for-data-science-beginners'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (43/43), done.[K
remote: Total 48 (delta 19), reused 20 (delta 2), pack-reused 0[K
Unpacking objects: 100% (48/48), 93.85 KiB | 620.00 KiB/s, done.


**2nd step:** Link weights file:

In [9]:
os.chdir("/opt/ai-assistant-for-data-science-beginners")
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-9b-it-sfp/1/9b-it-sfp.sbs
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-9b-it-sfp/1/tokenizer.spm

**3rd step:** Generate cache of context prompts:

In [10]:
!resty print_prompt.lua | resty /opt/lua-cgemma/tools/cache_prompt.lua --model 9b-pt --weights 9b-it-sfp.sbs
!ls -lh dump.bin

Loading model ...
Random seed of session: 3350186930
Reading prompt ...

Done! Session states of the prompt have been dumped to "dump.bin"
-rw-r--r-- 1 root root 1.6G Jul  1 09:15 dump.bin


*Note: `dump.bin` is the cache of context prompts, it could be saved somewhere without regenerating it every time.*

**4th step:** Launch assistant service:

In [11]:
!openresty -p . -c assistant.conf

# Chat to AI Assistant

Running this notebook in interactive mode lets you chat with the AI assistant yourself. Or you can follow the instructions in this notebook to install and launch this AI assistant service on your computer, then open [http://localhost:8042](http://localhost:8042) with your browser to chat with the AI assistant.

In [12]:
# Define the prompter input

class Prompter:
    __dummy_input = [
        "How to learn data science from scratch?",
        "What mathematical concepts should I learn?",
        "Tell me the most popular programming languages used in data science.",
        ""
    ]
    
    def __init__(self):
        self.__turn = 0
        self.__interactive = os.environ.get("KAGGLE_KERNEL_RUN_TYPE") == "Interactive"
        
    def __call__(self):
        if self.__interactive:
            self.__turn += 1
            return input("you:")
        else:
            dummy = self.__dummy_input[self.__turn]
            self.__turn += 1
            print("you: ", end="")
            print(dummy, flush=True)
            return dummy

In [13]:
import json
from websockets.sync.client import connect

history = []

# Connect to Server and Chat with AI Assistant
with connect("ws://localhost:8042/cgemma/session", open_timeout=None) as ws:
    you = Prompter()
    loading = "-\\|/"
    while True:
        text = None
        msg = json.loads(ws.recv())
        if "text" in msg:
            print(msg["role"], end=": ")
            print(msg["text"], flush=True)
            text = you()
        elif msg["pos"] < msg["prompt_size"]:
            print(msg["role"], end=": ")
            print(loading[msg["pos"] % len(loading)], end="\r", flush=True)
        else:
            if msg["pos"] == msg["prompt_size"]:
                print(msg["role"], end=": ")
                reply = ""
            if "token" in msg:
                print(msg["token"], end="", flush=True)
                reply += msg["token"]
            else:
                print("", flush=True)
                history.append({
                    "role": msg["role"],
                    "text": reply
                })
                text = you()
        if not text is None:
            if text:
                history.append({
                    "role": "user",
                    "text": text
                })
                ws.send(json.dumps(history[-1]))    
            else:
                break

system: New session started! Random seed of current session: 253235304
you: How to learn data science from scratch?
gemma: That's a great question! Learning data science from scratch can seem daunting, but it's absolutely achievable with the right approach. Here's a breakdown of how to get started:

**1. Build a Strong Foundation:**

* **Math:** Brush up on your algebra, calculus, and statistics. These are fundamental to understanding how algorithms work.
* **Programming:** Learn Python. It's the most popular language in data science due to its versatility and extensive libraries.

**2. Dive into Data Science Concepts:**

* **Data Wrangling:** Learn how to clean, transform, and prepare data for analysis.
* **Exploratory Data Analysis (EDA):** Discover patterns and insights within your data using visualization and summary statistics.
* **Machine Learning:** Understand the different types of algorithms (regression, classification, clustering, etc.) and how they work.

**3. Choose Your Le

Formatted output chat history:

In [14]:
from IPython.display import display, Markdown

for msg in history:
    if msg["role"] == "user":
        display(Markdown("# Q: " + msg["text"]))
    else:
        display(Markdown(msg["text"]))

# Q: How to learn data science from scratch?

That's a great question! Learning data science from scratch can seem daunting, but it's absolutely achievable with the right approach. Here's a breakdown of how to get started:

**1. Build a Strong Foundation:**

* **Math:** Brush up on your algebra, calculus, and statistics. These are fundamental to understanding how algorithms work.
* **Programming:** Learn Python. It's the most popular language in data science due to its versatility and extensive libraries.

**2. Dive into Data Science Concepts:**

* **Data Wrangling:** Learn how to clean, transform, and prepare data for analysis.
* **Exploratory Data Analysis (EDA):** Discover patterns and insights within your data using visualization and summary statistics.
* **Machine Learning:** Understand the different types of algorithms (regression, classification, clustering, etc.) and how they work.

**3. Choose Your Learning Path:**

* **Online Courses:** Platforms like Coursera, edX, DataCamp, and Udacity offer structured data science courses.
* **Books:** There are many excellent books on data science, both for beginners and advanced learners.
* **Bootcamps:** Intensive, short-term programs that provide a focused learning experience.
* **University Programs:**  Consider a Master's degree if you want a deeper theoretical understanding.

**4. Practice, Practice, Practice!**

* **Work on Projects:** Apply your knowledge to real-world datasets. Kaggle is a great platform to find competitions and datasets.
* **Contribute to Open Source:** Gain experience by contributing to data science projects on GitHub.
* **Build a Portfolio:** Showcase your projects to potential employers.

**5. Network and Connect:**

* **Attend Meetups:** Connect with other data scientists in your area.
* **Join Online Communities:** Participate in forums and discussions.
* **Follow Industry Leaders:** Stay up-to-date on the latest trends.

**Remember:**

* **Be Patient:** Learning data science takes time and effort.
* **Don't Be Afraid to Ask for Help:** There are many resources available to help you along the way.
* **Stay Curious:** Data science is a constantly evolving field. Keep learning and exploring!


Good luck on your data science journey!

# Q: What mathematical concepts should I learn?

You're right to focus on math! It's the backbone of data science. Here's a breakdown of the key mathematical concepts you should learn:

**1. Fundamentals:**

* **Algebra:**  
    * Linear equations, matrices, solving systems of equations.
    * This helps with understanding how models work and manipulating data.
* **Calculus:**
    * Derivatives and integrals: Essential for understanding optimization algorithms used in machine learning.
    * Not always needed for all data science roles, but beneficial for deeper understanding.
* **Statistics:**
    * Descriptive statistics (mean, median, mode, variance, standard deviation)
    * Probability distributions (normal, binomial, etc.)
    * Hypothesis testing
    * This is core to analyzing data, drawing inferences, and building models.

**2. Linear Algebra:**

* **Vectors and matrices:**  Representing data and performing operations on it.
* **Eigenvalues and eigenvectors:**  Used in dimensionality reduction techniques like PCA.
* **Matrix operations:**  Essential for many machine learning algorithms.

**3. Probability:**

* **Probability distributions:** Understanding how data is likely to be distributed.
* **Conditional probability:**  Probability of an event given another event has occurred.
* **Bayes' Theorem:**  Foundation for Bayesian statistics, used in many machine learning models.

**4. Additional Concepts:**

* **Optimization:**  Finding the best solution to a problem (often used in model training).
* **Information Theory:**  Understanding how to measure information and uncertainty.
* **Discrete Mathematics:**  Logic, set theory, graph theory (helpful for certain algorithms).

**Resources:**

* **Khan Academy:** Offers free online courses covering many of these topics.
* **3Blue1Brown:**  Excellent YouTube channel with intuitive explanations of math concepts.
* **Coursera/edX:**  Online courses with more in-depth coverage.

**Remember:** You don't need to be a math whiz to become a data scientist. Focus on understanding the concepts relevant to data science, and practice applying them.

# Q: Tell me the most popular programming languages used in data science.

You're right to ask about programming languages! They're essential tools for data scientists. 

Here are the most popular ones, along with their strengths:

**1. Python:**

* **Why it's popular:**  Beginner-friendly syntax, vast ecosystem of libraries (NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch), versatility (web development, scripting, data analysis).
* **Strengths:**  Data manipulation, visualization, machine learning, deep learning.

**2. R:**

* **Why it's popular:**  Specifically designed for statistical computing and graphics, excellent for statistical modeling and data exploration.
* **Strengths:**  Statistical analysis, data visualization (ggplot2), creating reproducible research.

**3. SQL:**

* **Why it's popular:**  Essential for working with relational databases, querying and managing data.
* **Strengths:**  Data retrieval, data cleaning, database management.

**4. Scala:**

* **Why it's popular:**  Combines object-oriented and functional programming paradigms, used with big data frameworks like Spark.
* **Strengths:**  Scalability, performance, working with large datasets.

**5. Java:**

* **Why it's popular:**  Robust and versatile, used in enterprise-level applications and big data tools.
* **Strengths:**  Building scalable applications, working with Hadoop, Spark.

**Choosing a Language:**

* **Beginners:** Start with Python. Its simplicity and vast resources make it ideal for learning.
* **Statistical Focus:**  R is excellent for in-depth statistical analysis and visualization.
* **Database Work:**  SQL is essential for interacting with databases.
* **Big Data:**  Scala or Java are good choices for handling massive datasets.



Let me know if you have any other questions!

# Conclusions

The output above shows that the AI assistant can extract information from a given post and answer questions asked by the user in the multi-turns conversation. Using the same approach, it could also be used for other tasks, such as AI customer service to provide FAQ Q&A services. All you need is to prepare an introductory document or just a FAQ list and set the role of the model in the prompt template.

Although this approach cannot implement complex AI assistants as it only uses context prompts. However, due to the excellent CPU inference speed of **[gemma.cpp](https://github.com/google/gemma.cpp)**, it can run smoothly on **CPU-only devices**. This will bring extremely low deployment and operation costs, making **Gemma**'s excellent language processing and generation capabilities expected to be more and more widely used.