<a href="https://www.kaggle.com/code/ufownl/ai-assistant-for-data-science-beginners?scriptVersionId=171849992" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook is powered by **[lua-cgemma](https://github.com/ufownl/lua-cgemma)**, a library that provides Lua bindings for **[gemma.cpp](https://github.com/google/gemma.cpp)**, see [README.md](https://github.com/ufownl/lua-cgemma/blob/main/README.md) for more details. Thanks to **[gemma.cpp](https://github.com/google/gemma.cpp)** using **[Google Highway Library](https://github.com/google/highway)** to take advantage of portable SIMD for CPU inference, it can run smoothly on the kernel without any accelerator. The purpose of this notebook is to explore a way to leverage the power of **Gemma** using only the CPUs.

*Note: the context prompts of this AI assistant are constructed based on the content of [this post](https://www.springboard.com/blog/data-science/learn-data-science-on-your-own/).*

# Requirements

Before starting, you should have installed:

- [CMake](https://cmake.org/)
- C++ compiler, supporting at least C++17
- [LuaJIT](https://luajit.org/), recommended to install [OpenResty](https://openresty.org/) directly
- [websockets](https://websockets.readthedocs.io/), for interacting with AI assistant in Kaggle Notebook

In [1]:
# Install OpenResty

!wget -O - https://openresty.org/package/pubkey.gpg | apt-key add -
!echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/openresty.list
!apt update && apt install -y openresty

--2024-04-13 16:06:58--  https://openresty.org/package/pubkey.gpg
Resolving openresty.org (openresty.org)... 3.131.85.84, 2600:1f1c:9b2:8000:f183:c67e:2c64:855f
Connecting to openresty.org (openresty.org)|3.131.85.84|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1688 (1.6K) [text/plain]
Saving to: 'STDOUT'


2024-04-13 16:06:58 (132 MB/s) - written to stdout [1688/1688]

OK
deb http://openresty.org/package/ubuntu focal main
Get:1 http://openresty.org/package/ubuntu focal InRelease [4720 B]
Hit:2 https://packages.cloud.google.com/apt gcsfuse-focal InRelease
Get:3 https://packages.cloud.google.com/apt cloud-sdk InRelease [6361 B]
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Get:5 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:6 http://openresty.org/package/ubuntu focal/main amd64 Packages [31.4 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:8 http://archive.ubuntu.com/ubuntu

In [2]:
# Install websockets

!pip install -q -U websockets

# Install lua-cgemma

In [3]:
import os
os.chdir("/opt")

**1st step:** Clone the source code from [GitHub](https://github.com/ufownl/lua-cgemma):

In [4]:
!git clone --branch main https://github.com/ufownl/lua-cgemma.git

Cloning into 'lua-cgemma'...
remote: Enumerating objects: 410, done.[K
remote: Counting objects: 100% (119/119), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 410 (delta 67), reused 67 (delta 29), pack-reused 291[K
Receiving objects: 100% (410/410), 108.69 KiB | 4.73 MiB/s, done.
Resolving deltas: 100% (246/246), done.


**2nd step:** Build and install:

In [5]:
!mkdir -p lua-cgemma/build
os.chdir("/opt/lua-cgemma/build")
!cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_CXX_FLAGS="-DGEMMA_TOPK=5" .. && make -j4 && make install

cmake: /opt/conda/lib/libcurl.so.4: no version information available (required by cmake)
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for 

**3rd step:** Test whether the module is installed correctly:

In [6]:
!resty -e 'require("cgemma").info()'


    __
   / /_  ______ _      _________ ____  ____ ___  ____ ___  ____ _
  / / / / / __ `/_____/ ___/ __ `/ _ \/ __ `__ \/ __ `__ \/ __ `/
 / / /_/ / /_/ /_____/ /__/ /_/ /  __/ / / / / / / / / / / /_/ /
/_/\__,_/\__,_/      \___/\__, /\___/_/ /_/ /_/_/ /_/ /_/\__,_/
                         /____/

Date & Time                                : 2024-04-13 16:11:54
Max Sequence Length                        : 4096
Top-K                                      : 5
Weight Type                                : SFP
Embedder Input Type                        : b16
Prefill Token Batch Size                   : 16
Hardware Concurrency                       : 4
Instruction Set                            : AVX2 (256 bits)
Compiled Config                            : opt



# Launch AI Assistant Service

In [7]:
import os
os.chdir("/opt")

**1st step:** Clone configuration and data files from [GitHub](https://github.com/ufownl/ai-assistant-for-data-science-beginners): 

In [8]:
!git clone https://github.com/ufownl/ai-assistant-for-data-science-beginners.git

Cloning into 'ai-assistant-for-data-science-beginners'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 39 (delta 13), reused 17 (delta 0), pack-reused 0[K
Unpacking objects: 100% (39/39), 80.64 KiB | 2.02 MiB/s, done.


**2nd step:** Link weights file:

In [9]:
os.chdir("/opt/ai-assistant-for-data-science-beginners")
!ln -s -f /kaggle/input/gemma/gemmacpp/1.1-2b-it-sfp/1/2b-it-sfp.sbs
!ln -s -f /kaggle/input/gemma/gemmacpp/1.1-2b-it-sfp/1/tokenizer.spm

**3rd step:** Generate cache of context prompts:

In [10]:
!resty print_prompt.lua | resty /opt/lua-cgemma/examples/cache_prompt.lua
!ls -lh dump.bin

Loading model ...
Random seed of session: 1316900978
Reading prompt ...

Done! Session states of the prompt have been dumped to "dump.bin"
-rw-r--r-- 1 root root 86M Apr 13 16:22 dump.bin


*Note: `dump.bin` is the cache of context prompts, it could be saved somewhere without regenerating it every time.*

**4th step:** Launch assistant service:

In [11]:
!openresty -p . -c assistant.conf

# Chat to AI Assistant

Running this notebook in interactive mode lets you chat with the AI assistant yourself. Or you can follow the instructions in this notebook to install and launch this AI assistant service on your computer, then open [http://localhost:8042](http://localhost:8042) with your browser to chat with the AI assistant.

In [12]:
# Define the prompter input

class Prompter:
    __dummy_input = [
        "How to learn data science from scratch?",
        "What mathematical concepts should I learn?",
        "Tell me the most popular programming languages used in data science.",
        ""
    ]
    
    def __init__(self):
        self.__turn = 0
        self.__interactive = os.environ.get("KAGGLE_KERNEL_RUN_TYPE") == "Interactive"
        
    def __call__(self):
        if self.__interactive:
            self.__turn += 1
            return input("you:")
        else:
            dummy = self.__dummy_input[self.__turn]
            self.__turn += 1
            print("you: ", end="")
            print(dummy, flush=True)
            return dummy

In [13]:
import json
from websockets.sync.client import connect

history = []

# Connect to Server and Chat with AI Assistant
with connect("ws://localhost:8042/cgemma/session") as ws:
    you = Prompter()
    loading = "-\\|/"
    while True:
        text = None
        msg = json.loads(ws.recv())
        if "text" in msg:
            print(msg["role"], end=": ")
            print(msg["text"], flush=True)
            text = you()
        elif msg["pos"] < msg["prompt_size"]:
            print(msg["role"], end=": ")
            print(loading[msg["pos"] % len(loading)], end="\r", flush=True)
        else:
            if msg["pos"] == msg["prompt_size"]:
                print(msg["role"], end=": ")
                reply = ""
            if "token" in msg:
                print(msg["token"], end="", flush=True)
                reply += msg["token"]
            else:
                print("", flush=True)
                history.append({
                    "role": msg["role"],
                    "text": reply
                })
                text = you()
        if not text is None:
            if text:
                history.append({
                    "role": "user",
                    "text": text
                })
                ws.send(json.dumps(history[-1]))    
            else:
                break

system: New session started! Random seed of current session: 4230004180
you: How to learn data science from scratch?
gemma: **Step 1: Build a Strong Foundation in Statistics and Math**

- Understand variance, correlations, conditional probabilities, and Bayes' theorem.
- Gain a solid understanding of statistics and probability through online resources or a introductory course.


**Step 2: Learn Programming with Python and R**

- Python and R are popular programming languages used in data science.
- Python for data wrangling and automation.
- R for translating statistical approaches into computer models.


**Step 3: Get Familiar with Databases**

- Understand SQL, a database query language for storing and retrieving data.
- Learn to use databases for storing and managing large datasets.


**Step 4: Learn Analysis Methods**

- Cluster analysis, regression, time series analysis, and cohort analysis are common data analysis techniques.
- Understand when to use each technique and its streng

Formatted output chat history:

In [14]:
from IPython.display import display, Markdown

for msg in history:
    if msg["role"] == "user":
        display(Markdown("# Q: " + msg["text"]))
    else:
        display(Markdown(msg["text"]))

# Q: How to learn data science from scratch?

**Step 1: Build a Strong Foundation in Statistics and Math**

- Understand variance, correlations, conditional probabilities, and Bayes' theorem.
- Gain a solid understanding of statistics and probability through online resources or a introductory course.


**Step 2: Learn Programming with Python and R**

- Python and R are popular programming languages used in data science.
- Python for data wrangling and automation.
- R for translating statistical approaches into computer models.


**Step 3: Get Familiar with Databases**

- Understand SQL, a database query language for storing and retrieving data.
- Learn to use databases for storing and managing large datasets.


**Step 4: Learn Analysis Methods**

- Cluster analysis, regression, time series analysis, and cohort analysis are common data analysis techniques.
- Understand when to use each technique and its strengths and weaknesses.


**Step 5: Practice and Repeat**

- Build beginner projects to solidify your understanding of data analysis concepts.
- Practice with real-world datasets to develop your analytical skills.


**Step 6: Learn Data Science Tools**

- Spark for batch processing.
- D3.js for data visualization.
- TensorFlow and PyTorch for deep learning and machine learning.


**Step 7: Work on Data Science Projects**

- Create projects related to your interests, such as sentiment analysis, recommendation systems, or data storytelling.
- Network with professionals in the industry and attend industry events.


**Step 8: Always Be Learning**

- Stay updated with industry trends and advancements.
- Attend conferences, read industry publications, and follow influential data scientists.


**Additional Tips:**

- Join online communities and forums to connect with other data scientists.
- Take online courses and tutorials to enhance your skills.
- Seek mentorship from experienced data scientists.

# Q: What mathematical concepts should I learn?

**Mathematical concepts that are fundamental to data science:**

**1. Statistics:**
- Variance
- Correlation
- Probability
- Conditional probabilities
- Bayes' theorem

**2. Probability:**
- Bayes' theorem
- Conditional probability
- Independence
- Bayes' rule

**3. Linear algebra:**
- Vectors and matrices
- Linear transformations
- Eigenvalues and eigenvectors

**4. Calculus:**
- Derivatives and integrals
- Optimization techniques

**5. Statistics of data:**
- Measures of central tendency (mean, median, mode)
- Measures of dispersion (range, standard deviation, variance)

**6. Optimization:**
- Linear programming
- Quadratic programming
- Nonlinear optimization techniques

**7. Numerical analysis:**
- Statistical software (R, Python)
- Mathematical software (Matlab, Mathematica)

**8. Machine learning:**
- Supervised learning
- Unsupervised learning
- Classification
- Regression


**Additional concepts that are helpful:**

- Data visualization
- Data wrangling
- Programming languages (Python, R)
- Big data processing
- Statistical modeling


**Tips for learning mathematical concepts:**

- Start with introductory resources
- Practice with hands-on exercises
- Use online tutorials and resources
- Take online courses and workshops
- Join online communities and forums

# Q: Tell me the most popular programming languages used in data science.

**Most popular programming languages used in data science:**

- **Python:**
    - Widely used for data analysis, machine learning, and web development.
    - Easy to learn and has many libraries and tools available.
- **R:**
    - Ideal for statistical analysis, data visualization, and Bayesian statistics.
    - Strong focus on data manipulation and statistical analysis.
- **Java:**
    - Popular for enterprise applications and data science in financial and insurance industries.
- **Scala:**
    - A high-performance language suitable for large-scale data processing and machine learning.
- **JavaScript:**
    - Essential for interactive dashboards and web applications.
- **SQL:**
    - Used for data manipulation and retrieval from relational databases.

# Conclusions

The output above shows that the AI assistant can extract information from a given post and answer questions asked by the user in the multi-turns conversation. Using the same approach, it could also be used for other tasks, such as AI customer service to provide FAQ Q&A services. All you need is to prepare an introductory document or just a FAQ list and set the role of the model in the prompt template.

Although this approach cannot implement complex AI assistants as it only uses context prompts. However, due to the excellent CPU inference speed of **[gemma.cpp](https://github.com/google/gemma.cpp)**, it can run smoothly on **CPU-only devices**. This will bring extremely low deployment and operation costs, making **Gemma**'s excellent language processing and generation capabilities expected to be more and more widely used.