<a href="https://www.kaggle.com/code/ufownl/ai-assistant-for-data-science-beginners?scriptVersionId=171828643" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook is powered by **[lua-cgemma](https://github.com/ufownl/lua-cgemma)**, a library that provides Lua bindings for **[gemma.cpp](https://github.com/google/gemma.cpp)**, see [README.md](https://github.com/ufownl/lua-cgemma/blob/main/README.md) for more details. Thanks to **[gemma.cpp](https://github.com/google/gemma.cpp)** using **[Google Highway Library](https://github.com/google/highway)** to take advantage of portable SIMD for CPU inference, this notebook can run smoothly on the kernel without any accelerator.

*Note: the context prompts of this AI assistant are constructed based on the content of [this post](https://www.springboard.com/blog/data-science/learn-data-science-on-your-own/).*

# Requirements

Before starting, you should have installed:

- [CMake](https://cmake.org/)
- C++ compiler, supporting at least C++17
- [LuaJIT](https://luajit.org/), recommended to install [OpenResty](https://openresty.org/) directly
- [websockets](https://websockets.readthedocs.io/), for interacting with AI assistant in Kaggle Notebook

In [1]:
# Install OpenResty

!wget -O - https://openresty.org/package/pubkey.gpg | apt-key add -
!echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/openresty.list
!apt update && apt install -y openresty

--2024-04-13 13:19:55--  https://openresty.org/package/pubkey.gpg
Resolving openresty.org (openresty.org)... 3.125.51.27, 2a05:d01c:66b:e800:560c:c0ea:4e5:db36, 2a05:d014:808:3600:d7c8:720c:b584:cdeb
Connecting to openresty.org (openresty.org)|3.125.51.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1688 (1.6K) [text/plain]
Saving to: 'STDOUT'


2024-04-13 13:19:55 (158 MB/s) - written to stdout [1688/1688]

OK
deb http://openresty.org/package/ubuntu focal main
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:2 https://packages.cloud.google.com/apt gcsfuse-focal InRelease [1225 B]
Get:3 https://packages.cloud.google.com/apt cloud-sdk InRelease [6361 B]
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Get:5 http://openresty.org/package/ubuntu focal InRelease [4720 B]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:7 http://security.ubuntu.com/ubuntu focal-security/multiverse amd

In [2]:
# Install websockets

!pip install -q -U websockets

# Install lua-cgemma

In [3]:
import os
os.chdir("/opt")

**1st step:** Clone the source code from [GitHub](https://github.com/ufownl/lua-cgemma):

In [4]:
!git clone --branch main https://github.com/ufownl/lua-cgemma.git

Cloning into 'lua-cgemma'...
remote: Enumerating objects: 410, done.[K
remote: Counting objects: 100% (119/119), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 410 (delta 67), reused 67 (delta 29), pack-reused 291[K
Receiving objects: 100% (410/410), 108.69 KiB | 9.06 MiB/s, done.
Resolving deltas: 100% (246/246), done.


**2nd step:** Build and install:

In [5]:
!mkdir -p lua-cgemma/build
os.chdir("/opt/lua-cgemma/build")
!cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_CXX_FLAGS="-DGEMMA_TOPK=5" .. && make -j4 && make install

cmake: /opt/conda/lib/libcurl.so.4: no version information available (required by cmake)
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for 

**3rd step:** Test whether the module is installed correctly:

In [6]:
!resty -e 'require("cgemma").info()'


    __
   / /_  ______ _      _________ ____  ____ ___  ____ ___  ____ _
  / / / / / __ `/_____/ ___/ __ `/ _ \/ __ `__ \/ __ `__ \/ __ `/
 / / /_/ / /_/ /_____/ /__/ /_/ /  __/ / / / / / / / / / / /_/ /
/_/\__,_/\__,_/      \___/\__, /\___/_/ /_/ /_/_/ /_/ /_/\__,_/
                         /____/

Date & Time                                : 2024-04-13 13:24:55
Max Sequence Length                        : 4096
Top-K                                      : 5
Weight Type                                : SFP
Embedder Input Type                        : b16
Prefill Token Batch Size                   : 16
Hardware Concurrency                       : 4
Instruction Set                            : AVX2 (256 bits)
Compiled Config                            : opt



# Launch AI Assistant Service

In [7]:
import os
os.chdir("/opt")

**1st step:** Clone configuration and data files from [GitHub](https://github.com/ufownl/ai-assistant-for-data-science-beginners): 

In [8]:
!git clone https://github.com/ufownl/ai-assistant-for-data-science-beginners.git

Cloning into 'ai-assistant-for-data-science-beginners'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 36 (delta 11), reused 17 (delta 0), pack-reused 0[K
Unpacking objects: 100% (36/36), 75.94 KiB | 2.71 MiB/s, done.


**2nd step:** Link weights file:

In [9]:
os.chdir("/opt/ai-assistant-for-data-science-beginners")
!ln -s -f /kaggle/input/gemma/gemmacpp/1.1-2b-it-sfp/1/2b-it-sfp.sbs
!ln -s -f /kaggle/input/gemma/gemmacpp/1.1-2b-it-sfp/1/tokenizer.spm

**3rd step:** Generate cache of context prompts:

In [10]:
!resty print_prompt.lua | resty /opt/lua-cgemma/examples/cache_prompt.lua
!ls -lh dump.bin

Loading model ...
Random seed of session: 2842775313
Reading prompt ...

Done! Session states of the prompt have been dumped to "dump.bin"
-rw-r--r-- 1 root root 86M Apr 13 13:35 dump.bin


*Note: `dump.bin` is the cache of context prompts, it could be saved somewhere without regenerating it every time.*

**4th step:** Launch assistant service:

In [11]:
!openresty -p . -c assistant.conf

# Chat to AI Assistant

Running this notebook in interactive mode lets you chat with the AI assistant yourself. Or you can follow the instructions in this notebook to install and launch this AI assistant service on your computer, then open [http://localhost:8042](http://localhost:8042) with your browser to chat with the AI assistant.

In [12]:
# Define the prompter input

class Prompter:
    __dummy_input = [
        "How to learn data science from scratch?",
        "What mathematical concepts should I learn?",
        "Tell me the most popular programming languages used in data science.",
        ""
    ]
    
    def __init__(self):
        self.__turn = 0
        self.__interactive = os.environ.get("KAGGLE_KERNEL_RUN_TYPE") == "Interactive"
        
    def __call__(self):
        if self.__interactive:
            self.__turn += 1
            return input("you:")
        else:
            dummy = self.__dummy_input[self.__turn]
            self.__turn += 1
            print("you: ", end="")
            print(dummy, flush=True)
            return dummy

In [13]:
import json
from websockets.sync.client import connect

history = []

# Connect to Server and Chat with AI Assistant
with connect("ws://localhost:8042/cgemma/session") as ws:
    you = Prompter()
    loading = "-\\|/"
    while True:
        text = None
        msg = json.loads(ws.recv())
        if "text" in msg:
            print(msg["role"], end=": ")
            print(msg["text"], flush=True)
            text = you()
        elif msg["pos"] < msg["prompt_size"]:
            print(msg["role"], end=": ")
            print(loading[msg["pos"] % len(loading)], end="\r", flush=True)
        else:
            if msg["pos"] == msg["prompt_size"]:
                print(msg["role"], end=": ")
                reply = ""
            if "token" in msg:
                print(msg["token"], end="", flush=True)
                reply += msg["token"]
            else:
                print("", flush=True)
                history.append({
                    "role": msg["role"],
                    "text": reply
                })
                text = you()
        if not text is None:
            if text:
                history.append({
                    "role": "user",
                    "text": text
                })
                ws.send(json.dumps(history[-1]))    
            else:
                break

system: New session started! Random seed of current session: 3317234081
you: How to learn data science from scratch?
gemma: **Step 1: Build a Foundation in Statistics and Math**

- Statistics and probability are foundational for understanding data science.
- Learn basic concepts like variance, correlations, conditional probabilities, and Bayes’ theorem.


**Step 2: Learn Programming with Python and R**

- Python and R are popular programming languages for data science.
- Python is better for data wrangling and large datasets.
- R is better for translating statistical approaches to computer models.


**Step 3: Get Familiar With Databases**

- Understand relational databases for data storage.
- Learn query languages like SQL.


**Step 4: Learn Analysis Methods**

- Learn different data analysis techniques like cluster analysis, regression, and time series analysis.


**Step 5: Practice and Repeat**

- Practice by working on beginner projects.
- Learn by doing and gaining a functional und

Formatted output chat history:

In [14]:
from IPython.display import display, Markdown

for msg in history:
    if msg["role"] == "user":
        display(Markdown("# Q: " + msg["text"]))
    else:
        display(Markdown(msg["text"]))

# Q: How to learn data science from scratch?

**Step 1: Build a Foundation in Statistics and Math**

- Statistics and probability are foundational for understanding data science.
- Learn basic concepts like variance, correlations, conditional probabilities, and Bayes’ theorem.


**Step 2: Learn Programming with Python and R**

- Python and R are popular programming languages for data science.
- Python is better for data wrangling and large datasets.
- R is better for translating statistical approaches to computer models.


**Step 3: Get Familiar With Databases**

- Understand relational databases for data storage.
- Learn query languages like SQL.


**Step 4: Learn Analysis Methods**

- Learn different data analysis techniques like cluster analysis, regression, and time series analysis.


**Step 5: Practice and Repeat**

- Practice by working on beginner projects.
- Learn by doing and gaining a functional understanding of concepts.


**Step 6: Learn Data Science Tools**

- Learn popular tools like Spark and TensorFlow.


**Step 7: Work on Data Science Projects**

- Build projects related to your interests and company requirements.


**Step 8: Become a Data Storyteller**

- Communicate your findings in a clear and concise way.
- Use narratives, visualizations, and storytelling skills.


**Step 9: Network and Stay Updated**

- Network with industry professionals and stay informed of advancements.
- Follow influencers and read industry publications.


**Step 10: Never Stop Learning**

- Continue learning and expanding your knowledge in data science.


**Additional Tips:**

- Start with a strong foundation in statistics and math.
- Practice regularly.
- Learn by doing and gain a functional understanding of concepts.
- Network with industry professionals.
- Stay up-to-date with the latest trends and technologies.

# Q: What mathematical concepts should I learn?

**Mathematical Concepts for Data Science:**

**Basic Statistics:**

- Variance
- Correlation
- Standard deviation
- Mode
- Probability distribution
- Hypothesis testing
- Confidence intervals

**Probability and Statistics:**

- Bayes’ theorem
- Expected value
- Conditional probability
- Independence
- Bayes’ theorem for multiple variables

**Linear Algebra:**

- Vectors and matrices
- Linear transformations
- Eigenvalues and eigenvectors
- Singular value decomposition


**Calculus:**

- Derivatives and integrals
- Optimization
- Numerical analysis


**Programming Concepts:**

- Python syntax
- Data manipulation
- Statistical functions
- Data visualization
- Machine learning algorithms


**Additional Concepts:**

- Deep learning
- Machine learning algorithms
- Data wrangling
- Cloud computing
- Big data analytics


**Specific Concepts Relevant to Specific Data Science Fields:**

- **Finance:** Portfolio optimization, risk analysis
- **Marketing:** Customer segmentation, market research
- **Healthcare:** Patient data analysis, disease prediction
- **Retail:** Demand forecasting, customer churn analysis


**Tips:**

- Start with the basics and progress to more advanced concepts.
- Practice regularly to solidify your understanding.
- Connect mathematical concepts to real-world applications.
- Join industry communities to network and learn from others.

# Q: Tell me the most popular programming languages used in data science.

**Most Popular Programming Languages Used in Data Science:**

**1. Python:**
- Widely used for data analysis, machine learning, and web development.
- Easy to use and learn, with a large and active community.

**2. R:**
- Primarily for statistical analysis and data visualization.
- Provides statistical functions that are specifically tailored for data analysis.

**3. SQL:**
- The standard query language for relational databases, essential for working with data in a practical setting.

**4. Java:**
- Popular choice for large-scale data processing and enterprise-level applications.
- Widely used in banking, finance, and other industries.

**5. Spark:**
- A distributed processing framework used for big data processing and machine learning.
- Offers fast and scalable data processing, making it suitable for handling massive datasets.

**6. TensorFlow:**
- An open-source machine learning framework based on the Google TensorFlow library.
- Designed to be efficient and scalable for deploying machine learning models.

**7. RShiny:**
- A web-based data visualization framework.
- Allows users to create interactive dashboards and charts.

**8. D3.js:**
- A JavaScript library for creating interactive and custom-designed visualizations.

**9. Apache Spark:**
- A distributed data processing framework for big data analysis.
- Provides efficient and scalable processing across multiple nodes.

**10. NumPy:**
- A library for efficient numerical computing in Python.
- Essential for working with large datasets and performing advanced calculations.

# Conclusions

The output above shows that the AI assistant can extract information from a given post and answer questions asked by the user in the multi-turns conversation. Using the same approach, it could also be used for other tasks, such as AI customer service to provide FAQ Q&A services. All you need is to prepare an introductory document or just a FAQ list and set the role of the model in the prompt template.

Although this approach cannot implement complex AI assistants as it only uses context prompts. However, due to the excellent CPU inference speed of **[gemma.cpp](https://github.com/google/gemma.cpp)**, it can run smoothly on **CPU-only devices**. This will bring extremely low deployment and operation costs, making **Gemma**'s excellent language processing and generation capabilities expected to be more and more widely used.