<a href="https://www.kaggle.com/code/ufownl/ai-assistant-for-data-science-beginners?scriptVersionId=186321161" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook is powered by **[lua-cgemma](https://github.com/ufownl/lua-cgemma)**, a library that provides Lua bindings for **[gemma.cpp](https://github.com/google/gemma.cpp)**, see [README.md](https://github.com/ufownl/lua-cgemma/blob/main/README.md) for more details. Thanks to **[gemma.cpp](https://github.com/google/gemma.cpp)** using **[Google Highway Library](https://github.com/google/highway)** to take advantage of portable SIMD for CPU inference, it can run smoothly on the kernel without any accelerator. The purpose of this notebook is to explore a way to leverage the power of **Gemma** using only the CPUs.

*Note: the context prompts of this AI assistant are constructed based on the content of [this post](https://www.springboard.com/blog/data-science/learn-data-science-on-your-own/).*

# Requirements

Before starting, you should have installed:

- [CMake](https://cmake.org/)
- C++ compiler, supporting at least C++17
- [LuaJIT](https://luajit.org/), recommended to install [OpenResty](https://openresty.org/) directly
- [websockets](https://websockets.readthedocs.io/), for interacting with AI assistant in Kaggle Notebook

In [1]:
# Install OpenResty

!wget -O - https://openresty.org/package/pubkey.gpg | apt-key add -
!echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/openresty.list
!apt update && apt install -y openresty

--2024-07-01 06:06:45--  https://openresty.org/package/pubkey.gpg
Resolving openresty.org (openresty.org)... 52.53.251.226, 52.11.111.175, 2600:1f1c:9b2:8000:f183:c67e:2c64:855f
Connecting to openresty.org (openresty.org)|52.53.251.226|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1688 (1.6K) [text/plain]
Saving to: 'STDOUT'


2024-07-01 06:06:45 (157 MB/s) - written to stdout [1688/1688]

OK
deb http://openresty.org/package/ubuntu focal main
Get:1 http://openresty.org/package/ubuntu focal InRelease [4720 B]
Get:2 https://packages.cloud.google.com/apt gcsfuse-focal InRelease [1225 B]
Get:3 https://packages.cloud.google.com/apt cloud-sdk InRelease [1616 B]
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Get:5 http://security.ubuntu.com/ubuntu focal-security InRelease [128 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [128 kB]
Get:7 https://packages.cloud.google.com/apt gcsfuse-focal/main amd64 Packages [24.1 kB]


In [2]:
# Install websockets

!pip install -q -U websockets

# Install lua-cgemma

In [3]:
import os
os.chdir("/opt")

**1st step:** Clone the source code from [GitHub](https://github.com/ufownl/lua-cgemma):

In [4]:
!git clone --branch main https://github.com/ufownl/lua-cgemma.git

Cloning into 'lua-cgemma'...
remote: Enumerating objects: 554, done.[K
remote: Counting objects: 100% (106/106), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 554 (delta 71), reused 79 (delta 56), pack-reused 448[K
Receiving objects: 100% (554/554), 124.55 KiB | 7.78 MiB/s, done.
Resolving deltas: 100% (351/351), done.


**2nd step:** Build and install:

In [5]:
!mkdir -p lua-cgemma/build
os.chdir("/opt/lua-cgemma/build")
!cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_CXX_FLAGS="-DGEMMA_TOPK=5" .. && make -j4 && make install

cmake: /opt/conda/lib/libcurl.so.4: no version information available (required by cmake)
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for 

**3rd step:** Test whether the module is installed correctly:

In [6]:
!resty -e 'require("cgemma").info()'


    __
   / /_  ______ _      _________ ____  ____ ___  ____ ___  ____ _
  / / / / / __ `/_____/ ___/ __ `/ _ \/ __ `__ \/ __ `__ \/ __ `/
 / / /_/ / /_/ /_____/ /__/ /_/ /  __/ / / / / / / / / / / /_/ /
/_/\__,_/\__,_/      \___/\__, /\___/_/ /_/ /_/_/ /_/ /_/\__,_/
                         /____/

Date & Time              : 2024-07-01 06:16:23
Max Sequence Length      : 4096
Top-K                    : 5
Prefill Token Batch Size : 16
CPU                      :            Intel(R) Xeon(R) CPU @ 2.20GHz
Instruction Set          : AVX2 (256 bits)
Hardware Concurrency     : 4
Compiled Config          : opt



# Launch AI Assistant Service

In [7]:
import os
os.chdir("/opt")

**1st step:** Clone configuration and data files from [GitHub](https://github.com/ufownl/ai-assistant-for-data-science-beginners): 

In [8]:
!git clone https://github.com/ufownl/ai-assistant-for-data-science-beginners.git

Cloning into 'ai-assistant-for-data-science-beginners'...
remote: Enumerating objects: 45, done.[K
remote: Counting objects: 100% (45/45), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 45 (delta 17), reused 21 (delta 2), pack-reused 0[K
Unpacking objects: 100% (45/45), 85.17 KiB | 2.43 MiB/s, done.


**2nd step:** Link weights file:

In [9]:
os.chdir("/opt/ai-assistant-for-data-science-beginners")
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-9b-it-sfp/1/9b-it-sfp.sbs
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-9b-it-sfp/1/tokenizer.spm

**3rd step:** Generate cache of context prompts:

In [10]:
!resty print_prompt.lua | resty ../tools/cache_prompt.lua --model 9b-pt --weights 9b-it-sfp.sbs
!ls -lh dump.bin

Lua input file ../tools/cache_prompt.lua not found.
ls: cannot access 'dump.bin': No such file or directory


*Note: `dump.bin` is the cache of context prompts, it could be saved somewhere without regenerating it every time.*

**4th step:** Launch assistant service:

In [11]:
!openresty -p . -c assistant.conf

# Chat to AI Assistant

Running this notebook in interactive mode lets you chat with the AI assistant yourself. Or you can follow the instructions in this notebook to install and launch this AI assistant service on your computer, then open [http://localhost:8042](http://localhost:8042) with your browser to chat with the AI assistant.

In [12]:
# Define the prompter input

class Prompter:
    __dummy_input = [
        "How to learn data science from scratch?",
        "What mathematical concepts should I learn?",
        "Tell me the most popular programming languages used in data science.",
        ""
    ]
    
    def __init__(self):
        self.__turn = 0
        self.__interactive = os.environ.get("KAGGLE_KERNEL_RUN_TYPE") == "Interactive"
        
    def __call__(self):
        if self.__interactive:
            self.__turn += 1
            return input("you:")
        else:
            dummy = self.__dummy_input[self.__turn]
            self.__turn += 1
            print("you: ", end="")
            print(dummy, flush=True)
            return dummy

In [13]:
import json
from websockets.sync.client import connect

history = []

# Connect to Server and Chat with AI Assistant
with connect("ws://localhost:8042/cgemma/session", open_timeout=None) as ws:
    you = Prompter()
    loading = "-\\|/"
    while True:
        text = None
        msg = json.loads(ws.recv())
        if "text" in msg:
            print(msg["role"], end=": ")
            print(msg["text"], flush=True)
            text = you()
        elif msg["pos"] < msg["prompt_size"]:
            print(msg["role"], end=": ")
            print(loading[msg["pos"] % len(loading)], end="\r", flush=True)
        else:
            if msg["pos"] == msg["prompt_size"]:
                print(msg["role"], end=": ")
                reply = ""
            if "token" in msg:
                print(msg["token"], end="", flush=True)
                reply += msg["token"]
            else:
                print("", flush=True)
                history.append({
                    "role": msg["role"],
                    "text": reply
                })
                text = you()
        if not text is None:
            if text:
                history.append({
                    "role": "user",
                    "text": text
                })
                ws.send(json.dumps(history[-1]))    
            else:
                break

system: New session started! Random seed of current session: 3785561245
you: How to learn data science from scratch?
gemma: ##  Embarking on Your Data Science Journey: A Roadmap From Scratch

Learning data science is an exciting venture! It's a field brimming with opportunities and constantly evolving. Here's a structured roadmap to guide you from scratch:

**1. Build a Strong Foundation:**

* **Mathematics:**

    * **Linear Algebra:** Vectors, matrices, eigenvalues, eigenvectors (essential for machine learning algorithms).
    * **Calculus:** Derivatives, integrals (understanding optimization in machine learning).
    * **Probability & Statistics:** Distributions, hypothesis testing, confidence intervals (data interpretation and modeling).

* **Programming:**

    * **Python:** The dominant language in data science. Learn syntax, data structures, control flow, and libraries like NumPy, Pandas, Matplotlib.

**2. Dive into Data Science Fundamentals:**

* **Data Wrangling & Preprocessin

Formatted output chat history:

In [14]:
from IPython.display import display, Markdown

for msg in history:
    if msg["role"] == "user":
        display(Markdown("# Q: " + msg["text"]))
    else:
        display(Markdown(msg["text"]))

# Q: How to learn data science from scratch?

##  Embarking on Your Data Science Journey: A Roadmap From Scratch

Learning data science is an exciting venture! It's a field brimming with opportunities and constantly evolving. Here's a structured roadmap to guide you from scratch:

**1. Build a Strong Foundation:**

* **Mathematics:**

    * **Linear Algebra:** Vectors, matrices, eigenvalues, eigenvectors (essential for machine learning algorithms).
    * **Calculus:** Derivatives, integrals (understanding optimization in machine learning).
    * **Probability & Statistics:** Distributions, hypothesis testing, confidence intervals (data interpretation and modeling).

* **Programming:**

    * **Python:** The dominant language in data science. Learn syntax, data structures, control flow, and libraries like NumPy, Pandas, Matplotlib.

**2. Dive into Data Science Fundamentals:**

* **Data Wrangling & Preprocessing:** Cleaning, transforming, and preparing data for analysis (using Pandas).
* **Exploratory Data Analysis (EDA):**

    * Visualizing data with Matplotlib, Seaborn, or Plotly to uncover patterns and insights.
    * Summarizing data using descriptive statistics.

* **Machine Learning:**

    * **Supervised Learning:** Regression (predicting continuous values), classification (predicting categories). Algorithms: Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines, Naive Bayes.
    * **Unsupervised Learning:** Clustering (grouping similar data points), dimensionality reduction (reducing data complexity). Algorithms: K-Means, Principal Component Analysis (PCA).

* **Model Evaluation & Selection:**

    * Metrics like accuracy, precision, recall, F1-score for evaluating model performance.
    * Cross-validation, hyperparameter tuning for optimizing models.

**3. Expand Your Skillset:**

* **Deep Learning:** Neural networks for complex pattern recognition. Frameworks like TensorFlow or PyTorch.
* **Big Data Technologies:** Hadoop, Spark for handling massive datasets.
* **Cloud Computing:** AWS, Azure, GCP for scalable data processing and storage.
* **Data Visualization:** Creating compelling and informative visualizations with tools like Tableau or Power BI.

**4. Practice, Practice, Practice:**

* **Personal Projects:** Apply your knowledge to real-world problems.
* **Kaggle Competitions:** Participate in data science challenges to test your skills and learn from others.
* **Open-Source Contributions:** Contribute to data science libraries or projects.

**5. Stay Updated:**

* **Read Blogs & Research Papers:** Keep abreast of the latest advancements in the field.
* **Attend Conferences & Workshops:** Network with other data scientists and learn from experts.

**Resources:**

* **Online Courses:** Coursera, edX, Udacity, DataCamp
* **Books:** "Python for Data Analysis" by Wes McKinney, "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron
* **Websites:** Towards Data Science, Analytics Vidhya, KDnuggets

**Remember:** Data science is a marathon, not a sprint. Be patient, persistent, and enjoy the learning process!

# Q: What mathematical concepts should I learn?

You're right to focus on math! It's the backbone of data science. Here's a breakdown of the essential mathematical concepts you should learn, categorized by their relevance to different data science tasks:

**1. Linear Algebra:**

* **Vectors & Matrices:**  Understanding how to represent data as vectors and matrices is fundamental. You'll use them for data manipulation, transformations, and representing linear relationships.
* **Matrix Operations:** Learn addition, subtraction, multiplication, and inversion of matrices. These are crucial for many machine learning algorithms.
* **Eigenvalues & Eigenvectors:**  These concepts are essential for dimensionality reduction techniques like Principal Component Analysis (PCA) and understanding the structure of data.

**2. Calculus:**

* **Derivatives:** Used in optimization algorithms (like gradient descent) that are at the heart of training machine learning models.
* **Integrals:**  Less directly used in core algorithms, but helpful for understanding probability distributions and Bayesian statistics.

**3. Probability & Statistics:**

* **Probability Distributions:** Understand common distributions like normal, binomial, and Poisson. This helps in modeling data and making predictions.
* **Hypothesis Testing:**  Learn to test assumptions about your data and draw statistically significant conclusions.
* **Confidence Intervals:**  Estimate the range within which a population parameter likely falls.
* **Statistical Inference:**  Make inferences about a population based on a sample of data.

**4. Additional Concepts:**

* **Set Theory:** Understanding sets and their operations is helpful for data manipulation and set-based algorithms.
* **Discrete Mathematics:**  Concepts like logic, graph theory, and combinatorics can be useful in specific data science domains.

**Learning Tips:**

* **Start with the basics:** Build a strong foundation in linear algebra and probability before diving into more advanced topics.
* **Focus on applications:** Learn math concepts in the context of data science problems. This will make the learning more meaningful and memorable.
* **Practice, practice, practice:** Work through exercises and real-world examples to solidify your understanding.
* **Don't be afraid to ask for help:** There are many online resources and communities where you can get support.


Remember, you don't need to be a math whiz to succeed in data science. But a solid understanding of these key concepts will significantly enhance your ability to analyze data, build models, and interpret results.

# Q: Tell me the most popular programming languages used in data science.

You're right to think about programming languages! They are your tools for working with data.  

Here's a rundown of the most popular languages in data science:

**1. Python:** 👑 **The King**

* **Why it's popular:**  Python's clean syntax, vast libraries (NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch), and strong community support make it the dominant language in data science.
* **Strengths:**  Beginner-friendly, versatile, excellent for data manipulation, analysis, visualization, and machine learning.

**2. R:** 📊 **The Statistician's Choice**

* **Why it's popular:** R was built specifically for statistical computing and graphics. It excels in data visualization and statistical modeling.
* **Strengths:** Powerful statistical analysis tools, extensive packages for statistical modeling and visualization, great for academic research.

**3. SQL:** 🗄️ **The Data Wrangler**

* **Why it's popular:**  SQL (Structured Query Language) is essential for querying and managing data stored in relational databases – a common data source.
* **Strengths:**  Efficiently retrieve, filter, and manipulate data within databases.

**4. Java & Scala:** 🚀 **The Big Data Players**

* **Why they're used:**  Java and Scala are often used in big data frameworks like Hadoop and Spark for processing massive datasets.
* **Strengths:**  Scalability, performance, and support for distributed computing.

**5. Julia:** 📈 **The Rising Star**

* **Why it's gaining traction:** Julia is a relatively new language designed for high-performance numerical computing. It's gaining popularity for its speed and ease of use.
* **Strengths:**  Combines the ease of Python with the performance of C/C++.

**Which Language Should You Learn?**

* **For beginners:** Python is the best starting point due to its versatility and beginner-friendly nature.
* **For statistical analysis:** R is a strong choice.
* **For working with databases:** Learn SQL.
* **For big data:**  Consider Java or Scala.


Remember, the best language for you depends on your specific interests and goals within data science.





# Conclusions

The output above shows that the AI assistant can extract information from a given post and answer questions asked by the user in the multi-turns conversation. Using the same approach, it could also be used for other tasks, such as AI customer service to provide FAQ Q&A services. All you need is to prepare an introductory document or just a FAQ list and set the role of the model in the prompt template.

Although this approach cannot implement complex AI assistants as it only uses context prompts. However, due to the excellent CPU inference speed of **[gemma.cpp](https://github.com/google/gemma.cpp)**, it can run smoothly on **CPU-only devices**. This will bring extremely low deployment and operation costs, making **Gemma**'s excellent language processing and generation capabilities expected to be more and more widely used.