<a href="https://www.kaggle.com/code/ufownl/ai-assistant-for-data-science-beginners?scriptVersionId=202247947" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook is powered by **[lua-cgemma](https://github.com/ufownl/lua-cgemma)**, a library that provides Lua bindings for **[gemma.cpp](https://github.com/google/gemma.cpp)**, see [README.md](https://github.com/ufownl/lua-cgemma/blob/main/README.md) for more details. Thanks to **[gemma.cpp](https://github.com/google/gemma.cpp)** using **[Google Highway Library](https://github.com/google/highway)** to take advantage of portable SIMD for CPU inference, it can run smoothly on the kernel without any accelerator. The purpose of this notebook is to explore a way to leverage the power of **Gemma** using only the CPUs.

*Note: the context prompts of this AI assistant are constructed based on the content of [this post](https://www.springboard.com/blog/data-science/learn-data-science-on-your-own/).*

# Requirements

Before starting, you should have installed:

- [CMake](https://cmake.org/)
- C++ compiler, supporting at least C++17
- [LuaJIT](https://luajit.org/), recommended to install [OpenResty](https://openresty.org/) directly
- [websockets](https://websockets.readthedocs.io/), for interacting with AI assistant in Kaggle Notebook

In [1]:
# Install OpenResty

!conda install -y openssl=1.1.1w
!wget -O - https://openresty.org/package/pubkey.gpg | apt-key add -
!echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/openresty.list
!apt update && apt install -y openresty

Retrieving notices: ...working... done
Channels:
 - rapidsai
 - nvidia
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done


    current version: 24.3.0
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openssl=1.1.1w


The following packages will be downloaded:

    package             

In [2]:
# Install websockets

!pip install -q -U websockets

# Install lua-cgemma

In [3]:
import os
os.chdir("/opt")

**1st step:** Clone the source code from [GitHub](https://github.com/ufownl/lua-cgemma):

In [4]:
!git clone --branch main https://github.com/ufownl/lua-cgemma.git

Cloning into 'lua-cgemma'...
remote: Enumerating objects: 897, done.[K
remote: Counting objects: 100% (134/134), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 897 (delta 92), reused 94 (delta 64), pack-reused 763 (from 1)[K
Receiving objects: 100% (897/897), 183.16 KiB | 5.09 MiB/s, done.
Resolving deltas: 100% (605/605), done.


**2nd step:** Build and install:

In [5]:
!mkdir -p lua-cgemma/build
os.chdir("/opt/lua-cgemma/build")
!cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_CXX_FLAGS="-DGEMMA_TOPK=5" .. && make -j4 && make install

cmake: /opt/conda/lib/libcurl.so.4: no version information available (required by cmake)
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for 

**3rd step:** Test whether the module is installed correctly:

In [6]:
!resty -e 'require("cgemma").info()'


    __
   / /_  ______ _      _________ ____  ____ ___  ____ ___  ____ _
  / / / / / __ `/_____/ ___/ __ `/ _ \/ __ `__ \/ __ `__ \/ __ `/
 / / /_/ / /_/ /_____/ /__/ /_/ /  __/ / / / / / / / / / / /_/ /
/_/\__,_/\__,_/      \___/\__, /\___/_/ /_/ /_/_/ /_/ /_/\__,_/
                         /____/

Date & Time              : 2024-10-20 10:06:59
Max Sequence Length      : 4096
Top-K                    : 5
CPU                      :            Intel(R) Xeon(R) CPU @ 2.20GHz
Instruction Set          : AVX2 (256 bits)
Hardware Concurrency     : 4
Compiled Config          : opt



# Launch AI Assistant Service

In [7]:
import os
os.chdir("/opt")

**1st step:** Clone configuration and data files from [GitHub](https://github.com/ufownl/ai-assistant-for-data-science-beginners): 

In [8]:
!git clone https://github.com/ufownl/ai-assistant-for-data-science-beginners.git

Cloning into 'ai-assistant-for-data-science-beginners'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (53/53), done.[K
remote: Total 61 (delta 28), reused 26 (delta 5), pack-reused 0 (from 0)[K
Unpacking objects: 100% (61/61), 108.73 KiB | 1.94 MiB/s, done.


**2nd step:** Link weights file:

In [9]:
os.chdir("/opt/ai-assistant-for-data-science-beginners")
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-2b-it-sfp/1/2.0-2b-it-sfp.sbs
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-2b-it-sfp/1/tokenizer.spm

**3rd step:** Generate cache of context prompts:

In [10]:
!resty print_prompt.lua | resty /opt/lua-cgemma/tools/cache_prompt.lua --stats
!ls -lh dump.bin

Loading model ...
Random seed of session: 
Reading prompt ...

Done! Session states of the prompt have been dumped to "dump.bin"


Statistics:

  prefill_tokens = 2468
  time_to_first_token = 428.91805054521
  generate_duration = 0.61487515909903
  tokens_generated = 1
  prefill_tokens_per_second = 5.762272921725
  generate_tokens_per_second = 1.6263463976416
  prefill_duration = 428.30321186196
-rw-r--r-- 1 root root 502M Oct 20 10:14 dump.bin


*Note: `dump.bin` is the cache of context prompts, it could be saved somewhere without regenerating it every time.*

**4th step:** Launch assistant service:

In [11]:
!openresty -p . -c assistant.conf

# Chat to AI Assistant

Running this notebook in interactive mode lets you chat with the AI assistant yourself. Or you can follow the instructions in this notebook to install and launch this AI assistant service on your computer, then open [http://localhost:8042](http://localhost:8042) with your browser to chat with the AI assistant.

In [12]:
# Define the prompter input

class Prompter:
    __dummy_input = [
        "How to learn data science from scratch?",
        "What mathematical concepts should I learn?",
        "Tell me the most popular programming languages used in data science.",
        ""
    ]
    
    def __init__(self):
        self.__turn = 0
        self.__interactive = os.environ.get("KAGGLE_KERNEL_RUN_TYPE") == "Interactive"
        
    def __call__(self):
        if self.__interactive:
            self.__turn += 1
            return input("you:")
        else:
            dummy = self.__dummy_input[self.__turn]
            self.__turn += 1
            print("you: ", end="")
            print(dummy, flush=True)
            return dummy

In [13]:
import json
from websockets.sync.client import connect

history = []

# Connect to Server and Chat with AI Assistant
with connect("ws://localhost:8042/cgemma/session", open_timeout=None) as ws:
    you = Prompter()
    loading = "-\\|/"
    while True:
        text = None
        msg = json.loads(ws.recv())
        if "text" in msg:
            print(msg["role"], end=": ")
            print(msg["text"], flush=True)
            text = you()
        elif msg["pos"] < msg["prompt_size"]:
            print(msg["role"], end=": ")
            print(loading[msg["pos"] % len(loading)], end="\r", flush=True)
        else:
            if msg["pos"] == msg["prompt_size"]:
                print(msg["role"], end=": ")
                reply = ""
            if "token" in msg:
                print(msg["token"], end="", flush=True)
                reply += msg["token"]
            else:
                print("", flush=True)
                history.append({
                    "role": msg["role"],
                    "text": reply
                })
                text = you()
        if not text is None:
            if text:
                history.append({
                    "role": "user",
                    "text": text
                })
                ws.send(json.dumps(history[-1]))    
            else:
                break

system: New chat session started!
you: How to learn data science from scratch?
gemma: You're on the journey to becoming a data scientist!  That's awesome.  Here's a breakdown of how to approach learning data science from scratch, along with resources to help you:

**1. Lay the Groundwork: Foundational Math & Statistics**

* **Why it's Crucial:**  Data science is deeply rooted in statistical thinking. You need a solid grasp of concepts like:
   * **Statistics:**  Mean, median, mode, standard deviation, probability, hypothesis testing.
   * **Regression:** Understanding how to find the relationship between variables.
   * **Correlation:**  How variables change together.
   * **Data Distributions:**  Normal, skewed, bimodal, etc. 
* **Resources:**
    * **Khan Academy:** [https://www.khanacademy.org/math](https://www.khanacademy.org/math) (Excellent free course on statistics)
    * **OpenIntro Statistics:** [https://openintro.com/](https://openintro.com/) (Another free, well-structured op

Formatted output chat history:

In [14]:
from IPython.display import display, Markdown

for msg in history:
    if msg["role"] == "user":
        display(Markdown("# Q: " + msg["text"]))
    else:
        display(Markdown(msg["text"]))

# Q: How to learn data science from scratch?

You're on the journey to becoming a data scientist!  That's awesome.  Here's a breakdown of how to approach learning data science from scratch, along with resources to help you:

**1. Lay the Groundwork: Foundational Math & Statistics**

* **Why it's Crucial:**  Data science is deeply rooted in statistical thinking. You need a solid grasp of concepts like:
   * **Statistics:**  Mean, median, mode, standard deviation, probability, hypothesis testing.
   * **Regression:** Understanding how to find the relationship between variables.
   * **Correlation:**  How variables change together.
   * **Data Distributions:**  Normal, skewed, bimodal, etc. 
* **Resources:**
    * **Khan Academy:** [https://www.khanacademy.org/math](https://www.khanacademy.org/math) (Excellent free course on statistics)
    * **OpenIntro Statistics:** [https://openintro.com/](https://openintro.com/) (Another free, well-structured option)
    * **MIT OpenCourseWare:**  [https://ocw.mit.edu/](https://ocw.mit.edu/) (Look for intro statistics courses)

**2. Dive Deep: Programming for Data Science**
* **Why it Matters:** Data scientists write code to:
   * **Process Data:** Clean, transform, analyze messy data.
   * **Build Models:**  Develop machine learning algorithms.
   * **Visualize Insights:**  Make your findings compelling to others. 
* **Essential Languages:**
    * **Python:** The king of data science! It's versatile and powerful. ([https://www.python.org/](https://www.python.org/)). 
    * **R:**  Great for statistical analysis and data visualization. ([https://www.r-project.org/](https://www.r-project.org/))
* **Resources:**
    * **Codecademy:**  Interactive courses for Python and R. [https://www.codecademy.com/](https://www.codecademy.com/) (Free and paid options)
    * **DataCamp:**  Data-centric courses with a focus on Python and SQL. ([https://www.datacamp.com/](https://www.datacamp.com/))
    * **FreeCodeCamp:**  Free, comprehensive data science curriculum. ([https://www.freecode camp.org/](https://www.freecodecamp.org/))

**3.  Master the Tools: Essential Data Science Libraries**

* **Python:**
    * NumPy: Powerful numerical computing and array operations.  
    * Pandas: Data wrangling (loading, cleaning, transforming), data structures for working with data.
    * Matplotlib/Seaborn: For creating visualizations (charts, graphs, etc.)
* **R:**
    * dplyr, tidyr: For data manipulation in R.
    * ggplot2:  A versatile package for beautiful data visualizations (like Matplotlib, but for R)
* **Other:**
   * TensorFlow, PyTorch (for deep learning, neural networks)
   * Scikit-Learn: Machine learning models and algorithms

**4. Build a Solid Foundation: Core Data Science Concepts & Techniques**

* **Data Wrangling:**  
   * Data cleaning: Removing errors, inconsistencies, duplicates.
   * Data transformation: Converting data types, scaling data.
   * Data exploration: Understanding your data through summaries, visualizations.
* **Machine Learning:**  
   * Supervised learning:  Predictive models (classification, regression, etc.). 
   * Unsupervised learning:  Clustering, dimensionality reduction, etc.
* **Data Visualization:**
    * Communicating your findings effectively using charts, graphs, etc. 

**5. Project-Based Learning:  Apply What You've Learned**

* **Find Real-World Datasets:** Kaggle: [https://www.kaggle.com/datasets](https://www.kaggle.com/datasets) (Tons of datasets)
    * **Projects:** 
    * **Titanic Survival Prediction:**  Predict passenger survival based on factors from the data.
    * **Movie Recommendation System:**  Predict what movies you'll enjoy based on your past ratings.
    * **Stock Market Prediction:**  Try to predict stock prices using historical data (be aware of limitations).
* **GitHub:**   Create your portfolio of projects to showcase your skills.

**6. Continuous Learning**

* **Stay Updated:** The data science field is constantly evolving. Keep learning new techniques, libraries, and tools!
* **Follow Industry Blogs/Podcasts/News**:  Get insights from experts.

**Key Resources** 

* **Kaggle:** [https://www.kaggle.com/](https://www.kaggle.com/).  Great for competitions, datasets, and learning.
* **Towards Data Science:** [https://towardsdatascience.com/](https://towardsdatascience.com/) - Medium publication with articles about all things data science.
* **DataCamp:**  [https://www.datacamp.com/](https://www.datacamp.com/)
* **Analytics Vidhya:** [https://www.analyticsvidhya.com/](https://www.analyticsvidhya.com/)
* **Machine Learning Mastery:** [https://machinelearningmastery.com/](https://machinelearningmastery.com/)


Let me know if you have any other questions. Good luck on your data science journey!

# Q: What mathematical concepts should I learn?

Excellent question! You're right to focus on math as a foundational aspect of becoming a data scientist. Let's break down the essential math concepts you'll need:

**I. Core Concepts (These are the building blocks of many data science techniques)**

   * **Algebra:**
        * Solving equations
        * Linear Algebra: Matrices, vectors, eigenvalues, eigenvectors (crucial for machine learning algorithms)
   * **Calculus:**
        * Derivatives and limits (for understanding how data changes over time and with different conditions)
        * Integrals (for calculating probabilities and distributions of data)
   * **Statistics:**
        *  Mean, median, standard deviation, variance, and percentiles (basic measures of central tendency and data spread)
        * Probability:  Fundamental to statistical analysis, Bayesian inference, and understanding uncertainty.  
        * Hypothesis Testing: Testing your statistical findings.   
   * **Discrete Mathematics:**  (Helpful for understanding data structures and algorithms)
        * Sets, functions, graphs, counting principles (important for data manipulation, data organization, and algorithms)

**II. Concepts You'll Encounter Frequently in Data Science**

   * **Linear Regression:**  A core model that helps you predict a continuous outcome variable based on one or more predictors (like predicting house prices, stock prices, etc.)
     * You'll learn about coefficients, R-squared, and other metrics to evaluate model performance.  
   * **Logistic Regression:** Used to predict the probability of a binary outcome (e.g., will a customer buy a product or not)
      * Understanding probabilities is key.
   * **Probability Distributions (e.g., Gaussian/Normal, Binomial, Poisson) :**  Understanding the distribution of data helps you analyze it. 

**III.  Important Tips** 
   * **Don't Just Memorize; Understand:**  Math concepts are about understanding relationships, not just memorization. 
   * **Visualize:**  Use graphs and diagrams to understand concepts better, even if they're not perfect.   
   * **Practice:** Work through exercises, problems, and try to solve real-world data challenges to solidify your understanding.

Let me know if you'd like more details on any of these concepts or if you want to learn about specific mathematical tools or algorithms used in data science. Good luck with your mathematical learning journey! 


# Q: Tell me the most popular programming languages used in data science.

You're in the right spot for this!  Here's a breakdown of the most popular programming languages used in data science, and why they're so valuable:

**Top 2 Contenders:**

1. **Python:**  
    * **Pros:** 
        * **Ease of Learning:**  Python's syntax is relatively straightforward and beginner-friendly.
        * **Vast Data Science Libraries:**  A huge ecosystem of libraries (NumPy, Pandas, Matplotlib, scikit-learn, etc.) makes it a power powerhouse. 
        * **Community:**   A huge, supportive community makes it easy to find help and learn.
        * **General-Purpose:**  Python's versatility extends its use beyond just data science, making it a valuable skill.
    * **Cons:** 
        * **Performance:**  Python can be slower than other languages in some computationally-intense tasks.

2.  **R:**  
    * **Pros:**
        * **Statistical Power:** R shines in statistical analysis, especially with its built-in packages for statistical methods. 
        * **Visualization:** R's visualization capabilities are excellent, particularly for creating rich data graphics.
    * **Cons:**
        * **Steeper Learning Curve:** R is often considered more demanding than Python.  
        * **Smaller Ecosystem:**  R's data science libraries, while strong, aren't as vast as those of Python.

 **The Rest of the Pack:** While Python and R hold most of the data science spotlight, here are some other contenders:

   * **Java:**  (Often used for enterprise-scale data analysis and large-scale data processing)
   * **C/C++:**  (Used for performance-critical tasks like high-performance computing)
   * **Scala:** (Used for working with large data sets in Hadoop and Spark)
   * **Julia:** (Growing in popularity for its speed and interactive capabilities for data science)



**Choosing a Language:**

The best language for you depends on your:

1. **Your Experience:**  Python is beginner-friendly, while R can require a bit more effort to learn.
2. **Your Project:** The nature of the project will influence the choice (e.g., large datasets may require more specialized languages).
3. **Your Preferences:**  Choose a language you find most intuitive!


I hope this gives you a clear picture of the top languages used in data science. Let me know if you'd like to delve deeper into a specific language's use cases!

# Conclusions

The output above shows that the AI assistant can extract information from a given post and answer questions asked by the user in the multi-turns conversation. Using the same approach, it could also be used for other tasks, such as AI customer service to provide FAQ Q&A services. All you need is to prepare an introductory document or just a FAQ list and set the role of the model in the prompt template.

Although this approach cannot implement complex AI assistants as it only uses context prompts. However, due to the excellent CPU inference speed of **[gemma.cpp](https://github.com/google/gemma.cpp)**, it can run smoothly on **CPU-only devices**. This will bring extremely low deployment and operation costs, making **Gemma**'s excellent language processing and generation capabilities expected to be more and more widely used.