<a href="https://www.kaggle.com/code/ufownl/ai-assistant-for-data-science-beginners?scriptVersionId=171773751" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook is powered by **[lua-cgemma](https://github.com/ufownl/lua-cgemma)**, a library that provides Lua bindings for **[gemma.cpp](https://github.com/google/gemma.cpp)**, see [README.md](https://github.com/ufownl/lua-cgemma/blob/main/README.md) for more details. Thanks to **[gemma.cpp](https://github.com/google/gemma.cpp)** using **[Google Highway Library](https://github.com/google/highway)** to take advantage of portable SIMD for CPU inference, this notebook can run smoothly on the kernel without any accelerator.

*Note: the context prompts of this AI assistant are constructed based on the content of [this post](https://www.springboard.com/blog/data-science/learn-data-science-on-your-own/).*

# Requirements

Before starting, you should have installed:

- [CMake](https://cmake.org/)
- C++ compiler, supporting at least C++17
- [LuaJIT](https://luajit.org/), recommended to install [OpenResty](https://openresty.org/) directly
- [websockets](https://websockets.readthedocs.io/), for interacting with AI assistant in Kaggle Notebook

In [1]:
# Install OpenResty

!wget -O - https://openresty.org/package/pubkey.gpg | apt-key add -
!echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/openresty.list
!apt update && apt install -y openresty

--2024-04-13 05:32:11--  https://openresty.org/package/pubkey.gpg
Resolving openresty.org (openresty.org)... 3.131.85.84, 2600:1f1c:9b2:8000:f183:c67e:2c64:855f
Connecting to openresty.org (openresty.org)|3.131.85.84|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1688 (1.6K) [text/plain]
Saving to: 'STDOUT'


2024-04-13 05:32:11 (127 MB/s) - written to stdout [1688/1688]

OK
deb http://openresty.org/package/ubuntu focal main
Hit:1 https://packages.cloud.google.com/apt gcsfuse-focal InRelease
Get:2 https://packages.cloud.google.com/apt cloud-sdk InRelease [6361 B]
Get:3 http://openresty.org/package/ubuntu focal InRelease [4720 B]
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Get:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:6 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:7 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Get:8 http://openresty.org/package/ubuntu focal/mai

In [2]:
# Install websockets

!pip install -q -U websockets

# Install lua-cgemma

In [3]:
import os
os.chdir("/opt")

**1st step:** Clone the source code from [GitHub](https://github.com/ufownl/lua-cgemma):

In [4]:
!git clone --branch main https://github.com/ufownl/lua-cgemma.git

Cloning into 'lua-cgemma'...
remote: Enumerating objects: 410, done.[K
remote: Counting objects: 100% (119/119), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 410 (delta 67), reused 67 (delta 29), pack-reused 291[K
Receiving objects: 100% (410/410), 108.69 KiB | 12.08 MiB/s, done.
Resolving deltas: 100% (246/246), done.


**2nd step:** Build and install:

In [5]:
!mkdir -p lua-cgemma/build
os.chdir("/opt/lua-cgemma/build")
!cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_CXX_FLAGS="-DGEMMA_TOPK=5" .. && make -j4 && make install

cmake: /opt/conda/lib/libcurl.so.4: no version information available (required by cmake)
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for 

**3rd step:** Test whether the module is installed correctly:

In [6]:
!resty -e 'require("cgemma").info()'


    __
   / /_  ______ _      _________ ____  ____ ___  ____ ___  ____ _
  / / / / / __ `/_____/ ___/ __ `/ _ \/ __ `__ \/ __ `__ \/ __ `/
 / / /_/ / /_/ /_____/ /__/ /_/ /  __/ / / / / / / / / / / /_/ /
/_/\__,_/\__,_/      \___/\__, /\___/_/ /_/ /_/_/ /_/ /_/\__,_/
                         /____/

Date & Time                                : 2024-04-13 05:37:32
Max Sequence Length                        : 4096
Top-K                                      : 5
Weight Type                                : SFP
Embedder Input Type                        : b16
Prefill Token Batch Size                   : 16
Hardware Concurrency                       : 4
Instruction Set                            : AVX2 (256 bits)
Compiled Config                            : opt



# Launch AI Assistant Service

In [7]:
import os
os.chdir("/opt")

**1st step:** Clone configuration and data files from [GitHub](https://github.com/ufownl/ai-assistant-for-data-science-beginners): 

In [8]:
!git clone https://github.com/ufownl/ai-assistant-for-data-science-beginners.git

Cloning into 'ai-assistant-for-data-science-beginners'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 27 (delta 5), reused 17 (delta 0), pack-reused 0[K
Unpacking objects: 100% (27/27), 62.61 KiB | 2.85 MiB/s, done.


**2nd step:** Link weights file:

In [9]:
os.chdir("/opt/ai-assistant-for-data-science-beginners")
!ln -s -f /kaggle/input/gemma/gemmacpp/1.1-2b-it-sfp/1/2b-it-sfp.sbs
!ln -s -f /kaggle/input/gemma/gemmacpp/1.1-2b-it-sfp/1/tokenizer.spm

**3rd step:** Generate prompt context:

In [10]:
!resty print_prompt.lua | resty /opt/lua-cgemma/examples/cache_prompt.lua
!ls -lh dump.bin

Loading model ...
Random seed of session: 3105299078
Reading prompt ...

Done! Session states of the prompt have been dumped to "dump.bin"
-rw-r--r-- 1 root root 86M Apr 13 05:48 dump.bin


*Note: `dump.bin` is the cached state of the context prompts, it could be saved somewhere without regenerating it every time.*

**4th step:** Launch assistant service:

In [11]:
!openresty -p . -c assistant.conf

# Chat to AI Assistant

Running this notebook in interactive mode lets you chat with the AI assistant yourself. Or you can follow the instructions in this notebook to install and launch this AI assistant service on your computer, then open [http://localhost:8042](http://localhost:8042) with your browser to chat with the AI assistant.

In [12]:
# Define the prompter input

class Prompter:
    __dummy_input = [
        "Hi, Gemma!",
        "What mathematical concepts should I learn?",
        "Tell me the most popular programming languages used in data science.",
        ""
    ]
    
    def __init__(self):
        self.__turn = 0
        self.__interactive = os.environ.get("KAGGLE_KERNEL_RUN_TYPE") == "Interactive"
        
    def __call__(self):
        if self.__interactive:
            self.__turn += 1
            return input("you:")
        else:
            dummy = self.__dummy_input[self.__turn]
            self.__turn += 1
            print("you: ", end="")
            print(dummy, flush=True)
            return dummy

In [13]:
import json
from websockets.sync.client import connect

history = []

# Connect to Server and Chat with AI Assistant
with connect("ws://localhost:8042/cgemma/session") as ws:
    you = Prompter()
    loading = "-\\|/"
    while True:
        text = None
        msg = json.loads(ws.recv())
        if "text" in msg:
            print(msg["role"], end=": ")
            print(msg["text"], flush=True)
            text = you()
        elif msg["pos"] < msg["prompt_size"]:
            print(msg["role"], end=": ")
            print(loading[msg["pos"] % len(loading)], end="\r", flush=True)
        else:
            if msg["pos"] == msg["prompt_size"]:
                print(msg["role"], end=": ")
                reply = ""
            if "token" in msg:
                print(msg["token"], end="", flush=True)
                reply += msg["token"]
            else:
                print("", flush=True)
                history.append({
                    "role": msg["role"],
                    "text": reply
                })
                text = you()
        if not text is None:
            if text:
                history.append({
                    "role": "user",
                    "text": text
                })
                ws.send(json.dumps(history[-1]))    
            else:
                break

system: New session started! Random seed of current session: 782393255
you: Hi, Gemma!
gemma: **Congratulations on your insightful breakdown of the steps to learn data science!**

**Here are some additional points you might want to consider:**

**1. Specialization:**

* While Python and R are the most popular choices, there are other languages and frameworks available.
* Consider specializing in a specific area of data science that interests you the most, such as machine learning, data visualization, or data warehousing.

**2. Practical Skills:**

* Emphasize the importance of hands-on experience through beginner projects and real-world work.
* Highlight the value of learning data wrangling, ETL, and data management.

**3. Continuous Learning:**

* Mention the importance of staying updated with the latest advancements in data science.
* Discuss the need for lifelong learning and professional development to thrive in this rapidly evolving field.

**4. Networking:**

* Encourage beginner

Formatted output chat history:

In [14]:
from IPython.display import display, Markdown

for msg in history:
    if msg["role"] == "user":
        display(Markdown("# Q: " + msg["text"]))
    else:
        display(Markdown(msg["text"]))

# Q: Hi, Gemma!

**Congratulations on your insightful breakdown of the steps to learn data science!**

**Here are some additional points you might want to consider:**

**1. Specialization:**

* While Python and R are the most popular choices, there are other languages and frameworks available.
* Consider specializing in a specific area of data science that interests you the most, such as machine learning, data visualization, or data warehousing.

**2. Practical Skills:**

* Emphasize the importance of hands-on experience through beginner projects and real-world work.
* Highlight the value of learning data wrangling, ETL, and data management.

**3. Continuous Learning:**

* Mention the importance of staying updated with the latest advancements in data science.
* Discuss the need for lifelong learning and professional development to thrive in this rapidly evolving field.

**4. Networking:**

* Encourage beginners to actively seek connections with industry professionals through online communities, conferences, and workshops.
* Emphasize the benefits of networking for job opportunities, knowledge sharing, and industry insights.

**5. Personal Motivation:**

* Highlight the importance of personal motivation and passion in pursuing a career in data science.
* Discuss the challenges and rewards of this exciting field.

**6. Tailoring to Specific Goals:**

* Recognize that career goals and industry requirements may vary, so emphasize the flexibility and adaptability of a data science career.
* Discuss the importance of identifying the best path for each individual's situation.

**Additional Resources:**

* **Online Courses:** Coursera, Udemy, edX
* **Books:** “Hands-On Machine Learning” by Aurélien Géron, “Predictive Modeling” by Lars Buerger
* **Communities:** Data Science Stack Exchange, Kaggle, Meetup.com

**Overall:**

* Your guide is comprehensive and provides a solid foundation for beginners.
* It emphasizes the importance of practical skills, continuous learning, and networking.
* It also acknowledges the personal nature of the career and the need for flexibility.

# Q: What mathematical concepts should I learn?

## Mathematical Concepts for Data Science Beginners

**Basic Statistics:**

* Mean
* Median
* Mode
* Standard deviation
* Variance
* Correlation

**Probability:**

* Probability distribution
* Conditional probability
* Bayes' theorem

**Linear Algebra:**

* Vectors
* Matrices
* Eigenvalues and eigenvectors
* Principal component analysis (PCA)

**Statistical Inference:**

* Confidence intervals
* Hypothesis testing
* Regression analysis

**Optimization:**

* Gradient descent
* Linear programming

**Additional Concepts:**

* Time series analysis
* Data visualization
* Machine learning algorithms

**Essentials:**

- Understanding of basic statistics is essential for understanding data and building models.
- Familiarity with linear algebra concepts is helpful for understanding machine learning algorithms.
- Knowing how to visualize data helps communicate insights effectively.
- Familiarity with common optimization methods is important for data-driven decision making.

**Resources:**

- Statisticshowto.com
- Khan Academy Statistics
- Coursera Statistical Learning Specialization

## Additional Tips for Learning Mathematical Concepts

- Start with online resources like Khan Academy and Statisticshowto.com
- Use interactive software to visualize concepts
- Practice with real-world datasets
- Join online communities and forums to ask questions and collaborate with other learners

**Remember:** The specific mathematical concepts you need to learn will depend on your area of data science specialization.

# Q: Tell me the most popular programming languages used in data science.

**Most Popular Programming Languages in Data Science:**

**1. Python:**
- Widely used for data analysis, machine learning, and data visualization.
- Easy to learn and has extensive libraries.
- Suitable for both beginners and experienced programmers.

**2. R:**
- Popular statistical programming language.
- Focuses on statistical analysis and modeling.
- Powerful for data wrangling and data visualization.

**3. SQL:**
- The standard language for managing and retrieving data in relational databases.
- Essential for working with databases in data science.

**4. Java:**
- Widely used for enterprise-level data applications.
- Robust and scalable for handling large datasets.

**5. C++:**
- Powerful and widely used for high-performance data processing.
- Suitable for building complex data analysis and machine learning models.

**6. Scala:**
- Object-oriented programming language with a robust data processing ecosystem.
- Ideal for large-scale data pipelines and data warehousing systems.

**7. JavaScript:**
- Popular language used for web data analysis and visualization.
- Enables building interactive dashboards and data-driven web applications.

**8. TensorFlow:**
- Open-source library for building and deploying machine learning models.
- Popular for large-scale data analysis and image recognition tasks.

**9. Pandas:**
- Library for working with dataframes in Python.
- Easy to use and provides data wrangling capabilities.

**10. Spark:**
- Open-source distributed data processing framework.
- Suitable for handling large datasets and complex data analytics jobs.