<a href="https://www.kaggle.com/code/ufownl/ai-assistant-for-data-science-beginners?scriptVersionId=192586860" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook is powered by **[lua-cgemma](https://github.com/ufownl/lua-cgemma)**, a library that provides Lua bindings for **[gemma.cpp](https://github.com/google/gemma.cpp)**, see [README.md](https://github.com/ufownl/lua-cgemma/blob/main/README.md) for more details. Thanks to **[gemma.cpp](https://github.com/google/gemma.cpp)** using **[Google Highway Library](https://github.com/google/highway)** to take advantage of portable SIMD for CPU inference, it can run smoothly on the kernel without any accelerator. The purpose of this notebook is to explore a way to leverage the power of **Gemma** using only the CPUs.

*Note: the context prompts of this AI assistant are constructed based on the content of [this post](https://www.springboard.com/blog/data-science/learn-data-science-on-your-own/).*

# Requirements

Before starting, you should have installed:

- [CMake](https://cmake.org/)
- C++ compiler, supporting at least C++17
- [LuaJIT](https://luajit.org/), recommended to install [OpenResty](https://openresty.org/) directly
- [websockets](https://websockets.readthedocs.io/), for interacting with AI assistant in Kaggle Notebook

In [1]:
# Install OpenResty

!wget -O - https://openresty.org/package/pubkey.gpg | apt-key add -
!echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/openresty.list
!apt update && apt install -y openresty

--2024-08-14 04:41:31--  https://openresty.org/package/pubkey.gpg
Resolving openresty.org (openresty.org)... 3.131.85.84, 2600:1f1c:9b2:8000:f183:c67e:2c64:855f
Connecting to openresty.org (openresty.org)|3.131.85.84|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1688 (1.6K) [text/plain]
Saving to: 'STDOUT'


2024-08-14 04:41:31 (130 MB/s) - written to stdout [1688/1688]

OK
deb http://openresty.org/package/ubuntu focal main
Get:1 https://packages.cloud.google.com/apt gcsfuse-focal InRelease [1227 B]
Get:2 https://packages.cloud.google.com/apt cloud-sdk InRelease [1616 B]
Get:3 http://openresty.org/package/ubuntu focal InRelease [4720 B]
Get:4 http://security.ubuntu.com/ubuntu focal-security InRelease [128 kB]
Get:5 https://packages.cloud.google.com/apt gcsfuse-focal/main amd64 Packages [25.6 kB]
Get:6 https://packages.cloud.google.com/apt cloud-sdk/main all Packages [1501 kB]
Hit:7 http://archive.ubuntu.com/ubuntu focal InRelease
Get:8 htt

In [2]:
# Install websockets

!pip install -q -U websockets

# Install lua-cgemma

In [3]:
import os
os.chdir("/opt")

**1st step:** Clone the source code from [GitHub](https://github.com/ufownl/lua-cgemma):

In [4]:
!git clone --branch main https://github.com/ufownl/lua-cgemma.git

Cloning into 'lua-cgemma'...
remote: Enumerating objects: 779, done.[K
remote: Counting objects: 100% (331/331), done.[K
remote: Compressing objects: 100% (177/177), done.[K
remote: Total 779 (delta 229), reused 242 (delta 154), pack-reused 448 (from 1)[K
Receiving objects: 100% (779/779), 162.20 KiB | 1.67 MiB/s, done.
Resolving deltas: 100% (509/509), done.


**2nd step:** Build and install:

In [5]:
!mkdir -p lua-cgemma/build
os.chdir("/opt/lua-cgemma/build")
!cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_CXX_FLAGS="-DGEMMA_TOPK=5" .. && make -j4 && make install

cmake: /opt/conda/lib/libcurl.so.4: no version information available (required by cmake)
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for 

**3rd step:** Test whether the module is installed correctly:

In [6]:
!resty -e 'require("cgemma").info()'


    __
   / /_  ______ _      _________ ____  ____ ___  ____ ___  ____ _
  / / / / / __ `/_____/ ___/ __ `/ _ \/ __ `__ \/ __ `__ \/ __ `/
 / / /_/ / /_/ /_____/ /__/ /_/ /  __/ / / / / / / / / / / /_/ /
/_/\__,_/\__,_/      \___/\__, /\___/_/ /_/ /_/_/ /_/ /_/\__,_/
                         /____/

Date & Time              : 2024-08-14 04:50:27
Max Sequence Length      : 4096
Top-K                    : 5
CPU                      :            Intel(R) Xeon(R) CPU @ 2.20GHz
Instruction Set          : AVX2 (256 bits)
Hardware Concurrency     : 4
Compiled Config          : opt



# Launch AI Assistant Service

In [7]:
import os
os.chdir("/opt")

**1st step:** Clone configuration and data files from [GitHub](https://github.com/ufownl/ai-assistant-for-data-science-beginners): 

In [8]:
!git clone https://github.com/ufownl/ai-assistant-for-data-science-beginners.git

Cloning into 'ai-assistant-for-data-science-beginners'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 55 (delta 24), reused 24 (delta 4), pack-reused 0 (from 0)[K
Unpacking objects: 100% (55/55), 99.79 KiB | 1.06 MiB/s, done.


**2nd step:** Link weights file:

In [9]:
os.chdir("/opt/ai-assistant-for-data-science-beginners")
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-2b-it-sfp/1/2.0-2b-it-sfp.sbs
!ln -sf /kaggle/input/gemma-2/gemmacpp/2.0-2b-it-sfp/1/tokenizer.spm

**3rd step:** Generate cache of context prompts:

In [10]:
!resty print_prompt.lua | resty ../tools/cache_prompt.lua --pin_threads 1 --stats
!ls -lh dump.bin

Lua input file ../tools/cache_prompt.lua not found.
ls: cannot access 'dump.bin': No such file or directory


*Note: `dump.bin` is the cache of context prompts, it could be saved somewhere without regenerating it every time.*

**4th step:** Launch assistant service:

In [11]:
!openresty -p . -c assistant.conf

# Chat to AI Assistant

Running this notebook in interactive mode lets you chat with the AI assistant yourself. Or you can follow the instructions in this notebook to install and launch this AI assistant service on your computer, then open [http://localhost:8042](http://localhost:8042) with your browser to chat with the AI assistant.

In [12]:
# Define the prompter input

class Prompter:
    __dummy_input = [
        "How to learn data science from scratch?",
        "What mathematical concepts should I learn?",
        "Tell me the most popular programming languages used in data science.",
        ""
    ]
    
    def __init__(self):
        self.__turn = 0
        self.__interactive = os.environ.get("KAGGLE_KERNEL_RUN_TYPE") == "Interactive"
        
    def __call__(self):
        if self.__interactive:
            self.__turn += 1
            return input("you:")
        else:
            dummy = self.__dummy_input[self.__turn]
            self.__turn += 1
            print("you: ", end="")
            print(dummy, flush=True)
            return dummy

In [13]:
import json
from websockets.sync.client import connect

history = []

# Connect to Server and Chat with AI Assistant
with connect("ws://localhost:8042/cgemma/session", open_timeout=None) as ws:
    you = Prompter()
    loading = "-\\|/"
    while True:
        text = None
        msg = json.loads(ws.recv())
        if "text" in msg:
            print(msg["role"], end=": ")
            print(msg["text"], flush=True)
            text = you()
        elif msg["pos"] < msg["prompt_size"]:
            print(msg["role"], end=": ")
            print(loading[msg["pos"] % len(loading)], end="\r", flush=True)
        else:
            if msg["pos"] == msg["prompt_size"]:
                print(msg["role"], end=": ")
                reply = ""
            if "token" in msg:
                print(msg["token"], end="", flush=True)
                reply += msg["token"]
            else:
                print("", flush=True)
                history.append({
                    "role": msg["role"],
                    "text": reply
                })
                text = you()
        if not text is None:
            if text:
                history.append({
                    "role": "user",
                    "text": text
                })
                ws.send(json.dumps(history[-1]))    
            else:
                break

system: New chat session started!
you: How to learn data science from scratch?
gemma: Here's a comprehensive guide on how to start your data science journey from scratch, breaking it down into manageable steps and resources:

**1. Foundation - Build Your Base Knowledge**

* **Mathematics:**
    * **Linear Algebra:** Vectors, matrices, eigenvalues/eigenvectors (essential for understanding data representations and algorithms).  [Khan Academy](https://www.khanacademy.org/math/linear-algebra), [MIT OpenCourseware](https://ocw.mit.edu/courses/)
    * **Calculus:** Derivatives and integrals (crucial for understanding optimization algorithms). [Khan Academy](https://www.khanacademy.org/math/calculus-ab/calculus-for-business)
    * **Probability & Statistics:** Basic statistical concepts, distributions, hypothesis testing, regression. [Khan Academy](https://www.khanacademy.org/math/statistics-probability)
* **Programming:** Choose a language popular in data science:
    * **Python:** Widely us

Formatted output chat history:

In [14]:
from IPython.display import display, Markdown

for msg in history:
    if msg["role"] == "user":
        display(Markdown("# Q: " + msg["text"]))
    else:
        display(Markdown(msg["text"]))

# Q: How to learn data science from scratch?

Here's a comprehensive guide on how to start your data science journey from scratch, breaking it down into manageable steps and resources:

**1. Foundation - Build Your Base Knowledge**

* **Mathematics:**
    * **Linear Algebra:** Vectors, matrices, eigenvalues/eigenvectors (essential for understanding data representations and algorithms).  [Khan Academy](https://www.khanacademy.org/math/linear-algebra), [MIT OpenCourseware](https://ocw.mit.edu/courses/)
    * **Calculus:** Derivatives and integrals (crucial for understanding optimization algorithms). [Khan Academy](https://www.khanacademy.org/math/calculus-ab/calculus-for-business)
    * **Probability & Statistics:** Basic statistical concepts, distributions, hypothesis testing, regression. [Khan Academy](https://www.khanacademy.org/math/statistics-probability)
* **Programming:** Choose a language popular in data science:
    * **Python:** Widely used due to its extensive libraries. [Codecademy](https://www.codecademy.com/learn/learn-python), [DataCamp](https://www.datacamp.com/courses/intro-to-python-for-data-science), [FreeCodeCamp](https://www.freecodecamp.org/learn/scientific-computing-with-python/)
    * **R:**  Great for statistical analysis and visualization. [DataCamp](https://www.datacamp.com/courses/r-for-data-scientists). 
    * **SQL:** Essential for querying and managing data from databases.  [Codecademy](https://www.codecademy.com/learn/learn-sql), [DataCamp](https://www.datacamp.com/courses/sql-for-data-scientists)

**2. Core Concepts in Data Science**
* **Data Wrangling:**
    * **Cleaning:** Handling missing values, outliers, inconsistent formats.
    * **Transformation:**  Scaling, feature scaling, feature engineering.
    * **Data Visualization:** Creating insightful graphs and dashboards using libraries like Matplotlib and Seaborn in Python.
* **Machine Learning:**
    * **Supervised Learning:** Classification (predicting categories) and Regression (predicting continuous values). 
        * **Algorithms:** Linear regression, logistic regression, decision trees, random forests, Support Vector Machines (SVMs).
    * **Unsupervised Learning:**  Clustering (grouping similar data points) and dimensionality reduction.
        * **Algorithms:** K-Means clustering, Principal Component Analysis (PCA).
* **Deep Learning:** (Optional, for more advanced applications)
    * **Neural Networks:** Multi-layered structures designed to learn from complex data.
    * **Deep Learning Libraries:** TensorFlow, Keras (Python), PyTorch.

**3. Hands-On Learning and Projects**

* Start with beginner-friendly projects: 
   * **Titanic Dataset:** Predict passenger survival based on various features.
   * **House Price Prediction:** Analyze housing data to predict property values.
   * **Spam Detection:**  Classify emails as spam or not spam.
* **Kaggle competitions:** Participate in real-world datasets and challenges to gain experience and learn from others.  (https://www.kaggle.com/)
* **Personal Data Analysis:** Explore your own data (music, spending habits, etc.) to gain insights.

**4. Resources & Learning Paths**

* **Online Courses:**
    * **DataCamp:** Excellent for hands-on projects and interactive learning ([https://www.datacamp.com/](https://www.datacamp.com/)).
    * **Coursera:**  Offers a range of courses from universities like Stanford and Johns Hopkins ([https://www.coursera.org/](https://www.coursera.org/)).
    * **edX:** Another platform with university courses ([https://www.edx.org/](https://www.edx.org/)).
* **Books:**
    * *Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow* by Aurélien Géron (practical, beginner-friendly)
    * *Python for Data Analysis* by Wes McKinney (focus on Pandas for data manipulation)
    * *Data Science for Business* by Foster Provost and Andrew Bruce (focus on business applications)
* **YouTube Channels:**
    * **StatQuest with Josh Starmer:**  Clear and engaging explanations of statistical concepts and ML.
    * **Sentdex:**  Tutorials covering Python, machine learning, and deep learning.
    * **freeCodeCamp.org:**  Comprehensive videos on various programming and data science topics.
* **Communities and Forums:**
    * **Kaggle Forums:**  Connect with other data scientists and discuss challenges and solutions.  (https://www.kaggle.com/forums)
    * **Data Science Stack Exchange:** A Q&A platform for all things data science. (https://datascience.stackexchange.com/)


**Key Tips for Success:**

* **Consistency:** Dedicate a specific time frame each week to study and practice. 
* **Start Small:** Break down complex tasks into manageable steps.
* **Embrace Challenges:** Don't be afraid to tackle difficult problems.
* **Collaborate and Learn From Others:**  Join communities, ask questions, and share your work.
* **Keep Practicing:** The more you work with data, the better you'll become. 



Remember, data science is a journey of continuous learning and exploration. Enjoy the process, stay curious, and never stop learning!

# Q: What mathematical concepts should I learn?

Here's a breakdown of the essential mathematical concepts for aspiring data scientists, emphasizing those most relevant to data science:

**1. Linear Algebra** 

* **Vectors and Matrices:** The core building block of data representation.  Learn how to manipulate and operate on them.
* **Linear Equations and Systems:**  Essential for data analysis and modeling. You'll solve systems of equations to find optimal solutions.
* **Eigenvalues and Eigenvectors:**  Understand how to simplify data and reduce dimensionality using these concepts.
* **Inner Products & Orthogonality:** Key for finding optimal projections in data.

**2. Calculus**

* **Differentiation & Derivatives:**  Crucial for understanding how functions change. This helps in optimization algorithms (finding optimal parameters for machine learning models).
* **Integration:**  Used to calculate areas and sums.  Important for data analysis, probability, and statistical modeling.
* **Partial Derivatives:**  Essential for understanding how changes in one variable affect a function with multiple variables (important in gradient descent).

**3. Probability and Statistics**

* **Probability Theory:**  Learn about random variables, probability distributions, and conditional probability.  This forms the basis for understanding statistical models and analyzing data distributions.
* **Statistical Inference:**  Understand hypothesis testing, confidence intervals, and how to draw conclusions from data.
* **Regression & Linear Models:**  Use these to predict relationships between variables.
* **Central Tendency & Dispersion:**  Learn about measures like mean, median, standard deviation, and how these help analyze data distributions.
* **Correlation & Regression Analysis:** Understand and use these to find relationships between variables, a fundamental aspect of data analysis.

**4. Optimization** 

* **Gradient Descent:** A fundamental algorithm in machine learning for optimizing parameters (weights in neural networks) to minimize error.
* **Convex Optimization:**  Understanding this concept is important in many machine learning algorithms.

**5. Additional Helpful Concepts**

* **Discrete Probability and Combinatorics:**  For working with probability in a more discrete way, especially when dealing with data from different domains.

**Learning Resources**

* **Khan Academy** - [https://www.khanacademy.org/math](https://www.khanacademy.org/math): Offers free, interactive courses and exercises for all of these topics.
* **MIT OpenCourseWare** - [https://ocw.mit.edu/](https://ocw.mit.edu/): Access free lectures, assignments, and exams from various MIT courses.
* **Coursera:**  [https://www.coursera.org/](https://www.coursera.org/) - Offers a range of courses from universities and other organizations, including many focusing on mathematical foundations.


**Key Takeaways:**

* **Focus on the Essentials:** Don't try to learn everything at once. Prioritize the concepts relevant to your data science learning path.
* **Practice Makes Perfect:** Work through problems, code examples, and real-world projects to solidify your understanding.
* **Connect Concepts:**  Try to see how different mathematical concepts relate to each other. 



Remember, mastering math is a process.  Don't get discouraged, be patient, and celebrate small victories. Good luck! 




# Q: Tell me the most popular programming languages used in data science.

Here's a breakdown of the most popular languages used in data science, along with their strengths and why they're favored:

**1. Python**

* **Popularity:**  Dominates the data science landscape. 
* **Strengths:**
    * **Libraries & Tools:** A rich ecosystem of libraries like NumPy, pandas, Scikit-learn, matplotlib, pandas, and others, providing everything you need for data analysis, visualization, machine learning.
    * **Beginner-Friendly:**  Easy to learn and read, thanks to its clear syntax and extensive documentation.
    * **Community & Support:**  Vast online communities, forums, and resources.
* **Use Cases:**
    * Data Cleaning & Exploration
    * Exploratory Data Analysis
    * Machine Learning & Deep Learning
    * Data Visualization and Reporting
* **Why It's So Popular:** Its versatility and ease of use, coupled with the abundance of data science-related libraries, have made it the go-to language in the data science world.

**2. R**

* **Popularity:**  Very popular in academia and for statistical analysis.
* **Strengths:**
    * **Statistical Focus:**  Built specifically for statistical computing and data visualization. 
    * **Specialized Packages:**  A wide array of statistical packages for everything from time series analysis, Bayesian models, and more. 
    * **Data Manipulation:**  Excellent for data manipulation and analysis with powerful data structures. 
* **Use Cases:**
    * Statistical Modeling
    * Data Visualization (ggplot2 for highly customizable visualizations)
    * Research and Academia
    * Time series analysis
* **Why It's Popular:**  R excels in statistical modeling, data manipulation, and visualization, making it a powerful tool for data scientists who need strong statistical foundations.

**3. SQL**

* **Popularity:**  Essential for data management and querying.
* **Strengths:**
    * **Data Retrieval:**  Used to access, query, and manipulate data stored in relational databases (MySQL, PostgreSQL, SQLite).
    * **Structured Data:**  Ideal for working with structured data in databases. 
* **Use Cases:**
    * Data Acquisition and Cleaning
    * Data Analysis and Reporting
    * Database Administration
* **Why It's Important:**  A core skill for any data scientist who works with structured data.

**4. Julia**

* **Popularity:**  A rising language with growing traction in the field.
* **Strengths:** 
    * **Performance:**  Designed for high-performance computation (especially in scientific and engineering fields).
    * **Modern Language:**  Focuses on code clarity, readability, and efficiency.
    * **Growing Ecosystem:** A rapidly growing library ecosystem.
* **Use Cases:**
    * High-Performance Computing
    * Scientific Research
    * Data Analysis
* **Why It's Popular:**  Its performance advantage, combined with a modern syntax and growing community, makes it attractive for complex and computationally intensive data analysis.

**Important Notes:**

* **The Choice Depends on the Task:** The "best" language depends on the specific project or task.  
    * Python is great for most general data science work, while R might be the better choice for statistical analysis.  
    * SQL is a must-have for working directly with databases 
* **Learning One Language First:**  It's usually more efficient to start with one language and become proficient in it before expanding into multiple languages. 


Let me know if you'd like to dive deeper into any specific language or have other questions!

# Conclusions

The output above shows that the AI assistant can extract information from a given post and answer questions asked by the user in the multi-turns conversation. Using the same approach, it could also be used for other tasks, such as AI customer service to provide FAQ Q&A services. All you need is to prepare an introductory document or just a FAQ list and set the role of the model in the prompt template.

Although this approach cannot implement complex AI assistants as it only uses context prompts. However, due to the excellent CPU inference speed of **[gemma.cpp](https://github.com/google/gemma.cpp)**, it can run smoothly on **CPU-only devices**. This will bring extremely low deployment and operation costs, making **Gemma**'s excellent language processing and generation capabilities expected to be more and more widely used.