<a href="https://colab.research.google.com/github/tfoesch/awesome-python/blob/master/ML_Intro_1_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example 1

If you want to save everything: `Click on File -> Save a Copy in Drive`

Now you have it in your drive and can always get back to it!

## Step -1: Download Dataset

We download the following dataset: https://github.com/eth-student-project-house/ws-ml-intro-1


In [None]:
# The ! at the beginning is a jupyter notebook specific. It tells the server to execute this as
# shell (instead of python) code.
# wget: program to download
# unzip: program to unzip (-o tells to overwrite if files exist)
!wget https://github.com/eth-student-project-house/ws-ml-intro-1/archive/refs/heads/main.zip
!unzip -o /content/main.zip

On the left hand side, you can click on the folder icon and browse all files in this runtime. When you double click a file, you can see a preview


## Step 0: Create Embeddings

Download a vector database: chromadb

In [None]:
!pip install chromadb
# It finishes mit `Successfully installed backoff-2.2.1 chromadb-0.3.25...` 
# You might see an error message just before.

1. Initialize the database and create what they call a collection. We tell it to use it's own embedding model. (We could use openai embeddings)

2. Then, load all data into this collection

In [None]:
import chromadb
import os

from chromadb.config import Settings
from chromadb.utils  import embedding_functions

client = chromadb.Client(Settings(
  chroma_db_impl="duckdb+parquet",
  persist_directory="/content/.chromadb" # Optional, defaults to .chromadb/ in the current directory
))

# embedding function
default_ef = embedding_functions.DefaultEmbeddingFunction()

collection = client.create_collection(name="ml-intro-1", embedding_function=default_ef)
print("created new collection")

dataDirectory = "/content/ws-ml-intro-1-main/data/base"

for filename in os.listdir(dataDirectory):
  if filename.endswith('.md'):
    # Open the file and read its contents
    filepath = os.path.join(dataDirectory, filename)
    with open(filepath, 'r') as f:
        fileContent = f.read()
    
    # Call your function on the file content
    print(f"adding document {filepath}")
    collection.add(documents=[fileContent], ids=[filepath])

Let's have a look into our database and their vectors

In [None]:
collection.peek(limit=1)

## Step 1+2: Sample Questions

- Are there any green projects?
- Are there any green projects? If so, please summarize their similarities.
- Are there any green projects? If so, please summarize their similarities. Output it in form of a list
- What is the Makerspace?
- What is the Ideaspace?
- What is the Digital Makerspace?
- What is the SPH?

In [None]:
#### Step 1: Question
question = "Are there any green projects?" #@param {type:"string"}
nr_of_results = 2 #@param {type:"slider", min:1, max:10}
#### Step 2: Search Vector Database
results = collection.query(query_texts=[question], n_results=nr_of_results)
# line below just outputs
results


## Step 3 - Use the Language Model (FINALLY)

Now we are ready to create the final prompt.

First install the libraries.

In [None]:
!pip install openai
!pip install tiktoken

Since we are using "Open"AI we need an API Key (for billing). This key is provided by SPH.

In [None]:
import openai
OPENAI_API_KEY = "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"
openai.api_key = OPENAI_API_KEY
import tiktoken

We Tell the model how to act. We use the "chat" function.

- We first give it a system prompt. This tells the Language Model how to act
- And finally, we construct *our* question and give it some context.
- Later, we might want to add an example, so that the assistant responds in the way we want it to.

Basically you have a lot of freedom how to create these prompts and examples

In [None]:
### Now tell the language model how to act. This is an example. (it might be a rather dumm instruction given our context!)

# Behaviour
system = f"""\
You are a helpful assistant. 
"""

# Question + Context
### This is where we add our question ⬇⬇
### TODO: You might need to properly print the array
user2 = f"""\
CONTEXT: {results['documents'][0]}

QUESTION: {question}
"""

### This is the final prompt
messages=[
    {"role": "system", "content": system},
    {"role": "user", "content": user2},
]
messages

In [None]:
### Step 3: Call to OPEN AI
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=messages,
    temperature=0.6,
    max_tokens=800,
    top_p=1,
    frequency_penalty=1,
    presence_penalty=1
)

response.choices[0].message.content

Now let's see how the Language Model replies, when we give it a more concise construction set:

In [None]:
# source: https://github.com/gannonh/chatgpt-pgvector
### Now tell the language model how to act. This is an example. (it might be a rather dumm instruction given our context!)

# Behaviour
system = f"""\
You are a helpful assistant. When given CONTEXT you answer questions using only that information,\
and you always format your output in markdown. You include code snippets if relevant. If you are unsure and the answer\
is not explicitly written in the CONTEXT provided, you say\
"Sorry, I don't know how to help with that."  If the CONTEXT includes \
source URLs include them under a SOURCES heading at the end of your response. Always include all of the relevant source urls \
from the CONTEXT, but never list a URL more than once (ignore trailing forward slashes when comparing for uniqueness). Never include URLs that are not in the CONTEXT sections. Never make up URLs\
"""

# Example
user1 = f"""\
CONTEXT:
Next.js is a React framework for creating production-ready web applications. It provides a variety of methods for fetching data, a built-in router, and a Next.js Compiler for transforming and minifying JavaScript code. It also includes a built-in Image Component and Automatic Image Optimization for resizing, optimizing, and serving images in modern formats.
SOURCE: nextjs.org/docs/faq

QUESTION: 
what is nextjs? 
"""

assistant1 = f"""\
Next.js is a framework for building production-ready web applications using React. It offers various data fetching options, comes equipped with an integrated router, and features a Next.js compiler for transforming and minifying JavaScript. Additionally, it has an inbuilt Image Component and Automatic Image Optimization that helps resize, optimize, and deliver images in modern formats.
  
\`\`\`js
function HomePage() {{
  return <div>Welcome to Next.js!</div>
}}

export default HomePage
\`\`\`

SOURCES:
https://nextjs.org/docs/faq
"""

# Question + Context
### This is where we add our question ⬇⬇
### TODO: You might need to correct the array reference! (right now, only one result is taken into account)
user2 = f"""\
CONTEXT: {results['documents'][0]}

QUESTION: {question}
"""

### This is the final prompt
messages=[
    {"role": "system", "content": system},
    {"role": "user", "content": user1},
    {"role": "assistant", "content": assistant1},
    {"role": "user", "content": user2},
]
messages

In [None]:
### Step 3: Call to OPEN AI
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=messages,
    temperature=0.6,
    max_tokens=2100,
    top_p=1,
    frequency_penalty=1,
    presence_penalty=1
)

response.choices[0].message.content

## Next steps: Play with it 🚀💭

- In Step 3: Think about the usage of `results['documents'][0]` for the CONTEXT.
- In Step 3: Use "gpt4" instead of "gpt-3.5-turbo" in "Step 3"
- In Step 2: Use more results from database `nr_of_results=4`?
- In Step 3: Play with temperature, max_tokens
- In Step 0: Look at the dataset. There is another dataset called: `with_source`. This dataset has the link to the document inside: `SOURCE: https:/....` When using this dataset, you can actually output the source to your document.
- In Step 3: Change the systemprompt: `You are a very euphoric...`
- In Step 3: Change the example
- Generally: Be aware, whenever you change something, you might want to compare how the new answer is compared to the old one? How could you do that reliably?


# Example 2 (for yourself)

There is another data set, the project hub: Download it, add it to a new collection. Or use your own data set export. You can use the script as is, as long as all files are markdown and are in one folder.

`https://n.ethz.ch/~thfrei/download/ml-intro-1-sph-projecthub.zip`

In [None]:
!wget https://n.ethz.ch/~thfrei/download/ml-intro-1-sph-projecthub.zip
# unzip will extract into /content/notion
!unzip -o /content/ml-intro-1-sph-projecthub.zip

.

.

.

.

.

.

.

. 

.

.

.

.

.

.

.

.

.

.

.


# Appendix / Snippets

### Format as Markdown

In [None]:
# This let's you format the output nicely!
from IPython.display import display, Markdown
display(Markdown(response.choices[0].message.content))

### Check token limit

You can also use the hosted version: https://platform.openai.com/tokenizer

In [None]:
### Check token limit.
def num_tokens_from_messages(messages, model="gpt-3.5-turbo"):
    # source: learn.microsoft.com + modification
    encoding = tiktoken.encoding_for_model(model)
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":  # if there's a name, the role is omitted
                num_tokens += -1  # role is always required and always 1 token   
    num_tokens += 3
    # every reply is primed with <|start|>assistant<|message|> # modification
    return num_tokens

print(f"{num_tokens_from_messages(messages)} number of tokens to be sent in our request")
