<a href="https://colab.research.google.com/github/raminass/tau-digital/blob/main/notebooks/bio_emb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Env Setup

In [None]:
!pip install --upgrade openai --quiet

In [4]:
OPENAI_API_KEY = "" # @param {type:"string"}
from openai import OpenAI
import pandas as pd
import re
import numpy as np
from bs4 import BeautifulSoup

client = OpenAI(api_key=OPENAI_API_KEY)

def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    result = re.sub(pattern, ' ', text)
    return result

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   # try if the request sussessful if not return None
   try:
        return client.embeddings.create(input = [text], model=model).data[0].embedding
   except:
        return None

def html_to_text(html_code):
    # Parse HTML
    soup = BeautifulSoup(html_code, 'html.parser')

    # Extract text content
    text_content = soup.get_text(separator='\n', strip=True)
    return text_content


# Build Embedding

In [None]:
Raw_Data = "https://github.com/raminass/tau-digital/blob/main/data/bio_forum.csv?raw=true" # @param ["https://github.com/raminass/tau-digital/blob/main/data/bio_forum.csv?raw=true", "https://github.com/raminass/tau-digital/blob/main/data/calc_forum.csv?raw=true", "https://github.com/raminass/tau-digital/blob/main/data/ds_exam.csv?raw=true"]
data_name = Raw_Data.split('/')[-1].split('.')[0]
# read raw data
df=pd.read_csv(Raw_Data)
# clean the data
df['clear_text'] = df.message.apply(lambda x: html_to_text(x))
# get the embedding
df['msg_embedding'] = df.message.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
#remove rows with None values
df = df[df['msg_embedding'].notna()]
# save the embeddings
df.to_csv(f'embedded_{data_name}.csv', index=False)

# TSNE

In [16]:
embedding_file = "https://raw.githubusercontent.com/raminass/tau-digital/main/data/embedded_bio_forum.csv" # @param ["https://raw.githubusercontent.com/raminass/tau-digital/main/data/embedded_bio_forum.csv"]
import pandas as pd
import numpy as np

df=pd.read_csv(embedding_file)
#remove rows with None values
df = df[df['msg_embedding'].notna()]

In [18]:
from sklearn.manifold import TSNE
import plotly.express as px
import numpy as np


features = np.array(df.msg_embedding.apply(eval).to_list())

tsne = TSNE(n_components=2, random_state=0)
projections = tsne.fit_transform(features)

df['x'] = projections[:,0]
df['y'] = projections[:,1]

## 1. Find the clusters using K-means

We show the simplest use of K-means. You can pick the number of clusters that fits your use case best.

In [19]:
from sklearn.cluster import KMeans

n_clusters = 4

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(features)
labels = kmeans.labels_
df["Cluster"] = labels.astype(str)



271 messages

Cleaning messages enhance results and save money.

example raw message:



In [20]:
fig = px.scatter(
    df, x='x', y='y',
    hover_name="clear_text", color="Cluster", title="Biochemistry"
)
fig.show()

In [None]:
from openai import OpenAI
client = OpenAI()

# ref: https://cookbook.openai.com/examples/clustering
# Reading a review which belong to each group.
rev_per_cluster = 5

for i in range(n_clusters):
    print(f"Cluster {i} Theme:\n", end=" ")

    reviews = "\n".join(
        df[df.Cluster.astype(int) == i]
        .message.str.replace("Title: ", "")
        .str.replace("\n\nContent: ", ":  ")
        .sample(rev_per_cluster, random_state=41)
        .values
    )
    # https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "assistant", "content": f'You are a friendly and helpful teaching assistant in a biochemistry course, the course is in university level for biology students. You are helping students with their questions, you can use material from Nelson & Cox / Lehninger - principles of biochemistry, 6’th edition'},
                {"role": "user", "content": f'What do the following students questions in biochemistry course have in common?\n\nStudent questions:\n"""\n{reviews}\n"""\n\nTheme:'},],
        temperature=0.7,
        max_tokens=64,
        top_p=0.5,
        # frequency_penalty=0,
        # presence_penalty=0,
    )
    print(response.choices[0].message.content.replace(". ", ".\n"))


Cluster 0 Theme:
 The common theme in these student questions is that they are seeking clarification or assistance with specific topics or concepts in biochemistry.
They are asking for help with understanding calculations involving the pI (isoelectric point) of a sequence, the impact of using a metabolic pathway on oxygen consumption, and the relationship between the use of
Cluster 1 Theme:
 The common theme among these student questions is that they are all related to specific topics or concepts in biochemistry.
The first question is about the splitting observed in a graph, the second question is about the role of low glucose affinity in the liver, the third question is about the addition of weak acid to mitochondria, and
Cluster 2 Theme:
 The common theme in these student questions is that they are all seeking clarification or explanation on specific topics in biochemistry.
The students are asking about specific questions from exams or assignments, requesting explanations on concepts