### Calculation of distributed representation vectors using OpenAI Text Embedding

In [1]:
!pip install --upgrade "httpx<0.28"

Collecting httpx<0.28
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Downloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: httpx
  Attempting uninstall: httpx
    Found existing installation: httpx 0.28.1
    Uninstalling httpx-0.28.1:
      Successfully uninstalled httpx-0.28.1
Successfully installed httpx-0.27.2


#### Import libraries  

In [13]:
import numpy as np
import pandas as pd
import openai
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# INIAD
API_KEY_INIAD = ''
BASE_URL_INIAD = 'https://api.openai.iniad.org/api/v1'

API_KEY = API_KEY_INIAD
BASE_URL = BASE_URL_INIAD

# https://platform.openai.com/docs/guides/embeddings/what-are-embeddings?lang=node
MODEL = 'text-embedding-3-small'  # max: 8191 tokens
# MODEL = 'text-embedding-3-large'  # max: 8191 tokens
# max_input_len = 20000  # 1 token is said to be about 3..4 characters in English
# max_input_len2 = 8000

In [4]:
client = openai.OpenAI(
    api_key = API_KEY,
    base_url = BASE_URL,
)

In [5]:
def get_embedding(text, model=MODEL):
    text = text.replace('\n', ' ')
    return client.embeddings.create(input=text, model=model).data[0].embedding

#### Setup working directory

In [6]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Documents/ds2024/dsF1/

Mounted at /content/drive
/content/drive/MyDrive/Documents/ds2024/dsF1


#### Parameters  

In [7]:
csv_in = 'sleep-text-score.csv'
vecs_out = 'vecs_sleep_by_openai.npy'
min_words = 0

#### Read CSV file  

In [8]:
df = pd.read_csv(csv_in, sep=',', skiprows=0, header=0, encoding='shift-jis')
print(df.shape)
print(df.info())
display(df.head())

(426, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426 entries, 0 to 425
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   text               426 non-null    object
 1   GPT-4o             426 non-null    int64 
 2   Gemini-1.5-Pro     426 non-null    int64 
 3   Claude-3.5-Sonnet  426 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 13.4+ KB
None


Unnamed: 0,text,GPT-4o,Gemini-1.5-Pro,Claude-3.5-Sonnet
0,就寝時間を毎日一定にする,2,2,2
1,朝日を積極的に浴びる,2,2,2
2,寝室の温度を18-22度に保つ,2,2,2
3,就寝前のストレッチで体をリラックスさせる,2,2,2
4,寝具は定期的に清潔に保つ,2,2,2


#### Check the number of documents in each category  

In [9]:
print(df['GPT-4o'].value_counts().sort_index(ascending=True))
print(df['Gemini-1.5-Pro'].value_counts().sort_index(ascending=True))
print(df['Claude-3.5-Sonnet'].value_counts().sort_index(ascending=True))

GPT-4o
0    164
1     61
2    201
Name: count, dtype: int64
Gemini-1.5-Pro
0    182
1     75
2    169
Name: count, dtype: int64
Claude-3.5-Sonnet
0    184
1     81
2    161
Name: count, dtype: int64


#### Calculation of Embedding  

In [10]:
%%time
from time import sleep

vecs = []
for i in range(df.shape[0]):
    txt = df.at[i, 'text']
    # try:
    #     vec = get_embedding(txt[:max_input_len])
    # except:
    #     vec = get_embedding(txt[:max_input_len2])
    vec = get_embedding(txt)
    vecs.append(np.array(vec))
    sleep(1.5)
vecs = np.array(vecs)

CPU times: user 5.79 s, sys: 730 ms, total: 6.52 s
Wall time: 14min 18s


In [11]:
print(vecs.shape)

(426, 1536)


#### Typical text for each label

In [14]:
for label in [0, 1, 2]:
    print(f"\nTop 10 typical texts for Label {label}:")
    print("-" * 50)

    mask = df['GPT-4o'] == label
    label_vecs = vecs[mask]
    label_texts = df[mask]['text']

    similarities = cosine_similarity(label_vecs)

    centrality_scores = np.mean(similarities, axis=1)

    top_indices = np.argsort(centrality_scores)[-10:][::-1]

    for i, idx in enumerate(top_indices, 1):
        score = centrality_scores[idx]
        print(f"{i}. {label_texts.iloc[idx]} (Score: {score:.3f})")


Top 10 typical texts for Label 0:
--------------------------------------------------
1. 寝る直前までの勉強 (Score: 0.473)
2. 就寝前にストレスフルな作業をする (Score: 0.472)
3. 昼寝を長時間とる (Score: 0.470)
4. 昼寝を長時間する (Score: 0.468)
5. 就寝前に激しいデバッグ作業をする (Score: 0.465)
6. 寝る前に心配事を考える (Score: 0.461)
7. 寝る前に激しい運動をする (Score: 0.458)
8. 寝る直前に強い運動をする (Score: 0.458)
9. 寝る直前に激しい運動をする (Score: 0.455)
10. 寝る前に緊張するようなテレビを見る (Score: 0.450)

Top 10 typical texts for Label 1:
--------------------------------------------------
1. 就寝前の短時間のテレビ視聴 (Score: 0.506)
2. 就寝前に軽いテレビ番組を見る (Score: 0.495)
3. 就寝前の短い通話 (Score: 0.478)
4. 寝る前に短い電話をする (Score: 0.478)
5. 夜寝る前にスマホを30分程度だけ見る (Score: 0.475)
6. 就寝前の短時間のSNSチェック (Score: 0.475)
7. 軽いテレビ番組を寝る前に少し見る (Score: 0.473)
8. 週末だけ少し長めに昼寝を取る (Score: 0.470)
9. 就寝前の長時間の電話 (Score: 0.468)
10. 夕方に長時間の昼寝をする (Score: 0.467)

Top 10 typical texts for Label 2:
--------------------------------------------------
1. 就寝前のストレッチで体をリラックスさせる (Score: 0.438)
2. 就寝前の深呼吸でリラックスする (Score: 0.436)
3. 寝る前にホットストーンを使ったリラックスを試す (Score:

#### Save vecs

In [12]:
np.save(vecs_out, vecs)