This notebook contains the work-in-progress (WIP) scrap for the `imbed` project.

It is not meant to be run by all.


# The k-d to 2-d problem

After embedding your objects of interest into k-dimensional vectors, you might further embed them into two or three dimensions for visualization. This process inevitably loses some information and introduces distortions, but it's still valuable. The initial conversion of texts or images into numerical vectors also involved information loss, yet it was useful for your specific goals.

Dimensionality reduction typically results in losing some details. The key is to preserve the essential information (the "signal") and minimize the loss of less important details (the "noise"). Done well, this can be beneficial, especially if it enhances the signal-to-noise ratio.

This principle also applies to compressing vectors for visualization. The aim is to present complex data usefully and intuitively, enabling analysts to spot patterns and gain insights. It's a balance between maintaining utility and practical implementation when transforming k-dimensional data into two-dimensional forms.



# Arxiv Abstracts

In [1]:
from imbed.mdat.arxiv import *

## Raw store

In [4]:
from imbed.util import Graze, GrazeReturningFilepaths

g = GrazeReturningFilepaths()
list(g)

['https://drive.google.com/file/d/1Ul5mPePtoPKHZkH5Rm6dWKAO11dG98GN/view?usp=share_link',
 'https://drive.google.com/file/d/1g3K-wlixFxklTSUQNZKpEgN4WNTFTPIZ/view?usp=share_link']

In [6]:
from imbed.mdat.arxiv import raw_sources

list(raw_sources)

['titles', 'abstracts']

In [7]:
raw = raw_sources['titles']
list(raw)

['titles_7.parquet',
 'titles_23.parquet',
 'titles_15.parquet',
 'verifyResults.py',
 'titles_14.parquet',
 'titles_22.parquet',
 'titles_6.parquet',
 'titles_16.parquet',
 'titles_20.parquet',
 'titles_4.parquet',
 'titles_5.parquet',
 'titles_21.parquet',
 'params.txt',
 'titles_17.parquet',
 'exampleEmbed.py',
 'titles_12.parquet',
 'README.md',
 'titles_9.parquet',
 'titles_1.parquet',
 'titles_13.parquet',
 'titles_8.parquet',
 'titles_18.parquet',
 'titles_3.parquet',
 'titles_11.parquet',
 'titles_10.parquet',
 'titles_19.parquet',
 'titles_2.parquet']

In [12]:
print(raw['exampleEmbed.py'].decode())

from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)


In [16]:
from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')


load INSTRUCTOR_Transformer
max_seq_length  512


In [17]:
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Research Paper title for retrieval; Input:"
embeddings = model.encode([[instruction, sentence]])


In [82]:
print(raw['params.txt'].decode())

prompt: Represent the Research Paper abstract for retrieval; Input:
type: abstract
time string: 20230518-180240
model: InstructorXL
version: 2.0


In [83]:
print(raw['exampleEmbed.py'].decode())

from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train."
instruction = "Represent the Research Paper abstract for retrieval; Input:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)


## The imbedding data store

And now, we'll transform the raw store to get a convenient interface to the actual data of interest.

In [14]:
from imbed.mdat.arxiv import sources  # raw store + wrapper. See parquet_codec code.

titles_tables = sources['titles']
abstract_tables = sources['abstracts']
print(list(titles_tables))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]


In [15]:
titles_df = titles_tables[1]
titles_df

Unnamed: 0,title,embeddings,doi
0,Calculation of prompt diphoton production cros...,"[-0.050620172, 0.041436385, 0.05363288, -0.029...",0704.0001
1,Sparsity-certifying Graph Decompositions,"[0.014515653, 0.023809524, -0.028145121, -0.04...",0704.0002
2,The evolution of the Earth-Moon system based o...,"[-4.766115e-05, 0.017415706, 0.04146007, -0.03...",0704.0003
3,A determinant of Stirling cycle numbers counts...,"[0.027208889, 0.046175897, 0.0010913888, -0.01...",0704.0004
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,"[0.0113909235, 0.0042667952, -0.0008565594, -0...",0704.0005
...,...,...,...
99995,Multiple Time Dimensions,"[0.02682626, -0.0015173098, -0.0019915192, -0....",0812.3869
99996,Depth Zero Representations of Nonlinear Covers...,"[-0.02740943, 0.011689809, -0.0105154915, -0.0...",0812.3870
99997,Decting Errors in Reversible Circuits With Inv...,"[0.0072460608, 0.0028085636, -0.015064359, -0....",0812.3871
99998,Unveiling the birth and evolution of the HII r...,"[0.009408689, -0.0047120117, 0.0021392817, -0....",0812.3872


In [16]:
abstract_df = abstract_tables[1]
abstract_df

Unnamed: 0,abstract,embeddings,doi
0,A fully differential calculation in perturba...,"[-0.035151865, 0.022851437, 0.025942933, -0.02...",0704.0001
1,"We describe a new algorithm, the $(k,\ell)$-...","[0.035485767, -0.0015772493, -0.0016615744, -0...",0704.0002
2,The evolution of Earth-Moon system is descri...,"[-0.014510429, 0.010210799, 0.049661566, -0.01...",0704.0003
3,We show that a determinant of Stirling cycle...,"[0.029191103, 0.047992915, -0.0061754594, -0.0...",0704.0004
4,In this paper we show how to compute the $\L...,"[-0.015174898, 0.01603887, 0.04062805, -0.0246...",0704.0005
...,...,...,...
99995,The possibility of physics in multiple time ...,"[0.016121766, 0.011126887, 0.018650021, -0.044...",0812.3869
99996,"We generalize the methods of Moy-Prasad, in ...","[-7.164341e-05, -0.007114291, -0.008979887, -0...",0812.3870
99997,Reversible logic is experience renewed inter...,"[0.03194286, -0.00771745, 0.015977046, -0.0474...",0812.3871
99998,"Based on a multiwavelength study, the ISM ar...","[-0.012340169, -0.021712925, 0.00806009, -0.00...",0812.3872


In [12]:
abstract_df['doi'].values

array(['0704.0001', '0704.0002', '0704.0003', ..., '0812.3871',
       '0812.3872', '0812.3873'], dtype=object)

## Merged "column oriented" view of tables

(See [Column-oriented DBMS](https://en.wikipedia.org/wiki/Column-oriented_DBMS).)

In [1]:
from tabled import ColumnOrientedMapping
from imbed.mdat.arxiv import sources 

titles_tables = sources['titles']
tables_merge = ColumnOrientedMapping(titles_tables)
list(tables_merge)

['title', 'embeddings', 'doi']

In [3]:
import numpy as np

embeddings_array = np.vstack(tables_merge['embeddings'])
embeddings_array.shape

(2254198, 768)

In [4]:
dois = tables_merge['doi']
assert len(set(dois)) == len(dois), "dois not unique!"
len(dois)

2254198

In [5]:
import umap
reducer = umap.UMAP()

In [6]:
planar_embedding = reducer.fit_transform(embeddings_array)
planar_embedding.shape

(2254198, 2)

In [9]:
w = planar_embedding.tolist()

In [15]:
import pickle, json, os
from dol import PickleFiles, JsonFiles  # TODO: make PickleFiles work
from py2store import PickleStore

p = PickleStore(os.path.expanduser('~/tmp/'))
p['arxiv-titles-2d-embedding.pkl'] = planar_embedding

In [19]:
from py2store import LocalJsonStore
ss = LocalJsonStore(os.path.expanduser('~/tmp/'))
t = [{'doi': doi, 'x': float(w[0]), 'y': float(w[1])} for doi, w in zip(dois, planar_embedding)]
ss['arxiv-titles-2d-embedding.json'] = t

In [20]:
titles = tables_merge['title']

In [21]:
t = {doi: title for doi, title in zip(dois, titles)}
ss['arxiv-doi-and-titles.json'] = t

In [13]:
from imbed.mdat.arxiv import get_dois
t = list(get_dois())


In [16]:
list(a)

['abstract', 'embeddings', 'doi']

In [48]:
from imbed.mdat.arxiv import *
from tabled import ColumnOrientedMapping

merged_tables = ColumnOrientedMapping(tables)
dois = merged_tables['doi']
assert len(set(dois)) == len(dois), "dois not unique!"


In [41]:
table = next(iter(merged_tables.tables.values()))
table

Unnamed: 0,abstract,embeddings,doi
0,A fully differential calculation in perturba...,"[-0.035151865, 0.022851437, 0.025942933, -0.02...",0704.0001
1,"We describe a new algorithm, the $(k,\ell)$-...","[0.035485767, -0.0015772493, -0.0016615744, -0...",0704.0002
2,The evolution of Earth-Moon system is descri...,"[-0.014510429, 0.010210799, 0.049661566, -0.01...",0704.0003
3,We show that a determinant of Stirling cycle...,"[0.029191103, 0.047992915, -0.0061754594, -0.0...",0704.0004
4,In this paper we show how to compute the $\L...,"[-0.015174898, 0.01603887, 0.04062805, -0.0246...",0704.0005
...,...,...,...
99995,The possibility of physics in multiple time ...,"[0.016121766, 0.011126887, 0.018650021, -0.044...",0812.3869
99996,"We generalize the methods of Moy-Prasad, in ...","[-7.164341e-05, -0.007114291, -0.008979887, -0...",0812.3870
99997,Reversible logic is experience renewed inter...,"[0.03194286, -0.00771745, 0.015977046, -0.0474...",0812.3871
99998,"Based on a multiwavelength study, the ISM ar...","[-0.012340169, -0.021712925, 0.00806009, -0.00...",0812.3872


In [45]:
table[table.columns.values.tolist()]


Unnamed: 0,abstract,embeddings,doi
0,A fully differential calculation in perturba...,"[-0.035151865, 0.022851437, 0.025942933, -0.02...",0704.0001
1,"We describe a new algorithm, the $(k,\ell)$-...","[0.035485767, -0.0015772493, -0.0016615744, -0...",0704.0002
2,The evolution of Earth-Moon system is descri...,"[-0.014510429, 0.010210799, 0.049661566, -0.01...",0704.0003
3,We show that a determinant of Stirling cycle...,"[0.029191103, 0.047992915, -0.0061754594, -0.0...",0704.0004
4,In this paper we show how to compute the $\L...,"[-0.015174898, 0.01603887, 0.04062805, -0.0246...",0704.0005
...,...,...,...
99995,The possibility of physics in multiple time ...,"[0.016121766, 0.011126887, 0.018650021, -0.044...",0812.3869
99996,"We generalize the methods of Moy-Prasad, in ...","[-7.164341e-05, -0.007114291, -0.008979887, -0...",0812.3870
99997,Reversible logic is experience renewed inter...,"[0.03194286, -0.00771745, 0.015977046, -0.0474...",0812.3871
99998,"Based on a multiwavelength study, the ISM ar...","[-0.012340169, -0.021712925, 0.00806009, -0.00...",0812.3872


In [18]:

def get_arxiv_pdf_link(doi: str) -> str:
    base_url = "https://arxiv.org/pdf/{doi}"
    return base_url + doi


# Example usage
doi = '0704.0001'
pdf_link = get_arxiv_pdf_link(doi)
print(pdf_link)  # Outputs: https://arxiv.org/pdf/0704.0001


SyntaxError: invalid syntax (1157999395.py, line 5)

In [None]:
r['arxiv_abstracts']
from dol import zipfiledol

In [18]:
from dol import FilesOfZip, remove_mac_junk_from_zip, Pipe, KeyTemplate, wrap_kvs
KeyTemplate('{}.parquet').filt_iter(z)

ValueError: field_type must be one of typing.Literal['str', 'dict', 'tuple', 'namedtuple', 'simple_str']. Was: FilesOfZip('None', '', {}, <function ZipReader.FILES_ONLY at 0x108875510>)

In [5]:
from dol import FilesOfZip, remove_mac_junk_from_zip, Pipe
from imbed.util import graze

arxiv_abstracts_url = (
    'https://drive.google.com/file/d/1g3K-wlixFxklTSUQNZKpEgN4WNTFTPIZ/view?usp=share_link'
)

z = remove_mac_junk_from_zip(FilesOfZip(graze(arxiv_abstracts_url)))
print(f"{len(z)=}")
list(z)

len(z)=27


['arxiv_abstracts/abstracts_21.parquet',
 'arxiv_abstracts/abstracts_1.parquet',
 'arxiv_abstracts/abstracts_8.parquet',
 'arxiv_abstracts/abstracts_17.parquet',
 'arxiv_abstracts/verifyResults.py',
 'arxiv_abstracts/abstracts_16.parquet',
 'arxiv_abstracts/abstracts_9.parquet',
 'arxiv_abstracts/abstracts_20.parquet',
 'arxiv_abstracts/abstracts_14.parquet',
 'arxiv_abstracts/abstracts_22.parquet',
 'arxiv_abstracts/abstracts_2.parquet',
 'arxiv_abstracts/abstracts_23.parquet',
 'arxiv_abstracts/abstracts_3.parquet',
 'arxiv_abstracts/params.txt',
 'arxiv_abstracts/abstracts_15.parquet',
 'arxiv_abstracts/exampleEmbed.py',
 'arxiv_abstracts/README.md',
 'arxiv_abstracts/abstracts_10.parquet',
 'arxiv_abstracts/abstracts_19.parquet',
 'arxiv_abstracts/abstracts_6.parquet',
 'arxiv_abstracts/abstracts_7.parquet',
 'arxiv_abstracts/abstracts_18.parquet',
 'arxiv_abstracts/abstracts_11.parquet',
 'arxiv_abstracts/abstracts_5.parquet',
 'arxiv_abstracts/abstracts_13.parquet',
 'arxiv_abstr

In [13]:
# TODO:
import tabled
from typing import NewType, TypeVar

blah = NewType('blah', str)
t = blah(3)
dir(t)



['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'as_integer_ratio',
 'bit_count',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes

In [7]:
print(z['arxiv_abstracts/README.md'].decode())

No claims are made about the copyright or license of contained materials. We assume no responsibilty for and are not liable under any circumstances for damages. Use at your own risk.

Good luck, have fun.


In [8]:
import os
k = 'asdf'
os.path.splitext(k)[1]

''

In [14]:
import dol

[x for x in dir(dol) if 'filt' in x]


['add_prefix_filtering', 'filt_iter', 'path_filter']

In [4]:
# remove_mac_junk_from_zip

v = z['arxiv_abstracts/abstracts_21.parquet']

In [6]:
import io
import pandas as pd

decoder = Pipe(
    io.BytesIO,
    pd.read_parquet
)
arxiv = decoder(v)
arxiv

Unnamed: 0,abstract,embeddings,doi
0,"We study a single, motionless three-dimensio...","[-0.016700998, -0.026170395, -0.0017791224, 0....",cond-mat/9912225
1,We argue that effective 1D models of stripes...,"[0.016688664, 0.03121568, 0.013161989, -0.0053...",cond-mat/9912226
2,We study numerically the one dimensional fer...,"[0.012090295, 0.013782813, 0.009999216, -0.019...",cond-mat/9912227
3,We overview some recent work and present new...,"[0.015418335, -0.005732018, -0.0061773793, 0.0...",cond-mat/9912228
4,An english translation of the original work ...,"[0.011050525, 0.009937972, 0.00615069, -0.0062...",cond-mat/9912229
...,...,...,...
99995,We investigate the possible effects on the e...,"[-0.014711679, -0.0066094864, -0.022661554, -0...",hep-th/0203099
99996,"On a historical note, we first describe the ...","[0.029632337, 0.02915576, 0.007837505, -0.0168...",hep-th/0203100
99997,There is strong evidence that the area of an...,"[0.012087484, -0.0109771425, -0.00483123, -0.0...",hep-th/0203101
99998,A three-dimensional simple N=1 supergravity ...,"[-0.020591462, -0.0009915903, 0.031237133, -0....",hep-th/0203102


## Langchain

In [51]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()

In [50]:
from imbed.mdat.arxiv import get_abstracts_tables

tables = get_abstracts_tables()
print(*tables)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23


In [61]:
table = next(iter(tables.values()))

from langchain.schema import Document
from itertools import repeat
# make a langchain docs iterable


class LangchainDocs:
    def __init__(self, texts, metadatas=None):
        self.texts = texts
        self.metadatas = metadatas #or repeat({})  # same dict for all: TODO: change

    def __iter__(self):
        if self.metadatas is None:
            return (Document(page_content=text) for text in self.texts)
        else:
            return (
                Document(page_content=text, metadata=meta) 
                for text, meta in zip(self.texts, self.metadatas)
            )
        

docs = LangchainDocs(
    table['abstract'].values, 
    ({'doi': doi} for doi in table['doi'].iloc[:10000].values)
)

In [62]:
vectorstore = Chroma.from_documents(docs, embeddings)
# 1m02 for 10000 docs

In [64]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever

# metadata_field_info = [
#     AttributeInfo(
#         name="genre",
#         description="The genre of the movie",
#         type="string or list[string]",
#     ),
#     AttributeInfo(
#         name="year",
#         description="The year the movie was released",
#         type="integer",
#     ),
#     AttributeInfo(
#         name="director",
#         description="The name of the movie director",
#         type="string",
#     ),
#     AttributeInfo(
#         name="rating", description="A 1-10 rating for the movie", type="float"
#     ),
# ]
# document_content_description = "Brief summary of a movie"
llm = OpenAI(temperature=0)
# retriever = SelfQueryRetriever.from_llm(
#     llm, vectorstore, 'page_content', 'metadata', verbose=True
# )


ValueError: dictionary update sequence element #0 has length 1; 2 is required

# Scrap