learning
# 1. vars(object) vs dir(object)

| Feature       | `vars(obj)`                             | `dir(obj)`                                                        |
|---------------|------------------------------------------|-------------------------------------------------------------------|
| **Size**      | Smaller                                  | Bigger                                                            |
| **Scope**     | Things stored in the object              | All things: stored attributes + methods + inherited + dunder     |
| **Type**      | `dict`                                   | `list`                                                            |
| **Relationship** | Subset (when `__dict__` exists)       | Superset of what's visible on the object                          |


In [1]:
import tiktoken
enc = tiktoken.get_encoding("gpt2")


# vars: defines the 
# vars(enc) is the same as enc.__dict__, plus both require the object to have a __dict__ attribute
assert vars(enc) == enc.__dict__ 
print("vars\n", [k for k in vars(enc)], end = '\n\n', sep ='')

callables, non_callables = [], []
for attr in dir(enc):
    if callable(getattr(enc, attr)):
        callables.append(attr)
    else:
        non_callables.append(attr)

print('dir')
print("callables\n", callables, end = '\n', sep = '')
print("non_callables\n", non_callables, end = '\n\n', sep = '')

vars
['name', '_pat_str', '_mergeable_ranks', '_special_tokens', 'max_token_value', '_core_bpe']

dir
callables
['__class__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '_encode_bytes', '_encode_only_native_bpe', '_encode_single_piece', 'decode', 'decode_batch', 'decode_bytes', 'decode_bytes_batch', 'decode_single_token_bytes', 'decode_tokens_bytes', 'decode_with_offsets', 'encode', 'encode_batch', 'encode_ordinary', 'encode_ordinary_batch', 'encode_single_token', 'encode_with_unstable', 'token_byte_values']
non_callables
['__dict__', '__doc__', '__module__', '__weakref__', '_core_bpe', '_mergeable_ranks', '_pat_str', '_special_tokens', 'eot_token', 'max_token_value', 'n_vocab', 'name', 'special_tokens_set']



# 2. imap

note: pool.imap can't find the function defined in this script

In [9]:
from datasets import load_dataset

data_path = "HuggingFaceFW/fineweb-edu"
name = "sample-10BT"

data = load_dataset(data_path, name, split = 'train')
n_sample = 20
sampled_data = [{"text":data[i]['text']} for i in range(n_sample)]
sampled_data

Resolving data files:   0%|          | 0/2110 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/98 [00:00<?, ?it/s]

[{'text': 'The Independent Jane\nFor all the love, romance and scandal in Jane Austen’s books, what they are really about is freedom and independence. Independence of thought and the freedom to choose.\nElizabeth’s refusal of Mr. Collins offer of marriage showed an independence seldom seen in heroines of the day. Her refusal of Mr. Darcy while triggered by anger showed a level of independence that left him shocked and stunned.\nThe freedom she exhibited in finally accepting him in direct defiance of Lady Catherine and knowing her father would disapprove was unusual even for Austen. In her last book Anne Elliot is persuaded to refuse Captain Wentworth at Lady Russel’s insistence.\nAlthough Jane played by the rules of the day, all of her writing is infused with how she wanted life to be. She ‘screams’ her outrage at the limitations for women in Emma.\nWhen accosted by Mrs. Elton, Jane Fairfax says,\n“Excuse me, ma’am, but this is by no means my intention; I make no inquiry myself, and sh

In [17]:
import multiprocessing as mp
from utils import tokenize

# def tokenize(x: str):

#     tokens = enc.encode_ordinary(x)

#     tokens = np.array(tokens).astype(np.unit16)
    
#     return tokens

# def test(x:str):

#     return x[:5]


with mp.Pool(2) as pool:

    res2 = []

    for token in pool.imap(tokenize, sampled_data, chunksize = 5):
        res2.append(token)

# 3. iterator, generator, and data loader

| Feature                    | Generator (`yield`)                       | PyTorch DataLoader / IterableDataset                 |
|---------------------------|-------------------------------------------|------------------------------------------------------|
| Simplicity                | Very lightweight                          | More complex; built for training pipelines           |
| State Management          | Internal to Python generator              | Requires careful handling (epochs, shuffle, etc.)    |
| Parallelism / Performance | Single-threaded                           | Supports multiprocessing (`num_workers > 0`)         |
| Output                    | One item (or batch) at a time             | Typically batches (e.g., dict of tensors)            |
| Reusability               | One-time use unless recreated             | Supports clean re-use over multiple epochs           |
| Flexibility               | Great for prototyping or light use        | Designed for large-scale, performant training        |


# 4. pass config to a class

In [36]:
from dataclasses import dataclass

@dataclass
class Config:
    param1 = 2
    param2 = 4


class DummyClass:

    def __init__(self, config):
        self.name = 'dummy class'
        self.config = config

dummy_class = DummyClass(Config)
dummy_class.config.param1

2

# 🔍 Python List Multiplication & Reference Behavior

| Code Snippet                       | Safe? | Explanation                                                                 |
|------------------------------------|--------|------------------------------------------------------------------------------|
| `a = [0]*3`<br>`a[1] = 1`          | ✅ Yes | Immutable integers, each slot is independent.                              |
| `b = [[0]*2]*3`<br>`b[1] = [1,1]`  | ⚠️ Partial | Replaces the reference at `b[1]`, so only `b[1]` is safe. Others still shared. |
| `b = [[0]*2]*3`<br>`b[1][:] = [1,1]` | ❌ No  | Mutates shared inner list → all rows reflect the change.                    |
| `c = [[0]*2 for _ in range(3)]`<br>`c[1] = [1,1]` | ✅ Yes | Each inner list is a distinct object. Safe from shared mutation.            |

---

🧠 **Tip**: Use `[[... for _ in range(n)]]` instead of `[*]*n` when working with nested lists to avoid shared references.


In [7]:
a = [0]*3 # No
a[1] =1 

a

[0, 1, 0]

In [8]:
b = [[0]*2]*3 # Yes

b[1] = [1,1] # This updates the reference of row 1

b

[[0, 0], [1, 1], [0, 0]]

In [10]:
b = [[0]*2]*3 # Yes

b[1][:] = [1,1] # This does not update the reference of row 1

b

[[1, 1], [1, 1], [1, 1]]

In [11]:
c = [[0]*2 for _ in range(3)] # No

c[1] = [1,1] # This updates the reference of row 1

c

[[0, 0], [1, 1], [0, 0]]