In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
from transformers import AutoTokenizer , AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = 'microsoft/Phi-3.5-mini-instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name , trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name , device_map = 'auto' , torch_dtype = torch.float16)
model

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.59s/it]


Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLUActivation()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_featur

In [3]:
prompt = [
    {'role': 'system' , 'content': 'you are a self-taught programmer'},
    {'role' : 'user' , 'content':'Solve: If 2x + 7 = 19, what is x?'}
]
prompt = tokenizer.apply_chat_template(prompt , add_generation_prompt= True , tokenize = False)
tokens = tokenizer.encode(prompt , return_tensors = 'pt').to(model.device)
output = model.generate(tokens , max_new_tokens = 100 , temperature = 0.3)
print(tokenizer.decode(output[0] , skip_special_tokens= False))

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|system|> you are a self-taught programmer<|end|><|user|> Solve: If 2x + 7 = 19, what is x?<|end|><|assistant|> To solve for x, follow these steps:

1. Subtract 7 from both sides of the equation:
2x + 7 - 7 = 19 - 7
2x = 12

2. Divide both sides by 2:
2x / 2 = 12 / 2
x = 6

So, x = 6.<|end|>


## Chunking:


In [4]:
# A sample blog post about productivity
tdoc = """
The Art of Deep Work: A Guide to Productivity

In today's world of constant notifications and endless distractions, the ability to focus deeply on cognitively demanding tasks has become increasingly rare—and increasingly valuable. This is what author Cal Newport calls "deep work": professional activities performed in a state of distraction-free concentration that push your cognitive capabilities to their limit.

Why Deep Work Matters

Deep work is valuable for several reasons. First, it allows you to produce high-quality output in less time. When you're fully concentrated on a single task, you work more efficiently and make fewer mistakes. Second, deep work helps you master complex skills faster. Learning difficult concepts requires sustained attention—something impossible to achieve when constantly switching between tasks.

Creating Your Deep Work Environment

The first step in cultivating deep work is designing an environment that supports concentration. This means more than just finding a quiet space. Consider your physical setup: Is your desk organized? Is the lighting appropriate? Do you have everything you need within reach?

Digital distractions are equally important to address. Turn off notifications on your devices. Close unnecessary browser tabs. Use website blockers if needed. The goal is to create a space where your attention isn't constantly being pulled away from the task at hand.

Time Blocking for Deep Work

Scheduling specific blocks of time for deep work is crucial. Don't wait for free time to magically appear—it won't. Instead, treat deep work sessions as important appointments with yourself. Many people find that early morning hours work best, when their mental energy is highest and distractions are minimal.

Start with manageable blocks. If you're new to deep work, even 60-90 minutes of focused time can feel challenging. As you build your concentration muscles, gradually extend these sessions. Some professionals work up to four-hour blocks of uninterrupted deep work.

The Shutdown Ritual

Just as important as starting deep work is knowing when to stop. Develop a shutdown ritual to mark the end of your workday. This might include reviewing your task list for tomorrow, closing all work-related browser tabs, and saying a specific phrase like "shutdown complete."

This ritual serves multiple purposes. It helps you mentally disconnect from work, reduces anxiety about unfinished tasks, and ensures you've captured anything important before stepping away. Without this clear boundary, work thoughts tend to linger into your evening, preventing true rest and recovery.

Measuring Your Progress

Track your deep work hours each week. This simple metric provides valuable feedback on your habits. You might discover that you're spending less time in deep work than you thought, or that certain days of the week are more conducive to concentration than others.

Remember, the goal isn't to spend every waking hour in deep work. Even the most focused professionals typically max out at four to five hours of truly deep work per day. What matters is consistency and intentionality—making deep work a regular part of your routine rather than an occasional occurrence.
"""

print(f"Document length: {len(tdoc)} characters")
print(f"Number of paragraphs: {tdoc.count(chr(10) + chr(10))}")

Document length: 3219 characters
Number of paragraphs: 15


### Fixed size chunking:


In [5]:
!pip install chonkie[viz]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [6]:
from chonkie import TokenChunker , Visualizer
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('gpt2')
chunker = TokenChunker(tokenizer = tokenizer , chunk_size = 50 , chunk_overlap = 0)

chunks = chunker.chunk(tdoc)

print(f'Number of chunks created:{len(chunks)}')
print(f"Size of each chunk:{[chunk.token_count for chunk in chunks]}")

vis = Visualizer()
vis.print(chunks)

Number of chunks created:13
Size of each chunk:[50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 17]


### Fixed size chunking with overlap:


In [7]:
from chonkie import TokenChunker , Visualizer
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('gpt2')
chunker = TokenChunker(tokenizer = tokenizer , chunk_size = 50 , chunk_overlap = 20)

chunks = chunker.chunk(tdoc)

print(f'Number of chunks created:{len(chunks)}')
print(f"Size of each chunk:{[chunk.token_count for chunk in chunks]}")

vis = Visualizer()
vis.print(chunks)

Number of chunks created:20
Size of each chunk:[50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 47]


In [8]:
print(chunks[0])
print(chunks[1])


The Art of Deep Work: A Guide to Productivity

In today's world of constant notifications and endless distractions, the ability to focus deeply on cognitively demanding tasks has become increasingly rare—and increasingly valuable. This is what author Cal Newport
 on cognitively demanding tasks has become increasingly rare—and increasingly valuable. This is what author Cal Newport calls "deep work": professional activities performed in a state of distraction-free concentration that push your cognitive capabilities to their limit.

Why Deep Work


In [9]:
# Load Dataset:
from datasets import load_dataset
data = load_dataset("m-ric/huggingface_doc")
print(f"Number of documents: {len(data)}")

Number of documents: 1


In [10]:
#print sample document:
doc = data['train'][0]

for key , val in doc.items():
    if isinstance(val , str) and len(val) > 500:
        print(f"{key} , {val[:500]}...")
    else:
        print(f"{key}. {val}")

text ,  Create an Endpoint

After your first login, you will be directed to the [Endpoint creation page](https://ui.endpoints.huggingface.co/new). As an example, this guide will go through the steps to deploy [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) for text classification. 

## 1. Enter the Hugging Face Repository ID and your desired endpoint name:

<img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-docu...
source. huggingface/hf-endpoints-documentation/blob/main/docs/source/guides/create_endpoint.mdx


In [11]:
#convert text to chunks:

tokenizer = Tokenizer.from_pretrained('gpt2')
chunker = TokenChunker(
    tokenizer = tokenizer,
    chunk_size = 256,
    chunk_overlap = 32
)

chunks = []
for doc in data['train']:
    chs = chunker.chunk(doc['text'])
    chunks.append(chs)

print(f"Total #of chunk lists:{len(chunks)}")

Total #of chunk lists:2647


In [12]:
ctexts = [chunk[0].text for chunk in chunks]
print(f"# of chunks to embed:{len(ctexts)}")

# of chunks to embed:2647


In [13]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [14]:
embeddings = embedding_model.encode(ctexts)
print(f"Embeddings shape:{embeddings.shape}")


Embeddings shape:(2647, 384)


In [15]:
# Embed a sample query using the same model
sample_query = "How do I load a pretrained model?"
query_embedding = embedding_model.encode(sample_query)

print(f"Query: '{sample_query}'")
print(f"Query embedding shape: {query_embedding.shape}")
print(f"Query embedding (first 10 values): {query_embedding[:10]}")

Query: 'How do I load a pretrained model?'
Query embedding shape: (384,)
Query embedding (first 10 values): [-0.02789504 -0.04522779  0.00200946  0.0494822  -0.01643985  0.08898936
 -0.08490766  0.04477691 -0.05589384 -0.0421176 ]


In [19]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.4.0-py3-none-any.whl.metadata (5.8 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.3-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.39.1-py3-none-any.whl.metadata (2.5 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading pypika-0.50.0-py2.py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [27]:
import chromadb
client = chromadb.Client()
try:
    client.get_collection(name = 'huggingface_docs')
    client.delete_collection(name = 'huggingface_docs')
except Exception:
    print("Collection doesn\'t exists!")
collection = client.create_collection(
    name = 'huggingface_docs',
    metadata = {'description': "Hugging Face documentation with metadata"}
)
ids = [f"doc{i + 1}" for i in range(len(ctexts))]


collection.add(
    ids = ids,
    documents = ctexts , 
    embeddings = embeddings.tolist(),
)

print(f"{collection.name} now contains {collection.count()} documents.")

huggingface_docs now contains 2647 documents.


In [31]:
#search among the embeddings:
query = 'How do I load a pre-trained model ?'

results = collection.query(
    query_texts = [query],
    n_results = 3
)

print(f"Query: '{query}'\n")
print("="*80)
print("Most Relevant Results:\n")

for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
    print(f"Result {i} (Distance: {distance:.4f}):")
    print(f"{doc[:300]}...")  # Print first 300 characters
    print("-"*80)

Query: 'How do I load a pre-trained model ?'

Most Relevant Results:

Result 1 (Distance: 0.8319):
n this video, we're going to see how to load and fine-tune a pre-trained model. It's very quick, and if you've watched our pipeline videos, which I'll link below, the process is very similar. This time, though, we're going to be using transfer learning and doing some training ourselves, rather than ...
--------------------------------------------------------------------------------
Result 2 (Distance: 0.8459):
FrameworkSwitchCourse {fw} />

# Introduction[[introduction]]

<CourseFloatingBanner
    chapter={3}
    classNames="absolute z-10 right-0 top-0"
/>

In [Chapter 2](/course/chapter2) we explored how to use tokenizers and pretrained models to make predictions. But what if you want to fine-tune a pret...
--------------------------------------------------------------------------------
Result 3 (Distance: 0.8738):
ow to instantiate a Transformers model? In this video we will look at how