## Pydantic Basics

- data validation
- define structure and validate data at runtime

In [9]:
from pydantic import BaseModel, Field
from typing import Optional, List

In [None]:
class Person(BaseModel):
    name: str
    age: int
    city: str

In [6]:
person1 = Person(name="Krish", age=35, city="Bangalore")
print(person1)
print(type(person1))

name='Krish' age=35 city='Bangalore'
<class '__main__.Person'>


In [7]:
person2 = Person(name="Krish", age=35, city=12)
print(person2)

ValidationError: 1 validation error for Person
city
  Input should be a valid string [type=string_type, input_value=12, input_type=int]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type

- Model with optional fields 

In [14]:
class Employee(BaseModel):
    id: int
    name: str
    department: str

    # optional with default values already set
    salary: Optional[float] = None
    is_active: Optional[bool] = True

In [10]:
emp1 = Employee(id=1, name="John", department="IT")
print(emp1)

id=1 name='John' department='IT' salary=None is_active=True


In [16]:
# only type casting from int to float
emp2 = Employee(id=2, name="Jane", department="HR", salary=50000)
print(emp2)

id=2 name='Jane' department='HR' salary=50000.0 is_active=True


- Model with List values

In [18]:
class Classroom(BaseModel):
    room_number: str
    students: List[str]
    capacity: int

In [None]:
classroom1 = Classroom(
    room_number="A612",

    # type casting from tuple/set to list will happen
    students=('Alice', 'Bob', 'Jamie'),
    
    capacity=20
)
print(classroom1)

room_number='A612' students=['Alice', 'Bob', 'Jamie'] capacity=20


In [23]:
try:
    invalid_classroom = Classroom(
        room_number="A612",
        students= ["Shashank" ,42],
        capacity=30
    )
except ValueError as e:
    print(e)

1 validation error for Classroom
students.1
  Input should be a valid string [type=string_type, input_value=42, input_type=int]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type


- Complex Structure Nested Models

In [7]:
class Address(BaseModel):
    street: str
    city: str
    zip_code: int


class Customer(BaseModel): 
    customer_id: int
    name: str
    address: Address 

In [8]:
customer1 = Customer(
    customer_id=1,
    name="Hix",
    address={"street": "Kh Road", "city": "Gandhinagar", "zip_code": "382028"}
    # automatic type casting from str to int
)

print(customer1)

customer_id=1 name='Hix' address=Address(street='Kh Road', city='Gandhinagar', zip_code=382028)


- Pydantic Fields: Customization and Constraints (Specify validation rules)

In [10]:
class Item(BaseModel):
    name: str=Field(min_length=2, max_length=50)
    price: float=Field(gt=0, le=1000)
    quantity: int=Field(ge=0)

In [13]:
item1 = Item(name="Book", price=45.43, quantity=1)
print(item1)

name='Book' price=45.43 quantity=1


- Fields with default values

In [14]:
class User(BaseModel):
    username: str=Field(description="Unique username for the user")
    age: int=Field(default=18, description="User age, defaults to 18")
    email: str=Field(default="user@email.com", description="Default Email Address")

In [15]:
user1 = User(username="alice")
print(user1)

username='alice' age=18 email='user@email.com'


In [16]:
user2 = User(username="bob", age=25, email="bob@domain.com")
print(user2)

username='bob' age=25 email='bob@domain.com'


In [19]:
User.model_json_schema()

{'properties': {'username': {'description': 'Unique username for the user',
   'title': 'Username',
   'type': 'string'},
  'age': {'default': 18,
   'description': 'User age, defaults to 18',
   'title': 'Age',
   'type': 'integer'},
  'email': {'default': 'user@email.com',
   'description': 'Default Email Address',
   'title': 'Email',
   'type': 'string'}},
 'required': ['username'],
 'title': 'User',
 'type': 'object'}

# Langchain and OpenAI

- Langchain, Langsmith, Langserve

- basic components: prompt templates, models, output parsers
- build app with langchain
- trace app with langsmith
- serve app with langserve

- .env file: contains API keys(openai, langchain)

In [32]:
from langchain_community.document_loaders import (
    TextLoader,
    PyPDFLoader,
    WebBaseLoader, 
    ArxivLoader,
    WikipediaLoader
)
import bs4

### DATA INGESTION
https://python.langchain.com/docs/integrations/document_loaders/

In [2]:
# text loader
txt_loader = TextLoader('./data/minerva_report.txt')
print(txt_loader)

<langchain_community.document_loaders.text.TextLoader object at 0x792cbc136900>


In [12]:
txt_documents = txt_loader.load()

In [14]:
print(txt_documents[0].metadata)

{'source': './data/minerva_report.txt'}


In [27]:
print(txt_documents[0].page_content[:500])

Detailed Report on Minerva Academy and Its Current Form (as of July 28, 2025)
Overview

Minerva Academy is a prominent institution with multiple branches specializing in sports (notably football), teacher education, and distance learning. Its flagship, Minerva Academy FC, based in Mohali, Punjab, is widely recognized as one of India’s most successful youth football academies, while their educational wings operate B.Ed. and distance education programs across teaching and professional domains.
Foo


In [None]:
# pdf loader
pdf_loader = PyPDFLoader("./data/resume.pdf")
pdf_documents = pdf_loader.load()    # load each page separately
len(pdf_documents)

2

In [23]:
pdf_documents[0].metadata

{'producer': 'pdfTeX-1.40.26',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2025-07-04T09:12:57+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2025-07-04T09:12:57+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) kpathsea version 6.4.0',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': './data/resume.pdf',
 'total_pages': 2,
 'page': 0,
 'page_label': 'i'}

In [26]:
print(pdf_documents[0].page_content[:1000])

Shashank Sharma
Department of Computer Science & Engineering /envel⌢peshashank.0901sharma@gmail.com / ♂phone+91-8490849270
Indraprastha Institute of Information Technology, Delhi /linkedinshashankgsharma / /githubshashank23088
EDUCATION
Year Degree/Certificate Institute CPI/%
2023-2025 M.Tech/CSE (AI Specialization) Indraprastha Institute of Information Technology , Delhi 8.07/10
2019-2023 B.E/Information & Communication
Technology
Adani Institute of Infrastructure Engineering,
Ahmedabad (Gujarat)
9.22/10
2019 CBSE(HSC) Kendriya Vidyalaya, Gandhinagar (Gujarat) 84.2%
2017 CBSE(SSC) Kendriya Vidyalaya, Gandhinagar (Gujarat) 10/10
PUBLICATIONS
• Pulse of the Crowd: Quantifying Crowd Energy through Audio and Video Analysis
IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2024.
◦ Developed a pipeline to score videos based on perceived crowd energy using audio and video analysis.
◦ Utilized STEERER for crowd density maps and ResNet50, VGG18, AlexN

In [19]:
# web-based loader
url = "https://timesofindia.indiatimes.com/business/india-business/tcs-layoffs-biggest-ever-for-indian-it-artificial-intelligence-not-to-blame-for-difficult-decision-top-10-things-to-know-about-mass-sackings/articleshow/122949729.cms"

web_loader = WebBaseLoader(
    web_path=url,
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(
        class_=("pZFl7", "heightCalc")
    ))
)

web_documents = web_loader.load()

In [20]:
print(len(web_documents))
web_documents[0].metadata

1


{'source': 'https://timesofindia.indiatimes.com/business/india-business/tcs-layoffs-biggest-ever-for-indian-it-artificial-intelligence-not-to-blame-for-difficult-decision-top-10-things-to-know-about-mass-sackings/articleshow/122949729.cms'}

In [25]:
print(len(web_documents[0].page_content))
web_documents[0].page_content[6000:7001]

9962


"tails regarding the calculation of the 2% reduction, the implementation process for the staff reductions, or whether additional rounds of job cuts would follow in subsequent periods.9. Tough IT Sector EnvironmentThe Indian IT sector is facing unprecedented job cuts, mirroring practices commonly seen in US companies, causing widespread concern throughout the industry. The combination of global economic uncertainties and disruptions caused by artificial intelligence technology continues to affect business demand.10. Global Trends of LayoffsBased on the data from Layoffs.fyi, a platform monitoring global tech industry redundancies, more than 80,000 technology sector employees have lost their jobs across 169 companies in 2025.2024, witnessed approximately 150,000 job losses spanning 551 technology firms. These figures reflect both worldwide economic challenges and ongoing discussions within the technology sector regarding artificial intelligence's influence on employment opportunities and

In [27]:
# Arxiv as data-source
doc_id = "2407.03305"
arxiv_loader = ArxivLoader(query=doc_id, load_max_docs=2)
arxiv_documents = arxiv_loader.load()

In [28]:
print(len(arxiv_documents))
arxiv_documents[0].metadata

1


{'Published': '2024-07-05',
 'Title': 'Advanced Smart City Monitoring: Real-Time Identification of Indian Citizen Attributes',
 'Authors': 'Shubham Kale, Shashank Sharma, Abhilash Khuntia',
 'Summary': "This project focuses on creating a smart surveillance system for Indian\ncities that can identify and analyze people's attributes in real time. Using\nadvanced technologies like artificial intelligence and machine learning, the\nsystem can recognize attributes such as upper body color, what the person is\nwearing, accessories they are wearing, headgear, etc., and analyze behavior\nthrough cameras installed around the city."}

In [31]:
print(arxiv_documents[0].page_content[:1000])

Advanced Smart City Monitoring: Real-Time
Identification of Indian Citizen Attributes
Shubham Kale
M.Tech CSE
Dept. of CSE
IIIT Delhi
shubham23094@iiitd.ac.in
Shashank Sharma
M.Tech CSE
Dept. of CSE
IIIT Delhi
shashank23088@iiitd.ac.in
Abhilash Khuntia
M.Tech CSE
Dept. of CSE
IIIT Delhi
abhilash23007@iiitd.ac.in
Abstract—This project focuses on creating a smart surveillance
system for Indian cities that can identify and analyze people’s
attributes in real time. Using advanced technologies like artificial
intelligence and machine learning, the system can recognize
attributes such as upper body color what the person is wearing,
accessories that he or she is wearing, headgear check, etc.,
and analyze behavior through cameras installed around the
city. We have provided all our code for our experiments at
https://github.com/abhilashk23/vehant-scs-par We will be contin-
uously updating the above GitHub repo to keep up-to-date with
the most cutting-edge work on person attribute recognition.
I

In [33]:
# wikipedia as data source
wikipedia_loader = WikipediaLoader(query="Huffman coding", load_max_docs=2)
wikipedia_docs = wikipedia_loader.load()

In [34]:
print(len(wikipedia_docs))
wikipedia_docs[0].metadata

2


{'title': 'Huffman coding',
 'summary': 'In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".\nThe output from Huffman\'s algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file).  The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol.  As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols.  Huffman\'s method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted.  Howe

In [37]:
print(wikipedia_docs[0].page_content[:1000])

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file).  The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol.  As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols.  Huffman's method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted.  However, although optimal among methods encoding