### Installation

In [1]:
%%capture
!pip install langchain
!pip install langchain-openai
!pip install -U langchain-community
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken
!pip install unstructured[all]
!pip install "unstructured[pdf]"
!pip install pdfminer.six
!pip install pi_heif
!pip install unstructured-inference
!pip install --upgrade nltk



In [2]:
# poppler is a prerequisite for the pdf2image library to handle PDF conversions, and it needs to be installed on your system.
%%capture
!apt-get install -y poppler-utils
!apt-get install -y tesseract-ocr


### Load Required Packages

In [3]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import FAISS
from langchain_openai import OpenAI

### 전체 요약
| 모듈                        | 역할         | 설명                                      |
| ------------------------- | ---------- | --------------------------------------- |
| `UnstructuredPDFLoader`   | 문서 로더      | PDF 파일을 텍스트로 읽어들이는 클래스                  |
| `OpenAIEmbeddings`        | 임베딩 생성     | 텍스트를 벡터로 변환해주는 OpenAI 임베딩 클래스           |
| `VectorstoreIndexCreator` | 벡터 인덱스 생성기 | 문서를 벡터 DB에 넣고 검색할 수 있게 준비하는 툴           |
| `FAISS`                   | 벡터 DB      | 벡터를 저장하고 빠르게 검색할 수 있는 벡터 데이터베이스         |
| `OpenAI`                  | LLM 사용     | ChatCompletion 등 OpenAI 모델과 대화할 수 있는 래퍼 |
  
1. from langchain.document_loaders import UnstructuredPDFLoader
문서 로딩 도구입니다.
- UnstructuredPDFLoader는 PDF를 읽고 내부의 텍스트를 추출합니다.
- LangChain의 document_loader는 다양한 포맷 (PDF, DOCX, HTML 등)을 다룰 수 있습니다.
- Unstructured 시리즈는 layout-aware한 텍스트 추출을 시도합니다.

2. from langchain.embeddings import OpenAIEmbeddings
- LangChain에 내장된 OpenAI의 임베딩 생성기입니다.
- 텍스트 → 벡터로 변환하여 검색이나 유사도 분석에 활용합니다.
- text-embedding-ada-002 같은 OpenAI 모델을 기반으로 동작합니다.

3. from langchain.indexes import VectorstoreIndexCreator
- 문서를 불러오고 임베딩하고, 벡터 스토어에 저장하는 모든 과정을 묶은 유틸 클래스입니다.
- 자동으로 FAISS + OpenAIEmbedding 등을 연결해서 인덱스를 만들어 줍니다.
- index = VectorstoreIndexCreator().from_loaders([loader]) 형식으로 자주 사용됩니다.

4. from langchain.vectorstores import FAISS
- Facebook이 만든 FAISS 벡터 검색 라이브러리를 래핑한 LangChain 모듈입니다.
- 고속 벡터 검색이 가능한 오픈소스 라이브러리입니다.
- FAISS 외에도 Chroma, Weaviate, Pinecone 등 다양한 vector store가 존재합니다.

5. from langchain_openai import OpenAI
- LangChain의 최신 구조에 맞춘 OpenAI 래퍼입니다.
- langchain_openai는 [LangChain 0.1 이상] 버전에서 공식 분리된 패키지입니다.
- OpenAI()를 통해 GPT-4나 GPT-3.5 같은 모델을 호출할 수 있습니다.
- 최신 버전에서는 langchain_openai.OpenAI 사용을 권장합니다. (langchain.llms.OpenAI는 deprecated)
```
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
```

### 전체 흐름 예시
```
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import OpenAI, OpenAIEmbeddings

# 1. PDF 문서 로드
loader = UnstructuredPDFLoader("example.pdf")

# 2. 벡터 인덱스 생성 (문서 임베딩 + 저장)
index = VectorstoreIndexCreator(
    embedding=OpenAIEmbeddings(),
    vectorstore_cls=FAISS
).from_loaders([loader])

# 3. 사용자 질문에 대한 검색 및 응답
query = "이 문서의 핵심 요약은?"
result = index.query(query, llm=OpenAI(model="gpt-3.5-turbo"))
print(result)
```

### OpenAI API Key

In [18]:
# Get your API keys from openai, you will need to create an account.
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os

my_openai_key = 'k-proj-P0tZZww7t6kCCjQ-uKPKXJtCKs0FGTqwlVcCeggFBnKnI2is4-YeY-guf3GWM66XSaTKZdO1rOT3BlbkFJTRw8RzazOu41guTAy4PmzdhQghK2M9ahgzX5J36GiJw18HBm6OGP-1U5wumsux-fKeOUb6LSYA'
os.environ['OPENAI_API_KEY'] = my_openai_key  # os.environ은 Python의 환경변수(Environment Variables) 설정 객체이다. 실제 프로젝트에서는 보통 .env 파일이나 환경 변수로만 관리한다.
my_open_key = my_openai_key

### Connect Google Drive
How to Connect Google Colab with Google Drive - Highly Recommended -

https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/

In [8]:
import os
from google.colab import drive
drive.mount('/content/drive/')

base_path = '/content/drive/My Drive/'
data_path_googleDrive='GenAI_sample_data/'
data_path =os.path.join(base_path, data_path_googleDrive)
print("--------Data Path in Google Drive \n")
print(data_path)
print("--------------------------------")

Mounted at /content/drive/
--------Data Path in Google Drive 

/content/drive/My Drive/GenAI_sample_data/
--------------------------------


In [9]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive"

Mounted at /content/gdrive


In [10]:
pdf_folder_path = f'{root_dir}/GenAI_sample_data/sample_pdf_data/'
os.listdir(pdf_folder_path)

['sample_dl.pdf',
 'sample_DS_07Nov2024.pdf',
 'sample_DS.pdf',
 'sample_ML.pdf',
 'sample_statistical_ml.pdf']

In [11]:
import nltk
import os

# Step 1: Create the NLTK data directory if it doesn't exist
nltk_data_path = "/sample_nltk_data/nltk_data"
os.makedirs(nltk_data_path, exist_ok=True)

# Step 2: Set the NLTK data path
nltk.data.path.append(nltk_data_path)

# Step 3: Download the 'punkt' tokenizer to the specified directory
nltk.download('punkt', download_dir=nltk_data_path)



[nltk_data] Downloading package punkt to
[nltk_data]     /sample_nltk_data/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [14]:
!ls /root/nltk_data/tokenizers/punkt. # why?

ls: cannot access '/root/nltk_data/tokenizers/punkt': No such file or directory


### Load Multiple PDF files

In [15]:
# location of the pdf file/files.
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]

In [16]:
loaders

[<langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x794ed917aa90>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x794eda4261d0>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x794f18790950>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x794f186c0ed0>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x794eeb320790>]

### Vector Store
Chroma as vectorstore to index and search embeddings


There are three main steps going on after the documents are loaded:

- Splitting documents into chunks

- Creating embeddings for each document

- Storing documents and embeddings in a vectorstore


In [19]:
# Initialize your embedding model
embeddings = OpenAIEmbeddings()  # Or another embedding class you're using

# Now pass the embedding to the VectorstoreIndexCreator
index = VectorstoreIndexCreator(embedding=embeddings).from_loaders(loaders)




In [22]:
# Initialize the LLM
llm_for_question = OpenAI(temperature=0.1)  # Adjust the temperature as needed

# Query the index with the LLM
response = index.query('What was the main topic of the address?', llm=llm_for_question)

# Print the response
print(response)

 The main topic of the address is the impact of deep learning and BERT on the field of natural language processing.


In [23]:
response = index.query('who is mehdi?', llm=llm_for_question)

# Print the response
print(response)

 Mehdi is an instructor at the school who supervised the collection of this text using ChatGPT. You can contact Mehdi at zadeh1980mehdi@gmail.com.


In [24]:
index.query('What was the summary of the address?', llm=llm_for_question)

' The text discusses the concept of deep learning and its impact on various industries, with a focus on BERT, a pre-trained transformer model designed for natural language processing. It explains the key features of BERT, such as its bidirectional approach and ability to understand contextual relationships between words, and highlights its potential for advancing AI applications.'

In [25]:
index.query_with_sources('what is Reinforcement Learning (RL)? ', llm=llm_for_question)

{'question': 'what is Reinforcement Learning (RL)? ',
 'answer': ' Reinforcement Learning (RL) is a subset of machine learning that involves learning through trial and error to optimize outcomes. It is used in various applications, such as image and speech recognition, recommendation systems, and predictive analytics. \n',
 'sources': '/content/gdrive/My Drive/GenAI_sample_data/sample_pdf_data/sample_ML.pdf'}

In [26]:
index.query_with_sources('who is Mehdi ? ', llm=llm_for_question)

{'question': 'who is Mehdi ? ',
 'answer': ' Mehdi is an instructor at the school.\n',
 'sources': '/content/gdrive/My Drive/GenAI_sample_data/sample_pdf_data/sample_DS_07Nov2024.pdf, /content/gdrive/My Drive/GenAI_sample_data/sample_pdf_data/sample_ML.pdf, /content/gdrive/My Drive/GenAI_sample_data/sample_pdf_data/sample_dl.pdf'}

In [27]:
# Initialize your embedding model with additional parameters
model_kwargs = {
    "temperature": 0.7                        # Set temperature for randomness in generation (0 to 1)
}

In [28]:
#Initialize your embedding model with additional parameters if needed
embeddings_2 = OpenAIEmbeddings(
    api_key=my_open_key                # Your OpenAI API key

)


In [29]:
# Now pass the embedding to the VectorstoreIndexCreator
index_2 = VectorstoreIndexCreator(embedding=embeddings_2).from_loaders(loaders)



In [31]:
# Initialize the LLM
llm_for_question_2 = OpenAI(temperature=0.9)  # Adjust the temperature as needed

# Query the index with the LLM
response = index.query('What was the main topic of the address?', llm=llm_for_question_2)

# Print the response
print(response)


index.query_with_sources('who is Mehdi and what is his email? ', llm=llm_for_question_2)



The main topic of the address was the role of deep learning and the transformer model BERT in revolutionizing natural language processing and its impact on various industries. 


{'question': 'who is Mehdi and what is his email? ',
 'answer': ' Mehdi is an instructor at the school.\n',
 'sources': '/content/gdrive/My Drive/GenAI_sample_data/sample_pdf_data/sample_DS_07Nov2024.pdf'}