By [Yulandy Chiu](https://www.youtube.com/@YulandySpace)

Aided with Gemini/Claude/ChatGPT and modified by Yulandy Chiu

Version: 2025/02

Videos:
* [Python實作個人知識庫 knowledge base：LangChain  +Vector Database 完整教學 | Step-by-Step 做中學！](https://youtu.be/1qB-opvJxnU)

YouTube: [Yulandy Chiu的AI觀測站](https://www.youtube.com/@YulandySpace)

Facebook: [Yulandy Chiu的AI資訊站](https://www.facebook.com/yulandychiu)

 This code is licensed under the Creative Commons Attribution-NonCommercial 4.0
 International License (CC BY-NC 4.0). You are free to use, modify, and share this code for non-commercial purposes, provided you give appropriate credit. For more details, see the LICENSE file or visit: https://creativecommons.org/licenses/by-nc/4.0/
 © [2025] Yulandy Chiu


In [1]:
# Step 1: Install required packages
!pip install google-generativeai langchain-google-genai faiss-cpu sentence-transformers pypdf
!pip install langchain-community
!pip install unstructured python-docx
!pip install python-magic
!pip install libmagic
!pip install -U langchain-huggingface
import IPython
IPython.display.clear_output()
print("All packages installed!")

All packages installed!


In [2]:
# Step 2: install libraries and define functions
import os
from typing import List, Dict
import glob
import google.generativeai as genai
from langchain_google_genai import GoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredWordDocumentLoader,
    TextLoader,
    CSVLoader
)
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from google.colab import userdata

class DocumentProcessor:
    def __init__(self):
        """
        Initialize Document Processor
        """
        try:
            # Get API key from Colab Secrets
            api_key = userdata.get('Gemini_API_Key')
            if not api_key:
                raise ValueError("Cannot get Gemini_API_Key from Colab Secrets")

            # Configure Google API
            genai.configure(api_key=api_key)

            # Initialize Gemini model
            self.llm = GoogleGenerativeAI(
                model="gemini-1.5-flash",
                google_api_key=api_key,
                temperature=0.3
            )

            # Initialize embeddings model
            self.embeddings = HuggingFaceEmbeddings(
                model_name="sentence-transformers/all-MiniLM-L6-v2",
                model_kwargs={'device': 'cpu'}
            )

            self.vector_store = None
            self.chain = None
            self.processed_files = []

            print("System initialized successfully!")

        except Exception as e:
            print(f"Initialization failed: {str(e)}")
            raise

    def scan_directory(self, directory_path: str) -> Dict[str, List[str]]:
        """
        Scan directory for supported file types
        Args:
            directory_path: Directory path containing documents
        Returns:
            Dict[str, List[str]]: Dictionary of file paths grouped by type
        """
        try:
            if not os.path.exists(directory_path):
                raise ValueError(f"Directory does not exist: {directory_path}")

            file_types = {
                'pdf': '*.pdf',
                'docx': '*.docx',
                'doc': '*.doc',
                'txt': '*.txt',
                'csv': '*.csv'
            }

            files_by_type = {type_: [] for type_ in file_types}

            for root, _, _ in os.walk(directory_path):
                for file_type, pattern in file_types.items():
                    file_pattern = os.path.join(root, pattern)
                    found_files = glob.glob(file_pattern)
                    files_by_type[file_type].extend(found_files)

            total_files = sum(len(files) for files in files_by_type.values())
            if total_files == 0:
                print(f"No supported files found in {directory_path}")
                return {}

            print(f"Found {total_files} files:")
            for file_type, files in files_by_type.items():
                if files:
                    print(f"\n{file_type.upper()} files ({len(files)}):")
                    for file in files:
                        print(f"- {os.path.basename(file)}")

            return files_by_type

        except Exception as e:
            print(f"Error scanning directory: {str(e)}")
            return {}

    def load_document(self, file_path: str) -> List:
        """
        Load document based on file type
        Args:
            file_path: Path to the document
        Returns:
            List: List of document objects
        """
        file_extension = os.path.splitext(file_path)[1].lower()

        try:
            if file_extension in ['.doc', '.docx']:
                loader = UnstructuredWordDocumentLoader(file_path)
            elif file_extension == '.pdf':
                loader = PyPDFLoader(file_path)
            elif file_extension == '.txt':
                loader = TextLoader(file_path)
            elif file_extension == '.csv':
                loader = CSVLoader(file_path)
            else:
                raise ValueError(f"Unsupported file type: {file_extension}")

            documents = loader.load()

            # Add source metadata
            for doc in documents:
                doc.metadata["source"] = os.path.basename(file_path)
                if "page" not in doc.metadata:
                    doc.metadata["page"] = 1

            return documents

        except Exception as e:
            print(f"Error loading {os.path.basename(file_path)}: {str(e)}")
            return []

    def process_documents(self, files_by_type: Dict[str, List[str]]) -> bool:
        """
        Process multiple documents and create unified vector database
        Args:
            files_by_type: Dictionary of file paths grouped by type
        Returns:
            bool: Whether processing was successful
        """
        try:
            all_texts = []
            self.processed_files = []

            for file_type, files in files_by_type.items():
                for file_path in files:
                    try:
                        print(f"\nProcessing {file_type.upper()}: {os.path.basename(file_path)}")

                        documents = self.load_document(file_path)
                        if not documents:
                            continue

                        text_splitter = RecursiveCharacterTextSplitter(
                            chunk_size=1000,
                            chunk_overlap=200,
                            length_function=len
                        )
                        texts = text_splitter.split_documents(documents)
                        all_texts.extend(texts)
                        self.processed_files.append(os.path.basename(file_path))
                        print(f"Successfully processed {len(texts)} text segments")

                    except Exception as e:
                        print(f"Error processing {os.path.basename(file_path)}: {str(e)}")
                        continue

            if not all_texts:
                print("No documents were successfully processed")
                return False

            print(f"\nProcessed {len(self.processed_files)} files, {len(all_texts)} text segments")

            self.vector_store = FAISS.from_documents(
                documents=all_texts,
                embedding=self.embeddings
            )
                # limit the response to the retrieved content
            self.chain = ConversationalRetrievalChain.from_llm(
                llm=self.llm,
                retriever=self.vector_store.as_retriever(
                    search_kwargs={"k": 3}
                ),
                return_source_documents=True
            )

            print("\nVector database created successfully!")
            print("Processed files:")
            for file in self.processed_files:
                print(f"- {file}")

            return True

        except Exception as e:
            print(f"Error processing documents: {str(e)}")
            return False

    def ask_question(self, question: str) -> Dict:
        """
        Ask questions about the documents
        Args:
            question: Question content
        Returns:
            Dict: Dictionary containing answer and source documents
        """
        try:
            if not self.chain:
                raise ValueError("Please process documents first!")

            print("Thinking about the question...")
            response = self.chain({"question": question, "chat_history": []})

            return {
                "answer": response["answer"],
                "sources": [
                    {
                        "file": doc.metadata["source"],
                        "page": doc.metadata["page"],
                        "content": doc.page_content[:200] + "..."
                    }
                    for doc in response["source_documents"]
                ]
            }

        except Exception as e:
            print(f"Error answering question: {str(e)}")
            return {"error": str(e)}


In [3]:
# Step 3: Create a folder for uploading and storing files manually

document_directory = "/content/source"
if not os.path.exists(document_directory):
    os.makedirs(document_directory)

##範例: 在source資料夾，上傳Food.docx bio.pdf friends.docx

In [4]:
# Step 4: Scan the directory for files, process documents, and create a vector database
processor = DocumentProcessor()

files_by_type = processor.scan_directory(document_directory)
if not files_by_type or not processor.process_documents(files_by_type):
  exit("File processing failed.")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

System initialized successfully!
Found 3 files:

PDF files (1):
- bio.pdf

DOCX files (2):
- Food.docx
- friends.docx

Processing PDF: bio.pdf
Successfully processed 2 text segments

Processing DOCX: Food.docx
Successfully processed 1 text segments

Processing DOCX: friends.docx
Successfully processed 2 text segments

Processed 3 files, 5 text segments

Vector database created successfully!
Processed files:
- bio.pdf
- Food.docx
- friends.docx


In [5]:
# Step 5: Query the vector database with a question and retrieve the answer with referecnes
response = processor.ask_question("Yulandy有哪些朋友?")
print("\nAnswer:", response["answer"])
print("\nReference Sources:")
for source in response["sources"]:
    print(f"- File: {source['file']}, Page {source['page']}")
    print(f"  Content: {source['content']}")


Thinking about the question...


  response = self.chain({"question": question, "chat_history": []})



Answer: Yulandy的朋友有艾米莉亞·布朗（Amelia Brown）、湯瑪士·貝克（Thomas Baker）和亨利·卡特（Henry Carter）。


Reference Sources:
- File: bio.pdf, Page 1
  Content: Yulandy 的名字最終被記錄在歷史書中，不僅作為一位技術革新者，也作為一
位社會改革家。他用自己的發明與智慧，將工業革命的動力轉化為造福全人類
的力量，證明了在技術的洪流中，人性的光輝永遠不可被取代。 
這個故事激勵了無數後人，告訴我們：無論身處何種時代，勇於追求夢想與突
破極限，便能創造奇蹟。...
- File: friends.docx, Page 1
  Content: 以下是Yulandy幾位朋友的詳細介紹，包括姓名、個性、工作等細節：

1. 艾米莉亞·布朗（Amelia Brown）

個性： 聰明、獨立、富有同情心，有時略帶一點固執。她是一位堅定的女性主義者，對社會的不公義現象充滿批判精神，同時也對科學和教育充滿熱情。

工作： 一位小學教師，致力於提高基層兒童的教育水平。她相信教育是改變社會的關鍵，並經常在課餘時間組織免費的讀書會，為貧困家庭的孩子提供學...
- File: Food.docx, Page 1
  Content: Yulandy的味蕾：工業時代的簡樸與能量

身處於19世紀工業革命時期的英國，Yulandy的飲食習慣很可能反映了當時的社會狀況與他個人的生活方式。作為一位在工廠工作的學徒，以及一位不斷投入研究的科學家，他的飲食可能以簡單、經濟且能提供能量的食物為主。

質樸的滋味，填飽肚子的能量來源：

粗糧麵包： 作為當時的主食，粗糧麵包是Yulandy日常生活中不可或缺的一部分。這種麵包口感扎實，雖然不像...
