# Microsoft Word(doc, docx) With Langchain

- Author: [Suhyun Lee](https://github.com/suhyun0115)
- Design: 
- Peer Review: [Sunyoung Park (architectyou)](https://github.com/Architectyou), [Teddy Lee](https://github.com/teddylee777)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/06-WordLoader.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/06-WordLoader.ipynb)

## Overview

This tutorial covers two methods for loading `Microsoft Word` documents into a document format that can be used in RAG. 


We will demonstrate the usage of `Docx2txtLoader` and `UnstructuredWordDocumentLoader`, exploring their functionalities to process and load .docx files effectively. 


Additionally, we provide a comparison to help users choose the appropriate loader for their requirements.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Comparison of DOCX Loading Methods](#Comparison-of-DOCX-Loading-Methods)
- [Docx2txtLoader](#Docx2txtLoader)
- [UnstructuredWordDocumentLoader](#UnstructuredWordDocumentLoader)

### References

- [UnstructuredWordDocumentLoader Documentation](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader.html#langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader/)
- [Docx2txtLoader Documentation](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.word_document.Docx2txtLoader.html/)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [None]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [None]:
# Install required packages
from langchain_opentutorial import package

package.install(
    ["langchain", "langchain_community", "docx2txt", "unstructured", "python-docx"],
    verbose=False,
    upgrade=False,
)

## Comparison of docx Loading Methods

| **Feature**           | **Docx2txtLoader**      | **UnstructuredWordDocumentLoader**    |
|-----------------------|-------------------------|---------------------------------------|
| **Base Library**      | docx2txt               | Unstructured                         |
| **Speed**             | Fast                   | Relatively slow                      |
| **Memory Usage**      | Efficient              | Relatively high                      |
| **Installation Dependencies** | Lightweight (only requires docx2txt) | Heavy (requires multiple dependency packages) |

## Docx2txtLoader

**Used Library**: A lightweight Python module such as `docx2txt` for text extraction.

**Key Features**:
- Extracts text from `.docx` files quickly and simply.
- Suitable for efficient and straightforward tasks.

**Use Case**:
- When you need to quickly retrieve text data from `.docx` files.

In [1]:
from langchain_community.document_loaders import Docx2txtLoader

# Initialize the document loader
loader = Docx2txtLoader("data/sample-word-document_eng.docx")

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the document
print(f"Document Metadata: {docs[0].metadata}\n")

# Note: The entire docx file is converted into a single document.
# It needs to be split into smaller parts using a text splitter.
print("Document Content")
print(docs[0])

Document Count: 1

Type of loader: <class 'langchain_community.document_loaders.word_document.Docx2txtLoader'>

Document Metadata: {'source': 'data/sample-word-document_eng.docx'}

Document Content
page_content='Semantic Search



Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.

Example: When a user searches for "planets in the solar system," the search returns information about planets like "Jupiter" and "Mars."

Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining



Embedding



Definition: Embedding refers to the process of transforming textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to understand and process text effectively.

Example: The word "apple" might be represented as a vector like [0.65, -0.23, 0.17].

Related Keywords: Natural Language Processing (NLP), Vectorization, 

## UnstructuredWordDocumentLoader

**Used Library**: A comprehensive document analysis library called `unstructured`.

**Key Features**:
- Capable of understanding the structure of a document, such as titles and body, and separating them into distinct elements.
- Allows hierarchical representation and detailed processing of documents.
- Extracts meaningful information from unstructured data and transforms it into structured formats.

**Use Case**:
- When you need to extract text while preserving the document's structure, formatting, and metadata.
- Suitable for handling complex document structures or converting unstructured data into structured formats.

| **Parameter**           | **Option**              | **Description**                                                                               |
|-------------------------|-------------------------|---------------------------------------------------------------------------------------------|
| `mode`                  | `single` (default)      | Returns the entire document as a single `Document` object.                                  |
|                         | `elements`              | Splits the document into elements (e.g., title, body) and returns each as a `Document` object. |
| `strategy`              | `None` (default)        | No specific strategy is applied.                                                           |
|                         | `fast`                  | Prioritizes speed (may reduce accuracy).                                                    |
|                         | `hi_res`                | Prioritizes high accuracy (slower processing).                                              |
| `include_page_breaks`   | `True` (default)        | Detects page breaks and adds `PageBreak` elements.                                          |
|                         | `False`                 | Ignores page breaks.                                                                        |
| `infer_table_structure` | `True` (default)        | Infers table structure and includes it in HTML format.                                      |
|                         | `False`                 | Does not infer table structure.                                                            |
| `starting_page_number`  | `1` (default)           | Specifies the starting page number of the document.                                         |

### mode: Single (default)


In this mode, the entire document is returned as a single LangChain Document object. In other words, all the content of the document is contained within a single object.

In [2]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader
loader = UnstructuredWordDocumentLoader("data/sample-word-document_eng.docx")

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the document
print(f"Document Metadata: {docs[0].metadata}\n")

# Note: The entire docx file is converted into a single document.
# It needs to be split into smaller parts using a text splitter.
print("Document Content")
print(docs[0])

Document Count: 1

Type of loader: <class 'langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader'>

Document Metadata: {'source': 'data/sample-word-document_eng.docx'}

Document Content
page_content='Semantic Search

Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.

Example: When a user searches for "planets in the solar system," the search returns information about planets like "Jupiter" and "Mars."

Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining

Embedding

Definition: Embedding refers to the process of transforming textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to understand and process text effectively.

Example: The word "apple" might be represented as a vector like [0.65, -0.23, 0.17].

Related Keywords: Natural Language Processing (NLP), Vecto

### mode: elements


The document is divided into individual elements, such as Title and NarrativeText. Each element is returned as a separate Document object, allowing for more detailed analysis or processing of the document's structure.

In [3]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader with "elements" mode
loader = UnstructuredWordDocumentLoader(
    "data/sample-word-document_eng.docx", mode="elements"
)

# Load the document
docs = loader.load()

# Print the number of documents
print(
    f"Document Count: {len(docs)}\n"
)  # Using "elements" mode, each element is converted into a separate Document object

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the first document element
print(f"Document Metadata: {docs[0].metadata}\n")

# Print the content of the first document element
print("Document Content")
print(docs[0])

Document Count: 123

Type of loader: <class 'langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader'>

Document Metadata: {'source': 'data/sample-word-document_eng.docx', 'category_depth': 0, 'file_directory': 'data', 'filename': 'sample-word-document_eng.docx', 'last_modified': '2024-12-29T15:35:49', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'ae1fc619a1205c0a9c3f6876535ffc46'}

Document Content
page_content='Semantic Search' metadata={'source': 'data/sample-word-document_eng.docx', 'category_depth': 0, 'file_directory': 'data', 'filename': 'sample-word-document_eng.docx', 'last_modified': '2024-12-29T15:35:49', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'ae1fc619a1205c0a9c3f6876535ffc46'}


In [4]:
# Print the content of the second document
print(docs[1].page_content)

Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.


### Efficient Document Loader Configuration with Various Parameter Combinations

By combining various parameters, you can configure a document loader that fits your specific needs efficiently. Adjusting settings such as `mode`, `strategy`, and `include_page_breaks` allows for tailored handling of different document structures and processing requirements.


In [5]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader with specific parameters
loader = UnstructuredWordDocumentLoader(
    "data/sample-word-document_eng.docx",
    strategy="fast",  # Prioritize fast processing
    include_page_breaks=True,  # Include page breaks as PageBreak elements
    infer_table_structure=True,  # Infer table structures and include in HTML format
    starting_page_number=1,  # Start page numbering from 1
)

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the first document
print(f"Document Metadata: {docs[0].metadata}\n")

# Print the content of the first document
print("Document Content")
print(docs[0])

Document Count: 1

Type of loader: <class 'langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader'>

Document Metadata: {'source': 'data/sample-word-document_eng.docx'}

Document Content
page_content='Semantic Search

Definition: Semantic search is a search methodology that goes beyond simple keyword matching to understand the meaning of a user's query and return relevant results.

Example: When a user searches for "planets in the solar system," the search returns information about planets like "Jupiter" and "Mars."

Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining

Embedding

Definition: Embedding refers to the process of transforming textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to understand and process text effectively.

Example: The word "apple" might be represented as a vector like [0.65, -0.23, 0.17].

Related Keywords: Natural Language Processing (NLP), Vecto