# Excel File Loading in LangChain

- Author: [Hwayoung Cha](https://github.com/forwardyoung)
- Design: []()
- Peer Review: [teddylee777](https://github.com/teddylee777), [jhboyo](https://github.com/jhboyo)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

This tutorial covers the process of loading and handling Excel files in `LangChain`.

It focuses on two primary methods: `UnstructuredExcelLoader` for raw text extraction and `DataFrameLoader` for structured data processing.

The guide aims to help developers effectively integrate Excel data into their `LangChain` projects, covering both basic and advanced usage scenarios.

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [UnstructedExcelLoader](#UnstructedExcelLoader)
- [DataFrameLoader](#DataFrameLoader)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial


[notice] A new release of pip is available: 23.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "unstructured",
        "openpyxl"
    ],
    verbose=False,
    upgrade=False,
)

## UnstructedExcelLoader

`UnstructedExcelLoader` is used to load `Microsoft Excel` files.

This loader works with both `xlsx` and `xls` files.

When the loader is used in `"elements"` mode, an HTML representation of the Excel file is provided under the `text_as_html` key in the document metadata.

In [3]:
# install
# !pip install -qU langchain-community unstructured openpyxl

In [4]:
import sys
from langchain_community.document_loaders import UnstructuredExcelLoader

# Set recursion limit
sys.setrecursionlimit(10**6)    

# Create UnstructuredExcelLoader 
loader = UnstructuredExcelLoader("./data/titanic.xlsx", mode="elements")

# Load a document
docs = loader.load()

# Print the number of documents
print(len(docs))

1


This confirms that one document has been loaded.

The `page_content` contains the data from each row, while the `text_as_html` in the `metadata` stores the data in HTML format.

In [5]:
# Print the document
print(docs[0].page_content[:200])

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) fem


In [6]:
# Print the text_as_html of metadata
print(docs[0].metadata["text_as_html"][:1000])

<table><tr><td>PassengerId</td><td>Survived</td><td>Pclass</td><td>Name</td><td>Sex</td><td>Age</td><td>SibSp</td><td>Parch</td><td>Ticket</td><td>Fare</td><td>Cabin</td><td>Embarked</td></tr><tr><td>1</td><td>0</td><td>3</td><td>Braund, Mr. Owen Harris</td><td>male</td><td>22</td><td>1</td><td>0</td><td>A/5 21171</td><td>7.25</td><td/><td>S</td></tr><tr><td>2</td><td>1</td><td>1</td><td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td><td>female</td><td>38</td><td>1</td><td>0</td><td>PC 17599</td><td>71.2833</td><td>C85</td><td>C</td></tr><tr><td>3</td><td>1</td><td>3</td><td>Heikkinen, Miss. Laina</td><td>female</td><td>26</td><td>0</td><td>0</td><td>STON/O2. 3101282</td><td>7.925</td><td/><td>S</td></tr><tr><td>4</td><td>1</td><td>1</td><td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td><td>female</td><td>35</td><td>1</td><td>0</td><td>113803</td><td>53.1</td><td>C123</td><td>S</td></tr><tr><td>5</td><td>0</td><td>3</td><td>Allen, Mr. William Henry</td><td>male</td><td>35</

![text_as_html](./assets/05-Excel-Loader-text-as-html.png)

## DataFrameLoader

- Similar to CSV files, we can load Excel files by using the `read_excel()` function to create a DataFrame, and then load it.

In [7]:
import pandas as pd

# read the Excel file
df = pd.read_excel("./data/titanic.xlsx")

In [8]:
from langchain_community.document_loaders import DataFrameLoader

# Set up DataFrame loader, specifying the page content column
loader = DataFrameLoader(df, page_content_column="Name")

# Load the document
docs = loader.load()

# Print the data
print(docs[0].page_content)

# Print the metadata
print(docs[0].metadata)

Braund, Mr. Owen Harris
{'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': nan, 'Embarked': 'S'}
