# [Analyzing structured data](https://python.langchain.com/docs/use_cases/tabular.html)

表格数据处理工具集

## Document loading

存储在文档中的表格数据一般处理方式是以 [CSVLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/csv.html) 方式加载数据然后创建索引，最后再检索数据。

### [Data connection](https://python.langchain.com/docs/modules/data_connection)

解决不同数据处理需求的模块

- [Document loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/): 加载不同数据源
- [Document transformers](https://python.langchain.com/docs/modules/data_connection/document_transformers/): 拆分文档、删除多余部分等等
- [Text embedding models](https://python.langchain.com/docs/modules/data_connection/text_embedding/): 将非结构化的文档转化成数字
- [Vector stores](https://python.langchain.com/docs/modules/data_connection/vectorstores/): 以向量数据的形式存储和查询
- [Retrievers](https://python.langchain.com/docs/modules/data_connection/retrievers/): 按需求查询数据

#### Document loaders

将不同源数据加载进来作为“文档”，然后通过`load`的方法将数据加载到内存中。

In [5]:
from langchain.document_loaders import TextLoader

loader = TextLoader("./README.md")
loader.load()

[Document(page_content='# LangChain 官方文档\n\n来自[LangChain](https://python.langchain.com/en/latest/index.html)的官方文档。\n', metadata={'source': './README.md'})]

In [1]:
# CSV 文件读取
from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
print(data)

[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 0}), Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 1}), Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 2}), Document(page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 3}), Document(page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 4}), Document(page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 5}), Document(page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93', metadata={'source': './

In [10]:
# 文件夹读取
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('../', glob="**/*.md", show_progress=True)
docs = loader.load()
len(docs)

100%|██████████| 2/2 [00:00<?, ?it/s]


2

In [27]:
# github 读取
from langchain.document_loaders import GitLoader
from getpass import getpass

loader = GitLoader(
    clone_url="https://github.com/hwchase17/langchain",
    repo_path="./example_data/test_repo2/",
    branch="master",
)

data = loader.load()

data[0]

Document(page_content='.venv\n.github\n.git\n.mypy_cache\n.pytest_cache\nDockerfile', metadata={'file_path': '.dockerignore', 'file_name': '.dockerignore', 'file_type': ''})

In [23]:
# hf的数据集读取
from langchain.document_loaders import HuggingFaceDatasetLoader

dataset_name = "imdb"
page_content_column = "text"


loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

data = loader.load()
data[:15]

Downloading builder script: 4.31kB [00:00, ?B/s]
Downloading metadata: 2.17kB [00:00, 2.13MB/s]
Downloading readme: 7.59kB [00:00, ?B/s]


Downloading and preparing dataset imdb/plain_text to C:/Users/FY/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data: 100%|██████████| 84.1M/84.1M [00:06<00:00, 12.3MB/s]
                                                                                              

Dataset imdb downloaded and prepared to C:/Users/FY/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 194.85it/s]


[Document(page_content='I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are 

In [41]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.espn.com/")
loader.requests_kwargs = {'verify':False}

data = loader.load()
print(data[0].page_content)










ESPN - Serving Sports Fans. Anytime. Anywhere.




























































































        Skip to main content
    

        Skip to navigation
    

































<

>









MenuESPN


Search



scores



NFLNBANHLMLBSoccerNCAA…TennisNCAAFNCAAMNCAAWSports BettingBoxingCFLCricketF1GolfHorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyWNBAWWEX GamesXFLMore ESPNFantasyListenWatchESPN+















  

SUBSCRIBE NOW





NCAA Men's College World Series Finals







PGA TOUR LIVE







MLB: Select Games







Cricket: Select Matches


Quick Links




Men's College World Series







2023 NHL draft







Women's World Cup Rosters







How To Watch PGA TOUR






Favorites






      Manage Favorites
      



Customize ESPNSign UpLog InESPN Sites




ESPN Deportes







Andscape







espnW







ESPNFC







X Games







SEC Network


ESPN Apps




ESPN







ESPN Fantasy



#### Document transformers

包含对文档的分割、组合、过滤和其他操作方法

##### Text splitters

文档分割有很多方法，默认方法时[RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)，它使用`["\n\n", "\n", " ", ""]`这些字符进行文档拆分，它会按顺序尝试在这些字符上进行拆分，直到块足够小为止。

按照指定字符进行切分的方法[CharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter)，这个方法非常简单，按照指定的字符进行切分，同时按照指定的块的大小完成切分。这个方法实例化后有多种文档切分方法，比如其中一种`from_tiktoken_encoder`就是按照token的长度进行切分，当然也可以直接使用`TokenTextSplitter`进行切分，更多信息可以参考[Split by tokens](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token)。

按照不同编程语言进行切分，注意版本的不同，官方文档现在用[Language](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter)这个模块，没有更新到最新不一定有这个模块。

In [47]:
# This is a long document we can split up.
with open('./example_data/state_of_the_union.txt') as f:
    state_of_the_union = f.read()

In [50]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
    # add_start_index = True
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={}
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={}


In [51]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' metadata={}


In [52]:
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
print(documents[0])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' metadata={'document': 1}


In [53]:
text_splitter.split_text(state_of_the_union)[0]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'

In [58]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.


In [57]:
from langchain.text_splitter import PythonCodeTextSplitter

PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = PythonCodeTextSplitter(
    chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='hello_world():\n    print("Hello, World!")', metadata={}),
 Document(page_content='# Call the function\nhello_world()', metadata={})]

#### Text embedding models

字符向量化的重要性不言而喻，因为一旦进行向量化，对于字符的操作就可以在向量空间上进行，比如语义搜索就可以转化成向量相似度的计算。

`langchain`的文本向量化有两种方法：一种是基于文档的嵌入，就是多个文档进行向量化，另一种是基于查询的嵌入，就是一个文档进行向量化。

##### OpenAI

In [61]:
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()

In [62]:
embeddings = embedding_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/engines/text-embedding-ada-002/embeddings (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))).


(5, 1536)

In [63]:
embedded_query = embedding_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]

[0.00538721214979887,
 -0.0005941778072156012,
 0.03892524912953377,
 -0.002979141427204013,
 -0.008912666700780392]

##### Hugging Face Hub

In [2]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

text = "This is a test document."

query_result = embeddings.embed_query(text)

doc_result = embeddings.embed_documents([text])

print(query_result)
print(doc_result)

  from .autonotebook import tqdm as notebook_tqdm


[-0.04895178973674774, -0.03986210376024246, -0.021562790498137474, 0.009908553212881088, -0.03810388222336769, 0.012684348970651627, 0.04349441081285477, 0.07183387875556946, 0.0097486088052392, -0.006987079046666622, 0.06352805346250534, -0.030322600156068802, 0.013839391991496086, 0.025805896148085594, -0.0011363003868609667, -0.014563580974936485, 0.04164035618305206, 0.03622828796505928, -0.026800865307450294, 0.025120755657553673, -0.024978606030344963, -0.004533299244940281, -0.026667160913348198, 0.004100671969354153, -0.05204806476831436, -0.009930499829351902, -0.052065346390008926, 0.00899211224168539, -0.03830048441886902, -0.04405849426984787, -0.004204418044537306, 0.07047972083091736, 0.005133940372616053, -0.07161541283130646, 1.6975303651634022e-06, -0.006047751288861036, -0.011076485738158226, 0.01751335896551609, -0.022299837321043015, 0.040954891592264175, 0.03379017487168312, 0.056650370359420776, -0.07114940136671066, 0.040976617485284805, -0.00590606639161706, -0

#### Vector stores

向量数据库就是存储、查询、计算向量的数据库。

In [18]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS, Redis


raw_documents = TextLoader('./example_data/state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

db = FAISS.from_documents(documents, OpenAIEmbeddings())

In [5]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


In [20]:
rds = Redis.from_documents(
    docs, embeddings, redis_url="redis://localhost:6379", index_name="link"
)

texts = [d.page_content for d in docs]
metadatas = [d.metadata for d in docs]

rds, keys = Redis.from_texts_return_keys(texts,
                                    OpenAIEmbeddings(),
                                    redis_url="redis://localhost:6379",
                                    index_name="link")

In [21]:
rds.index_name

'link'

In [22]:
query = "What did the president say about Ketanji Brown Jackson"
results = rds.similarity_search(query)
print(results[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


In [23]:
print(rds.add_texts(["Ankush went to Princeton"]))

['doc:link:5459054cca3944ad9ca610c6829f62b9']


In [24]:
query = "Princeton"
results = rds.similarity_search(query)
print(results[0].page_content)

Ankush went to Princeton


In [26]:
# Load from existing index
rds = Redis.from_existing_index(
    OpenAIEmbeddings(), redis_url="redis://localhost:6379", index_name="link"
)

query = "What did the president say about Ketanji Brown Jackson"
results = rds.similarity_search(query)
print(results[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


#### Retrievers

检索比向量存储用的更多，向量存储可以用于检索，但是检索的方法有很多。

默认使用的向量库是`chromadb`，`VectorstoreIndexCreator`将文档切割、向量化、向量索引建立和存储都一并完成了。

In [27]:
from langchain.document_loaders import TextLoader
loader = TextLoader('./example_data/state_of_the_union.txt', encoding='utf8')

In [28]:
from langchain.indexes import VectorstoreIndexCreator

index = VectorstoreIndexCreator().from_loaders([loader])

In [29]:
query = "What did the president say about Ketanji Brown Jackson"
index.query(query)

" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

In [30]:
query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)

{'question': 'What did the president say about Ketanji Brown Jackson',
 'answer': " The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, one of the nation's top legal minds, who will continue Justice Breyer's legacy of excellence, and that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\n",
 'sources': './example_data/state_of_the_union.txt'}

In [31]:
index.vectorstore

<langchain.vectorstores.chroma.Chroma at 0x2e831918eb0>

In [32]:
index.vectorstore.as_retriever()

VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x000002E831918EB0>, search_type='similarity', search_kwargs={})

In [33]:
#  VectorstoreIndexCreator 做了下面这些事情

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

documents = loader.load()

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)


from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()


from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)

retriever = db.as_retriever()

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)

query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He said that she is a consensus builder and has received a broad range of support since being nominated."

### [Retrieval QA](https://python.langchain.com/docs/modules/chains/popular/vector_db_qa.html)

展示在索引上查询数据的过程

In [1]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [3]:
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever())

In [4]:
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

" The President said that Ketanji Brown Jackson is one of the nation's top legal minds, a consensus builder, and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

## Querying

有大量表格数据同时不想创建并使用索引进行操作，那就可以使用`chains`和`agents`。

### Chains

链式操作就是按照提前制定计划进行工作的，它方便你更好的控制和了解工具是怎么运行的。

#### [SQL Database Chain](https://python.langchain.com/docs/modules/chains/popular/sqlite.html)

### Agents

代理是非常复杂的，因为它将很多步骤混合在一起，因此它的有缺点也很明显，有点就是功能强大，可以帮你完成很多复杂的工作，缺点就是并不一定会受你控制严格按照预期走。

- [SQL Agent](https://python.langchain.com/docs/modules/agents/toolkits/sql_database.html)
- [Pandas Agent](https://python.langchain.com/docs/modules/agents/toolkits/pandas.html)
- [CSV Agent](https://python.langchain.com/docs/modules/agents/toolkits/csv.html)