# CSV extraction

This notebook shows examples of text extraction from csv files with different packages

**Table of contents**<a id='toc0_'></a>    
- [Methods to load csv files](#toc1_)    
  - [Load from CSV Loader](#toc1_1_)    
    - [Customize csv parsing and loading](#toc1_1_1_)    
    - [Load text splitter](#toc1_1_2_)    
    - [CSV loder with custom field names](#toc1_1_3_)    
  - [Load from pandas](#toc1_2_)    
  - [Load from Unstructed.io](#toc1_3_)    
- [Embedding & Storage](#toc2_)    
  - [Evaluate embedding by similarity](#toc2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import sys
sys.path.append('../')

import os
import pandas as pd
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm.autonotebook import trange


# <a id='toc1_'></a>[Methods to load csv files](#toc0_)

In [2]:
file_path = 'sample_data/sample_files/customers-100.csv'

## <a id='toc1_1_'></a>[Load from CSV Loader](#toc0_)

In [3]:
from langchain.document_loaders.csv_loader import CSVLoader


loader_csv = CSVLoader(file_path=file_path, encoding="utf-8", csv_args={'delimiter': ','})
docs_csv = loader_csv.load()
for doc in docs_csv:
    print(f'{doc.page_content}\n')

Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: zunigavanessa@smith.info
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/

Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944
Email: vmata@colon.com
Subscription Date: 2021-04-23
Website: http://www.hobbs.com/

Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: Isabelborough
Country: Antigua and Barbuda
Phone 1: +1-539-402-0259
Phone 2: (496)978-3969x58947
Email: beckycarr@hogan.com
Subscription Date: 2020-03-25
Website: http://www.lawrence.com/

Index: 4
Customer Id: 5Cef8BFA16c5e3c
First Name: Linda
Last Name: Olsen
Company: Dominguez, Mcmillan and Donovan
City: Bensonview
Country: Dominican Republic
P

### <a id='toc1_1_1_'></a>[Customize csv parsing and loading](#toc0_)

Extract field name

CSVLoader uses the first line as the field name if fieldname is not specified

In [4]:
df = pd.read_csv(file_path)
fieldnames = df.columns.to_list()
fieldnames

['Index',
 'Customer Id',
 'First Name',
 'Last Name',
 'Company',
 'City',
 'Country',
 'Phone 1',
 'Phone 2',
 'Email',
 'Subscription Date',
 'Website']

### <a id='toc1_1_2_'></a>[Load text splitter](#toc0_)

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
        # Set a small chunk size, just to make splitting evident.
        chunk_size = 200,
        chunk_overlap  = 20,
        length_function = len,
        add_start_index = True,
        separators = ["\n\n\n","\n"," "]
    )

### <a id='toc1_1_3_'></a>[CSV loder with custom field names](#toc0_)

In [6]:
loader_csv = CSVLoader(
    file_path=file_path,
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
        "fieldnames": fieldnames
    }
)

docs_csv = loader_csv.load_and_split(text_splitter = text_splitter)
for doc in docs_csv:
    print(f'{doc.page_content}\n')

Index: Index
Customer Id: Customer Id
First Name: First Name
Last Name: Last Name
Company: Company
City: City
Country: Country
Phone 1: Phone 1
Phone 2: Phone 2
Email: Email

Email: Email
Subscription Date: Subscription Date
Website: Website

Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718

Email: zunigavanessa@smith.info
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/

Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944

Email: vmata@colon.com
Subscription Date: 2021-04-23
Website: http://www.hobbs.com/

Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: Isabelborough
Country: Antigua and Barbuda
Phone 1: +1-539-402-0259
Phone 2: (496)978-3969x58947

Email:

## <a id='toc1_2_'></a>[Load from pandas](#toc0_)
this method can only choose one column to build the embedding

In [7]:
from langchain.document_loaders import DataFrameLoader
# if no field name in the .csv file, load with pd.read_csv(file_path, names=['filedname1', 'filedname2', ...])
df = pd.read_csv(file_path)
# must specify which column in the data frame contains the text we'll create embeddings for. 
loader_pandas = DataFrameLoader(df, page_content_column="Company")
                        
docs_pandas = loader_pandas.load()
for doc in docs_pandas:
    print(f'{doc.page_content}\n')

Rasmussen Group

Vega-Gentry

Murillo-Perry

Dominguez, Mcmillan and Donovan

Martin, Lang and Andrade

Steele Group

Lester, Woodard and Mitchell

Sanford, Davenport and Giles

Browning-Simon

Beck-Hendrix

Oconnell, Meza and Everett

Hoffman, Reed and Mcclain

Graham-Francis

Carpenter-Cook

Carter-Hancock

Singleton and Sons

Winters-Mendoza

Valentine LLC

Simon LLC

Mays-Mccormick

Patterson Inc

Manning, Hester and Arroyo

Greer and Sons

Watts-Donaldson

Tucker LLC

Giles Ltd

Simmons Group

Hinton, Chaney and Stokes

Santana-Duran

Sawyer PLC

Acosta, Petersen and Morrow

Mcgee Group

Adkins-Salinas

Herrera Group

Waters, Chase and Aguilar

Palmer, Barnes and Houston

Jordan Ltd

Glover and Sons

Huerta-Mclean

Anderson Ltd

Monroe PLC

Kaufman and Sons

Perkins-Trevino

Cross PLC

Herrera, Shepherd and Underwood

Price, Sexton and Mcdaniel

Short-Wiggins

Brennan, Acosta and Ramos

Osborne-Erickson

Hobbs, Garrett and Sanford

Phelps, Forbes and Koch

May, Goodwin and Martin


## <a id='toc1_3_'></a>[Load from Unstructed.io](#toc0_)

In [8]:
from langchain.document_loaders import UnstructuredAPIFileLoader
# register at Unstructured.io to get a free API Key
load_dotenv('export.env')

loader_unestructured = UnstructuredAPIFileLoader(file_path, mode="elements", 
                                                 api_key=os.environ.get("UNSTRUCTURED_API_KEY"), 
                                                 url=os.environ.get("UNSTRUCTURED_URL"))
docs_unstructured = loader_unestructured.load_and_split(text_splitter)
for doc in docs_unstructured:
    print(f'{doc.page_content}\n')

Index
Customer Id
First Name
Last Name
Company
City
Country
Phone 1
Phone 2
Email
Subscription Date
Website

1
DD37Cf93aecA6Dc
Sheryl
Baxter
Rasmussen Group
East Leonard
Chile
229.077.5154
397.884.0519x718
zunigavanessa@smith.info
2020-08-24
http://www.stephenson.com/

2
1Ef7b82A4CAAD10
Preston
Lozano
Vega-Gentry
East Jimmychester
Djibouti
5153435776
686-620-1820x944
vmata@colon.com
2021-04-23
http://www.hobbs.com/

3
6F94879bDAfE5a6
Roy
Berry
Murillo-Perry
Isabelborough
Antigua and Barbuda
+1-539-402-0259
(496)978-3969x58947
beckycarr@hogan.com
2020-03-25
http://www.lawrence.com/

4
5Cef8BFA16c5e3c
Linda
Olsen
Dominguez, Mcmillan and Donovan
Bensonview
Dominican Republic
001-808-617-6467x12895
+1-813-324-8756
stanleyblackwell@benson.org
2020-06-02
http://www.good-lyons.com/

5
053d585Ab6b3159
Joanna
Bender
Martin, Lang and Andrade
West Priscilla
Slovakia (Slovak Republic)
001-234-203-0635x76146
001-199-446-3860x3486
colinalvarado@miles.net
2021-04-17

2021-04-17
https://goodwin-ingram

# <a id='toc2_'></a>[Embedding & Storage](#toc0_)

In [13]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

encode_kwargs = {'normalize_embeddings': True}
embd_model = HuggingFaceInstructEmbeddings( model_name='intfloat/e5-large-v2',
                                            embed_instruction="", # no instructions needed for candidate passages
                                            query_instruction="Represent this sentence for searching relevant passages: ",
                                            encode_kwargs=encode_kwargs)
vectorstore_csv = FAISS.from_documents(documents=docs_csv, embedding=embd_model)
vectorstore_pandas = FAISS.from_documents(documents=docs_pandas, embedding=embd_model)
vectorstore_unstructured = FAISS.from_documents(documents=docs_unstructured, embedding=embd_model)
type(vectorstore_csv)

load INSTRUCTOR_Transformer
max_seq_length  512


langchain.vectorstores.faiss.FAISS

## <a id='toc2_1_'></a>[Evaluate embedding by similarity](#toc0_)

In [14]:
query = "What is the information about customer with first name Preston?"
ans = vectorstore_csv.similarity_search(query)
print("-------CSV Loader----------\n")
print(ans[0].page_content)


ans_2 = vectorstore_pandas.similarity_search(query)
print("--------Pandas------------\n")
print(ans_2[0].page_content)


ans_3 = vectorstore_unstructured.similarity_search(query)
print("--------Unstructred------------\n")
print(ans_3[0].page_content)

-------CSV Loader----------

Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944
--------Pandas------------

Lamb-Peterson
--------Unstructred------------

Index
Customer Id
First Name
Last Name
Company
City
Country
Phone 1
Phone 2
Email
Subscription Date
Website
