### Read Excel file, extract URLs from column A into a list, and treat first line as a header

https://colab.research.google.com/drive/1f_1HeD1mK_wXfjgvY4VGNFKSQBE5Imeh?usp=sharing#scrollTo=jm2nTHTtVTp0

In [99]:
import sys
import pandas as pd
sys.path.append('../..')
from py3810.myUtils import pickle_dump, pickle_load

# Set the path to the directory containing the Excel file
path_lumen_dump = "../langchain/docs/lumen/"
path_lumen_docs = path_lumen_dump + "docs/"

lumen_urls = pickle_load(filename_pickle='lumen_urls', path_pickle_dump=path_lumen_dump)

In [100]:
from langchain.document_loaders import WebBaseLoader
from langchain_community.document_transformers import Html2TextTransformer

loader = WebBaseLoader(lumen_urls)
docs_html = loader.load()
html2text = Html2TextTransformer()
docs_text_orig = html2text.transform_documents(docs_html)

In [101]:
pickle_dump(file_to_pickle=docs_text_orig, filename_pickle='docs_text_orig', path_pickle_dump=path_lumen_dump)

In [151]:
docs_text = pickle_load(filename_pickle='docs_text_orig', path_pickle_dump=path_lumen_dump)

# TODO remove_newlines as a global function

In [152]:
import re

def remove_strings(text, strings):
  """
  This function removes all text strings in a list, pipe characters ("|"), and forward slashes ("/") from a given text string.

  Args:
      text: The text string to remove the strings from.
      strings: A list of strings to remove from the text.

  Returns:
      The text string with all the strings in the list, pipe characters, and forward slashes removed.
  """
  # pattern = re.compile("|".join(re.escape(s) for s in strings) + r"|\||/")
  pattern = re.compile("|".join(re.escape(s) for s in strings) + r"|\||/|-|—")  
  return re.sub(pattern, " ", text).strip()

text = "This is Skip to primary navigation Myopia Managementsome text with a few strings to remove. Here are the strings: apple- —/- —banana-/-cherry, and the pipe symbol (|) should also be removed."
strings = \
[
"apple", \

"banana", \

"cherry", \

'Skip to primary navigation \
Myopia Management', \
]

print(text)
result = remove_strings(text, strings)
print(result)


This is Skip to primary navigation Myopia Managementsome text with a few strings to remove. Here are the strings: apple- —/- —banana-/-cherry, and the pipe symbol (|) should also be removed.
This is  some text with a few strings to remove. Here are the strings:              , and the pipe symbol ( ) should also be removed.


In [153]:
import re

def remove_newlines(text):
  # r'\n|\s{2,}' finds both newlines (\n) and multiple whitespaces (\s{2,}) 
  # and replace them with " ", .strip() strips leading and trailing spaces 
  return re.sub(r'(\n)|(\s{2,})|(Go to page \d+)|(Go to Next Page)|(\…)', ' ', text).strip()

text_1 = "   \n  This is a string  \n  with   multiple \n \n \nnewlines.  "
text_2 = "page_content .... : s whether sleep habits impact eye health and vision development. In this blog post, we'll [...] Read\n"
text_3 = "page_content end:    More Go to page 1 Go to page 2 Go to page 3 Interim pages omitted … Go to page 25 Go to Next Page"
text = text_1 + text_2 + text_3
text_cleaned = remove_newlines(text)
print(f'text: {text}')
print('=========')
print(f'text_cleaned: {text_cleaned}')

text:    
  This is a string  
  with   multiple 
 
 
newlines.  page_content .... : s whether sleep habits impact eye health and vision development. In this blog post, we'll [...] Read
page_content end:    More Go to page 1 Go to page 2 Go to page 3 Interim pages omitted … Go to page 25 Go to Next Page
text_cleaned: This is a string with multiple newlines. page_content .... : s whether sleep habits impact eye health and vision development. In this blog post, we'll [...] Read page_content end: More       Interim pages omitted


In [156]:
remove_texts = \
[
'SCHEDULE AN APPOINTMENTLumen Optometric 14 West Sierra Madre Blvd, \
Sierra Madre, CA 91024 (626) 921-0199 info@lumenoptometric.comServicesComprehensive \
Eye Exams Contact Lens Exams Orthokeratology Neurolens Therapy Scleral Lenses / Keratoconus \
Quick LinksHome About Us Sitemap Hours Of OperationTuesday9:45 am - 5:30 \
pmWednesday9:45 am - 5:30 pmThursday9:45 am - 1:30 pmFriday9:45 am - 5:30 \
pmSaturday9:45 am - 5:30 pmCopyright © 2020 Lumen Optometric. All Rights Reserved.\
X[contact-form-7 404 "Not Found"]Top', \

"Allow us to take part in your vision care and you'll love what you see. \
Proudly serving Sierra Madre, Arcadia, Pasadena, and the surrounding communities.", \

'Skip to main content', \

'Skip to primary sidebar', \

'(626) 921-0199', \

'14 West Sierra Madre Blvd, Sierra Madre, CA 91024', \

'Call Us(626) 921-0199', \

'Schedule An Appointment', \

'Lumen Optometric | Sierra Madre, CA', \

'Skip to primary navigation', \

'Myopia Management', \

'June 2024 (3) May 2024 (3) April 2024 (3) March 2024 (2) February 2024 (5) \
January 2024 (2) December 2023 (2) November 2023 (2) October 2023 (5) \
September 2023 (2) August 2023 (4) July 2023 (3) June 2023 (3) May 2023 (3) \
April 2023 (3) March 2023 (3) February 2023 (3) January 2023 (3) \
December 2022 (1) November 2022 (2) October 2022 (4) September 2022 (3) \
August 2022 (3) July 2022 (2) June 2022 (3) May 2022 (3) April 2022 (3) \
March 2022 (3) February 2022 (2) January 2022 (3) December 2021 (3) \
November 2021 (3) October 2021 (3) September 2021 (3) August 2021 (3) \
July 2021 (3) June 2021 (3) May 2021 (3) April 2021 (3) March 2021 (2) \
February 2021 (2) January 2021 (3) December 2020 (3) November 2020 (3) \
October 2020 (3) September 2020 (2) August 2020 (4) July 2020 (2) \
June 2020 (3) May 2020 (2) April 2020 (4) March 2020 (2)', \

'Primary SidebarLatest Post', \

'June 20, 2024Beyond Age: Can Children and Young Adults Get Cataracts?June 17, 2024\
How Long Does My Child Need ?June 1, 2024The Manufacturing Process of Contact Lenses\
May 24, 2024Eye Vitamins: Can They Improve Vision?May 20, 2024 Category AMD Awareness \
Monthcovid 19eye careeye healtheye safetyEyesInformationoffice videosvision changes Archives', \

'Why Children Should Have Their Eyes Tested Yearly', \


# 'Primary SidebarLatest Post The Importance of Yearly Eye Tests for ChildrenJune 20, \
# 2024Beyond Age: Can Children and Young Adults Get Cataracts?June 17, 2024How Long Does \
# My Child Need ?June 1, 2024The Manufacturing Process of Contact LensesMay 24, 2024Eye \
# Vitamins: Can They Improve Vision?May 20, 2024 Category AMD Awareness Monthcovid 19eye \
# careeye healtheye safetyEyesInformationoffice videosvision changes Archives', \

# 'Primary SidebarLatest Post Why Children Should Have Their Eyes Tested YearlyJune 20, \
# 2024Beyond Age: Can Children and Young Adults Get Cataracts?June 17, 2024How Long Does \
# My Child Need ?June 1, 2024The Manufacturing Process of Contact LensesMay 24, 2024Eye \
# Vitamins: Can They Improve Vision?May 20, 2024 Category AMD Awareness Monthcovid 19eye \
# careeye healtheye safetyEyesInformationoffice videosvision changes Archive', \

# 'Primary SidebarLatest', \
]

https://platform.openai.com/docs/tutorials/web-qa-embeddings

In [157]:
from langchain.docstore.document import Document

In [158]:
lumen_office_info = \
  "Lumen Optometric address is located at 14 West Sierra Madre Blvd, Sierra Madre, CA 91024. \
   Lumen Optometric office is located at 14 West Sierra Madre Blvd, Sierra Madre, CA 91024. \
   Lumen Optometric location is 14 West Sierra Madre Blvd, Sierra Madre, CA 91024. \
   Lumen Optometric phone number is (626) 507-2724. \
   Lumen Optometric office hours are Tuesday, Wednesday, Friday, and Saturday from 9:45 am to 5:30 pm, \
   and Thursday from 9:45 am to 1:30 pm. \
   Lumen Optometric email is info@lumenoptometric.com \
   Lumen Optometric website url is www.lumenoptometric.com"

doc_office_info = [Document(
  page_content=lumen_office_info,
  metadata={
    "source": "https://www.lumenoptometric.com/",
    "description": "info, address, office location, phone number, office hours, email, url"}
  )]

In [159]:
docs_text_n_office_info = doc_office_info + docs_text

In [160]:
docs_text_n_office_info

[Document(page_content='Lumen Optometric address is located at 14 West Sierra Madre Blvd, Sierra Madre, CA 91024.    Lumen Optometric office is located at 14 West Sierra Madre Blvd, Sierra Madre, CA 91024.    Lumen Optometric location is 14 West Sierra Madre Blvd, Sierra Madre, CA 91024.    Lumen Optometric phone number is (626) 507-2724.    Lumen Optometric office hours are Tuesday, Wednesday, Friday, and Saturday from 9:45 am to 5:30 pm,    and Thursday from 9:45 am to 1:30 pm.    Lumen Optometric email is info@lumenoptometric.com    Lumen Optometric website url is www.lumenoptometric.com', metadata={'source': 'https://www.lumenoptometric.com/', 'description': 'info, address, office location, phone number, office hours, email, url'}),
 Document(page_content='Lumen Optometric | Sierra Madre, CA | Eye Care | Best Optometrist Skip to\nprimary navigation Skip to main content(626) 921-019914 West Sierra Madre\nBlvd, Sierra Madre, CA 91024Call Us(626) 921-0199Schedule An AppointmentMyopia\

In [162]:
for doc in docs_text_n_office_info:
  page_content = doc.page_content
  # run next line twice to remove multiple occurrence
  for i in range(3):    
    page_content = remove_newlines(page_content)
    page_content = remove_strings(page_content, remove_texts)  

  doc.page_content = remove_newlines(page_content)

In [164]:
for i, doc in enumerate(docs_text_n_office_info[8:30]):
  page_content = doc.page_content
  print(f'{i} {page_content[-500:]}')

0  other risk factors for developing AMD, you should get an eye exam with your doctor. If you live in the Sierra Madre area, Lumen Optometric can provide comprehensive eye exams to help detect AMD early and offer recommendations for treatment and management. We also provide other ophthalmologic services and products, such as contact lenses and orthokeratology. Call us today at to schedule an appointment. Filed Under: AMD Awareness Month Tagged With: contact lenses, eye exam doctor, orthokeratology
1 o address that specifically. There is NO scientific evidence to support the claims that contact lens wear is unhealthy during these times. For the vast majority of contact lens wearers, their contact lenses are inserted and removed at home, with no manipulation throughout the day. When worn properly, contact lenses are cleaned or disposed of on a daily basis. In contrast, glasses typically are not disinfected every day! Filed Under: covid 19 Tagged With: contact lens, contacts, covid 19, he

In [165]:
pickle_dump(file_to_pickle=docs_text_n_office_info, filename_pickle='lumen_docs_website', path_pickle_dump=path_lumen_docs)

In [166]:
_docs = pickle_load(filename_pickle='lumen_docs_website', path_pickle_dump=path_lumen_docs)

In [167]:
_docs

[Document(page_content='Lumen Optometric address is located at . Lumen Optometric office is located at . Lumen Optometric location is . Lumen Optometric phone number is (626) 507 2724. Lumen Optometric office hours are Tuesday, Wednesday, Friday, and Saturday from 9:45 am to 5:30 pm, and Thursday from 9:45 am to 1:30 pm. Lumen Optometric email is info@lumenoptometric.com Lumen Optometric website url is www.lumenoptometric.com', metadata={'source': 'https://www.lumenoptometric.com/', 'description': 'info, address, office location, phone number, office hours, email, url'}),
 Document(page_content="Eye Care Best Optometrist What is myopia? Is myopia unhealthy? Nature Versus Nurture Myopia Treatments Our technology Scleral Lenses Keratoconus Poseyedon Lens Neurolens Therapy Orthokeratology How’s Ortho K Work? Is Ortho k Safe? Candidacy Our Technology Frequently Asked Questions Adult vs. Children’s Designs Eye Exams Contact Lens Exams Patient Center Appointment & Forms COVID 19 Protocols Vi