## Remark
This notebook is to experiment with langchain before implementing them into kedro.

Carefully reading langchain documents, I reckon that Langhchain can make the whole work flow easy.      

## Tutorial : WebScraping with Langchain - [link](https://python.langchain.com/v0.1/docs/use_cases/web_scraping/) 
1 & 2 are independent experiments  

### General workflow of webscraping with Langchain
- Search: Query to url (e.g., using GoogleSearchAPIWrapper) 
- Loading: Url to HTML (e.g., using AsyncHtmlLoader, AsyncChromiumLoader, etc).
- Transforming: HTML to formatted text (e.g., using HTML2Text or Beautiful Soup).

[Tool List](https://python.langchain.com/v0.1/docs/integrations/tools/)  
[Tool setup](https://python.langchain.com/v0.1/docs/get_started/installation/)   

### Next Steps ??
- Split data
- Store data in the vector database 
- Retrieve
    Tool bindiing :  
        - Tagging or Categorization ??? when ??
- LangGraph

### Not sure
- Search Implementation ?? ( We have URLs to look for. )

### Required for me
- Explore RAG architecture : [RAG](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)

#### 0. OPEN API setup 

In [73]:
# load OPENAI API
import dotenv 
dotenv.load_dotenv()

True

### 1. Cleansing HTML using Transformer : HTML2Text  
[HTML2Text](https://python.langchain.com/v0.1/docs/integrations/document_transformers/html2text/) provides a straightforward conversion of HTML content into plain text (with markdown-like formatting) without any specific tag manipulation.  
It's best suited for scenarios where the goal is to extract human-readable text without needing to manipulate specific HTML elements.  

#### 1.1. Loading already crawled data
The data used here are html files crawled by Khalil.

In [47]:
import os
from langchain.schema import Document  # Import the Document class
from langchain_community.document_transformers import Html2TextTransformer

# Define transformer
html2text = Html2TextTransformer()

# Base directory containing the company folders
base_dir = "company_crawled_data"

# Dictionary to store the parsed HTML content for each company
company_data = {}

# Loop through each folder inside company_crawled_data
for company in os.listdir(base_dir):
    company_folder = os.path.join(base_dir, company)
    
    # Check if it is a directory
    if os.path.isdir(company_folder):
        company_data[company] = []
        
        # Loop through HTML files in the company's folder
        for html_file in os.listdir(company_folder):
            file_path = os.path.join(company_folder, html_file)
            
            # Ensure it's an HTML file
            if html_file.endswith(".html"):
                with open(file_path, "r", encoding="utf-8") as file:
                    raw_html = file.read()
                    # Create a Document object for each HTML file
                    document = Document(page_content=raw_html, metadata={"source": file_path})
                    company_data[company].append(document)


# Example: Transform the HTML content for each company
for company, document_list in company_data.items():
    print(f"--- {company} ---")
    if document_list:
        # Transform the list of Document objects
        html_transformed = html2text.transform_documents(document_list)
        
        # Print the first 2000 characters of the transformed content
        # Adjust the character to your taste
        print('================== 0 ===================')
        print(html_transformed[0].page_content[:2000])
        print('================== 1 ===================')
        print(html_transformed[1].page_content[:2000])
    else:
        print("No HTML files found.")


--- generali ---
  * Privatkunden 
  * Geschäftskunden 

  * Journal 
  * Berater finden 
  * Service & Kontakt 

Suchen

  * Rundum-Schutz
  * Fahrzeug & Zuhause
  * Gesundheit & Freizeit
  * Recht & Haftung
  * Vorsorge & Finanzen

Rundum-Schutz

  * Vermögenssicherungspolice 
  * Vermögensaufbau & Sicherheitsplan 
  * Mein Zukunftsplan 
  * Mein Pflegeschutz 

Young Line

  * Young & Drive 
  * Young & Home 
  * Young & Life 
  * Young & Law 
  * Vermögensaufbau4you 

**Vermögenssicherungspolice**  
Rundum geschützt durchs Leben

mehr erfahren

Fahrzeug

  * Kfz-Versicherung 
  * Kfz-Schutzbrief 
  * Young & Drive 
  * Elektro-Fahrzeug 
  * Digitale Pannenhilfe 
  * Fahrer-Mobilitätsschutz 
  * Oldtimer Optimal 
  * Motorradversicherung 
  * Moped & E-Scooter 

Zuhause

  * Hausratversicherung 
  * Wohngebäudeversicherung 
  * Glasversicherung 
  * Haus- und Wohnungsschutzbrief 
  * Konto- und Finanzschutzbrief 
  * Photovoltaikversicherung 
  * Kunstversicherung 
  * Bauversicherun

### 2. URL extraction -> loader -> transform 

##### 2.1. Extracting internal urls with one example company : AXA Gemany

In [7]:
companies = [ { 'name' : 'AXA Germany', 'url' : 'https://www.axa.de' } ]

In [None]:
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, quote

def crawl_company_websites(companies, max_pages=200):
    """
        Crawls the websites of a list of companies and collects internal URLs.

        Parameters:
            companies (list): A list of dictionaries, where each dictionary represents a company and must have:
                            - 'name' (str): The name of the company.
                            - 'url' (str): The URL of the company's website.
            max_pages (int): The maximum number of pages to crawl for each company. Default is 200.
        
        Returns:
            dict: A dictionary where the keys are company names (formatted as lowercase with spaces replaced by underscores)
                and the values are sets of crawled URLs.     
    """

    # a collection of company and its inernal URLs to be returned 
    url_list = {}

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
    }

    for company in companies:
        company_name = company['name'].lower().replace(' ', '_')

        start_url = company['url']
        visited_urls = set()
        to_visit = [start_url]
        pages_crawled = 0

        try:
            while to_visit and pages_crawled < max_pages:
                url = to_visit.pop(0)
                if url in visited_urls:
                    continue

                response = requests.get(url, headers=headers)
                response.raise_for_status()

                soup = BeautifulSoup(response.content, 'html.parser')
                visited_urls.add(url)
                pages_crawled += 1

                for link in soup.find_all('a', href=True):
                    full_url = urljoin(url, link['href'])

                    if is_internal_link(start_url, full_url) and full_url not in visited_urls:
                        to_visit.append(full_url)

        except Exception as e:
            print(f"Error crawling {company['name']}: {e}")

        
        url_list[company_name] = visited_urls
    return url_list

def is_internal_link(base_url, test_url):
    base_domain = urlparse(base_url).netloc
    test_domain = urlparse(test_url).netloc
    return base_domain == test_domain

In [None]:
url_list = crawl_company_websites(companies, max_pages = 10)
for company, url in url_list.items():
    print(company, url)

axa_germany {'https://www.axa.de/geschaeftskunden', 'https://www.axa.de/karriere', 'https://www.axa.de/medien', 'https://www.axa.de/', 'https://www.axa.de/home', 'https://www.axa.de', 'https://www.axa.de/kontakt', 'https://www.axa.de/wir-ueber-uns', 'https://www.axa.de/geschaeftskunden/services-fuer-geschaeftskunden', 'https://www.axa.de/site/axa-de/redirect/MyAxaLogin?AKTIONSCODE=14015D'}


#### 2.2. AsyncChromiumLoader : Retrieving HTML of webpages

Traditionally, when performing web scraping with Python, requests.get is used to fetch HTML, and BeautifulSoup is used to parse the HTML.   
However, AsyncChromiumLoader is a component provided by LangChain that asynchronously fetches HTML.  

- [AsyncChromiumLoader Document](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.chromium.AsyncChromiumLoader.html)   
    (See also : [AyncHtmlLoader](https://python.langchain.com/v0.1/docs/integrations/document_loaders/async_chromium/) -- lightweight version )  
    - Asynchronous Error Issue within Jupyter Notebbok 
        - errors may occur in the jupyter notebook environment because ```asyncio.run()``` cannot be used in a Jupyter notebook since Jupyter itself runs an event loop in the background. To fix this issue, you can use ```nest_asyncio``` to allow nested event loops or adapt your code to work within Jupyter's existing event loop.
- Document Object : A Document is a piece of text(page_content) and associated metadata(metadata). 




In [69]:
import nest_asyncio

# To run AsyncChromiumLoader in a jupyter notebook, nest_asyncio should be run. 
# Function to check if the code is running in Jupyter Notebook
def is_running_in_jupyter():
    try:
        from IPython import get_ipython
        if 'IPKernelApp' in get_ipython().config:  
            return True
    except:
        # If the IPython module cannot be imported, it is not running in Jupyter Notebook
        return False
    # Return False if not in Jupyter Notebook
    return False

if is_running_in_jupyter():
    print("Running in Jupyter Notebook environment. Applying nest_asyncio...")
    nest_asyncio.apply()
else:
    print("Not running in Jupyter Notebook environment. nest_asyncio is not applied.")

Running in Jupyter Notebook environment. Applying nest_asyncio...


In [None]:
from langchain_community.document_loaders import AsyncChromiumLoader

# a list of internal urls 
urls = url_list['axa_germany']

# loading html asynchronously
loader = AsyncChromiumLoader(urls=urls)

# Load data into Document Obejcts ( return type : list[Documents] )
docs = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


#### 2.3 Transform with HTML2Text (Cleansing)
HTML2Text provides a straightforward conversion of HTML content into plain text (with markdown-like formatting) without any specific tag manipulation.  
It's best suited for scenarios where the goal is to extract human-readable text without needing to manipulate specific HTML elements.

[HTML2Text document](https://python.langchain.com/v0.1/docs/integrations/document_transformers/html2text/)

In [78]:
from langchain_community.document_transformers import Html2TextTransformer

html2text = Html2TextTransformer()
# Sequence of Cleansed document objects ( return typ : sequence[Documents] ) 
docs_transformed = html2text.transform_documents(docs) 

# Example of the first document with 2000 characters
print(docs_transformed[0].page_content[:2000])

Bitte aktivieren Sie JavaScript in den Browser-Einstellungen, um diese Seite
nutzen zu konnen.

Privatkunden Geschäftskunden Über AXA Karriere Medien My Axa Login Meine
Gesundheit Login Kontakt

Fahrzeuge Haftpflicht & Recht Haus & Wohnung Gesundheit Vorsorge & Vermögen
Kundenservice

Sach- & Ertragsausfall Haftpflicht Bürgschaften Finanzierung Weitere Produkte
Service & Kontakt

Das Unternehmen Unsere Verantwortung Unsere Auszeichnungen

Warum AXA? Berufsfelder Jobs Tipps & Kontakt Karriere im Vertrieb

Pressemitteilungen Mediathek Medienkontakt AXA auf Social Media

Suchvorschläge

______

Fahrzeuge

Versicherungsschutz mit maximaler Flexibilität und für höchste Fahr-Ansprüche.

Kfz im Überblick

Kfz-Versicherung Motorradversicherung Elektroauto-Versicherung
Rollerversicherung E-Scooter Versicherung Oldtimer-Versicherung Lieferwagen-
Versicherung Verkehrsrechtschutzversicherung schadenservice360° Auto Ratgeber
Kfz

Haftpflicht & Recht

Alle Haftpflichtversicherungen bieten einen wich

### 3. Split 
### 4. Store

### Next Step : Retrieval, Tool Calling/Binding Experiment 

##### LLM with function calling

In [None]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")

##### Defining a schema
you define a schema to specify what kind of data you want to extract.   
Possible attributes for the competitive analysis database : see this [video ( 28:30 )](https://www.youtube.com/watch?app=desktop&v=NHeOMxa7VgU)--pause the video and investigate the attributes in the video. 


In [65]:
from langchain.chains import create_extraction_chain

schema = {
    "properties" : {
        "ist_versicherung" : { "type" : "string" },
        "für wen" : { "type" : "string" }
    },
    "required" : ["page_name", "page_summary"],
}

def extract(content: str, schema: dict):
    return create_extraction_chain(schema = schema, llm = llm).run(content)