# Processing HTML Files

We will be using **html2parquet transform**

References
- [html2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python)

## Step-1: Data

We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).

We have a couple of crawled HTML files in  `input` directory. 

## Step-2: Configuration

In [1]:
## All config is defined here
from my_config import MY_CONFIG

In [2]:
import os, sys
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_DIR, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR, exist_ok=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_HTML, exist_ok=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_MARKDOWN, exist_ok=True)

print ("✅ Cleared  output directory")

✅ Cleared  output directory


## Step-3: HTML2Parquet

Process HTML documents and extract the text in markdown format

In [3]:
%%time 

import ast
import os
import sys

# from html2parquet_transform import Html2ParquetTransform, Html2ParquetTransformConfiguration
from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration
from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import GB, ParamsUtils


local_conf = {
    "input_folder": MY_CONFIG.INPUT_DIR,
    "output_folder": MY_CONFIG.OUTPUT_DIR_HTML,
}

params =  {
    "data_files_to_use": ast.literal_eval("['.html','.zip']"),
    "html2parquet_output_format": "markdown",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# launcher = PythonTransformLauncher(runtime_config=Html2ParquetTransformConfiguration())
launcher = PythonTransformLauncher(runtime_config=Html2ParquetPythonTransformConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Job completed successfully")
else:
    raise Exception ("❌ Job failed")

00:23:09 INFO - html2parquet parameters are : {'output_format': <html2parquet_output_format.MARKDOWN: 'markdown'>, 'favor_precision': <html2parquet_favor_precision.TRUE: 'True'>, 'favor_recall': <html2parquet_favor_recall.TRUE: 'True'>}
00:23:09 INFO - pipeline id pipeline_id
00:23:09 INFO - code location None
00:23:09 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/1-html2parquet
00:23:09 INFO - data factory data_ max_files -1, n_sample -1
00:23:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html', '.zip'], files to checkpoint ['.parquet']
00:23:09 INFO - orchestrator html2parquet started at 2024-11-26 00:23:09
00:23:09 INFO - Number of files is 20, source profile {'max_file_size': 0.23515033721923828, 'min_file_size': 0.0885457992553711, 'total_file_size': 2.5425233840942383}
00:23:10 INFO - Completed 1 files (5.0%) in 0.003 min
00:23:10 INFO - Completed 2 files 

✅ Job completed successfully
CPU times: user 1.3 s, sys: 1.06 s, total: 2.36 s
Wall time: 1.42 s


## Step-4: Inspect the Output


In [4]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(MY_CONFIG.OUTPUT_DIR_HTML)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (20, 6)


Unnamed: 0,title,document,contents,document_id,size,date_acquired
0,thealliance_ai_blog-open-source-ai-demo-night-...,thealliance_ai_blog-open-source-ai-demo-night-...,"On August 8th, The AI Alliance, in collaborati...",7802bb7e50653e6b21f571b28843fd9a4bcf5023eaab3a...,3151,2024-11-26T00:23:10.251906
1,thealliance_ai_core-projects-sb1047_text.html,thealliance_ai_core-projects-sb1047_text.html,"The AI Alliance, a community of technology cre...",bbfed07faf040c9f276df43207437f0501cf9da14ec956...,7184,2024-11-26T00:23:10.303543
2,thealliance_ai_focus-areas-foundation-models-d...,thealliance_ai_focus-areas-foundation-models-d...,# Open Foundation Models and Datasets\n\n### E...,cace8c007c2c65b7a92d9f152b7e012502b1614205e7c9...,4499,2024-11-26T00:23:10.341442
3,thealliance_ai_focus-areas-skills-education_te...,thealliance_ai_focus-areas-skills-education_te...,# Skills & Education\n\n### Supporting global ...,d98ef830df5e293bb7903e021b60194e8b4e529ef4824b...,334,2024-11-26T00:23:10.362269
4,thealliance_ai_focus-areas-applications-and-to...,thealliance_ai_focus-areas-applications-and-to...,![abstract gradient](https://images.prismic.io...,37752caba69be871c683c399ca2d5ab2afbec4d2623563...,568,2024-11-26T00:23:10.333829


In [5]:
output_df.iloc[0,]['title']

'thealliance_ai_blog-open-source-ai-demo-night-sf-2024_text.html'

In [6]:
output_df.iloc[0,]['document']

'thealliance_ai_blog-open-source-ai-demo-night-sf-2024_text.html'

In [7]:
## Display markdown text
print ('content length:', len(output_df.iloc[0,]['contents']), '\n')
print (output_df.iloc[0,]['contents'])


content length: 3151 

On August 8th, The AI Alliance, in collaboration with Cerebral Valley and Ollama, hosted Open Source AI Demo Night in San Francisco, bringing together more than 200+ developers and innovators to showcase and celebrate the latest advances in open-source AI. There were 7 demo teams and a panel discussion on [why open technologies and communities are essential to driving innovation in California](https://youtu.be/tOXzyHJvOKw).

The demo teams included:

[Ollama](https://ollama.com/)- helps developers run language models such as Llama 3.1, Mistral, Gemma 2, and others, locally on the computer or on a server cluster. Watch Michael Yang’s demo here:[Tool calling with Ollama - How an LLM accesses external information.](https://youtu.be/YWLLrgzzbj8)[Continue](https://www.continue.dev/)– a leading open-source AI code assistant that connects any models and any context to build custom autocomplete and chat experiences inside the IDE. Watch Ty Dunn’s demo here:[Using Continu

In [8]:
## display markdown in pretty format
from IPython.display import Markdown
display(Markdown(output_df.iloc[0,]['contents']))


On August 8th, The AI Alliance, in collaboration with Cerebral Valley and Ollama, hosted Open Source AI Demo Night in San Francisco, bringing together more than 200+ developers and innovators to showcase and celebrate the latest advances in open-source AI. There were 7 demo teams and a panel discussion on [why open technologies and communities are essential to driving innovation in California](https://youtu.be/tOXzyHJvOKw).

The demo teams included:

[Ollama](https://ollama.com/)- helps developers run language models such as Llama 3.1, Mistral, Gemma 2, and others, locally on the computer or on a server cluster. Watch Michael Yang’s demo here:[Tool calling with Ollama - How an LLM accesses external information.](https://youtu.be/YWLLrgzzbj8)[Continue](https://www.continue.dev/)– a leading open-source AI code assistant that connects any models and any context to build custom autocomplete and chat experiences inside the IDE. Watch Ty Dunn’s demo here:[Using Continue to understand a brand new code library](https://youtu.be/BUq66FHVqng)[AgentOps](https://www.agentops.ai/)– an industry-leading developer platform to test and debug AI agents. Watch Alex Reibman and Ajay Poshak demo LlamaFS here:[LlamaFS: A self-organizing agentic filesystem](https://youtu.be/P3pND_JSkuQ)[CrewAI](https://www.crewai.com/)- Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.Watch João Moura’s demo here:[Build multi-agent automations with Crew.ai](https://youtu.be/5b07faElxfM).[Based Hardware](https://basedhardware.com/)– a fully open-source AI note taker that provides you with reminders, suggestions, and more; all in one simple app. Watch Nik Shevchenko’s demo here:[Friend: An AI necklace you wear which records your day](https://youtu.be/e0owdgDDP0I)[Datafog](https://www.datafog.ai/)– an open source AI/ML platform with solutions to scan unstructured content in files for PII, either annotating, anonymizing, or redacting sensitive information. Watch Sid Mohan’s demo here:[Using Open Source LLMs for PII data detection with DataFog](https://youtu.be/c1dx2bzaplk)[Semikong](https://www.semikong.ai/)- the World’s First Semiconductor Industry-Specific Large Language Model. Watch Nanda Kishore‘s demo here:[SemiKong: The Open Source Semiconductor LLM powered by Llama](https://youtu.be/zIhyFom_obM)


Demo Night also featured a panel discussion “[AI in the Era of Open Innovation](https://youtu.be/tOXzyHJvOKw),” moderated by CEO & Founder Aitomatic Christopher Nguyen, and featured Matt White, Executive Director of PyTorch Foundation and General Manager of AI, Linux Foundation; Charles Xie, CEO of Zilliz; and Sharon Zhou, CEO of Lamini. The panelists underscored the importance of having access to state of the art open-source AI models in building their company by fine-tuning the models to their respective company needs. Moreover, the panelists opposed California Senate Bill 1047, highlighting that it would stifle open-source AI development and have a downstream chilling effect on AI investment and expansion.

## Step-5: Save the markdown

In [None]:
import os

for index, row in output_df.iterrows():
    html_file = row['document']
    base_name = os.path.splitext(os.path.basename(html_file))[0]
    md_output_file = os.path.join(MY_CONFIG.OUTPUT_DIR_MARKDOWN, base_name +  '.md')
    
    with open(md_output_file, 'w') as md_output_file_handle:
        md_output_file_handle.write (row['contents'])
# -- end loop ---       

print (f"✅ Saved {index+1} md files into '{MY_CONFIG.OUTPUT_DIR_MARKDOWN}'")

✅ Saved 20 md files into 'output/2-markdown'
