# Processing HTML Files

We will be using **html2parquet transform**

References
- [html2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python)

## Step-1: Data

We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).

We have a couple of crawled HTML files in  `input` directory. 

## Step-2: Configuration

In [1]:
## All config is defined here
from my_config import MY_CONFIG

In [2]:
import os, sys
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_DIR, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR, exist_ok=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_HTML, exist_ok=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_MARKDOWN, exist_ok=True)

print ("✅ Cleared  output directory")

✅ Cleared  output directory


## Step-3: HTML2Parquet

Process HTML documents and extract the text in markdown format

In [3]:
from dpk_html2parquet.transform_python import Html2Parquet

x=Html2Parquet(input_folder= MY_CONFIG.INPUT_DIR, 
               output_folder= MY_CONFIG.OUTPUT_DIR_HTML, 
               data_files_to_use=['.html'],
               html2parquet_output_format= "markdown"
               ).transform()

14:39:10 INFO - html2parquet parameters are : {'output_format': <html2parquet_output_format.MARKDOWN: 'markdown'>, 'favor_precision': <html2parquet_favor_precision.TRUE: 'True'>, 'favor_recall': <html2parquet_favor_recall.TRUE: 'True'>}
14:39:10 INFO - pipeline id pipeline_id
14:39:10 INFO - code location None
14:39:10 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/1-html2parquet
14:39:10 INFO - data factory data_ max_files -1, n_sample -1
14:39:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html'], files to checkpoint ['.parquet']
14:39:10 INFO - orchestrator html2parquet started at 2025-02-25 14:39:10
14:39:10 INFO - Number of files is 87, source profile {'max_file_size': 0.3294343948364258, 'min_file_size': 0.10027503967285156, 'total_file_size': 11.12916374206543}
14:39:10 INFO - Completed 1 files (1.15%) in 0.003 min
14:39:10 INFO - Completed 2 files (2.3%) i

## Step-4: Inspect the Output


In [4]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(MY_CONFIG.OUTPUT_DIR_HTML)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (87, 6)


Unnamed: 0,title,document,contents,document_id,size,date_acquired
0,thealliance_ai_working-groups-hardware-enablem...,thealliance_ai_working-groups-hardware-enablem...,[Hardware Enablement Focus Area](/focus-areas/...,698eddd25c4e6e9f172a19ebec695247c0a72e6ec88c66...,1553,2025-02-25T14:39:11.560563
1,thealliance_ai_blog-open-source-ai-demo-night-...,thealliance_ai_blog-open-source-ai-demo-night-...,"On August 8th, The AI Alliance, in collaborati...",7802bb7e50653e6b21f571b28843fd9a4bcf5023eaab3a...,3151,2025-02-25T14:39:10.915017
2,thealliance_ai_working-groups-applications-and...,thealliance_ai_working-groups-applications-and...,[Applications and Tools Focus Area](/focus-are...,1aaa9d752f74d7abd233abbd8688884c99ea64f575162b...,1565,2025-02-25T14:39:11.518781
3,thealliance_ai_blog-open-innovation-day-tokyo_...,thealliance_ai_blog-open-innovation-day-tokyo_...,"Open innovation in AI software, algorithms, da...",2f82c2d26c751fcb2528eb7c9273ebf3fac4d21b842787...,1304,2025-02-25T14:39:10.907142
4,thealliance_ai_blog-ai-alliance-skills-and-edu...,thealliance_ai_blog-ai-alliance-skills-and-edu...,"By Rebekkah Hogan (Meta), Sowmya Kannan (IBM),...",a8c21ef29afc54923a30393be621693674a1ad23965998...,3615,2025-02-25T14:39:10.734115


In [5]:
output_df.iloc[0,]['title']

'thealliance_ai_working-groups-hardware-enablement_text.html'

In [6]:
output_df.iloc[0,]['document']

'thealliance_ai_working-groups-hardware-enablement_text.html'

In [7]:
## Display markdown text
print ('content length:', len(output_df.iloc[0,]['contents']), '\n')
print (output_df.iloc[0,]['contents'])


content length: 1553 

[Hardware Enablement Focus Area](/focus-areas/hardware-enablement)

# Hardware Enablement Working Group

## Co-leads

- Adam Pingel (IBM)
- Amit Sangani (Meta)

## Frequently Asked Questions (FAQ)

**How can my organization join the AI Alliance as an organizational member?**Please[send a message via our contact form](https://thealliance.ai/contact). Thanks!**How can I join as an individual contributor?**Please complete the[working group application form](https://thealliance.ai/become-a-collaborator). Thanks!**How do I get access to the AI Alliance Slack?**Once your application – as an organization or individual member – has been reviewed and approved, you will be invited to the AI Alliance Slack and receive additional instructions how to join our community.**What if I have additional questions?**[Please contact us](https://thealliance.ai/contact).

## Join the Hardware Enablement Working Group

By submitting this form, you agree that the AI Alliance will collect 

In [8]:
## display markdown in pretty format
# from IPython.display import Markdown
# display(Markdown(output_df.iloc[0,]['contents']))


## Step-5: Save the markdown

In [9]:
import os

for index, row in output_df.iterrows():
    html_file = row['document']
    base_name = os.path.splitext(os.path.basename(html_file))[0]
    md_output_file = os.path.join(MY_CONFIG.OUTPUT_DIR_MARKDOWN, base_name +  '.md')
    
    with open(md_output_file, 'w') as md_output_file_handle:
        md_output_file_handle.write (row['contents'])
# -- end loop ---       

print (f"✅ Saved {index+1} md files into '{MY_CONFIG.OUTPUT_DIR_MARKDOWN}'")

✅ Saved 87 md files into 'output/2-markdown'
