# Processing HTML Files

We will be using **html2parquet transform**

References
- [html2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python)

## Step-1: Data

We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).

We have a couple of crawled HTML files in  `input*` directory. 

## Step-2: Configuration

In [1]:
## All config is defined here
from my_config import MY_CONFIG

## Step-3: HTML2Parquet

In [2]:
%%time 

import ast
import os
import sys

# from html2parquet_transform import Html2ParquetTransform, Html2ParquetTransformConfiguration
from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration
from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import GB, ParamsUtils


local_conf = {
    "input_folder": MY_CONFIG.INPUT_DIR,
    "output_folder": MY_CONFIG.OUTPUT_DIR,
}

params =  {
    "data_files_to_use": ast.literal_eval("['.html','.zip']"),
    "html2parquet_output_format": "markdown",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# launcher = PythonTransformLauncher(runtime_config=Html2ParquetTransformConfiguration())
launcher = PythonTransformLauncher(runtime_config=Html2ParquetPythonTransformConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Job completed successfully")
else:
    raise Exception ("❌ Job failed")

14:35:21 INFO - html2parquet parameters are : {'output_format': <html2parquet_output_format.MARKDOWN: 'markdown'>, 'favor_precision': <html2parquet_favor_precision.TRUE: 'True'>, 'favor_recall': <html2parquet_favor_recall.TRUE: 'True'>}
14:35:21 INFO - pipeline id pipeline_id
14:35:21 INFO - code location None
14:35:21 INFO - data factory data_ is using local data access: input_folder - input2/thealliance.ai/ output_folder - output
14:35:21 INFO - data factory data_ max_files -1, n_sample -1
14:35:21 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html', '.zip'], files to checkpoint ['.parquet']
14:35:21 INFO - orchestrator html2parquet started at 2024-11-13 14:35:21
14:35:21 INFO - Number of files is 18, source profile {'max_file_size': 0.2035503387451172, 'min_file_size': 0.06348037719726562, 'total_file_size': 1.7128067016601562}
14:35:21 INFO - Completed 1 files (5.56%) in 0.003 min
14:35:21 INFO - Completed 2 fil

✅ Job completed successfully
CPU times: user 1.25 s, sys: 1.04 s, total: 2.29 s
Wall time: 1.34 s


## Inspect the Output


In [3]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(MY_CONFIG.OUTPUT_DIR)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (9, 6)


Unnamed: 0,title,document,contents,document_id,size,date_acquired
0,about-aia.html,about-aia.html,AI Alliance members plan to start or enhance p...,0458ffe75d14867d5ea3c455ab4bf24bc0b6670812209a...,1381,2024-11-13T14:35:21.120730
1,become-a-collaborator.html,become-a-collaborator.html,# Become a collaborator\n\nWant to get involve...,e3886adc3518259f0ecea4283b99f52a095e93525139c0...,397,2024-11-13T14:35:21.338032
2,community.html,community.html,# Join the Community\n\nJoin leading AI innova...,eaa326274fea96c05c270dd7527faeb070c6ca704e63ae...,2957,2024-11-13T14:35:21.361282
3,blog.html,blog.html,![](https://images.prismic.io/ai-alliance/ZyPy...,9907b2cab6aa23704d0ff3a475c38f33f5f2daccf6bb5a...,3193,2024-11-13T14:35:21.348011
4,contact.html,contact.html,# Contact us\n\nWe’re here to help and answer ...,31062192f12429eda320c8a6c5f1d2f725f5579be515a7...,462,2024-11-13T14:35:21.372251
