# DPK Example: Detect SPAM and Low Quality Documents

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/data-prep-kit/doc_quality_1.ipynb)

 This notebook will illustrate how we can use Data Prep Kit's  [Document Quality feature](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/doc_quality) to detect spam and low quality documents

 References and credits:

 - https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/doc_quality

  

## Step-1: Figure out Runtime Environment

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [20]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 - Install dependencies if running on Google Colab

In [21]:
## Download any code files we may need

if RUNNING_IN_COLAB:
    !wget -O 'file_utils.py'   'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data-prep-kit/file_utils.py'

In [22]:
# %%capture
# %%time

import os

if RUNNING_IN_COLAB:
  ## setup a sandbox env to avoid conflicts with colab libraries
  !pip install -q condacolab
  import condacolab
  condacolab.install()

  !conda create -n my_env python=3.11 -y
  !conda activate my_env
  !pip install  --default-timeout=100  \
        'data-prep-toolkit-transforms[pdf2parquet, doc_quality]'==1.1.0 \
        humanfriendly
  ## terminate the current kernel, so we restart the runtime
  os.kill(os.getpid(), 9)
  ## restart the session

### 1.3 - Restart Runtime

After installing dependencies, you may want to <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configuration  & Utils

### 2.1 - Basic Config

In [23]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 2.2 - Setup input/outpur directories

In [24]:
## setup path to utils folder
import sys
sys.path.append('../utils')

In [25]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [26]:
import os, sys
import shutil

if RUNNING_IN_COLAB:
    input_dir = "input/"
    shutil.os.makedirs(input_dir, exist_ok=True)
else:
    input_dir = "../data/doc-quality/"

output_dir = "output"
output_pdf2pq_dir = os.path.join (output_dir, '01_pdf2pq_out')
output_docquality_dir = os.path.join (output_dir, '02_docq_out')
output_md_dir = os.path.join (output_dir, "03_md")


## clear output folder
shutil.rmtree(output_dir, ignore_errors=True)
shutil.os.makedirs(output_dir, exist_ok=True)
shutil.os.makedirs(output_md_dir, exist_ok=True)
print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: Inspect the Data

Sample data files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/doc-quality/)

### 3.1 -Download Data

In [27]:
from file_utils import download_file

if RUNNING_IN_COLAB:
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/doc-quality/earth.pdf', os.path.join(input_dir, 'earth.pdf'))
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/doc-quality/spam.pdf', os.path.join(input_dir, 'spam.pdf'))
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/doc-quality/lorem-ipsum.pdf', os.path.join(input_dir, 'lorem-ipsum.pdf'))
else:
    print ('input : ', input_dir)

input :  ../data/doc-quality/


## Step-4: Extract Data from PDF (pdf2parquet)

This step we will read PDF files and extract the text data.

[Pdf2Parquet documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)

We use the [Docling package](https://github.com/DS4SD/docling).


In [28]:
%%time

from dpk_pdf2parquet.transform_python import Pdf2Parquet
from dpk_pdf2parquet.transform import pdf2parquet_contents_types

print (f"🏃🏼 Processing input='{input_dir}' --> output='{output_pdf2pq_dir}'\n", flush=True)

result = Pdf2Parquet(input_folder= input_dir,
                    output_folder= output_pdf2pq_dir,
                    data_files_to_use=['.pdf'],
                    pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN,   # markdown
                    ).transform()

if result == 0:
    print (f"✅ Operation completed successfully")
else:
    raise Exception (f"❌ Operation  failed")

🏃🏼 Processing input='../data/doc-quality/' --> output='output/01_pdf2pq_out'



13:15:09 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
13:15:09 INFO - pipeline id pipeline_id
13:15:09 INFO - code location None
13:15:09 INFO - data factory data_ is using local data access: input_folder - ../data/doc-quality/ output_folder - output/01_pdf2pq_out
13:15:09 INFO - data factory data_ max_files -1, n_sample -1
13:15:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
13:15:09 INFO - orchestrator pdf2parquet started at 2025-03-26 13:15:09
13:15:09 INFO - Number of files is 3, source profile {'max_file_size': 0.055823326110839844, 'min_file_si

✅ Operation completed successfully
CPU times: user 3.44 s, sys: 0 ns, total: 3.44 s
Wall time: 4.52 s


### 4.2 - Inspect Generated output


In [29]:
from file_utils import read_parquet_files_as_df

print ("Displaying contents of : ", output_pdf2pq_dir)
output_df = read_parquet_files_as_df(output_pdf2pq_dir)
output_df


Displaying contents of :  output/01_pdf2pq_out
Successfully read 3 parquet files with 3 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,lorem-ipsum.pdf,Lorem ipsum\n\nLorem ipsum\n\nLorem ipsum,1,0,4,4272a5ab-2bb1-4285-822e-4af57ef3c23b,50660627009040522,pdf,b3c09cf4b822dda90dbfee0c4b358dec8ecfec78df91c9...,37,2025-03-26T13:15:14.025864,0.157193,lorem-ipsum.pdf
1,spam.pdf,Free xxx,1,0,2,c5710d45-e4cb-45b7-b847-293ad3707d4c,5577338085393325113,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-03-26T13:15:14.188099,0.160242,spam.pdf
2,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,99cf439b-7ecb-4127-bdcc-1b231c23d600,17915699055171962696,pdf,3766e7a7dfb15354f2a8c77e43db4cfa40d4627f921126...,611,2025-03-26T13:15:13.864851,0.19867,earth.pdf


In [30]:
print (output_df[output_df['filename'] == 'spam.pdf'].iloc[0,]['contents'])

Free xxx


In [31]:
print (output_df[output_df['filename'] == 'lorem-ipsum.pdf'].iloc[0,]['contents'])

Lorem ipsum

Lorem ipsum

Lorem ipsum


In [32]:
print (output_df[output_df['filename'] == 'earth.pdf'].iloc[0,]['contents'])

## Earth

## Solar System

Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.

For more details about our Solar system see Chapter 1.

## Earth

Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.

Basic facts about Earth:

- · Distance from the Sun: Average of 149.6 million kilometers (93 million miles)
- • Moons: One moon, called Luna or simply 'the Moon'.
- • Rotation Period: 24 hours (one day)


## Step-5: Document Quality

This handy plugin will score documents across many metrics.

Here we will look for 'bad words' metric.

[Document quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)

By default it uses [bad words collection](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/dpk_doc_quality/ldnoobw).  You can supply a custom file by passing an argument `bad_word_filepath=/path/to/badwords_file`

### 5.1 - Execute

In [33]:
%%time

from dpk_doc_quality.transform_python import DocQuality

STAGE = 5
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_pdf2pq_dir}' --> output='{output_docquality_dir}'\n", flush=True)

result = DocQuality(input_folder=output_pdf2pq_dir,
                    output_folder= output_docquality_dir,
                    docq_text_lang = "en",
                    docq_doc_content_column ="contents",
                    ).transform()

if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed (result={result})")

🏃🏼 STAGE-5: Processing input='output/01_pdf2pq_out' --> output='output/02_docq_out'



13:15:14 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': '/home/sujee/apps/anaconda3/envs/dpk-3-pdf-processing-r1.1.0-py3.11/lib/python3.11/site-packages/dpk_doc_quality/ldnoobw/en', 's3_cred': None, 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x74e4a5c9a590>}
13:15:14 INFO - data factory docq_ is using local configuration without input/output path
13:15:14 INFO - data factory docq_ max_files -1, n_sample -1
13:15:14 INFO - data factory docq_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:15:14 INFO - pipeline id pipeline_id
13:15:14 INFO - code location None
13:15:14 INFO - data factory data_ is using local data access: input_folder - output/01_pdf2pq_out output_folder - output/02_docq_out
13:15:14 INFO - data factory data_ max_files -1, n_sample -1
13:15:14 INFO - data factory da

✅ Stage:5 completed successfully
CPU times: user 27.9 ms, sys: 0 ns, total: 27.9 ms
Wall time: 23.3 ms


### 5.2 - Inspect the Output

We will see several new columns starting with the name **docq_**.

Look at the column **docq_contain_bad_word**; this will flag documents with 'bad words'.

Also inspect the column **docq_lorem_ipsum_ratio**; this will flag documents with 'lorem ipsum' text

For more information see : [Doc Quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)

In [34]:
from file_utils import read_parquet_files_as_df
output_df = read_parquet_files_as_df(output_docquality_dir)
print ("Displaying contents of : ", output_docquality_dir)
output_df.head()

Successfully read 3 parquet files with 3 total rows
Displaying contents of :  output/02_docq_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,...,docq_mean_word_len,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words
0,lorem-ipsum.pdf,Lorem ipsum\n\nLorem ipsum\n\nLorem ipsum,1,0,4,4272a5ab-2bb1-4285-822e-4af57ef3c23b,50660627009040522,pdf,b3c09cf4b822dda90dbfee0c4b358dec8ecfec78df91c9...,37,...,5.0,0.0,1,0.085714,0.0,False,0.0,0.0,1.0,False
1,spam.pdf,Free xxx,1,0,2,c5710d45-e4cb-45b7-b847-293ad3707d4c,5577338085393325113,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,...,3.5,0.0,1,0.0,0.0,True,0.0,0.0,1.0,False
2,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,99cf439b-7ecb-4127-bdcc-1b231c23d600,17915699055171962696,pdf,3766e7a7dfb15354f2a8c77e43db4cfa40d4627f921126...,611,...,4.550459,0.027523,9,0.0,0.0,False,0.176471,0.0,0.880734,True


## Step-6 - Filtering 'quality' documents

So from the output above we see **spam.pdf** is flagged for containing bad words (**docq_contain_bad_word=True**).

Also **lorem.pdf** is flagged for place holder content **lorem ipsum**  (**docq_lorem_ipsum_ratio > 0**)

We are going to filter them both out

In [35]:
all_docs_df = read_parquet_files_as_df(output_docquality_dir)

# remove documents with badwords
clean_docs_df = all_docs_df[all_docs_df['docq_contain_bad_word'] == False]

# also filter out 'lorem ipsum' text
clean_docs_df = clean_docs_df[clean_docs_df['docq_lorem_ipsum_ratio'] == 0]

clean_docs_df.head(10)

Successfully read 3 parquet files with 3 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,...,docq_mean_word_len,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words
2,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,99cf439b-7ecb-4127-bdcc-1b231c23d600,17915699055171962696,pdf,3766e7a7dfb15354f2a8c77e43db4cfa40d4627f921126...,611,...,4.550459,0.027523,9,0.0,0.0,False,0.176471,0.0,0.880734,True


## Step-7: Save as MD

In [36]:
for index, row in clean_docs_df.iterrows():
    output_file_name = os.path.join (output_md_dir, row['filename'] + '.md')
    with open(output_file_name, 'w') as output_file:
        output_file.write(row['contents'])

print (f"✅ Saved CLEAN markdown output to '{output_md_dir}'")

✅ Saved CLEAN markdown output to 'output/03_md'
