<a href="https://colab.research.google.com/github/rjrizani/pdf_scraper/blob/main/pdf_to_csv_konstruksi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Extract 'NAMA PERUSAHAAN', 'KUALIFIKASI', and 'E-MAIL USAHA' from the PDF file "/content/kontruksi_2024_jateng.pdf" and save the extracted data into a CSV file. If the previous method failed, try an alternative approach to extract the data.

## Load pdf

### Subtask:
Load the PDF file using a library like `pdfplumber`.


**Reasoning**:
Import the `pdfplumber` library and open the PDF file.



In [None]:
import pdfplumber

pdf = pdfplumber.open("/content/kontruksi_2024_jateng.pdf")

## Extract text

### Subtask:
Extract text line by line from the PDF pages instead of extracting tables.


**Reasoning**:
Extract text line by line from each page of the PDF and store them in a list.



In [None]:
all_lines = []
for page in pdf.pages:
  text = page.extract_text()
  lines = text.splitlines()
  all_lines.extend(lines)



## Process data

### Subtask:
Iterate through the extracted text lines. Use regular expressions or pattern matching to identify and extract company names, qualifications, and email addresses from each relevant line.


**Reasoning**:
Initialize empty lists and iterate through the lines, using regular expressions to extract the required information.



In [None]:
import re

company_names = []
qualifications = []
emails = []

# Regex patterns to identify company names, qualifications, and emails
# This is a basic pattern and might need adjustment based on the actual data format
company_pattern = re.compile(r'^\d+\s+(.*)') # Starts with a number, followed by company name
qualification_pattern = re.compile(r'\b(BESAR|MENENGAH|KECIL)\b') # Looks for specific qualification keywords
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') # Standard email pattern

for line in all_lines:
    line = line.strip() # Remove leading/trailing whitespace

    # Check if the line potentially contains company information
    if re.match(r'^\d+\s+', line): # Line starts with a number followed by a space
        company_match = company_pattern.match(line)
        if company_match:
            company_names.append(company_match.group(1).strip())
        else:
            company_names.append(None) # Append None if company name not found

        qualification_match = qualification_pattern.search(line)
        if qualification_match:
            qualifications.append(qualification_match.group(1))
        else:
            qualifications.append(None) # Append None if qualification not found

        email_match = email_pattern.search(line)
        if email_match:
            emails.append(email_match.group(0))
        else:
            emails.append(None) # Append None if email not found


**Reasoning**:
Create a DataFrame from the extracted lists and display the first few rows to verify the extraction.



In [None]:
import pandas as pd

df_extracted = pd.DataFrame({
    'NAMA PERUSAHAAN': company_names,
    'KUALIFIKASI': qualifications,
    'E-MAIL USAHA': emails
})

display(df_extracted.head())

Unnamed: 0,NAMA PERUSAHAAN,KUALIFIKASI,E-MAIL USAHA
0,Direktori Perusahaan Konstruksi Provinsi Jawa ...,,
1,tahun dan setelahnya wajib mendaftar kembali k...,,
2,"Kecil Min. 300 Juta h t s.d. 2,5 Milyar 6 dan ...",,
3,orang PJTBU dengan SKK minimal,,
4,per,,


## Create dataframe

### Subtask:
Organize the extracted information into a pandas DataFrame.


**Reasoning**:
Create a pandas DataFrame from the extracted lists and display the head.



In [None]:
import pandas as pd

df_extracted = pd.DataFrame({
    'NAMA PERUSAHAAN': company_names,
    'KUALIFIKASI': qualifications,
    'E-MAIL USAHA': emails
})

display(df_extracted.head())

Unnamed: 0,NAMA PERUSAHAAN,KUALIFIKASI,E-MAIL USAHA
0,Direktori Perusahaan Konstruksi Provinsi Jawa ...,,
1,tahun dan setelahnya wajib mendaftar kembali k...,,
2,"Kecil Min. 300 Juta h t s.d. 2,5 Milyar 6 dan ...",,
3,orang PJTBU dengan SKK minimal,,
4,per,,


## Process data

### Subtask:
Retry extracting company names, qualifications, and email addresses by iterating through the text lines and applying more robust regular expressions and pattern matching to identify and extract the required information.


**Reasoning**:
Retry extracting company names, qualifications, and email addresses by iterating through the text lines and applying more robust regular expressions and pattern matching to identify and extract the required information.



In [None]:
import re

company_names = []
qualifications = []
emails = []

# Regex patterns to identify company names, qualifications, and emails
# This is a basic pattern and might need adjustment based on the actual data format
company_pattern = re.compile(r'^\d+\s+(.*)') # Starts with a number, followed by company name
qualification_pattern = re.compile(r'\b(BESAR|MENENGAH|KECIL)\b') # Looks for specific qualification keywords
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') # Standard email pattern


current_company_name = None
current_qualification = None
current_email = None

for i, line in enumerate(all_lines):
    line = line.strip() # Remove leading/trailing whitespace

    # Check if the line starts a new company entry
    if re.match(r'^\d+\s+', line): # Line starts with a number followed by a space
        # If we were processing a previous company, append its data
        if current_company_name is not None or current_qualification is not None or current_email is not None:
            company_names.append(current_company_name)
            qualifications.append(current_qualification)
            emails.append(current_email)

        # Reset for the new company
        current_company_name = None
        current_qualification = None
        current_email = None

        # Try to extract information from the new line
        company_match = company_pattern.match(line)
        if company_match:
            current_company_name = company_match.group(1).strip()

        qualification_match = qualification_pattern.search(line)
        if qualification_match:
            current_qualification = qualification_match.group(1)

        email_match = email_pattern.search(line)
        if email_match:
            current_email = email_match.group(0)

    else:
        # If not a new company line, try to find qualification or email in the current line
        qualification_match = qualification_pattern.search(line)
        if qualification_match:
            current_qualification = qualification_match.group(1)

        email_match = email_pattern.search(line)
        if email_match:
            current_email = email_match.group(0)


# After the loop, append the last company's data
if current_company_name is not None or current_qualification is not None or current_email is not None:
    company_names.append(current_company_name)
    qualifications.append(current_qualification)
    emails.append(current_email)

## Create dataframe

### Subtask:
Organize the extracted information into a pandas DataFrame.


**Reasoning**:
Create a pandas DataFrame named `df_extracted` using the `company_names`, `qualifications`, and `emails` lists with the specified column names and display the head.



In [None]:
import pandas as pd

df_extracted = pd.DataFrame({
    'NAMA PERUSAHAAN': company_names,
    'KUALIFIKASI': qualifications,
    'E-MAIL USAHA': emails
})

display(df_extracted.head())

Unnamed: 0,NAMA PERUSAHAAN,KUALIFIKASI,E-MAIL USAHA
0,,,
1,Direktori Perusahaan Konstruksi Provinsi Jawa ...,,
2,tahun dan setelahnya wajib mendaftar kembali k...,,
3,"Kecil Min. 300 Juta h t s.d. 2,5 Milyar 6 dan ...",,
4,orang PJTBU dengan SKK minimal,,


## Save to csv

### Subtask:
Save the DataFrame to a CSV file.


In [None]:
#save to csv
df_extracted.to_csv('extracted_data.csv', index=False)
