`tabula-py` is a Python wrapper for `tabula-java`, which allows for extracting tables from PDFs into pandas DataFrames. Below are the primary functions available in `tabula-py`:

1. **read_pdf**:
   - This is the most commonly used function in `tabula-py`.
   - It extracts tables from a PDF into a list of pandas DataFrames.
   - Parameters include `input_path` (path to the PDF file), `output_format`, `encoding`, `java_options`, `pandas_options`, `multiple_tables`, `user_agent`, `password`, and more.
   - Example usage:
     ```python
     tables = tabula.read_pdf("path_to_pdf_file.pdf", pages="all", multiple_tables=True)
     ```

2. **convert_into**:
   - Allows you to convert a PDF directly into a CSV, TSV, or JSON file without going through a DataFrame.
   - Parameters similar to `read_pdf` but with the addition of `output_path` to specify where the resulting file should be saved.
   - Example usage:
     ```python
     tabula.convert_into("path_to_pdf_file.pdf", "output.csv", output_format="csv")
     ```

3. **convert_into_by_batch**:
   - Converts all PDFs in a directory into CSV, TSV, or JSON.
   - Primarily used for batch processing.
   - Example usage:
     ```python
     tabula.convert_into_by_batch("directory_with_pdfs", output_format="csv")
     ```

4. **environment_info**:
   - Returns the environment information of your local machine. This includes the version of `tabula-py`, `tabula-java`, `Java Runtime Environment (JRE)`, and the OS version.
   - Useful for debugging.
   - Example usage:
     ```python
     print(tabula.environment_info())
     ```

Apart from these functions, there are several configuration options, parameters, and other utilities available, allowing for a wide range of control over table extraction. Always refer to the official documentation or use Python's built-in `help(tabula)` to get more detailed information on available functions and their parameters.

In [13]:
import tabula

# Define the path to your PDF file
file_path = "Samplepdf/Material_Declaration.pdf"

# Extract tables from the PDF into a list of DataFrames
tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True)

# Loop through tables and save each one as CSV
for idx, table in enumerate(tables):
    table.to_csv(f"table_{idx + 1}.csv", index=False)


In [14]:
import pandas as pd
df1 = pd.read_csv('table_1.csv')

In [3]:
df1

Unnamed: 0.1,Component Mass (g),Chemical,Unnamed: 0,CAS number,Percent (%),Mass (g),Unnamed: 1,PPM
0,CORE 0.041,CuO,,1317-38-0,2.2,0.0009,,22000.0
1,FT .12 .065 .065 .8K,,,,,,,
2,0.041,Fe2O3,,1309-37-1,52.6,0.02157,,526000.0
3,0.041,NiO,,1313-99-1,33.5,0.01374,,335000.0
4,0.041,Paracyclophane,,1633-22-3,0.7,0.00029,,7000.0
5,0.041,ZnO,,1314-13-2,11.0,0.00451,,110000.0
6,Magnetic/Ferrite 0.061,Fe2O3,,1309-37-1,68.97,0.04207,,689700.0
7,0.061,Mn3O4,,1317-35-7,19.71,0.01202,,197100.0
8,0.061,ParyleneC,,28804-46-8,1.47,0.0009,,14700.0
9,0.061,ZnO,,1314-13-2,9.85,0.00601,,98500.0


In [15]:
import tabula

# Define the path to your PDF file
file_path = "/Users/bonnieao/Desktop/Resources/MSiA-Capstone-Cisco/Samplepdf/C0GNP0-Dielectric.pdf"

# Define the area to consider for extraction (ignoring the first 100 points from the top)
area = [[0, 1000, 10000, 10000]]  # Use large numbers to represent the bottom and right boundaries


# Extract tables from the PDF into a list of DataFrames
tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True)

# Loop through tables and save each one as CSV
for idx, table in enumerate(tables):
    table.to_csv(f"table1_{idx + 1}.csv", index=False)


In [17]:
import pandas as pd
df1 = pd.read_csv('table1_1.csv')
df1

Unnamed: 0,Parameter/Test,NP0 Specification Limits,Measuring Conditions
0,Operating Temperature Range,-55oC to +125oC,Temperature Cycle Chamber
1,Capacitance,Within specified tolerance,Freq.: 1.0 MHz ± 10% for cap ≤ 1000 pF
2,Q,,
3,,<30 pF: Q≥ 400+20 x Cap Value,1.0 kHz ± 10% for cap > 1000 pF
4,,≥30 pF: Q≥ 1000,Voltage: 1.0Vrms ± .2V
...,...,...,...
56,Humidity,<10 pF: Q≥ 200 +10C,
57,Insulation,,Remove from chamber and stabilize at room
58,Resistance,≥ Initial Value x 0.3 (See Above),temperature for 24 ± 2 hours before measuring.
59,Dielectric,,


In [18]:
import tabula

# Define the path to your PDF file
file_path = "Samplepdf/D58V0M4U8MR.pdf"

# Define the area to consider for extraction (ignoring the first 100 points from the top)
area = [[0, 1000, 10000, 10000]]  # Use large numbers to represent the bottom and right boundaries


# Extract tables from the PDF into a list of DataFrames
tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True)

# Loop through tables and save each one as CSV
for idx, table in enumerate(tables):
    table.to_csv(f"table2_{idx + 1}.csv", index=False)


In [19]:
import pandas as pd
df1 = pd.read_csv('table1_2.csv')
df1

Unnamed: 0.1,SIZE,0101*,0201,0402,Unnamed: 0,0603,Unnamed: 1,Unnamed: 2,Unnamed: 3,0805,Unnamed: 4,Unnamed: 5,Unnamed: 6,1206,Unnamed: 7,Unnamed: 8
0,Soldering,Reflow Only,Reflow Only,Reflow/Wave,,Reflow/Wave,,,,Reflow/Wave,,,,Reflow/Wave,,
1,Packaging,All Paper,All Paper,All Paper,,All Paper,,,,Paper/Embossed,,,,Paper/Embossed,,
2,mm(L) Length,0.40 ± 0.02,0.60 ± 0.03,1.00 ± 0.10,,1.60 ± 0.15,,,,2.01 ± 0.20,,,,3.20 ± 0.20,,
3,(in.),(0.016 ± 0.0008),(0.024 ± 0.001),(0.040 ± 0.004),,(0.063 ± 0.006),,,,(0.079 ± 0.008),,,,(0.126 ± 0.008),,
4,mmW) Width,0.20 ± 0.02,0.30 ± 0.03,0.50 ± 0.10,,0.81 ± 0.15,,,,1.25 ± 0.20,,,,1.60 ± 0.20,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68,0.068,,,,,,,,,,,X,X,X,,
69,0.082,,,,,,,,,,,,,,,
70,0.1,,,,,,,,,,,X,X,X,,
71,WVDC,16,25 50,16 25 50,16,25 50 100,200,16,25,50 100 200,250,16,25,50 100 200,250,500
