# tabula-py example notebook

tabula-py is a tool for convert PDF tables to pandas DataFrame. tabula-py is a wrapper of [tabula-java](https://github.com/tabulapdf/tabula-java), which requires java on your machine. tabula-py also enables you to convert tables in a PDF into CSV/TSV files.

tabula-py's PDF extraction accuracy is same as tabula-java or [tabula app](https://tabula.technology/); GUI tool of tabula, so if you want to know the performance of tabula-py, I highly recommend you to try tabula app.

tabula-py is good for:
- automation with Python script
- advanced analytics after converting pandas DataFrame
- casual analytics with Jupyter notebook or Google Colabolatory


In [1]:
!java -version

java version "11.0.9" 2020-10-20 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.9+7-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.9+7-LTS, mixed mode)


In [2]:
import tabula

tabula.environment_info()

Python version:
    3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0]
Java version:
    java version "11.0.9" 2020-10-20 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.9+7-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.9+7-LTS, mixed mode)
tabula-py version: 2.2.0
platform: Linux-3.10.0-1127.19.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core
uname:
    uname_result(system='Linux', node='woolsey', release='3.10.0-1127.19.1.el7.x86_64', version='#1 SMP Tue Aug 25 17:23:54 UTC 2020', machine='x86_64', processor='x86_64')
linux_distribution: ('CentOS Linux', '7', 'Core')
mac_ver: ('', ('', '', ''), '')
    


## Read a PDF with `read_pdf()` function

Let's read a PDF from GitHub. tabula-py can load a PDF or file like object on both local or internet by using `read_pdf()` function.

In [8]:
import tabula
pdf_path = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"

dfs = tabula.read_pdf(pdf_path, stream=True)
# read_pdf returns list of DataFrames
print(len(dfs))
dfs[0]

'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Nov 05, 2020 6:50:38 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:50:38 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



1


Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


## Options for `read_pdf()`

Note that `read_pdf()` function reads only page 1 by default. For more details, use `?read_pdf` and `?tabula.wrapper.build_options`.

Help on function read_pdf in module tabula.io:

read_pdf(input_path, output_format=None, encoding='utf-8', java_options=None, pandas_options=None, multiple_tables=True, user_agent=None, **kwargs)
    Read tables in PDF.
    
    Args:
        input_path (str, path object or file-like object):
            File like object of tareget PDF file.
            It can be URL, which is downloaded by tabula-py automatically.
        output_format (str, optional):
            Output format for returned object (``dataframe`` or ``json``)
        encoding (str, optional):
            Encoding type for pandas. Default: ``utf-8``
        java_options (list, optional):
            Set java options.
    
            Example:
                ``["-Xmx256m"]``
        pandas_options (dict, optional):
            Set pandas options.
    
            Example:
                ``{'header': None}``
    
            Note:
                With ``multiple_tables=True`` (default), pandas_options is passed
                to pandas.DataFrame, otherwise it is passed to pandas.read_csv.
                Those two functions are different for accept options like ``dtype``.
        multiple_tables (bool):
            It enables to handle multiple tables within a page. Default: ``True``
    
            Note:
                If `multiple_tables` option is enabled, tabula-py uses not
                :func:`pd.read_csv()`, but :func:`pd.DataFrame()`. Make
                sure to pass appropriate `pandas_options`.
        user_agent (str, optional):
            Set a custom user-agent when download a pdf from a url. Otherwise
            it uses the default ``urllib.request`` user-agent.
        kwargs:
            Dictionary of option for tabula-java. Details are shown in
            :func:`build_options()`
    
    Returns:
        list of DataFrames or dict.
    
    Raises:
        FileNotFoundError:
            If downloaded remote file doesn't exist.
    
        ValueError:
            If output_format is unknown format, or if downloaded remote file size is 0.
    
        tabula.errors.CSVParseError:
            If pandas CSV parsing failed.
    
        tabula.errors.JavaNotFoundError:
            If java is not installed or found.
    
        subprocess.CalledProcessError:
            If tabula-java execution failed.
    
    
    Examples:
    
        Here is a simple example.
        Note that :func:`read_pdf()` only extract page 1 by default.
    
        Notes:
            As of tabula-py 2.0.0, :func:`read_pdf()` sets `multiple_tables=True` by
            default. If you want to get consistent output with previous version, set
            `multiple_tables=False`.
    
        >>> import tabula
        >>> pdf_path = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
        >>> tabula.read_pdf(pdf_path, stream=True)
        [             Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
        0             Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
        1         Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
        2            Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
        3        Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
        4     Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2
        5               Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0     3     1
        6            Duster 360  14.3    8  360.0  245  3.21  3.570  15.84   0   0     3     4
        7             Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
        8              Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0     4     2
        9              Merc 280  19.2    6  167.6  123  3.92  3.440  18.30   1   0     4     4
        10            Merc 280C  17.8    6  167.6  123  3.92  3.440  18.90   1   0     4     4
        11           Merc 450SE  16.4    8  275.8  180  3.07  4.070  17.40   0   0     3     3
        12           Merc 450SL  17.3    8  275.8  180  3.07  3.730  17.60   0   0     3     3
        13          Merc 450SLC  15.2    8  275.8  180  3.07  3.780  18.00   0   0     3     3
        14   Cadillac Fleetwood  10.4    8  472.0  205  2.93  5.250  17.98   0   0     3     4
        15  Lincoln Continental  10.4    8  460.0  215  3.00  5.424  17.82   0   0     3     4
        16    Chrysler Imperial  14.7    8  440.0  230  3.23  5.345  17.42   0   0     3     4
        17             Fiat 128  32.4    4   78.7   66  4.08  2.200  19.47   1   1     4     1
        18          Honda Civic  30.4    4   75.7   52  4.93  1.615  18.52   1   1     4     2
        19       Toyota Corolla  33.9    4   71.1   65  4.22  1.835  19.90   1   1     4     1
        20        Toyota Corona  21.5    4  120.1   97  3.70  2.465  20.01   1   0     3     1
        21     Dodge Challenger  15.5    8  318.0  150  2.76  3.520  16.87   0   0     3     2
        22          AMC Javelin  15.2    8  304.0  150  3.15  3.435  17.30   0   0     3     2
        23           Camaro Z28  13.3    8  350.0  245  3.73  3.840  15.41   0   0     3     4
        24     Pontiac Firebird  19.2    8  400.0  175  3.08  3.845  17.05   0   0     3     2
        25            Fiat X1-9  27.3    4   79.0   66  4.08  1.935  18.90   1   1     4     1
        26        Porsche 914-2  26.0    4  120.3   91  4.43  2.140  16.70   0   1     5     2
        27         Lotus Europa  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
        28       Ford Pantera L  15.8    8  351.0  264  4.22  3.170  14.50   0   1     5     4
        29         Ferrari Dino  19.7    6  145.0  175  3.62  2.770  15.50   0   1     5     6
        30        Maserati Bora  15.0    8  301.0  335  3.54  3.570  14.60   0   1     5     8
        31           Volvo 142E  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2]
    
        If you want to extract all pages, set ``pages="all"``.
    
        >>> dfs = tabula.read_pdf(pdf_path, pages="all")
        >>> len(dfs)
        4
        >>> dfs
        [       0    1      2    3     4      5      6   7   8     9
        0    mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear
        1   21.0    6  160.0  110  3.90  2.620  16.46   0   1     4
        2   21.0    6  160.0  110  3.90  2.875  17.02   0   1     4
        3   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4
        4   21.4    6  258.0  110  3.08  3.215  19.44   1   0     3
        5   18.7    8  360.0  175  3.15  3.440  17.02   0   0     3
        6   18.1    6  225.0  105  2.76  3.460  20.22   1   0     3
        7   14.3    8  360.0  245  3.21  3.570  15.84   0   0     3
        8   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4
        9   22.8    4  140.8   95  3.92  3.150  22.90   1   0     4
        10  19.2    6  167.6  123  3.92  3.440  18.30   1   0     4
        11  17.8    6  167.6  123  3.92  3.440  18.90   1   0     4
        12  16.4    8  275.8  180  3.07  4.070  17.40   0   0     3
        13  17.3    8  275.8  180  3.07  3.730  17.60   0   0     3
        14  15.2    8  275.8  180  3.07  3.780  18.00   0   0     3
        15  10.4    8  472.0  205  2.93  5.250  17.98   0   0     3
        16  10.4    8  460.0  215  3.00  5.424  17.82   0   0     3
        17  14.7    8  440.0  230  3.23  5.345  17.42   0   0     3
        18  32.4    4   78.7   66  4.08  2.200  19.47   1   1     4
        19  30.4    4   75.7   52  4.93  1.615  18.52   1   1     4
        20  33.9    4   71.1   65  4.22  1.835  19.90   1   1     4
        21  21.5    4  120.1   97  3.70  2.465  20.01   1   0     3
        22  15.5    8  318.0  150  2.76  3.520  16.87   0   0     3
        23  15.2    8  304.0  150  3.15  3.435  17.30   0   0     3
        24  13.3    8  350.0  245  3.73  3.840  15.41   0   0     3
        25  19.2    8  400.0  175  3.08  3.845  17.05   0   0     3
        26  27.3    4   79.0   66  4.08  1.935  18.90   1   1     4
        27  26.0    4  120.3   91  4.43  2.140  16.70   0   1     5
        28  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5
        29  15.8    8  351.0  264  4.22  3.170  14.50   0   1     5
        30  19.7    6  145.0  175  3.62  2.770  15.50   0   1     5
        31  15.0    8  301.0  335  3.54  3.570  14.60   0   1     5,               0            1             2            3        4
        0  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
        1           5.1          3.5           1.4          0.2   setosa
        2           4.9          3.0           1.4          0.2   setosa
        3           4.7          3.2           1.3          0.2   setosa
        4           4.6          3.1           1.5          0.2   setosa
        5           5.0          3.6           1.4          0.2   setosa
        6           5.4          3.9           1.7          0.4   setosa,      0             1            2             3            4          5
        0  NaN  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
        1  145           6.7          3.3           5.7          2.5  virginica
        2  146           6.7          3.0           5.2          2.3  virginica
        3  147           6.3          2.5           5.0          1.9  virginica
        4  148           6.5          3.0           5.2          2.0  virginica
        5  149           6.2          3.4           5.4          2.3  virginica
        6  150           5.9          3.0           5.1          1.8  virginica,        0
        0   supp
        1     VC
        2     VC
        3     VC
        4     VC
        5     VC
        6     VC
        7     VC
        8     VC
        9     VC
        10    VC
        11    VC
        12    VC
        13    VC
        14    VC]


help(tabula.read_pdf)


### tabula.io.build_options

help(tabula.io.build_options)
Help on function build_options in module tabula.io:

build_options(pages=None, guess=True, area=None, relative_area=False, lattice=False, stream=False, password=None, silent=None, columns=None, format=None, batch=None, output_path=None, options='')
    Build options for tabula-java
    
    Args:
        pages (str, int, `list` of `int`, optional):
            An optional values specifying pages to extract from. It allows
            `str`,`int`, `list` of :`int`. Default: `1`
    
            Examples:
                ``'1-2,3'``, ``'all'``, ``[1,2]``
        guess (bool, optional):
            Guess the portion of the page to analyze per page. Default `True`
            If you use "area" option, this option becomes `False`.
    
            Note:
                As of tabula-java 1.0.3, guess option becomes independent from
                lattice and stream option, you can use guess and lattice/stream option
                at the same time.
    
        area (list of float, list of list of float, optional):
            Portion of the page to analyze(top,left,bottom,right).
            Default is entire page.
    
            Note:
                If you want to use multiple area options and extract in one table, it
                should be better to set ``multiple_tables=False`` for :func:`read_pdf()`
    
            Examples:
                ``[269.875,12.75,790.5,561]``,
                ``[[12.1,20.5,30.1,50.2], [1.0,3.2,10.5,40.2]]``
    
        relative_area (bool, optional):
            If all area values are between 0-100 (inclusive) and preceded by ``'%'``,
            input will be taken as % of actual height or width of the page.
            Default ``False``.
        lattice (bool, optional):
            Force PDF to be extracted using lattice-mode extraction
            (if there are ruling lines separating each cell, as in a PDF of an
            Excel spreadsheet)
        stream (bool, optional):
            Force PDF to be extracted using stream-mode extraction
            (if there are no ruling lines separating each cell, as in a PDF of an
            Excel spreadsheet)
        password (str, optional):
            Password to decrypt document. Default: empty
        silent (bool, optional):
            Suppress all stderr output.
        columns (list, optional):
            X coordinates of column boundaries.
    
            Example:
                ``[10.1, 20.2, 30.3]``
        format (str, optional):
            Format for output file or extracted object.
            (``"CSV"``, ``"TSV"``, ``"JSON"``)
        batch (str, optional):
            Convert all PDF files in the provided directory. This argument should be
            directory path.
        output_path (str, optional):
            Output file path. File format of it is depends on ``format``.
            Same as ``--outfile`` option of tabula-java.
        options (str, optional):
            Raw option string for tabula-java.
    
    Returns:
        list:
            Built list of options

Let's set `pages` option. Here is the extraction result of page 3:

In [9]:
# set pages option
dfs = tabula.read_pdf(pdf_path, pages=3, stream=True)
dfs[0]

Got stderr: Nov 05, 2020 6:50:53 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:50:53 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5
5,10.0,VC,0.5
6,11.2,VC,0.5
7,11.2,VC,0.5
8,5.2,VC,0.5
9,7.0,VC,0.5


In [10]:
# pass pages as string
tabula.read_pdf(pdf_path, pages="1-2,3", stream=True)

Got stderr: Nov 05, 2020 6:51:03 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:51:03 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



[             Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  \
 0             Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1   
 1         Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1   
 2            Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1   
 3        Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0   
 4     Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0   
 5               Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0   
 6            Duster 360  14.3    8  360.0  245  3.21  3.570  15.84   0   0   
 7             Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0   
 8              Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0   
 9              Merc 280  19.2    6  167.6  123  3.92  3.440  18.30   1   0   
 10            Merc 280C  17.8    6  167.6  123  3.92  3.440  18.90   1   0   
 11           Merc 450SE  16.4    8  275.8  180  3.0

You can set `pages="all"` for extration all pages. If you hit OOM error with Java, you should set appropriate `-Xmx` option for `java_options`.

In [11]:
# extract all pages
tabula.read_pdf(pdf_path, pages="all", stream=True)

Got stderr: Nov 05, 2020 6:51:16 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:51:16 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



[             Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  \
 0             Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1   
 1         Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1   
 2            Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1   
 3        Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0   
 4     Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0   
 5               Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0   
 6            Duster 360  14.3    8  360.0  245  3.21  3.570  15.84   0   0   
 7             Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0   
 8              Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0   
 9              Merc 280  19.2    6  167.6  123  3.92  3.440  18.30   1   0   
 10            Merc 280C  17.8    6  167.6  123  3.92  3.440  18.90   1   0   
 11           Merc 450SE  16.4    8  275.8  180  3.0

## Read partial area of PDF

If you want to set a certain part of page, you can use `area` option.

Note that as of tabula-py 2.0.0, `multiple_tables` option became `True` so if you want to use multiple `area` options like `[[0, 0, 100, 50], [0, 50, 100, 100]]`, you need to set `multiple_tables=False`.

In [None]:
# set area option
dfs = tabula.read_pdf(pdf_path, area=[126,149,212,462], pages=2)
dfs[0]

## Read giving column information

In [6]:
pdf_path2 = "https://github.com/chezou/tabula-py/raw/master/tests/resources/campaign_donors.pdf"

dfs = tabula.read_pdf(pdf_path2, columns=[47, 147, 256, 310, 375, 431, 504], guess=False, pages=1)
df = dfs[0].drop(["Unnamed: 0"], axis=1)
df

Unnamed: 0,Apellido,Nombre,Matricula,Cuit,Fecha,Tipo,Importe
0,MENA,JUAN MARTÍN,27.083.460,20-27083460-5,09/10/2013,EFECTIVO,"$ 10.000,00"
1,MOLLE,MATÍAS,25.348.547,20-25348547-8,09/10/2013,EFECTIVO,"$ 10.000,00"
2,MOLLEVI,FEDERICO OSCAR,25.028.246,20-25028246-0,09/10/2013,EFECTIVO,"$ 10.000,00"
3,PERAZZO,PABLO DANIEL,25.348.394,20-25348394-7,09/10/2013,EFECTIVO,"$ 10.000,00"
4,PICARDI,FRANCO EDUARDO,27.382.271,20-27382271-3,09/10/2013,EFECTIVO,"$ 10.000,00"
5,PISONI,CARLOS ENRIQUE,26.034.823,20-26034823-0,09/10/2013,EFECTIVO,"$ 10.000,00"
6,PONTORIERO,MARÍA PAULA,23.249.597,27-23249597-4,09/10/2013,EFECTIVO,"$ 10.000,00"
7,PULESTON,JUAN MIGUEL,11.895.661,20-11895661-4,09/10/2013,EFECTIVO,"$ 10.000,00"
8,REMÓN,MABEL AURORA,11.292.939,27-11292939-3,09/10/2013,EFECTIVO,"$ 10.000,00"
9,SARRABAYROUSE,DIEGO,24.662.899,20-24662899-9,09/10/2013,EFECTIVO,"$ 10.000,00"


## Extract to JSON, TSV, or CSV

tabula-py has capability to convert not only DataFrame but also JSON, TSV, or CSV. You can set output format with `output_format` option.

In [12]:
# read pdf as JSON
tabula.read_pdf(pdf_path, output_format="json")

'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Nov 05, 2020 6:51:29 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:51:29 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



[{'extraction_method': 'lattice',
  'top': 125.17005,
  'left': 247.14917,
  'width': 292.90167236328125,
  'height': 395.14886474609375,
  'right': 540.05084,
  'bottom': 520.3189,
  'data': [[{'top': 125.17005,
     'left': 247.14917,
     'width': 30.773834228515625,
     'height': 12.186347961425781,
     'text': 'mpg'},
    {'top': 125.17005,
     'left': 277.923,
     'width': 24.407867431640625,
     'height': 12.186347961425781,
     'text': 'cyl'},
    {'top': 125.17005,
     'left': 302.33087,
     'width': 34.6478271484375,
     'height': 12.186347961425781,
     'text': 'disp'},
    {'top': 125.17005,
     'left': 336.9787,
     'width': 26.899566650390625,
     'height': 12.186347961425781,
     'text': 'hp'},
    {'top': 125.17005,
     'left': 363.87827,
     'width': 30.24810791015625,
     'height': 12.186347961425781,
     'text': 'drat'},
    {'top': 125.17005,
     'left': 394.12637,
     'width': 34.64752197265625,
     'height': 12.186347961425781,
     'text': 'w

## Convert PDF tables into CSV, TSV, or JSON files

You can convert files directly rather creating Python objects with `convert_into()` function.

In [13]:
# You can convert from pdf into JSON, CSV, TSV

tabula.convert_into(pdf_path, "test.json", output_format="json")
!cat test.json

'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Nov 05, 2020 6:53:49 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:53:49 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



[{"extraction_method":"lattice","top":125.17005,"left":247.14917,"width":292.90167236328125,"height":395.14886474609375,"right":540.05084,"bottom":520.3189,"data":[[{"top":125.17005,"left":247.14917,"width":30.773834228515625,"height":12.186347961425781,"text":"mpg"},{"top":125.17005,"left":277.923,"width":24.407867431640625,"height":12.186347961425781,"text":"cyl"},{"top":125.17005,"left":302.33087,"width":34.6478271484375,"height":12.186347961425781,"text":"disp"},{"top":125.17005,"left":336.9787,"width":26.899566650390625,"height":12.186347961425781,"text":"hp"},{"top":125.17005,"left":363.87827,"width":30.24810791015625,"height":12.186347961425781,"text":"drat"},{"top":125.17005,"left":394.12637,"width":34.64752197265625,"height":12.186347961425781,"text":"wt"},{"top":125.17005,"left":428.7739,"width":34.64801025390625,"height":12.186347961425781,"text":"qsec"},{"top":125.17005,"left":463.4219,"width":21.1429443359375,"height":12.186347961425781,"text":"vs"},{"top":125.17005,"left"

In [14]:
tabula.convert_into(pdf_path, "test.csv", output_format="csv", stream=True)
!cat test.csv

'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Nov 05, 2020 6:54:28 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:54:28 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



"",mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.90,2.620,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.90,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.320,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.440,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.460,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.570,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.190,20.00,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.150,22.90,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.440,18.30,1,0,4,4
Merc 280C,17.8,6,167.6,123,3.92,3.440,18.90,1,0,4,4
Merc 450SE,16.4,8,275.8,180,3.07,4.070,17.40,0,0,3,3
Merc 450SL,17.3,8,275.8,180,3.07,3.730,17.60,0,0,3,3
Merc 450SLC,15.2,8,275.8,180,3.07,3.780,18.00,0,0,3,3
Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.250,17.98,0,0,3,4
Lincoln Continental,10.4,8,460.0,215,3.00,5.424,17.82,0,0,3,4
Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
Fiat 128,32.4,4,78.7,66,4

## Use lattice mode for more accurate extraction for spreadsheet style tables

If your tables have lines separating cells, you can use `lattice` option. By default, tabula-py sets `guess=True`, which is the same behavior for default of tabula app. If your tables don't have separation lines, you can try `stream` option.

As it mentioned, try tabula app before struglling with tabula-py option. Or, [PDFplumber](https://github.com/jsvine/pdfplumber) can be an alternative since it has different extraction strategy.

In [15]:
pdf_path3 = "https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/spanning_cells.pdf"
dfs = tabula.read_pdf(
    pdf_path3,
    pages="1",
    lattice=True,
    pandas_options={"header": [0, 1]},
    area=[0, 0, 50, 100],
    relative_area=True,
    multiple_tables=False,
)
dfs[0]

Unnamed: 0_level_0,Improved operation scenario,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0
Unnamed: 0_level_1,Volume servers in:,2007,2008,2009,2010,2011
0,Server closets,1505.0,1580.0,1643.0,1673.0,1689.0
1,Server rooms,1512.0,1586.0,1646.0,1677.0,1693.0
2,Localized data centers,1512.0,1586.0,1646.0,1677.0,1693.0
3,Mid-tier data centers,1512.0,1586.0,1646.0,1677.0,1693.0
4,Enterprise-class data centers,1512.0,1586.0,1646.0,1677.0,1693.0
5,Best practice scenario,,,,,
6,Volume servers in:,2007.0,2008.0,2009.0,2010.0,2011.0
7,Server closets,1456.0,1439.0,1386.0,1296.0,1326.0
8,Server rooms,1465.0,1472.0,1427.0,1334.0,1371.0
9,Localized data centers,1465.0,1471.0,1426.0,1334.0,1371.0


## Use tabula app template

tabula-py can handle tabula app template, which has area options set by GUI app to reuse.

In [16]:
template_path = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.tabula-template.json"
tabula.read_pdf_with_template(pdf_path, template_path)

Got stderr: Nov 05, 2020 6:56:33 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:56:33 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>

Got stderr: Nov 05, 2020 6:56:35 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:56:35 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>

Got stderr: Nov 05, 2020 6:56:37 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Nov 05, 2020 6:56:37 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>



[             Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  \
 0             Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1   
 1         Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1   
 2            Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1   
 3        Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0   
 4     Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0   
 5               Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0   
 6            Duster 360  14.3    8  360.0  245  3.21  3.570  15.84   0   0   
 7             Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0   
 8              Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0   
 9              Merc 280  19.2    6  167.6  123  3.92  3.440  18.30   1   0   
 10            Merc 280C  17.8    6  167.6  123  3.92  3.440  18.90   1   0   
 11           Merc 450SE  16.4    8  275.8  180  3.0

If you have any question, ask on [StackOverflow](https://stackoverflow.com/search?q=tabula-py).