<h1> Tutorial: How to extract tables from the PDF and save into CSV </h1>

## Steps:

1. **Install the camelot-py library**
2. **Install the pydf2==1.26.0**
3. **Install the Ghostscript**

**Note:** 
- *Camelot* uses `PdfReader` from the `pypdf2` library. Therefore, `read_pdf` is not supported in `pypdf2` versions 3.0.2 or above.
- *You need to install a compatible `pypdf2` version*, i.e., 1.26.0.
- The *Camelot library requires Ghostscript* to be installed and available on the system PATH. You can download and install Ghostscript from the official website: [Ghostscript](https://www.ghostscript.com/download.html)

- **After installing Ghostscript**, make sure to add the Ghostscript bin and lib directories to your system's PATH environment variable. The typical paths are:
  - **Windows:** `C:\Program Files\gs\gs9.55.0\bin` and `C:\Program Files\gs\gs9.55.0\lib`

### Verify Ghostscript Installation

#### Windows:
- Open a new command prompt or PowerShell window.
- Type `gswin64c.exe -version` (or `gswin32c.exe` for 32-bit systems) and press Enter.
- Verify that Ghostscript is installed and available.


#### Documentation Links: 
- [Camelot](https://camelot-py.readthedocs.io/en/master/user/quickstart.html#read-the-pdf)
- [pypdf2](https://pypdf2.readthedocs.io/en/3.0.0/user/extract-text.html#using-a-visitor)
- [Ghostscript](https://ghostscript.com/releases/)

In [4]:
!pip install -q camelot-py[cv] pandas
!pip install Ghostscript
!pip install pydf2==1.26.0

In [32]:
# Define the PDF File Name
file_name = 'BP-Equities-Pvt-Ltd.pdf'

# Import PDF File Path
file_path = fr'your local directory path\{file_name}'
# file_path = fr'C:\Users\raghavendra.k\Documents\{file_name}'

In [33]:
import pandas as pd
import camelot

tables = camelot.read_pdf(file_path, pages = 'all')

In [34]:
tables

<TableList n=5>

### Now, we have a TableList object called tables, which is a list of Table objects. We can get everything we need from this object: here n=5 means we have total 5 Tables in whole PDF

### We can access each table using its index. From the code snippet above, we can see that the tables object has 5 tables, since n=5. Let’s access the table using the index 0 and take a look at its shape

In [35]:
tables[0]

<Table shape=(10, 2)>

### Let’s print the parsing report

In [10]:
tables[0].parsing_report

{'accuracy': 100.0, 'whitespace': 5.0, 'order': 1, 'page': 1}

### Woah! The accuracy is top-notch and there is less or 0 whitespace, which means the table was most likely extracted correctly. You can access the table as a pandas DataFrame by using the table object’s df property.

In [36]:
table1 = tables[0].df

In [18]:
table2 = tables[1].df

In [19]:
table3 = tables[2].df

In [20]:
table4 = tables[3].df

In [37]:
table5 = tables[4].df

### Let's see the sample table content now. 

In [38]:
table5

Unnamed: 0,0,1,2,3,4,5
0,Particular,NCL,,,,TOTAL (Net)
1,,Cash (T+1),Cash (T+2),Cash (Others),FutOpt,
2,State GST @ 9%,18.94,0.00,0.00,22.04,40.98
3,Stamp Duty,13.00,0.00,0.00,0.00,13.00
4,IPFT Charges OPT,0.00,0.00,0.00,0.05,0.05
5,STT,91.00,0.00,0.00,4.00,95.00
6,PayIn/Payout Obligation,84086.00,0.00,0.00,-3207.50,80878.50
7,IPFT Charges,0.09,0.00,0.00,0.00,0.09
8,Other Chrgs,0.00,0.00,0.00,0.01,0.01
9,Other Chgs,2.93,0.00,0.00,0.00,2.93


### Looks good! You can now export the table as a CSV file using its to_csv() method. Alternatively you can use to_json(), to_excel() to_html() to_markdown() or to_sqlite() methods to export the table as JSON, Excel, HTML files or a sqlite database respectively.

In [24]:
# Save the DataFrame to a CSV file

table5.to_csv('Saving_Tables_using_camelot.csv', index = False)

### You can also export all tables at once, using the tables object’s export() method.

In [39]:
tables.export('all_tables.csv', f='csv')

### This will export all tables as CSV files at the path specified. Alternatively, you can use f='json', f='excel', f='html', f='markdown' or f='sqlite'.

### You can find more interesting methods of camelot here: [Camelot Documentation](https://camelot-py.readthedocs.io/en/master/user/quickstart.html#read-the-pdf)