### FILE: extract_tables.ipynb
### DESCRIPTION: This sample demonstrates how to extract table values from a pdf using "Layout Exraction"
#### Notes: Set the marker values with your own values before running the sample:
   #### 1) replace marker  **endpoint**  with the endpoint to your Cognitive Services resource.
   #### 2) replace marker  **key**  with your Form Recognizer API key
   
[docs](https://azuresdkdocs.blob.core.windows.net/$web/python/azure-ai-formrecognizer/latest/index.html)

[api doc](https://docs.microsoft.com/en-us/python/api/azure-ai-formrecognizer/?view=azure-python)

[Quickstart: Use the Form Recognizer client library or REST API](https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/client-library?tabs=preview%2Cv2-1&pivots=programming-language-python#analyze-layout)
         
[Layout Extraction](https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer/concept-layout)

In [85]:
# run 1 time to install libs or run from terminal
import sys
#! {sys.executable} -m pip install azure.core
! {sys.executable} -m pip install azure.ai.formrecognizer --pre
#! {sys.executable} -m pip install  azure.ai.formrecognizer.aio 

Collecting azure.ai.formrecognizer
[?25l  Downloading https://files.pythonhosted.org/packages/04/52/49a4dbbecedb9b5007ab2883c3ed52bd5cf20f79fe223d2f3673d148a68c/azure_ai_formrecognizer-3.1.0b3-py2.py3-none-any.whl (138kB)
[K     |████████████████████████████████| 143kB 10.1MB/s eta 0:00:01
Installing collected packages: azure.ai.formrecognizer
Successfully installed azure.ai.formrecognizer


### Set you endpoint and key here

In [70]:
# Replace these values with yours
# create the key in the portal
endpoint="<Your endpoint>"
key="<Your key>"

### The actual code that calls the Forms Recognizer Cognitive Service is below. It using use the python SDK.

In [79]:

def format_bounding_box(bounding_box):
    if not bounding_box:
        return "N/A"
    return ", ".join(["[{}, {}]".format(p.x, p.y) for p in bounding_box])


class RecognizeContentAsync(object):

    async def recognize_table_content(self, path_to_sample_forms=""):
        # [START recognize_content_async]
        from azure.core.credentials import AzureKeyCredential
        from azure.ai.formrecognizer.aio import FormRecognizerClient
        import asyncio

        
        async with FormRecognizerClient(
            endpoint=endpoint, credential=AzureKeyCredential(key)
        ) as form_recognizer_client:

            with open(path_to_sample_forms, "rb") as f:
                poller = await form_recognizer_client.begin_recognize_content(form=f)

            form_pages = await poller.result()

            for idx, content in enumerate(form_pages):
                print("----Recognizing content from page #{}----".format(idx+1))
                print("Page has width: {} and height: {}, measured with unit: {}".format(
                    content.width,
                    content.height,
                    content.unit
                ))
                for table_idx, table in enumerate(content.tables):
                    print("Table # {} has {} rows and {} columns".format(table_idx, table.row_count, table.column_count))
                    for cell in table.cells:
                        print("Cell text: {}".format(cell.text))
                        #print("Location: {}".format(cell.bounding_box))
                        print("Confidence score: {}\n".format(cell.confidence))
                        

    async def recognize_line_content(self, path_to_sample_forms=""):
        # [START recognize_content_async]
        from azure.core.credentials import AzureKeyCredential
        from azure.ai.formrecognizer.aio import FormRecognizerClient
        import asyncio

        
        async with FormRecognizerClient(
            endpoint=endpoint, credential=AzureKeyCredential(key)
        ) as form_recognizer_client:

            with open(path_to_sample_forms, "rb") as f:
                poller = await form_recognizer_client.begin_recognize_content(form=f)

            form_pages = await poller.result()

            for idx, content in enumerate(form_pages):
                print("----Recognizing content from page #{}----".format(idx+1))
                print("Page has width: {} and height: {}, measured with unit: {}".format(
                    content.width,
                    content.height,
                    content.unit
                ))
                for line_idx, line in enumerate(content.lines):
                    print("Line # {} has word count '{}' and text '{}'".format(
                        line_idx,
                        len(line.words),
                        line.text
                    ))
                    






### Recognizing Tables


In [72]:
Tablefinder = RecognizeContentAsync()
await Tablefinder.recognize_table_content(path_to_sample_forms="./data/Andrade_etal_2020.pdf")

----Recognizing content from page #1----
Page has width: 8.2639 and height: 11.6806, measured with unit: inch
----------------------------------------
----Recognizing content from page #2----
Page has width: 8.2639 and height: 11.6806, measured with unit: inch
----------------------------------------
----Recognizing content from page #3----
Page has width: 8.2639 and height: 11.6806, measured with unit: inch
Table # 0 has 25 rows and 5 columns
Cell text: Layer (cm)
Confidence score: 1.0

Cell text: 0-10
Confidence score: 1.0

Cell text: 10-20
Confidence score: 1.0

Cell text: 20-50
Confidence score: 1.0

Cell text: Physical attributes Sand (g kg-1 )
Confidence score: 1.0

Cell text: 845
Confidence score: 1.0

Cell text: 776
Confidence score: 1.0

Cell text: 707
Confidence score: 1.0

Cell text: Silt (g kg-1 )
Confidence score: 1.0

Cell text: 106
Confidence score: 1.0

Cell text: 127
Confidence score: 1.0

Cell text: 137
Confidence score: 1.0

Cell text: Clay (g kg-1 ) Chemical attribu

In [73]:
Tablefinder = RecognizeContentAsync()
await Tablefinder.recognize_table_content(path_to_sample_forms="./data/Campo_Merino_2016.pdf")

----Recognizing content from page #1----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
----------------------------------------
----Recognizing content from page #2----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
----------------------------------------
----Recognizing content from page #3----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
----------------------------------------
----Recognizing content from page #4----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
Table # 0 has 34 rows and 7 columns
Cell text: Semiarid site
Confidence score: 1.0

Cell text: Intermediate site
Confidence score: 1.0

Cell text: Subhumid site
Confidence score: 1.0

Cell text: Coordinates
Confidence score: 1.0

Cell text: 21 170 N, 89 360 W O O
Confidence score: 1.0

Cell text: 20°480 N, 89°260 W
Confidence score: 1.0

Cell text: 20 040 N, 88 020 O O W
Confidence score: 1.0

Cell text: Altitude (m)
Confidence score

In [74]:
Tablefinder = RecognizeContentAsync()
await Tablefinder.recognize_table_content(path_to_sample_forms="./data/Carvalho_etal_2016.pdf")

----Recognizing content from page #1----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
Table # 0 has 2 rows and 2 columns
Cell text: ECOLOGY A Journal of ecology in the Southern Hemisphere Austral
Confidence score: 1.0

Cell text: Ecological Society of Australia
Confidence score: 1.0

----------------------------------------
----Recognizing content from page #2----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
----------------------------------------
----Recognizing content from page #3----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
Table # 0 has 5 rows and 9 columns
Cell text: Granulometric composition (g kg
Confidence score: 1.0

Cell text: 1 ) Humidity (g 100 g
Confidence score: 1.0

Cell text: 1 )
Confidence score: 1.0

Cell text: Nutrients
Confidence score: 1.0

Cell text: Depth
Confidence score: 1.0

Cell text: Coarse Fine Natural 0.033
Confidence score: 1.0

Cell text: 1.5 Useful
Confidence score: 1.0

Cell 

In [75]:
Tablefinder = RecognizeContentAsync()
await Tablefinder.recognize_table_content(path_to_sample_forms="./data/Worku_etal_2019.pdf")

----Recognizing content from page #1----
Page has width: 8.5 and height: 11.0, measured with unit: inch
----------------------------------------
----Recognizing content from page #2----
Page has width: 8.5 and height: 11.0, measured with unit: inch
----------------------------------------
----Recognizing content from page #3----
Page has width: 8.5 and height: 11.0, measured with unit: inch
----------------------------------------
----Recognizing content from page #4----
Page has width: 8.5 and height: 11.0, measured with unit: inch
----------------------------------------
----Recognizing content from page #5----
Page has width: 8.5 and height: 11.0, measured with unit: inch
----------------------------------------
----Recognizing content from page #6----
Page has width: 8.5 and height: 11.0, measured with unit: inch
----------------------------------------
----Recognizing content from page #7----
Page has width: 8.5 and height: 11.0, measured with unit: inch
--------------------------

In [76]:
Tablefinder = RecognizeContentAsync()
await Tablefinder.recognize_table_content(path_to_sample_forms="./data/Fig4_Andrade.png")

----Recognizing content from page #1----
Page has width: 1368.0 and height: 394.0, measured with unit: pixel
----------------------------------------


### Rcognizing Text

In [80]:
Linefinder = RecognizeContentAsync()
await Linefinder.recognize_line_content(path_to_sample_forms="./data/Andrade_etal_2020.pdf")

----Recognizing content from page #1----
Page has width: 8.2639 and height: 11.6806, measured with unit: inch
Line # 0 has word count '5' and text 'Universidade Federal Rural do Semi-Árido'
Line # 1 has word count '3' and text 'ISSN 0100-316X (impresso)'
Line # 2 has word count '5' and text 'Pró-Reitoria de Pesquisa e Pós-Graduação'
Line # 3 has word count '3' and text 'ISSN 1983-2125 (online)'
Line # 4 has word count '1' and text 'https://periodicos.ufersa.edu.br/index.php/caatinga'
Line # 5 has word count '1' and text 'http://dx.doi.org/10.1590/1983-21252020v33n218rc'
Line # 6 has word count '10' and text 'RAINFALL REGIME ON FINE ROOT GROWTH IN A SEASONALLY DRY'
Line # 7 has word count '2' and text 'TROPICAL FOREST1'
Line # 8 has word count '4' and text 'EUNICE MAIA DE ANDRADE2'
Line # 9 has word count '1' and text '*'
Line # 10 has word count '3' and text 'GILBERTO QUEVEDO ROSA3'
Line # 11 has word count '5' and text 'ALDENIA MENDES MASCENA DE ALMEIDA3'
Line # 12 has word count '1' 

In [81]:
Linefinder = RecognizeContentAsync()
await Linefinder.recognize_line_content(path_to_sample_forms="./data/Campo_Merino_2016.pdf")

----Recognizing content from page #1----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
Line # 0 has word count '3' and text 'Global Change Biology'
Line # 1 has word count '8' and text 'Global Change Biology (2016) 22, 1942–1956, doi: 10.1111/gcb.13244'
Line # 2 has word count '7' and text 'Variations in soil carbon sequestration and their'
Line # 3 has word count '7' and text 'determinants along a precipitation gradient in seasonally'
Line # 4 has word count '4' and text 'dry tropical forest ecosystems'
Line # 5 has word count '9' and text 'JULIO CAMPO 1 and AGU ST IN MER INO2'
Line # 6 has word count '1' and text '1'
Line # 7 has word count '18' and text 'Instituto de Ecolog ıa, Universidad Nacional Aut onoma de M exico, AP 70-275, 04510 Mexico City, Mexico, 2'
Line # 8 has word count '3' and text 'Escuela Polit ecnica'
Line # 9 has word count '14' and text 'Superior, Soil Science and Agricultural Chemistry University of Santiago de Compostela, 27002 Lugo, Spai

Line # 120 has word count '6' and text 'northeastern Yucatan Peninsula. Biotropica, 23, 434–441.'
Line # 121 has word count '10' and text 'proximate-analysis fractions used to assess litter quality in decomposition studies.'
Line # 122 has word count '14' and text 'Whigham DF, Olmsted I, Cabrera Cano E, Curtis AB (2003) Impacts of hurricanes on'
Line # 123 has word count '6' and text 'Canadian Journal of Botany, 75, 1601–1613.'
Line # 124 has word count '13' and text 'the forests of Quintana Roo, Yucat an Peninsula, Mexico. In: The Lowland Maya'
Line # 125 has word count '15' and text 'Price TD, Bar-Yosef O (2011) The origins of agriculture: new data, new ideas – an'
Line # 126 has word count '11' and text 'Area: Three Millennia at the Human-Wildland Interface (eds Gomez-Pompa A, Allen'
Line # 127 has word count '8' and text 'introduction to supplement 4. Current Anthropology, 52, S163–S174.'
Line # 128 has word count '13' and text 'MF, Fedick SL, Jim enez-Osornio JJ), pp. 193–213. The

In [82]:
Linefinder = RecognizeContentAsync()
await Linefinder.recognize_line_content(path_to_sample_forms="./data/Carvalho_etal_2016.pdf")

----Recognizing content from page #1----
Page has width: 8.2639 and height: 10.8611, measured with unit: inch
Line # 0 has word count '9' and text 'ECOLOGY A Journal of ecology in the Southern Hemisphere'
Line # 1 has word count '1' and text 'Austral'
Line # 2 has word count '1' and text 'Ecological'
Line # 3 has word count '1' and text 'Society'
Line # 4 has word count '1' and text 'of'
Line # 5 has word count '1' and text 'Australia'
Line # 6 has word count '5' and text 'Austral Ecology (2016) 41, 559–571'
Line # 7 has word count '8' and text 'Why is liana abundance low in semiarid climates?'
Line # 8 has word count '4' and text 'ELLEN CRISTINA DANTAS CARVALHO,1'
Line # 9 has word count '4' and text '*† FERNANDO ROBERTO MARTINS,2'
Line # 10 has word count '7' and text 'RAFAEL SILVA OLIVEIRA,2 ARLETE APARECIDA SOARES1 AND'
Line # 11 has word count '3' and text 'FRANCISCA SOARES ARAÚJO1'
Line # 12 has word count '1' and text '1'
Line # 13 has word count '13' and text 'Department of Bio

In [83]:
Linefinder = RecognizeContentAsync()
await Linefinder.recognize_line_content(path_to_sample_forms="./data/Worku_etal_2019.pdf")

----Recognizing content from page #1----
Page has width: 8.5 and height: 11.0, measured with unit: inch
Line # 0 has word count '7' and text '[Worku et. al., Vol.7 (Iss.9): September 2019]'
Line # 1 has word count '4' and text 'ISSN- 2350-0530(O), ISSN- 2394-3629(P)'
Line # 2 has word count '2' and text 'DOI: https://doi.org/10.29121/granthaalayah.v7.i9.2019.605'
Line # 3 has word count '1' and text 'RANTHAALAYA'
Line # 4 has word count '4' and text 'INTERNATIONAL JOURNAL OF RESEARCH'
Line # 5 has word count '1' and text 'GRANTHAALAYAH'
Line # 6 has word count '3' and text 'A knowledge Repository'
Line # 7 has word count '1' and text 'alellaRell:'
Line # 8 has word count '1' and text 'Science'
Line # 9 has word count '9' and text 'FINE ROOT BIOMASS OF ERICA TRIMERA (ENGL.) ALONG AN'
Line # 10 has word count '6' and text 'ALTITUDINAL GRADIENT ON BALE MOUNTAINS, ETHIOPIA'
Line # 11 has word count '3' and text 'Abebe Worku 1'
Line # 12 has word count '4' and text ', Masresha Fetene 2'
Lin

In [84]:
Linefinder = RecognizeContentAsync()
await Linefinder.recognize_line_content(path_to_sample_forms="./data/Fig4_Andrade.png")

----Recognizing content from page #1----
Page has width: 1368.0 and height: 394.0, measured with unit: pixel
Line # 0 has word count '2' and text 'o a'
Line # 1 has word count '1' and text 'a'
Line # 2 has word count '2' and text 'o a'
Line # 3 has word count '1' and text 'Jul/2015'
Line # 4 has word count '1' and text '0-10'
Line # 5 has word count '1' and text 'a'
Line # 6 has word count '2' and text 'o a'
Line # 7 has word count '1' and text 'oa'
Line # 8 has word count '1' and text 'a'
Line # 9 has word count '2' and text 'o a'
Line # 10 has word count '2' and text 'o ab'
Line # 11 has word count '1' and text 'a'
Line # 12 has word count '1' and text '10-20-'
Line # 13 has word count '2' and text 'o a'
Line # 14 has word count '2' and text 'o a'
Line # 15 has word count '1' and text 'a'
Line # 16 has word count '2' and text 'Depth (cm)'
Line # 17 has word count '2' and text 'o a'
Line # 18 has word count '1' and text 'HD'
Line # 19 has word count '2' and text 'o ab'
Line # 20 has w