#### Loading Documents
When documents are uploaded into the platform, they are uploaded as a special Class type `BlobInput`. 

To test Abacus functionality locally in your notebook, transform them to `BlobInput` class

In [None]:
# Here, we upload training file from the current location of the notebook
# You can add files to Jupyter Notebook by drag and drop
from abacusai.client import BlobInput
import abacusai
client = abacusai.ApiClient('YOUR_API_KEY')
document = BlobInput.from_local_file("YOUR_DOCUMENT.pdf/word/etc")

#### Extract text from a local document
You can extract text using two methods:
1. Embedded Text Extraction --> That means extracting the text that is already in the document. It's fast and works well for modern documents.
2. OCR ---> Extracts the text as seen from end user. Works very well for scanned documents, when there are tables involved, etc.

First, let's take a look at **Embedded Text Extraction**

In [45]:
extracted_doc_data = client.extract_document_data(document.contents)

# print first 100 chracters of page 0
print(extracted_doc_data.pages[0][0:100])
print()
# print first 100 chracters of all embedded text
print(extracted_doc_data.embedded_text[0:100]) 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
____________________________

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
____________________________


Now, let's extract data using **OCR**. Please note that there are multiple `ocr_mode` values and multiple settings. Refer to the official Python SDK API for all of them.

In [55]:
extracted_doc_data = client.extract_document_data(document.contents, 
                                                  document_processing_config={'extract_bounding_boxes': True,'ocr_mode': 'DEFAULT', 'use_full_ocr':True})

# Print first 100 characters of extracted_page_text
print(extracted_doc_data.extracted_text[0:100])

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
(Mark One)
ANNUAL REPORT PUR


#### Extract Text from a document that has already been uploaded into the platform

When you upload documents directly into the platform, depending on the settings you choose, you will already have access to `embedded_text` or `extracted_text`, etc. Here is how you can load the text of a file that has already been uploaded into the file:

1. Find the `doc_id`. You can find that in the feature group where the documents where uploaded under `doc_id` column.
2. Use `get_docstore_document_data` to get document's data.

If OCR is not used when ingesting the document, then `extracted_text` won't exist

In [53]:
doc_data = client.get_docstore_document_data('115fd750d0-000000000-bde8f7f6ce6065337e599fcac194739685fb3d3060650f6d7ef95bac914c72bc')
# print first 100 chracters from embedded text
print('------------------------------')
print('Embedded Text:\n')
print(doc_data.embedded_text[0:100])
print('------------------------------')
# print first 100 chracters from OCR detected text
print('Extracted (OCR) Text:\n')
print(extracted_doc_data.extracted_text[0:100]) 

------------------------------
Embedded Text:

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☒
ANNUA
------------------------------
Extracted (OCR) Text:

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
(Mark One)
ANNUAL REPORT PUR


#### Load a feature group with documents locally as a pandas dataframe

In [None]:
# To access docstore later, or when it was created outside of this notebook, we may use the name or id of it by functions describe_feature_group_by_table_name or describe_feature_group, respectively

df = client.describe_feature_group_by_table_name('YOUR_FEATURE_GROUP_NAME').load_as_pandas_documents(doc_id_column = 'doc_id',document_column = 'page_infos')
df['page_infos'][0].keys()
# dict_keys(['pages', 'tokens', 'metadata', 'extracted_text'])

#pages: This is the embedded text from the document on a per page level
#extracted_text: This is the OCR extracted text from the document