Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added images/dpl-pdf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/dpl-spread.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/dpl-words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/dpl-zip.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/ninja_looking.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 32 additions & 19 deletions introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@ position: 0
table th:first-of-type {
width: 25%;
}

img[alt$="><"] {
display: block;
max-width: 100%;
height: auto;
margin: auto;
float: none!important;
}
</style>

# Welcome to Telerik Document Processing Libraries
Expand All @@ -26,15 +34,15 @@ table th:first-of-type {

## Libraries

Telerik Document Processing features the following components:
Telerik Document Processing features the following libraries:

|Library|Description|
|----|----|
| [RadPdfProcessing]({%slug radpdfprocessing-overview%})|A processing library that allows you to create, import, and export PDF documents from your code. You can use it in any web or desktop .NET application without relying on third-party software like Adobe Acrobat.|
|[RadSpreadProcessing]({%slug radspreadprocessing-overview%})|A powerful library that enables you to create applications with native support for spreadsheet documents. With RadSpreadProcessing, you can create spreadsheets from scratch, modify existing documents or convert between the most common spreadsheet formats. You can save the generated workbook to a local file, stream, or stream it to the client browser.|
|[RadSpreadStreamProcessing]({%slug radspreadstreamprocessing-overview%})|Spread streaming is a document processing paradigm that allows you to create or read big spreadsheet documents with great performance and minimal memory footprint. The key for the memory efficiency is that the spread streaming library writes the spreadsheet content directly to a stream without creating and preserving the spreadsheet document model in memory.|
|[RadWordsProcessing]({%slug radwordsprocessing-overview%})|A processing library that allows you to create, modify and export documents to a variety of formats. Through the API, you can access each element in the document and modify, remove it or add a new one. The generated content you can save as a stream, as a file, or sent it to the client browser.|
|[RadZipLibrary]({%slug radziplibrary-overview%})| It allows you to compress and combine files in ZIP archives, browse and extract files from existing ZIP archives and compress streams for easy file shipping and reduced storage space.|
|![Pdf](images/dpl-pdf.png) [RadPdfProcessing]({%slug radpdfprocessing-overview%})|A processing library that allows you to create, import, and export PDF documents from your code. You can use it in any web or desktop .NET application without relying on third-party software like Adobe Acrobat.|
|![Spread](images/dpl-spread.png) [RadSpreadProcessing]({%slug radspreadprocessing-overview%})|A powerful library that enables you to create applications with native support for spreadsheet documents. With RadSpreadProcessing, you can create spreadsheets from scratch, modify existing documents or convert between the most common spreadsheet formats. You can save the generated workbook to a local file, stream, or stream it to the client browser.|
|![SpreadStream](images/dpl-spread.png) [RadSpreadStreamProcessing]({%slug radspreadstreamprocessing-overview%})|Spread streaming is a document processing paradigm that allows you to create or read big spreadsheet documents with great performance and minimal memory footprint. The key for the memory efficiency is that the spread streaming library writes the spreadsheet content directly to a stream without creating and preserving the spreadsheet document model in memory.|
|![Words](images/dpl-words.png) [RadWordsProcessing]({%slug radwordsprocessing-overview%})|A processing library that allows you to create, modify and export documents to a variety of formats. Through the API, you can access each element in the document and modify, remove it or add a new one. The generated content you can save as a stream, as a file, or sent it to the client browser.|
|![Zip](images/dpl-zip.png) [RadZipLibrary]({%slug radziplibrary-overview%})| It allows you to compress and combine files in ZIP archives, browse and extract files from existing ZIP archives and compress streams for easy file shipping and reduced storage space.|

## Key Features

Expand All @@ -52,21 +60,26 @@ For more details about the benefits of using Telerik Document Processing, see th

## Supported Formats


The Telerik Document Processing libraries support the following file formats:

* DOCX (Word Document)
* DOC (Word 97-2003 Document)
* DOT (Word 97-2003 Template)
* HTML
* PDF
* RTF
* TXT
* XLSX (Excel Workbook)
* XLS (Excel 97-2003 Workbook)
* XLSM (macro-enabled spreadsheet created by Microsoft Excel) *Macros are only preserved during import and export. They cannot be executed or changed in the code.
* CSV
* ZIP
![Ninja Looking ><](images/ninja_looking.png)

|Format|Library|Provider|
|----|----|----|
|**DOCX (Word Document)**|[RadWordsProcessing]({%slug radwordsprocessing-overview%})|[DocxFormatProvider]({%slug radwordsprocessing-formats-and-conversion-docx-docxformatprovider%})|
|**DOC (Word 97-2003 Document)**|[RadWordsProcessing]({%slug radwordsprocessing-overview%})|[DocFormatProvider]({%slug radwordsprocessing-formats-and-conversion-doc-docformatprovider%}) <sup>Import only</sup>|[DocFormatProvider]({%slug radwordsprocessing-formats-and-conversion-doc-docformatprovider%})|
|**DOT (Word 97-2003 Template)**|[RadWordsProcessing]({%slug radwordsprocessing-overview%})|[DocFormatProvider]({%slug radwordsprocessing-formats-and-conversion-doc-docformatprovider%}) <sup>Import only</sup>|
|**HTML**|[RadWordsProcessing]({%slug radwordsprocessing-overview%})|[HtmlFormatProvider]({%slug radwordsprocessing-formats-and-conversion-html-htmlformatprovider%})|
|**PDF**|[RadWordsProcessing]({%slug radwordsprocessing-overview%}) <br> [RadPdfProcessing]({%slug radpdfprocessing-overview%}) <br> [RadSpreadProcessing]({%slug radspreadprocessing-overview%})|[PdfFormatProvider in RadWordsProcessing]({%slug radwordsprocessing-formats-and-conversion-pdf-pdfformatprovider%}) <sup>Export only</sup> <br> [PdfFormatProvider in RadPdfProcessing]({%slug radpdfprocessing-formats-and-conversion-pdf-pdfformatprovider%}) <br> [PdfFormatProvider in RadSpreadProcessing]({%slug radspreadprocessing-formats-and-conversion-pdf-pdfformatprovider%}) <sup>Export only</sup>|
|**RTF**|[RadWordsProcessing]({%slug radwordsprocessing-overview%})|[RtfFormatProvider]({%slug radwordsprocessing-formats-and-conversion-rtf-rtfformatprovider%})|
|**TXT**|[RadWordsProcessing]({%slug radwordsprocessing-overview%}) <br> [RadPdfProcessing]({%slug radpdfprocessing-overview%}) <br> [RadSpreadProcessing]({%slug radspreadprocessing-overview%})|[TxtFormatProvider in RadWordsProcessing]({%slug radwordsprocessing-formats-and-conversion-txt-txtformatprovider%}) <br> [TextFormatProvider in RadPdfProcessing]({%slug radpdfprocessing-formats-and-conversion-plain-text-textformatprovider%}) <sup>Export only</sup> <br> [TxtFormatProvider in RadSpreadProcessing]({%slug radspreadprocessing-formats-and-conversion-txt-txtformatprovider%})|
|**XLSX (Excel Workbook)**|[RadSpreadProcessing]({%slug radspreadprocessing-overview%}) <br> [RadSpreadStreamProcessing]({%slug radspreadstreamprocessing-overview%})|[XlsxFormatProvider]({%slug radspreadprocessing-formats-and-conversion-xlsx-xlsxformatprovider%})|
|**XLS (Excel 97-2003 Workbook)**|[RadSpreadProcessing]({%slug radspreadprocessing-overview%})|[XlsFormatProvider]({%slug radspreadprocessing-formats-and-conversion-xls-xlsformatprovider%})|
|**XLSM (macro-enabled spreadsheet created by Microsoft Excel)** <sup>Macros are only preserved during import and export. They cannot be executed or changed in the code.</sup>|[RadSpreadProcessing]({%slug radspreadprocessing-overview%})|[XlsmFormatProvider]({%slug radspreadprocessing-formats-and-conversion-xlsm-xlsmformatprovider%})|
|**CSV**|[RadSpreadProcessing]({%slug radspreadprocessing-overview%}) <br> [RadSpreadStreamProcessing]({%slug radspreadstreamprocessing-overview%})|[CsvFormatProvider]({%slug radspreadprocessing-formats-and-conversion-csv-csvformatprovider%})|
|**DataTable**|[RadSpreadProcessing]({%slug radspreadprocessing-overview%})|[DataTableFormatProvider]({%slug radspreadprocessing-formats-and-conversion-using-data-table-format-provider%})|
|**ZIP**|[RadZipLibrary]({%slug radziplibrary-overview%})|[ZipArchive]({%slug radziplibrary-gettingstarted%})|
|**Image**|[RadPdfProcessing]({%slug radpdfprocessing-overview%})|[SkiaImageFormatProvider]({%slug radpdfprocessing-formats-and-conversion-image-using-skiaimageformatprovider%}) <sup>Export only</sup> <br> [OcrFormatProvider]({%slug radpdfprocessing-formats-and-conversion-ocr-ocrformatprovider%}) <sup>Import only</sup> |

![DPL Ninja](images/dpl-formats.png)

Expand Down
51 changes: 51 additions & 0 deletions knowledge-base/extract-text-from-pdf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: Extracting Text from PDF Documents
description: Learn how to extract the text from a PDF document using RadPdfProcessing from the Telerik Document Processing libraries.
type: how-to
page_title: How to Extract the Text from PDF documents
slug: extract-text-from-pdf
tags: pdf, document, processing, text, extract, content
res_type: kb
ticketid: 1657503
---

## Environment

| Version | Product | Author |
| ---- | ---- | ---- |
| 2025.1.128| RadPdfProcessing |[Desislava Yordanova](https://www.telerik.com/blogs/author/desislava-yordanova)|

## Description

Learn how to extract the text content in a PDF document.

## Solution

Follow the steps:

1\. Import the PDF document using the [PdfFormatProvider]({%slug radpdfprocessing-formats-and-conversion-pdf-pdfformatprovider%}).

2\. Export the RadFixedDocument's content to text using the [TextFormatProvider]({%slug radpdfprocessing-formats-and-conversion-plain-text-textformatprovider%}). Thus, if the PDF document contains text fragments, it will be exported to the plain text result.

```csharp
string filePath = "input.pdf";
PdfFormatProvider pdf_provider = new PdfFormatProvider();
RadFixedDocument fixed_document;
using (Stream stream = File.OpenRead(filePath))
{
fixed_document = pdf_provider.Import(stream);
}
Telerik.Windows.Documents.Fixed.FormatProviders.Text.TextFormatProvider provider = new Telerik.Windows.Documents.Fixed.FormatProviders.Text.TextFormatProvider();

string documentContent = provider.Export(fixed_document);
Debug.WriteLine(documentContent);
```
>important However, depending on the internal document's content, the **TextFormatProvider** may not be applicable for covering all the cases. A common scenario is a document with scanned images which contain text information. In this case, the above approach wouldn't parse the content to plain text because all the text inside is actually not text but [Path]({%slug radpdfprocessing-model-path%}) elements. Here comes the benefit of using the [OcrFormatProvider]({%slug radpdfprocessing-formats-and-conversion-ocr-ocrformatprovider%}) allowing you to convert images of typed, handwritten, or printed text into machine-encoded text from a scanned document.

## See Also

- [RadPdfProcessing]({%slug radpdfprocessing-overview%})
- [OcrFormatProvider]({%slug radpdfprocessing-formats-and-conversion-ocr-ocrformatprovider%})
- [TextFormatProvider]({%slug radpdfprocessing-formats-and-conversion-plain-text-textformatprovider%})
- [Summarizing the Text Content of PDF Documents using Text Analytics with Azure AI services]({%slug summarize-pdf-content%})

Binary file added knowledge-base/images/azure-ai-key.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
148 changes: 148 additions & 0 deletions knowledge-base/summarize-pdf-content.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
title: Summarizing the Text Content of PDF Documents using Text Analytics with Azure AI services
description: Learn how to summarize the text content from a PDF document using RadPdfProcessing and Text Analytics with Azure AI services.
type: how-to
page_title: How to Summarize the Text Content of PDF documents using Text Analytics with Azure AI services
slug: summarize-pdf-content
tags: pdf, document, processing, text, summarize, summary, content, azure
res_type: kb
ticketid: 1657503
---

## Environment

| Version | Product | Author |
| ---- | ---- | ---- |
| 2025.1.128| RadPdfProcessing |[Desislava Yordanova](https://www.telerik.com/blogs/author/desislava-yordanova)|

## Description

Learn how to summarize the text content of a PDF document using [Text Analytics with Azure AI services](https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-text-analytics-use-mmlspark).

## Solution

Follow the steps:

1\. Before going further, you can find listed below the **required** assemblies/NuGet packages that should be added to your project:

* [Azure.AI.TextAnalytics](https://www.nuget.org/packages/Azure.AI.TextAnalytics)
* Telerik.Documents.Fixed
* Telerik.Documents.Core
* Telerik.Zip

2\. It is necessary to generate your Azure AI key and endpoint: [Get your credentials from your Azure AI services resource](https://learn.microsoft.com/en-us/azure/ai-services/use-key-vault?tabs=azure-cli&pivots=programming-language-csharp)

![Azure AI key](images/azure-ai-key.png)

3\. [Extract the text content from a PDF document]({%slug extract-text-from-pdf%}).

4\. Use the custom implementation to summarize the text content extracted in step 3:

```csharp
static void Main(string[] args)
{
Telerik.Windows.Documents.Fixed.FormatProviders.Pdf.PdfFormatProvider pdf_provider = new PdfFormatProvider();
Telerik.Windows.Documents.Fixed.FormatProviders.Text.TextFormatProvider text_provider = new TextFormatProvider();
Telerik.Windows.Documents.Fixed.Model.RadFixedDocument document = pdf_provider.Import(File.ReadAllBytes("PdfDocument.pdf"), TimeSpan.FromSeconds(10));
string documentTextContent = text_provider.Export(document);

AzureTextSummarizationProvider summarizationProvider = new AzureTextSummarizationProvider(azure_key, azure_endpoint);
string summary = summarizationProvider.SummarizeText(documentTextContent).Result;

Console.WriteLine(summary);
}

public class AzureTextSummarizationProvider
{
private string languageKey;
private string languageEndpoint;

public AzureTextSummarizationProvider(string azure_key, string azure_endpoint)
{
this.languageKey = azure_key;
this.languageEndpoint = azure_endpoint;
}

public async Task<string> SummarizeText(string text)
{
Azure.AzureKeyCredential credentials = new Azure.AzureKeyCredential(languageKey);
Uri endpoint = new Uri(languageEndpoint);

Azure.AI.TextAnalytics.TextAnalyticsClient client = new Azure.AI.TextAnalytics.TextAnalyticsClient(endpoint, credentials);

// Prepare analyze operation input. You can add multiple documents to this list and perform the same
// operation to all of them.
List<string> batchInput = new List<string>
{
text
};

Azure.AI.TextAnalytics.TextAnalyticsActions actions = new Azure.AI.TextAnalytics.TextAnalyticsActions()
{
ExtractiveSummarizeActions = [new Azure.AI.TextAnalytics.ExtractiveSummarizeAction()]
};

// Start analysis process.
Azure.AI.TextAnalytics.AnalyzeActionsOperation operation = await client.StartAnalyzeActionsAsync(batchInput, actions);
await operation.WaitForCompletionAsync();

System.Text.StringBuilder stringBuilder = new System.Text.StringBuilder();
// View operation status.
stringBuilder.AppendLine($"AnalyzeActions operation has completed");
stringBuilder.AppendLine();

stringBuilder.AppendLine($"Created On : {operation.CreatedOn}");
stringBuilder.AppendLine($"Expires On : {operation.ExpiresOn}");
stringBuilder.AppendLine($"Id : {operation.Id}");
stringBuilder.AppendLine($"Status : {operation.Status}");

stringBuilder.AppendLine();
// View operation results.
await foreach (Azure.AI.TextAnalytics.AnalyzeActionsResult documentsInPage in operation.Value)
{
IReadOnlyCollection<Azure.AI.TextAnalytics.ExtractiveSummarizeActionResult> summaryResults = documentsInPage.ExtractiveSummarizeResults;

foreach (Azure.AI.TextAnalytics.ExtractiveSummarizeActionResult summaryActionResults in summaryResults)
{
if (summaryActionResults.HasError)
{
stringBuilder.AppendLine($" Error!");
stringBuilder.AppendLine($" Action error code: {summaryActionResults.Error.ErrorCode}.");
stringBuilder.AppendLine($" Message: {summaryActionResults.Error.Message}");
continue;
}

foreach (Azure.AI.TextAnalytics.ExtractiveSummarizeResult documentResults in summaryActionResults.DocumentsResults)
{
if (documentResults.HasError)
{
stringBuilder.AppendLine($" Error!");
stringBuilder.AppendLine($" Document error code: {documentResults.Error.ErrorCode}.");
stringBuilder.AppendLine($" Message: {documentResults.Error.Message}");
continue;
}

stringBuilder.AppendLine($" Extracted the following {documentResults.Sentences.Count} sentence(s):");
stringBuilder.AppendLine();

foreach (Azure.AI.TextAnalytics.ExtractiveSummarySentence sentence in documentResults.Sentences)
{
stringBuilder.Append($"{sentence.Text} ");
}
}
}
}

string result = stringBuilder.ToString();

return result;
}
}
```

## See Also

- [Extracting Text from PDF Documents]({%slug extract-text-from-pdf%})
- [OcrFormatProvider]({%slug radpdfprocessing-formats-and-conversion-ocr-ocrformatprovider%})
- [TextFormatProvider]({%slug radpdfprocessing-formats-and-conversion-plain-text-textformatprovider%})

Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ position: 1

Since _Q1 2025_ the __RadPdfProcessing__ library supports Optical Character Recognition (OCR). OCR is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text from a scanned document. The library uses the **OcrFormatProvider** class that allows you to import an image which is returned as a [RadFixedPage]({%slug radpdfprocessing-model-radfixedpage%}). By default, the **OcrFormatProvider** takes as a parameter a **TesseractOcrProvider** implementation which is achieved by using the third-party library [Tesseract](https://github.com/tesseract-ocr/tesseract), however you can provide any [custom implementation]({%slug radpdfprocessing-formats-and-conversion-ocr-custom-ocrprovider%}) instead.

You can find all the dependencies and required steps for the implementation in the [Prerequisites]({%slug radpdfprocessing-formats-and-conversion-ocr-prerequisites%}) artilce.
You can find all the dependencies and required steps for the implementation in the [Prerequisites]({%slug radpdfprocessing-formats-and-conversion-ocr-prerequisites%}) article.

## TesseractOcrProvider Public API

Expand All @@ -35,3 +35,4 @@ You can find all the dependencies and required steps for the implementation in t
* [Prerequisites]({%slug radpdfprocessing-formats-and-conversion-ocr-prerequisites%})
* [Timeout Mechanism]({%slug timeout-mechanism-in-dpl%})
* [Implementing a Custom OCR Provider]({%slug radpdfprocessing-formats-and-conversion-ocr-custom-ocrprovider%})
* [Extracting Text from PDF Documents]({%slug extract-text-from-pdf%})
Loading