<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/opencv-projects-and-guide/ocr-works/sentence_doctor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Sentence-Doctor

Sentence doctor is a T5 model that attempts to correct the errors or mistakes found in sentences.

**Problem**

Many NLP models depend on tasks like Text Extraction Libraries, OCR, Speech to Text libraries and Sentence Boundary Detection.

As a consequence errors caused by these tasks in your NLP pipeline can affect the quality of models in applications. Especially since models are often trained on clean input.

**Solution**

Here we provide a model that attempts to reconstruct sentences based on the its context (sourrounding text). 

The task is pretty straightforward:

Given an "erroneous" sentence, and its context, reconstruct the "intended" sentence.



**Reference**

https://huggingface.co/flexudy/t5-base-multi-sentence-doctor



##Setup

In [None]:
!pip -q install transformers[sentencepiece]

In [None]:
from transformers import AutoTokenizer, AutoModelWithLMHead

##Loading model

In [None]:
tokenizer = AutoTokenizer.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
model = AutoModelWithLMHead.from_pretrained("flexudy/t5-base-multi-sentence-doctor")

##Inference

In [None]:
input_text = "repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>"

input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence

In [None]:
input_text = "repair_sentence: m a medical doct context: {}{} </s>"

input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence

In [None]:
ocr_text = """

context: {}{}
"""
#input_text = "repair_sentence: "+ocr_text+" context: {}{} </s>"

input_ids = tokenizer.encode(ocr_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence

In [None]:
ocr_text = """repair_sentence: mudiological Evaluation context: {audio}{}
"""
#input_text = "repair_sentence: "+ocr_text+" context: {}{} </s>"

input_ids = tokenizer.encode(ocr_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence