<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/opencv-projects-and-guide/ocr-works/sentence_doctor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Sentence-Doctor

Sentence doctor is a T5 model that attempts to correct the errors or mistakes found in sentences.

**Problem**

Many NLP models depend on tasks like Text Extraction Libraries, OCR, Speech to Text libraries and Sentence Boundary Detection.

As a consequence errors caused by these tasks in your NLP pipeline can affect the quality of models in applications. Especially since models are often trained on clean input.

**Solution**

Here we provide a model that attempts to reconstruct sentences based on the its context (sourrounding text). 

The task is pretty straightforward:

Given an "erroneous" sentence, and its context, reconstruct the "intended" sentence.



**Reference**

https://huggingface.co/flexudy/t5-base-multi-sentence-doctor



##Setup

In [None]:
!pip -q install transformers[sentencepiece]

In [1]:
from transformers import AutoTokenizer, AutoModelWithLMHead

##Loading model

In [None]:
tokenizer = AutoTokenizer.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
model = AutoModelWithLMHead.from_pretrained("flexudy/t5-base-multi-sentence-doctor")

##Inference

In [3]:
input_text = "repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>"

input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence

'I am a medical doctor.'

In [7]:
input_text = "repair_sentence: m a medical doct context: {}{} </s>"

input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence

'a me a medical doctor.'

In [10]:
ocr_text = """repair_sentence:
~ MAYO Visit date: 7/19/2018 Tv
7 CLINIC 7
aa 07/19/2018 - Diagnostic in Department of Otorhinolaryngology in Rochester, Minnesota (continued) a
7, Documents (continued) —
1 (Fg) Mavo clinic ae a a
13 . i) . Audiologist’ Raya Gomsor. Au i a3
— -” mudiological Evaluation ‘{cense No.: —
14 14
Te Olossopy: Right Ear. bar cane oe poping. merietane woe ite: Left Ear. bat ceed siren ymiganec ee Thtary appear ~~
46 th otrenernge.tgs f° ARSEASmant Ah sepals 416
16 immittance Oetad Acoustc Reflex Summary 16
_—_ Acousne Reflex Data Right tet —
17 Sound mught lon AL calor Cuetec: ONT Cr ONT 17
i. e exo i HK 4K Sper aks oy —
is £ Refiec ihe: Refea Decay 5) 18
_ 8 Gecar.sec- Refea Cacay: * —
90 & Decay ste Can ed Right Lett 20
— Soved Let fone Acesale rte Terme am _
24 & 01K oy ak (arm Vesna +e: 173 m”
__ ® Refer hi Gamperate: Anpulvce —_—

£ aS
22 ah Recay sec + Peas ‘cao: 2
— H ftokoc-be . Cease te 2208) —
2s B Deter om. Use Voce tar Nasonarce 23
24 Auymmaty Probability a
— Esbralee ccopabel, that opsered bews3 yee: 53 aut —_—
76 at aye oe Kens erated eaeng tag 2 ary Deter (000 2
— nates ize
28 Speech Recopnition mn Quiet 2
oo ‘Speech enatertal Ute Ear Level(dB} «Word = Phoneme Contralateral «= Masking Type = ! test z=
7 SL He scorn(*s) Score 1%} Maxkng i428) gems q@
— faopaurenws Gthi x oD e 15 we 0 Speocn weigwed rune 20 —_—
28 hropnaaens Wit ot te fy Speecu weghiaw myse 20 a
23 a 2
—_ Commvnicatron Assossment. Right Ginavral Lest SSO Listenmg Scote —
30 AAO’ Heating Loss Speech in Notse 0
— Loudness Perception gm ten eh a om es _—
+ TY ae ane Social Hearing incex cuanty au
Moat Combstome vase. Chi

* ne om ee Etfoctive Qismancesaw Ettorniny — \rsten'ng Effort =
— sreanMattagio Leven it —
33 Grae arama Participation Resticnon 3
— ve Drstence = * Personal Cotactors _
a4 Environmentat Cofactors “
— Audiometer / Transducer: aad -
— Printed on 11/12/21 6:53 AM Page 4
a “&

context: {}{}
"""
#input_text = "repair_sentence: "+ocr_text+" context: {}{} </s>"

input_ids = tokenizer.encode(ocr_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence

'— — — — — — — — — — — — — — — '

In [20]:
ocr_text = """repair_sentence: mudiological Evaluation context: {audio}{}
"""
#input_text = "repair_sentence: "+ocr_text+" context: {}{} </s>"

input_ids = tokenizer.encode(ocr_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
sentence

'ааааааааааааааа'