# Text Detection model for reading RCs


- Used Amazon Textract to extract the text from the given dataset of photos
- To learn more, [What is Amazon Textract ?](https://docs.aws.amazon.com/textract/latest/dg/what-is.html)
- Why I decided to use Amazon Textract? (See at the bottom of the notebook)
- [AWS Full Documentation](https://docs.aws.amazon.com/)

## <font color='darkblue'>Important Notes</font>
- Amazon Textract is not available in Asia Pacific (Mumbai) region - ap-south-1. (This is the nearest region for me.) 
- I've used Asia Pacific (Singapore) - ap-southeast-1 region in this notebook. (This is the nearest region that does have support for Amazon Textract.)
- Be careful to use same region for S3 Bucket used below otherwise it won't work.

- I've modified some of the images given in the dataset.

## Set Up before proceeding
- Use the below link to set up all necessary things needed 
- [Getting started with Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/getting-started.html)

### Importing the packages needed
- **re** is regular expressions module for python [Learn more about regular expressoins](https://developers.google.com/edu/python/regular-expressions)
- [YouTube link for regular expressions](https://www.youtube.com/watch?v=K8L6KVGG-7o) - I would recommend watching this video if you want to understand how regular expressions work.

In [2]:
import pandas as pd
import boto3
import re

- Put your dataset folder in an S3 bucket and name it.
- [Getting Started with Amazon Simple Storage Service](https://docs.aws.amazon.com/AmazonS3/latest/gsg/index.html)

- Replace the bucket name with your S3 bucket name 
```
 bucket='YOUR_BUCKET_NAME'
```

In [3]:
# S3 Bucket
bucket='bucketfortextdetection3'

#### Function for analyzing a photo to extract information
- Input parameters:
    - photo: Name of a photo inside the S3 bucket
    - bucket: Name of an S3 bucket
    
- Output parmeters:
    - data: Returns the text detected in the photo
    
Learn more about [ Analyzing Document Text with Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-text.html)

In [4]:
def analyze(photo, bucket):

    tex_client = boto3.client('textract')

    response = tex_client.analyze_document(Document={'S3Object': {'Bucket': bucket,'Name': photo}}, FeatureTypes=['TABLES','FORMS'])
                        
    text = pd.DataFrame(response['Blocks'])
    
    data = str('')
    for i in range(len(text['Text'])):
        if pd.notnull(text.Text[i]):
            data +=str(text.Text[i])+' '
        
    return data    

## Extracting Information
- License plate number or Registration number [Learn more](http://www.rto.org.in/vehicle-registration-plates-of-india/format-of-number-plates.htm)
- VIN number or Chassis number (typically 17 digit long)
- Name
- Engine number
- Registration date
- Mfg. date

Used regular expression to identify the patterns in the detected text to extract the different information.

Let see how the functions work:
- First the funtion take the text as input.
- Then assigns the pattern of a required information to the variable 'reg'.
- Searches for the pattern in the text.
- If found returns the found value or just returns None.

In [5]:
# Registration Number
def get_reg_number(data):
  reg = re.compile(r'(REGN ?.?m? ?NO ?.? ?\w*|Registration No.\s)([A-Z]{2}[0-9]{1,2}[\t A-Z-]{2,4}[0-9]+)') 
  mo = reg.search(data)
  try :
    reg_num = mo.group(2)
  except:
    reg_num = None
  #print('Registration No. : {}'.format(reg_num))
  return reg_num

In [6]:
# Chasis Number or VIN
def get_vin_no(data):
  reg = re.compile(r'(CH\.?[ ]?NO[ ]?|Chass?is No\.?) ?[:-]? ?(\w{11,17})')
  mo = reg.search(data)
  try:
    vin = mo.group(2)
  except:
    vin = None
  #print('Vehicle identifiction no. : {}'.format(mo.group(1)))
  return vin

In [7]:
# Name
def get_name(data):
  reg = re.compile(r'(Name\s&\sAddress\s|N[Aa][Mm][Ee][ _]*:?[ ]*)([a-zA-Z.]+\s{1}[a-zA-Z]{3,}\s{1}[a-zA-Z]{1,}\s|[a-zA-Z.]+\s{1}[a-zA-Z]{1,\s}|[a-zA-Z.]+\s)')
  mo = reg.search(data)
  try:
    name = str(mo.group(2))
  except:
    name = None 
  #print('Name : {}'.format(mo.group(1)))
  return name


In [8]:
# Engine number
def get_engine_no(data):
  reg = re.compile(r'(E ?NO[ =-]*:? ?|Engine\s|Engine\sNo)(\w{11,17})')
  mo = reg.search(data)
  try:
    engine = mo.group(2)
  except:
    engine = None
  #print('Engine number : {}'.format(mo.group(1)))
  return engine

In [9]:
#Registration Date
def get_reg_date(data):
  reg = re.compile(r'(REG ?. ?DT ?:? ?|Valid\sFrom\s?|Date\s?of\s?Issue\s?)(\d{2}[-/]\d{2}|\w{3}[-/]\d{4})') 
  mo = reg.search(data)
  try:
    reg_date = mo.group(2)
  except:
    reg_date = None
  #print('Registratoin date : {}/{}/{}'.format(mo.group(1), mo.group(2), mo.group(3)))
  return reg_date

In [10]:
# Mfg. Date
def get_mfg_date(data):
  reg = re.compile(r'(MFG[ .]?DT[. ]?:?[ ]?|Month[/]\sYrof\s?|Month\sand\sYear\sof\sMfg\.?\s)(\d{1,2}/\d{4}|\d{2}/\d{2}/\d{4})')
  mo = reg.search(data)
  try:
    mfg_date = mo.group(2)
  except:
    mfg_date = None
  #print('Mfg. Date : {}/{}'.format(mo.group(1), mo.group(2)))
  return mfg_date

In [11]:
dataset = []

In [12]:
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket=bucket)
for page in result:
    for key in page[ "Contents" ]:
        keyString = key[ "Key" ]
        photo = keyString
        data = analyze(photo, bucket)
        dataset.append(data)

In [13]:
data_set = pd.DataFrame(dataset, columns=['text'])

In [14]:
data_set.head()

Unnamed: 0,text
0,REGN NO - DL9CAC6215 O SNo - 01 REG. DT: 24/12...
1,REGN. NO I DL2CAT9109 NEW REGN DT: 21/07/2015 ...
2,GOVERNMENT OF HARYANA Transfer of Ownership CE...
3,CERTIFICATE GOVERNMENT OF REGISTRATION VEHICLE...
4,GOVERNMENT OF HARYANA CERTIFICATE OF REGISTRAT...


In [15]:
print(data_set.text[1])

REGN. NO I DL2CAT9109 NEW REGN DT: 21/07/2015 CH NO MA3ETDE1S00218363 O SNO :01 E NO 2 7567094 COLOUR GLISTENINGGREY MFR MARUTI SUZUKI INDIA LTD VEH CL Motor Car NAME ANOOP SURESH DHAWALE S/WID OF SURESH HARISHCHANDRA DHAWALE ADDRESS 72 SHRI BADRINATH APPTT.PLOT NO-18 SEC-4 DWARKA NEW DELHI.. South West, Delhi-110075 MODEL MARUTI CELERIO VXI GREEN BODY RIGID (PASSENGER CAR) WHEEL BASE 002425 NO OF CYL 03 MFG DT 07/2015 UNLADEN WT 000915 FUEL PETROLICNG SEATING C 005 REGD UPTO 20/07/2030 STANDING C 00 Signature TAX UPTO OTT CU CAP 000998 26412/2015A098:09 IP DEPOT REGN. NO I DL2CAT9109 NEW REGN DT: 21/07/2015 CH NO MA3ETDE1S00218363 O SNO :01 E NO 2 7567094 COLOUR GLISTENINGGREY MFR MARUTI SUZUKI INDIA LTD VEH CL Motor Car NAME ANOOP SURESH DHAWALE S/WID OF SURESH HARISHCHANDRA DHAWALE ADDRESS 72 SHRI BADRINATH APPTT.PLOT NO-18 SEC-4 DWARKA NEW DELHI.. South West, Delhi-110075 MODEL MARUTI CELERIO VXI GREEN BODY RIGID (PASSENGER CAR) WHEEL BASE 002425 NO OF CYL 03 MFG DT 07/2015 UNLADEN

In [16]:
def extract_info(text):
    data_info = {}
    data_info['reg_num'] = get_reg_number(text)
    data_info['vin'] = get_vin_no(text)
    data_info['name'] = get_name(text)
    data_info['engine_no'] = get_engine_no(text)
    data_info['reg_date'] = get_reg_date(text)
    data_info['mfg_date'] = get_mfg_date(text)
    
    return data_info

In [17]:
info = []
for i in range(len(data_set.text)):
    data_info = extract_info(data_set.text[i])
    info.append(data_info)

In [18]:
df = pd.DataFrame(info)

In [19]:
df.head()

Unnamed: 0,reg_num,vin,name,engine_no,reg_date,mfg_date
0,DL9CAC6215,MA3FHEB1S00358580,SRISHTI,D13A0338461,24/12,
1,DL2CAT9109,MA3ETDE1S00218363,ANOOP SURESH DHAWALE,,21/07,07/2015
2,,MA3EYD81S00765439,SUBE,F8DN3321864,,9/2006
3,,MA3EYD81S01277497,RANBEER,F8DN1266647,,11/2008
4,HR49D 0002,MA3FJEB1S00404062,AMAR,D13A2235550,,9/2013


In [20]:
# saving the dataframe 
df.to_csv('file1.csv') 

## Try it yourself
- Hoping you have already set the necessary things described at the start of the notebook

In [None]:
bucket = 'BUCKET_NAME'        #Your S3 Bucket
image = 'IMAGE'               #Image name - put the image in the bucket directly

Detected_text = analyze(image, bucket)
Info = extract_info(Detected_text)

print(Info)

## Why I decided to use Amazon Textract ?

- Why not Google Cloud Vision API - Simple I didn't have a reccuring payment card but Google Vision API is better than Amazon Textract.
- Why not Amazon Rekognition - I first tried this only to find it later that it only identifies 50 words per image, so I had to drop the idea of using it and use Amazon Textract instead which obviously doesn't have 50 words per image limit.

#### I also tried PyTesseract but the result doesn't seem close to what Amazon Textract can give. You can find the link to that notebook below.
[Text Detection using Pytesseract](https://colab.research.google.com/drive/1XvYzPinaG5ejHQNQxl6ZfiyaWu4dIrC7?usp=sharing)