# Case Study
*An Insurance Company wants to digitize their legacy data. Most of the data are captured in the form of text which is available as a portable document format(pdf). The company wants to extract information and then export the required information in the form of CSV so that it can be loaded into the database easily.*

For the purpose of this tutorial, we assume that we have only one pdf which is stored __[here](https://github.com/vedantnarayan/DataScienceTutorial/blob/master/TextFromPDF/4200410050.PDF)__ for your reference. For simplicity, we are going to extract the following information from pdf.
* Master Policy No.
* Certificate No.
* GST NO.

## STEP 1 Convert PDF To Text

In Python there are many packages available to convert Text Based Pdfs to text string. Please note here we are dealig with text based pdfs and image based scanned pdfs
are out of scope for this tutorial. For this tutorial we are using PyPDF2 package to extract text from pdf. You can install PyPDF2 from __[this](https://pypi.org/project/PyPDF2/)__ link

### Converting pdf to text using pypdf2

In [1]:
import PyPDF2

In [2]:
def convert_pdf_to_text_py(path):
    content = ""
    with open(path, "rb") as f:
        pdfDoc = PyPDF2.PdfFileReader(f,"rb")
        for i in range(0,pdfDoc.getNumPages()):
            content += pdfDoc.getPage(i).extractText() + "\n"
        return (content)

In [3]:
text=convert_pdf_to_text_py("4200410050.PDF")

In [4]:
text=text.replace('\n','') ## Removed Newline characters
text

'PA Insurance for E-Ticket Passengers of IRCTCCertificate of InsuranceMaster Policy No: 10031/48/20/000002Certificate No: 201911230158587Name and Address of the GroupOrganizer/Group Policy holderIndian Railway Catering and Tourism Corporation 12th floor,IRCTC Corporate Office Statesman House, Barakhamba Road,New Delhi, Pin Code 110001.Originating Station: YESVANTPUR JNDestination Station: HUBBALLI JN#Trip means the actual departure of train from the originating station to actual arrival of train at the destination station asmentioned in booked ticket through which insurance cover has been opted and premium paid, including process ofentrainingand process of detrainingthe train. For other T&C and Policy wording please visit our website www.shriramgi.comInsured DetailsPNR NO. 4200410050NameAgeGenderMobile No.Email IDMr. NITISH KUMAR29Male9921168907vedant.006@gmail.comSum Insured (IN INR) DeathPermanent totaldisabilityPermanent partialdisability uptoHospitalizationexpenses for InjuryuptoTr

## STEP 2 Extract Information from Text

#### Extracting Information from Text using Regular Expressions

In [5]:
import re ## Regular Expressions library used for python

In [6]:
def get_info_from_text(text,regex):
    """
    Extract the string that matches the regular expressions
    param:
        text - text string extracted from pdf file
        regex - reguar expressions that is to be searched for
    returns the searched string
    """
    required_info=''
    ResSearch = re.search(regex, text)
    if ResSearch:
        required_info = ResSearch.group(1)
    return required_info

#### Define the regex for each and every field from the text which needs to be extracted. For Testing the regular expression use https://pythex.org/. Copy and paste your texts in Your test string field
#### You can read more about regular expressions at https://docs.python.org/3/howto/regex.html

In [7]:
regex_Master_Policy_No = 'Master Policy No:\s?([0-9/]{18})'
regex_Certificate_No = 'Certificate No:\s?(\d{15})'
regex_GST_No='GST No.:\s?(\d{2}[A-Z]{5}\d{4}[A-Z]{1}[A-Z\d]{1}[Z]{1}[A-Z\d]{1})' ## https://stackoverflow.com/questions/44431819/regex-for-gst-identification-number-gstin

In [8]:
Master_Policy_No = get_info_from_text(text,regex_Master_Policy_No)
Certificate_No=get_info_from_text(text,regex_Certificate_No)
GST_No= get_info_from_text(text,regex_GST_No)

In [9]:
print ('Master Policy Number is ' + Master_Policy_No)
print ('Certificate Number is ' + Certificate_No)
print ('GST Number is ' + GST_No)

Master Policy Number is 10031/48/20/000002
Certificate Number is 201911230158587
GST Number is 08AAKCS2509K1Z3


# STEP 3 Reformat and Export

####  Convert the data into a pandas dataframe

In [10]:
import pandas as pd # Data manipulation library

In [11]:
## Creating a dataframe
df_details = pd.DataFrame(columns=['MasterPolicyNo','CertificateNo','GSTNO'])

In [12]:
data=[Master_Policy_No,Certificate_No,GST_No]

In [13]:
## Inserting the row
df_details.loc[len(df_details.index)] = data

In [14]:
df_details

Unnamed: 0,MasterPolicyNo,CertificateNo,GSTNO
0,10031/48/20/000002,201911230158587,08AAKCS2509K1Z3


In [15]:
##export
df_details.to_csv("policydetails.csv",index=False)