In [2]:
#Import required libraries
#Import re so that we can use regular expression to capture the data
import re
#Import pandas to store the extracted data into a DataFrame and then convert that dataframe into whatever format we want
import pandas as pd

## 3. Load the Data

In [3]:
file = open('inp.txt','r') #File is opened in read mode
text = file.read()  # Content of the file is read into a variable text
file.close() #close the file as it is not needed anymore
text[0:100] #Check the first line of the file to see what we are dealing with

'<?xml version="1.0" encoding="UTF-8"?>\n<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file'

Below 9 lists are created to store the data extracted from the given input file so as to create the required 9 columns as per the question

In [3]:
#Initialize the lists we are going to use for storing the extracted data
grant_id = []
patent_title = []
kind = []
number_of_claims = []
inventors = []
citations_applicant_count = []
citatons_examiner_count = []
claims_text = []
abstract = []

The main part from which data related to the columns can be extracted is found using the below regex. It is then stored as a list of strings. Each element in the list contains informations pertaining to a single patent, that way we can can iterate through each patent and systematically append the relevant information to their corresponding list.

In [4]:
#Partitioning the file by patent i.e each list element will be a single patent
patents = re.findall('(?s)<us-patent-grant(.*?)/claims',text)

### Extracting grant_id

Using the for loop,we iterate through each patent and use the regex below to extract the grant id. The regex is "doc-number>(.*)</" which means to select all the data after 'doc-number>' and '</' in the data that is stored in the patent variable. 
Grant id consists of the country and the ID number. We use regex to find the country aswell and concatenate to the ID number.
Grantid is is then stored in the list grant_id.

In [5]:
#iterate through each patent
for patent in patents:
    #The regex and code below concatenates the first occurrences of country and document number into one string
    doc_num = re.search('<country>(.*)</country>', patent).group(1) + re.search('doc-number>(.*)</', patent).group(1)
    #append result to the list
    grant_id.append(doc_num)

### Extracting patent_title

Using the for loop, we iterate through each patent and use the regex below to extract the patent title. The regex is ">(.*)</in" which looks for the inventor title in the data that is in patent variable.
Patent_title is then stored in the list patent_title.

In [6]:
#iterate through the list 
for patent in patents:
    #regex to find the first occurrence of the title
    title = re.search('>(.*)</in', patent).group(1)
    #append title to the list
    patent_title.append(title)

### Extracting Patent kind

Using the for loop,we iterate through each patent and use the regex below to extract the patent kind. The regex is ">(.*)</kind" which looks for the kinds of patents. There are 6 kinds of patents : B1, B2, E1, P2, P3 and S1.

And each kind referes to: 

B1 :Utility Patent Grant (no published application) issued on or after January 2, 2001.

B2 :Utility Patent Grant (with a published application) issued on or after January 2, 2001.

E1 :Reissue Patent

P2 :Plant Patent Grant (no published application) issued on or after January 2, 2001

P3 :Plant Patent Grant (with a published application) issued on or after January 2, 2001

S1 :Design Patent

If you look closely at the structure of the data above, you will notice that it looks like a dictionary. We will we be using the same to for the information extraction as it is the most efficient method.

In [7]:
#Create the dictionary to use as the patent code reference
#Note: this dictionary was created using the sample input as the ground truth
kind_ref = {'B1':'Utility Patent Grant (no published application) issued on or after January 2, 2001.',
            'B2':'Utility Patent Grant (with a published application) issued on or after January 2, 2001.',
            'E1':'Reissue Patent',
            'P2':'Plant Patent Grant (no published application) issued on or after January 2, 2001',
            'P3':'Plant Patent Grant (with a published application) issued on or after January 2, 2001',
            'S1':'Design Patent'
           }
#iterate through the patents
for patent in patents:
    #Regex to get the first occurrence of the patent kind code
    kind_code = re.search('>(.*)</kind', patent).group(1)
    #Append the code reference using the dictionary from earlier
    kind.append(kind_ref[kind_code])

### Extracting the number_of_claims

Using the for loop,we iterate through each patent and use the regex below to extract the number of claims. The regex is "<claim id" which looks for the data that gives the number of claims. It is then stored in the list number_of_claims.

In [8]:
#Iterate through the patents
for patent in patents:
    #count the number of times claim ID occurs
    num_cl = len(re.findall('<claim id', patent))
    #append the result to the list
    number_of_claims.append(num_cl)

### Extracting inventors

Using the for loop,we iterate through each patent and use the regex below to extract the names of the inventors. The regex is "(?s)<inventors>(.*?)</inventors>" which looks for the data regarding inventors. The names are given in the format of first name and last name for each inventor. Hence, the first name is extracted using the regex "first-name>(.*)</first-name>" and the last name is extracted by using "<last-name>(.*)</last-name>". The first and last name are combined using a for loop. It is then stored in the list "inventors".

In [9]:
#Iterate through the patents
for patent in patents:
    #initialize the list of names in each patent
    name_list = []
    #Capture all the names of the inventors
    invent = re.search('(?s)<inventors>(.*?)</inventors>', patent)
    if invent == None:
        #Append 'NA' if no inventors
        name_string = 'NA'
    else:
        invent = invent.group(1)
        #Store the lastnames
        last_name = re.findall('<last-name>(.*)</last-name>',invent)
        #Store the first names
        first_name = re.findall('<first-name>(.*)</first-name>',invent)
        for i in range(len(first_name)):
            #Concatenate the corresponding names
            name_list.append(first_name[i] + ' ' + last_name[i])
        #Convert name list to string of appropriate format
        name_string = '['
        for name in name_list:
            name_string += name + ','
        name_string = name_string[:-1]
        name_string += ']'
    #append result
    inventors.append(name_string)

### Extracting citations_applicant_count and citations_examiner_count



Using the for loop,we iterate through each patent and use the regex below to extract the citations_applicant_count and citatons_examiner_count . The regex is "cited by applicant" which looks for the data about the citation. Then the number of such citations are counted using the len command. It is then stored in the "citations_applicant_count" list. Similarly, the regex "cited by examiner" is used to look for the data about the citation by examiner. Then the number of such citations are counted using the len command. It is then stored in the "citatons_examiner_count" list.

In [10]:
#Iterate through patents
for patent in patents:
    #Count occurrences of 'cited by applicant'
    num_ci_app = len(re.findall('cited by applicant', patent))
    #Count occurrences of 'cited by examiner'
    num_ci_exam = len(re.findall('cited by examiner', patent))
    #Append results to their approprite list
    citations_applicant_count.append(num_ci_app)
    citatons_examiner_count.append(num_ci_exam)

### Extracting claims_text

Using the for loop,we iterate through each patent and use the regex below to extract the claims_text. The regex is "(?s)<claim (.*?)</claim>" which looks for the data about the claim. If-else is used to verify if there are any claims for each patent in the data file. If a patent does not have any claim then 'NA' is stored in the claims_text list. If claim is present for the patent, then a for loop is used to combine the data on different lines into one string by using the re.sub commmand to remove the newline characters and hence merge the text from different lines into one paragraph. It is then stored in the claims_text list.

In [11]:
#Iterate through patents
for patent in patents:
    #capture all the text in the claim section
    claim_list = []
    claims = re.findall('(?s)<claim (.*?)</claim>', patent)
    if claims == None :
        #print 'NA' if no claims
        claims_string = 'NA'
    else:
        #Iterate thourgh each claim
        for claim in claims:
            #Remove xml tags and line breaks
            claim = re.sub('<.*?>', '', claim)
            claim = re.sub('.*>', '', claim)
            claim = re.sub('\n', '', claim)
            #remove all unwanted spaces
            claim = " ".join(claim.split())
            #Append clean claim to list
            claim_list.append(claim)
        #Convert all the claims to a single string format within []
        claim_string = '['
        for claim in claim_list:
            claim_string += claim + ','
        claim_string = claim_string[:-1]
        claim_string += ']'
    #append result to final list
    claims_text.append(claim_string)

### Extracting abstact

Using the for loop,we iterate through each patent and use the regex below to extract the names of the inventors. The regex is "(?s)<abstract(.*?)</abstract>" which looks for the data regarding abstacts. If- else is used to check if each patent has an abstract or no. If there is no abstract, 'NA' is stored in the abstract list for that patent. If there is abstract, then the re.sub command is used to combine the text from different lines into one string by replacing the newline character by a space. It is then stored in the list 'abstract'.

In [12]:
#Iterate through patents
for patent in patents:
    #Caoture all text in the abstract section
    abstracts = re.search('(?s)<abstract(.*?)</abstract>', patent)
    if abstracts == None:
        #Return 'NA' if no abstract
        abstracts = 'NA'
    else :
        abstracts = abstracts.group(1)
        #Remove xml tags and linebreaks
        abstracts = re.sub('<.*?>', '', abstracts)
        abstracts = re.sub('.*>', '', abstracts)
        abstracts = re.sub('\n', '', abstracts)
        #Remove unecessary spaces
        abstracts = " ".join(abstracts.split())
    #Append to list
    abstract.append(abstracts)

### Creating a dataframe

The lists are zipped and then a list is created which is passed onto the pd.dataframe function to create a new dataframe. The column names are specified as per the column names mentioned in the question.

In [13]:
#zip everything together so that we can conver the data into a single dataframe
zippedList =  list(zip(grant_id, patent_title, kind, number_of_claims, inventors, citations_applicant_count, citatons_examiner_count, claims_text, abstract))
#Convert the zipped data into a Dataframe with appropriate column headers
csv_df = pd.DataFrame(zippedList, columns = ['grant_id' , 'patent_title', 'kind', 'number_of_claims', 'inventors', 'citations_applicant_count', 'citations_examiner_count', 'claims_text', 'abstract'])


### CSV file from the dataframe

In [14]:
#Convert the data frame to CSV format
csv_df.to_csv('155.csv', index = False)

### Generating a JSON file from the dataframe

To convert the data into JSON format we simply iterate throug each patent and concatenate the information into one large string variable in the format of a JSON file and then save the string to a file

In [15]:
#open file for writing the data to json format
f = open("155.json","w+")
#Convert all the data to a single string variable
#Note: This format is in accordance to the ground truth which is the sample output provided
#Intialize string variable
json_output = '{'
#Iterate through all the data elements
for i in range(len(grant_id)):
    json_output += '"' + grant_id[i] + '":{'
    json_output += '"' + 'patent_title' + '":"' + patent_title[i] + '",'
    json_output += '"' + 'kind' + '":"' + kind[i] + '",'
    json_output += '"' + 'number_of_claims' + '":' + str(number_of_claims[i]) + ','
    json_output += '"' + 'inventors' + '":"' + inventors[i] + '",'
    json_output += '"' + 'citations_applicant_count' + '":' + str(citations_applicant_count[i]) + ','
    json_output += '"' + 'citations_examiner_count' + '":' + str(citatons_examiner_count[i]) + ','
    json_output += '"' + 'claims_text' + '":"' + claims_text[i] + '",'
    json_output += '"' + 'abstract' + '":"' + abstract[i] + '"},'
#Remove unecessary ','
json_output = json_output[:-1]
#End the Json file
json_output += '}'
#Properly escape all the '/'
json_output = re.sub('/','\\\\/', json_output)
#Write string variable intp the file
f.write(json_output)
#Flush the file so that the output buffer writes everything to the file
f.flush()
#Close the file
f.close

<function TextIOWrapper.close()>