# Process the patent XML file
This notebook 
1. extracts necessary fields from the large patent text XML file. 
2. extract technical terms from the patent abstract and claims.

The input patent file is downloaded from https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2023/ipa230720.zip 
The file is large hence not available in the repository. Download ahead of running this notebook. 

unzipping the above ipa230720.zip file will give ipa230720.xml. Since this is almost a 1GB file, we will use etree iterparse instead of loading the entire file into memory.

The file itself is not usable for following reasons:
1. The file does not have a root xml node. Instead it contains individual patent documents appended. This means the etree parsor cannot be used directly. 
2. There are certain lines like <?xml version="1.0" encoding="UTF-8"?> and <!DOCTYPE us-patent-application SYSTEM "us-patent-application-v46-2022-02-17.dtd" [ ]> that need to be removed (parser has a problem parsing these lines)
3. While most patents start with <us-patent-application tag, some don't. Some of these have <sequence-cwu. Further Some these don't have end tags hence causing parser errors. All these need to be fixed. 



In [1]:
# To download en_core_sci_lg language model used for the tests, uncomment and run the following line
#!pip install scispacy
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_lg-0.5.3.tar.gz 
#!conda install scipy
#!pip install --upgrade scipy #went from 1.7.3 to 1.11.4
#!conda list
#!pip list

In [2]:
#!conda list

In [3]:
import re
from lxml import etree
import pandas as pd
import spacy as sp
import import_ipynb
import joblib
from io import BytesIO
import scispacy

In [4]:
# Import spacy_helper_methods notebook should be in same directory
import spacy_helper_methods as sph

importing Jupyter notebook from spacy_helper_methods.ipynb


## Preprocess input file 
1. remove unnecessary lines from the xml
2. replace sequence-cww tags with us-patent-application tags
3. 

In [5]:
%%time
input_file = '../input_files/ipa230720.xml'
with open(input_file,'r') as f:
    xmlfile = f.read()

CPU times: user 2.61 s, sys: 4.61 s, total: 7.21 s
Wall time: 13.1 s


In [6]:
%%time
xmlfile = re.sub(r'<\?xml version="1.0".*','',xmlfile)
xmlfile = re.sub(r'<!DOCTYPE.*','',xmlfile)
xmlfile = re.sub(r'sequence-cwu',r'us-patent-application',xmlfile)
#xmlfile = xmlfile.split('\n')
#xmlfile = [line.strip() for line in xmlfile if line]
xmlfile[:200]


CPU times: user 4.53 s, sys: 2.53 s, total: 7.06 s
Wall time: 8.44 s


'\n\n<us-patent-application lang="EN" dtd-version="v4.6 2022-02-17" file="US20230225235A1-20230720.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20230704" date-publ="202'

In [7]:
#print('sequence-cwu' in xmlfile)
#xmlfile = xmlfile.replace('sequence-cwu','us-patent-application')
#'sequence-cwu' in xmlfile

In [8]:
# The iterparse requires a heirarchical xml model hence add root tags
xmlfile = '<root>\n' + xmlfile + '\n</root>\n'
#xmlstr = ' '.join(xmlfile)

In [9]:
# code to find any missing tags and add them
'''
%%time
open_tags = []
end_tags = []
count = 0
for i in range(len(tmp)):
    if tmp[i].startswith('<us-patent-application'): #and tmp[i].endswith('>'):
        if open_tags:
            print(open_tags)
            tmp.insert(i,'</us-patent-application>')
            count += 1
        open_tags.append(tmp[i])
    elif tmp[i].startswith('</us-patent-application'):# and tmp[i].endswith('>'):
        if open_tags:
            open_tags.pop()
count
'''

"\n%%time\nopen_tags = []\nend_tags = []\ncount = 0\nfor i in range(len(tmp)):\n    if tmp[i].startswith('<us-patent-application'): #and tmp[i].endswith('>'):\n        if open_tags:\n            print(open_tags)\n            tmp.insert(i,'</us-patent-application>')\n            count += 1\n        open_tags.append(tmp[i])\n    elif tmp[i].startswith('</us-patent-application'):# and tmp[i].endswith('>'):\n        if open_tags:\n            open_tags.pop()\ncount\n"

In [10]:
# code to count and make sure starting and ending tags match 
'''
patent_s = 0
patent_e = 0
sequence_s = 0
sequence_e = 0
for line in xmldata:
    if "<us-patent-application" in line:
        patent_s += 1
    elif "</us-patent-application" in line:
        patent_e += 1
    elif "<sequence-cwu" in line:
        sequence_s += 1
    elif "</sequence-cwu" in line:
        sequence_e += 1
        
patent_s, patent_e, sequence_s, sequence_e
'''

'\npatent_s = 0\npatent_e = 0\nsequence_s = 0\nsequence_e = 0\nfor line in xmldata:\n    if "<us-patent-application" in line:\n        patent_s += 1\n    elif "</us-patent-application" in line:\n        patent_e += 1\n    elif "<sequence-cwu" in line:\n        sequence_s += 1\n    elif "</sequence-cwu" in line:\n        sequence_e += 1\n        \npatent_s, patent_e, sequence_s, sequence_e\n'

In [11]:
# convert to a file format to feed to iterparse
xmlfile = BytesIO(xmlfile.encode("UTF-8"))

## Parse cleaned patent xml
Walk through one patent record at a time and extract necessary fields

In [12]:
%%time
patent = []
mydict = dict()

tags = ['publication-reference','application-reference','invention-title',\
        'us-applicant','inventors','abstract','claims']


for _, element in etree.iterparse(xmlfile, tag=tags):
    if element.tag == 'publication-reference':
        if mydict:
            patent.append(mydict)
        mydict = dict()
        for elem in element.iter('doc-number','country','date'):
            if elem.tag == 'doc-number':
                mydict[elem.tag] = int(elem.text)
            else:
                mydict[elem.tag] = elem.text
        element.clear()
    elif element.tag == 'application-reference':
        subdict = dict()
        for elem in element.iter('doc-number','country','date'):
            if elem.tag == 'doc-number':
                subdict[elem.tag] = int(elem.text)
            else:
                subdict[elem.tag] = elem.text
            mydict['application-reference'] = subdict
        element.clear()
    elif element.tag == 'invention-title':
        mydict['title'] = element.text
        element.clear()
    elif element.tag == 'us-applicant':
        subdict = dict()
        for e in element.iter('orgname','city','state','country'):
            subdict[e.tag] = e.text
        mydict['assignee'] = subdict
        element.clear()
    elif element.tag == 'inventors':
        mydict['inventors'] = []
        for elem in element.iter('inventor'):
            subdict = dict()
            for e in elem.iter('first-name','last-name','city','state','country'):
                subdict[e.tag] = e.text
            mydict['inventors'].append(subdict)
        element.clear()
    elif element.tag == 'abstract':
        mydict['abstract'] = ''.join([text for text in element.itertext() if element.tag not in ('b', 'i', 'u')])
        element.clear()
    elif element.tag == 'claims':
        mydict['claims'] = ''.join([text for text in element.itertext('claim-text') if element.tag not in ('b', 'i', 'u')])
        element.clear()

patent.append(mydict) # append last one
#patent

CPU times: user 25.9 s, sys: 5.74 s, total: 31.6 s
Wall time: 1min 53s


In [13]:
len(patent)

7502

In [15]:
# Drop any patents that don't have abstract
patent_df = pd.DataFrame.from_dict(patent)
patent_df = patent_df[~patent_df['abstract'].isna()]
patent_df.head()

Unnamed: 0,country,doc-number,date,application-reference,title,assignee,inventors,abstract,claims
0,US,20230225235,20230720,"{'country': 'US', 'doc-number': 17754513, 'dat...","AGRICULTURAL TRENCH DEPTH SYSTEMS, METHODS, AN...","{'orgname': 'Precision Planting LLC', 'city': ...","[{'last-name': 'Sloneker', 'first-name': 'Dill...",\nA row unit (10) of an agricultural planter w...,a row unit frame;\na furrow opening disc rotat...
1,US,20230225236,20230720,"{'country': 'US', 'doc-number': 18007883, 'dat...",Agricultural Attachment for Cultivating Row Crops,{'orgname': 'Amazonen-Werke H. Dreyer SE & Co....,"[{'last-name': 'RESCH', 'first-name': 'Rainer'...",\nThe invention relates to an agricultural att...,"a row-detection device adapted to detect, duri..."
2,US,20230225237,20230720,"{'country': 'US', 'doc-number': 18121636, 'dat...",TRAVEL LINE CREATION SYSTEM FOR AGRICULTURAL M...,"{'orgname': 'KUBOTA CORPORATION', 'city': 'Osa...","[{'last-name': 'MORIMOTO', 'first-name': 'Taka...",\nA travel line creation system for an agricul...,a position acquirer to acquire position measur...
3,US,20230225238,20230720,"{'country': 'US', 'doc-number': 18187398, 'dat...",AGRICULTURAL HARVESTING MACHINE WITH PRE-EMERG...,"{'orgname': 'Deere & Company', 'city': 'Moline...","[{'last-name': 'BLANK', 'first-name': 'Sebasti...",\nAn agricultural harvesting machine includes ...,crop processing functionality configured to en...
4,US,20230225239,20230720,"{'country': 'US', 'doc-number': 18190358, 'dat...","DETECTION OF PLANT DISEASES WITH MULTI-STAGE, ...","{'orgname': 'CLIMATE LLC', 'city': 'Saint Loui...","[{'last-name': 'Guan', 'first-name': 'Wei', 'c...",\nA computer system is provided comprising a c...,a classification model management server compu...


In [16]:
patent_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7234 entries, 0 to 7501
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   country                7234 non-null   object
 1   doc-number             7234 non-null   int64 
 2   date                   7234 non-null   object
 3   application-reference  7234 non-null   object
 4   title                  7234 non-null   object
 5   assignee               7234 non-null   object
 6   inventors              7234 non-null   object
 7   abstract               7234 non-null   object
 8   claims                 7234 non-null   object
dtypes: int64(1), object(8)
memory usage: 565.2+ KB


## Extract technical terms
Process the patent dictionary generated above by doing following things:
1. Lemmatize text of abstract and claims fields
2. Extract entities for each record using scispacy entity language model 
3. Pass the extracted entities through binary RandomForrest classifier to remove non-tech terms
4. Drop abstract and claim columns and save output to csv

In [17]:
!unzip -o ../model/trained_tech_classifier_model.joblib.zip -d ../model

Archive:  ../model/trained_tech_classifier_model.joblib.zip
  inflating: ../model/trained_tech_classifier_model.joblib  


In [18]:
model = joblib.load('../model/trained_tech_classifier_model.joblib')
#scikit

In [19]:
nlp = sp.load('en_core_sci_lg')

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


In [21]:
%%time
# This can take really long time to process all patents and extract entities. Uncomment as necessary
#patent_df['abstract_entities'] = sph.extract_tech_entities(nlp, model,patent_df['abstract'])
#patent_df['claim_entities'] = sph.extract_tech_entities(nlp, model, patent_df['claims'])
#patent_df.to_csv('../preprocessed_files/patents_entities.csv')

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 15.3 µs


In [22]:
# get a sample of 1k patents
pmatch_df = patent_df.sample(1000) #pd.read_json('../preprocessed_files/patents.json')

### add companies that have entries in both patent and SBIR database manually

In [23]:
winning_list = ['Beirobotics LLC',
'Ultra Safe Nuclear Corporation',
'Andluca Technologies Inc.',
'FURCIFER INC.',
'Kurt J. Lesker Company',
'Nanosys, Inc.'   
]

In [24]:
winning_str = '|'.join(winning_list)
winning_str

'Beirobotics LLC|Ultra Safe Nuclear Corporation|Andluca Technologies Inc.|FURCIFER INC.|Kurt J. Lesker Company|Nanosys, Inc.'

In [25]:
winning_patents = []
for p in patent:
    if 'assignee' in p.keys() and 'orgname' in p['assignee'].keys():
        if (p['assignee']['orgname'] in winning_list):
            print(p['assignee']['orgname'])
            winning_patents.append(p)

Beirobotics LLC
FURCIFER INC.
Nanosys, Inc.
Ultra Safe Nuclear Corporation
Kurt J. Lesker Company
Andluca Technologies Inc.


In [26]:
winning_patents = pd.DataFrame(winning_patents)

In [27]:
winning_patents

Unnamed: 0,country,doc-number,date,application-reference,title,assignee,inventors,abstract,claims
0,US,20230227158,20230720,"{'country': 'US', 'doc-number': 18123405, 'dat...",UNMANNED AERIAL SYSTEM AND METHOD FOR CONTACT ...,"{'orgname': 'Beirobotics LLC', 'city': 'Richmo...","[{'last-name': 'Beiro', 'first-name': 'Michael...",\nA system for performing work on electrical p...,a power line tool adapted to perch on an energ...
1,US,20230229051,20230720,"{'country': 'US', 'doc-number': 17577538, 'dat...",METHOD AND DEVICE FOR CONTROLLING STATES OF DY...,"{'orgname': 'FURCIFER INC.', 'city': 'FREMONT'...","[{'last-name': 'WANG', 'first-name': 'JIAN', '...",\nThe disclosure relates generally to a method...,selecting a desired optical state of the elect...
2,US,20230229087,20230720,"{'country': 'US', 'doc-number': 18098167, 'dat...",UV-CURABLE QUANTUM DOT FORMULATIONS,"{'orgname': 'Nanosys, Inc.', 'city': 'Milpitas...","[{'last-name': 'IPPEN', 'first-name': 'Christi...",\nProvided are patterned films comprising nano...,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
3,US,20230230714,20230720,"{'country': 'US', 'doc-number': 18010358, 'dat...",CONTROL DRUM CONTROLLER FOR NUCLEAR REACTOR SY...,"{'orgname': 'Ultra Safe Nuclear Corporation', ...","[{'last-name': 'Chaleff', 'first-name': 'Ethan...",\nA nuclear reactor system includes a nuclear ...,a pressure vessel;\na nuclear reactor core dis...
4,US,20230230802,20230720,"{'country': 'US', 'doc-number': 18178664, 'dat...",Ultra High Purity Conditions for Atomic Scale ...,"{'orgname': 'Kurt J. Lesker Company', 'city': ...","[{'last-name': 'Rayner, JR.', 'first-name': 'G...",\nAn apparatus for atomic scale processing is ...,a reactor having inner and outer surfaces;\nwh...
5,US,20230231508,20230720,"{'country': 'US', 'doc-number': 18123817, 'dat...",WINDOWS WITH POWER GENERATION FROM TRANSPARENT...,"{'orgname': 'Andluca Technologies Inc.', 'city...","[{'last-name': 'Davy', 'first-name': 'Nicholas...",\nIllustrative embodiments of the invention ge...,a rigid transparent panel including a transpar...


In [92]:
#pmatch_df = pd.read_json('../preprocessed_files/patents.json')

In [28]:
pfinal_df = pd.concat([winning_patents, pmatch_df], ignore_index=True)

In [29]:
pfinal_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1006 entries, 0 to 1005
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   country                1006 non-null   object
 1   doc-number             1006 non-null   int64 
 2   date                   1006 non-null   object
 3   application-reference  1006 non-null   object
 4   title                  1006 non-null   object
 5   assignee               1006 non-null   object
 6   inventors              1006 non-null   object
 7   abstract               1006 non-null   object
 8   claims                 1006 non-null   object
 9   abstract_entities      1000 non-null   object
dtypes: int64(1), object(9)
memory usage: 78.7+ KB


In [30]:
%%time
pfinal_df['abstract_entities'] = sph.extract_tech_entities(nlp, model,pfinal_df['abstract'])
pfinal_df['claim_entities'] = sph.extract_tech_entities(nlp, model, pfinal_df['claims'])

CPU times: user 7min 39s, sys: 37.2 s, total: 8min 16s
Wall time: 8min 55s


In [31]:
pfinal_df.to_json('../preprocessed_files/patents.json')