### Investigate application patent data.

In [1]:
import pandas as pd

In [2]:
application_patent = pd.read_pickle("../data/extacted_application2017_data.dat")

In [3]:
application_patent.head()

Unnamed: 0,application_id,cited_doc_number,patent_data
0,15338532,20170043899,"[<us-patent-application lang=""EN"" dtd-version=..."
1,15338532,20170043899,"[<us-patent-application lang=""EN"" dtd-version=..."
2,15051356,20170246242,"[<us-patent-application lang=""EN"" dtd-version=..."
3,15051356,20170246242,"[<us-patent-application lang=""EN"" dtd-version=..."
4,15238319,20170066736,"[<us-patent-application lang=""EN"" dtd-version=..."


Consider regex to extract claim data.

In [83]:
one_patent = application_patent['patent_data'][0]

In [154]:
one_patent

['<us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20170043899A1-20170216.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20170201" date-publ="20170216">',
 '<us-bibliographic-data-application lang="EN" country="US">',
 '<publication-reference>',
 '<document-id>',
 '<country>US</country>',
 '<doc-number>20170043899</doc-number>',
 '<kind>A1</kind>',
 '<date>20170216</date>',
 '</document-id>',
 '</publication-reference>',
 '<application-reference appl-type="utility">',
 '<document-id>',
 '<country>US</country>',
 '<doc-number>15338532</doc-number>',
 '<date>20161031</date>',
 '</document-id>',
 '</application-reference>',
 '<us-application-series-code>15</us-application-series-code>',
 '<classifications-ipcr>',
 '<classification-ipcr>',
 '<ipc-version-indicator><date>20060101</date></ipc-version-indicator>',
 '<classification-level>A</classification-level>',
 '<section>B</section>',
 '<class>65</class>',
 '<subclass>D</subclass>',
 

In [108]:
import re

### For kind codes
kind_pattern = "<kind>"
kind_re = re.compile(kind_pattern)

element_extract_pattern = r"<[^>]*?>"
element_extract_re = re.compile(element_extract_pattern)

### For claims
claim_s_pattern = r"<claim id="
claim_s = re.compile(claim_s_pattern)

claim_e_pattern = r"</claim>"
claim_e = re.compile(claim_e_pattern)

tag_pattern = r"(<claim-text>|</claim-text>|<claim id=.*?>|<b>[0-9]*</b>.|<claim-ref idref=.*?</claim-ref>)"
tag_re = re.compile(tag_pattern)

In [110]:
element_extract_re.sub("", "<kind>A1</kind>")

'A1'

In [86]:
tag_re.sub("", "<claim-text><b>1</b>. A bottle extending along a longitudinal axis and that includes, a base;")

' A bottle extending along a longitudinal axis and that includes, a base;'

In [112]:
kind_code = ""
claims = ""

for idx,elem in enumerate(one_patent):
    if kind_re.match(elem):
        kind_code += element_extract_re.sub("", elem)
    if claim_s.match(elem):
        i = 0
        while not claim_e.match(one_patent[idx+i]):
#             print( tag_re.sub("", one_patent[idx+i]) )
            one_line_text = tag_re.sub("", one_patent[idx+i])
            claims += one_line_text
            i += 1

In [113]:
kind_code

'A1'

In [95]:
claims

' A bottle extending along a longitudinal axis and that includes, a base;a neck;an insulative body extending axially between the base and the neck, and including:radially outwardly facing first surfaces spaced axially apart from one another;a radially outwardly facing second surface radially smaller than, and located axially between, the first surfaces;a plurality of projections projecting from the second surface and collectively establishing a radially outwardly facing third surface radially larger than the second surface; andparting line bridges projecting radially outwardly from the second surface, diametrically opposed to one another, and extending axially between the first surfaces; anda label carried by the body over at least a portion of the third surface, wherein a continuous insulation volume is established between the label and the second surface, and extends continuously over more than 90 angular degrees around the bottle. The bottle set forth in , wherein the second surface

Get all {kind code, claims} pairs.

In [115]:
pd.DataFrame([[1,2],[3,4]])

Unnamed: 0,0,1
0,1,2
1,3,4


In [127]:
def extract_kindcode_and_claims(one_patent):
    kind_code = ""
    claims = ""

    for idx,elem in enumerate(one_patent):
        if kind_re.match(elem):
            kind_code += element_extract_re.sub("", elem)
        if claim_s.match(elem):
            i = 0
            while not claim_e.match(one_patent[idx+i]):
                one_line_text = tag_re.sub("", one_patent[idx+i])
                claims += one_line_text
                i += 1
    
    return [kind_code, claims]

In [128]:
# [kind_code, claims] = extract_kindcode_and_claims(one_patent)

In [130]:
%%time

all_pairs = []

for patent in application_patent['patent_data']:
    all_pairs.append( extract_kindcode_and_claims(patent) )

CPU times: user 3.3 s, sys: 11.6 ms, total: 3.32 s
Wall time: 3.32 s


In [131]:
df_for_modeling = pd.DataFrame(all_pairs)

In [132]:
df_for_modeling.head()

Unnamed: 0,0,1
0,A1,A bottle extending along a longitudinal axis ...
1,A1,A bottle extending along a longitudinal axis ...
2,A1,An oral composition comprising minicapsules o...
3,A1,An oral composition comprising minicapsules o...
4,A1,(canceled) A method for processing biomass co...


Category count.

In [140]:
from collections import Counter

In [143]:
cnt = Counter( df_for_modeling[0] )

In [144]:
cnt

Counter({'A1': 4624,
         'A100': 78,
         'A2A1': 4,
         'A2A100': 1,
         'A9A1': 29,
         'A9A100': 2})

It's too skew data for data analysis.
And it includes many deprecated rows.
Remove them.

In [150]:
df_for_modeling = df_for_modeling.drop_duplicates()

In [151]:
cnt = Counter( df_for_modeling[0] )

In [152]:
cnt

Counter({'A1': 2501,
         'A100': 76,
         'A2A1': 2,
         'A2A100': 1,
         'A9A1': 15,
         'A9A100': 2})

We need to investigate other data sources for some analysis.

### Investigate the combination of application data and office_action data.

In [155]:
office_action_df = pd.read_pickle("../data/office_15.dat")

In [156]:
office_action_df.head()

Unnamed: 0,app_id,ifw_number,document_cd,mail_dt,art_unit,uspc_class,uspc_subclass,header_missing,fp_missing,rejection_fp_mismatch,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
3549653,15001440,IKIANOLHRXEAPX4,CTNF,2016-02-16,3762,607,42000,0,0,0,...,0,1,0,0,0,0,0,0,0,1
3560595,15009822,IKSBCEFQRXEAPX1,CTNF,2016-02-22,2139,711,162000,0,0,0,...,1,0,0,0,0,0,0,0,2,1
3561801,15005636,IKSG9XK1RXEAPX1,CTNF,2016-02-22,3766,600,509000,0,0,0,...,0,1,1,0,0,0,0,0,0,0
3562763,15002146,IKSLCJQLRXEAPX4,CTNF,2016-02-22,3723,15,104930,0,0,0,...,0,0,1,1,0,0,0,0,0,1
3564336,15001553,IKTT4P0TRXEAPX1,CTNF,2016-02-19,1625,514,279000,0,1,0,...,0,1,0,0,0,0,0,0,0,1


In [163]:
df_for_modeling.head()

Unnamed: 0,0,1
0,A1,A bottle extending along a longitudinal axis ...
2,A1,An oral composition comprising minicapsules o...
4,A1,(canceled) A method for processing biomass co...
6,A1,An exercise cable handle assembly comprising:...
7,A1,"A composition for repelling an arthropod, com..."


In [164]:
df_for_modeling.columns = ['kind_code', 'claims']

In [165]:
df_for_modeling['app_id'] = application_patent['application_id'].drop_duplicates()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [167]:
df_for_modeling.head()

Unnamed: 0,kind_code,claims,app_id
0,A1,A bottle extending along a longitudinal axis ...,15338532
2,A1,An oral composition comprising minicapsules o...,15051356
4,A1,(canceled) A method for processing biomass co...,15238319
6,A1,An exercise cable handle assembly comprising:...,15078852
7,A1,"A composition for repelling an arthropod, com...",15073698


Merge them.

In [170]:
df_for_modeling.merge(office_action_df, left_on='app_id', right_on='app_id', how = 'inner')

Unnamed: 0,kind_code,claims,app_id,ifw_number,document_cd,mail_dt,art_unit,uspc_class,uspc_subclass,header_missing,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type


In [172]:
office_action_df[ office_action_df['app_id'] == 15338532 ]

Unnamed: 0,app_id,ifw_number,document_cd,mail_dt,art_unit,uspc_class,uspc_subclass,header_missing,fp_missing,rejection_fp_mismatch,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
4316208,15338532,J082O8FWRXEAPX1,CTNF,2017-03-13,3781,215,384000,0,0,0,...,1,0,1,0,0,0,0,0,3,0


In [173]:
pd.merge(df_for_modeling, office_action_df, on='app_id')

Unnamed: 0,kind_code,claims,app_id,ifw_number,document_cd,mail_dt,art_unit,uspc_class,uspc_subclass,header_missing,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type


In [176]:
df_for_modeling['app_id'][:1][0]

'15338532'

In [179]:
df_for_modeling['app_id'].astype('int')

ValueError: cannot convert float NaN to integer

In [180]:
len(df_for_modeling)

2597

In [182]:
len( df_for_modeling.dropna() )

2272

In [183]:
df_for_modeling = df_for_modeling.dropna()

In [185]:
df_for_modeling['app_id'] = df_for_modeling['app_id'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [187]:
df_combined = df_for_modeling.merge(office_action_df, left_on='app_id', right_on='app_id', how = 'inner')

In [188]:
df_combined.head()

Unnamed: 0,kind_code,claims,app_id,ifw_number,document_cd,mail_dt,art_unit,uspc_class,uspc_subclass,header_missing,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
0,A1,A bottle extending along a longitudinal axis ...,15338532,J082O8FWRXEAPX1,CTNF,2017-03-13,3781,215,384000,0,...,1,0,1,0,0,0,0,0,3,0
1,A1,An oral composition comprising minicapsules o...,15051356,IT4LPNM1RXEAPX0,CTNF,2016-09-19,1613,424,400000,0,...,1,0,1,0,0,0,1,0,10,1
2,A1,(canceled) A method for processing biomass co...,15238319,J4K4AWPURXEAPX3,CTNF,2017-07-05,1653,435,41000,0,...,1,0,0,0,0,0,0,0,3,0
3,A1,An exercise cable handle assembly comprising:...,15078852,J0JYIXB8RXEAPX3,CTNF,2017-03-22,3764,482,139000,0,...,1,1,0,1,0,0,0,0,3,3
4,A1,"A composition for repelling an arthropod, com...",15073698,ITVQ26LXRXEAPX0,CTNF,2016-10-04,1672,514,311000,0,...,0,0,0,0,0,0,0,0,0,1


In [196]:
df_combined.columns

Index(['kind_code', 'claims', 'app_id', 'ifw_number', 'document_cd', 'mail_dt',
       'art_unit', 'uspc_class', 'uspc_subclass', 'header_missing',
       'fp_missing', 'rejection_fp_mismatch', 'closing_missing',
       'rejection_101', 'rejection_102', 'rejection_103', 'rejection_112',
       'rejection_dp', 'objection', 'allowed_claims', 'cite102_gt1',
       'cite103_gt3', 'cite103_eq1', 'cite103_max', 'signature_type'],
      dtype='object')

In [189]:
df_combined.kind_code.value_counts()

A1        2767
A100        86
A9A1        14
A2A1         4
A9A100       3
A2A100       1
Name: kind_code, dtype: int64

In [191]:
df_combined.document_cd.value_counts()

CTNF    2355
CTFR     520
Name: document_cd, dtype: int64

In [193]:
df_combined.art_unit.value_counts()

3711    26
3762    24
3633    22
3766    22
2852    22
3711    20
2853    19
3762    19
3733    19
3641    19
2853    19
3763    18
3765    18
3766    17
3733    16
2852    16
3638    15
3631    15
3735    15
3678    14
2844    14
3673    14
3632    14
3651    14
3763    13
1615    13
2831    13
3731    12
3618    12
3633    12
        ..
2854     1
2427     1
3622     1
2437     1
2411     1
2438     1
2441     1
2442     1
2124     1
1726     1
2448     1
2644     1
2864     1
2857     1
3691     1
2491     1
1782     1
1732     1
1616     1
2672     1
2625     1
3782     1
2141     1
1645     1
1781     1
1643     1
3625     1
2874     1
2863     1
2456     1
Name: art_unit, Length: 779, dtype: int64

In [194]:
df_combined.uspc_class.value_counts()

257    157
424    105
606     80
455     79
600     71
345     69
370     69
052     64
607     62
707     59
514     53
623     52
438     47
709     46
340     40
726     40
348     40
604     39
347     38
705     36
473     36
435     34
375     29
428     29
280     28
248     28
725     28
399     27
439     26
381     26
      ... 
057      1
111      1
384      1
392      1
440      1
137      1
184      1
336      1
126      1
326      1
433      1
169      1
505      1
141      1
369      1
507      1
380      1
238      1
352      1
075      1
181      1
124      1
071      1
585      1
318      1
570      1
105      1
445      1
518      1
101      1
Name: uspc_class, Length: 266, dtype: int64

Not easy to set up classification problems...