# Training-Test set setup

Create dataset from Paul-san's.

- gs://ppr/fast_data/statistical_learning/ml-study-phys/patent/patent_applications_with_office_actions_2005-2018.zip
- gs://ppr/fast_data/statistical_learning/ml-study-phys/patent/patent_grants_with_office_actions_2005-2012.zip
- gs://ppr/fast_data/statistical_learning/ml-study-phys/patent/applications_with_office_actions_citing_grants_with_inner_join.zip

Use rejection 102 only.
From rejection 102 data, sample 1000 app id to training set and 1000 app id to test set.

Result dataset are

- gs://karino2-uspatent/citations_info_2000.df.gz
- gs://karino2-uspatent/testset_app_1000.df.gz
- gs://karino2-uspatent/training_app_1000.df.gz
- gs://karino2-uspatent/grants_for_2000.df.gz



### Download data of Paul from cloud storage.(prerequisite)

https://github.com/Pawlovicky/US-patent-analysis

Data download is done outside of docker because gsutil setup is a little messy in docker.
Run following commands.

```
cd ../data
gsutil cp gs://ppr/fast_data/statistical_learning/ml-study-phys/patent/patent_applications_with_office_actions_2005-2018.zip ./
gsutil cp gs://ppr/fast_data/statistical_learning/ml-study-phys/patent/patent_grants_with_office_actions_2005-2012.zip ./
gsutil cp gs://ppr/fast_data/statistical_learning/ml-study-phys/patent/applications_with_office_actions_citing_grants_with_inner_join.zip ./
unzip applications_with_office_actions_citing_grants_with_inner_join.zip
# takes 10min or more.
unzip patent_applications_with_office_actions_2005-2018.zip
mv applicationsY.h5 patent_applications_with_office_actions_2005-2018.h5
unzip patent_grants_with_office_actions_2005-2012.zip
mv grants_n6.h5 patent_grants_with_office_actions_2005-2012.h5

```

In [1]:
import h5py
import pandas as pd
import numpy as np

In [3]:
oa_citations = pd.read_csv("../data/applications_with_office_actions_citing_grants_with_inner_join.csv").iloc[:, 1:]

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
oa_citations.head()

Unnamed: 0,app_id,app_fnm,citation_pat_pgpub_id,parsed,ifw_number,action_type,action_subtype,form892,form1449,citation_in_oa,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
0,13910109,/work/data/apps/2014/ipa141204/F_1142.xml,7968444,7968444,IAFGKOGCPXXIFW4,103.0,,1,1,1,...,1,0,0,1,0,0,0,0,2,0
1,13910109,/work/data/apps/2014/ipa141204/F_1142.xml,7780839,7780839,IAFGKOGCPXXIFW4,103.0,,1,1,1,...,1,0,0,1,0,0,0,0,2,0
2,13910109,/work/data/apps/2014/ipa141204/F_1142.xml,7968444,7968444,IF5G7INKPXXIFW4,103.0,,1,1,1,...,1,0,0,0,0,0,0,0,2,3
3,13910109,/work/data/apps/2014/ipa141204/F_1142.xml,7780839,7780839,IF5G7INKPXXIFW4,103.0,,1,1,1,...,1,0,0,0,0,0,0,0,2,3
4,13910109,/work/data/apps/2014/ipa141204/F_1142.xml,7968444,7968444,IMNIOWMYPXXIFW4,103.0,,1,1,1,...,1,0,0,0,0,0,0,0,2,0


In [5]:
oa_citations.shape

(1229300, 41)

In [6]:
oa_citations.app_id.unique().shape

(494374,)

# Use Rejection 102 only

In [7]:
oc_citations_103 = oa_citations[oa_citations.rejection_103 == 1]

In [8]:
oc_citations_103.shape

(1150519, 41)

In [9]:
oc_citations_102 = oa_citations[oa_citations.rejection_102 == 1]

In [10]:
oc_citations_102.shape

(574503, 41)

In [11]:
oc_citations_102.app_id.unique().shape

(268955,)

In [12]:
app_id_102s = sorted(list(set(oc_citations_102.app_id)))

In [13]:
app_id_102s[0:5]

[12000025, 12000026, 12000028, 12000033, 12000034]

## Sample 1000 training set, 1000 test set.

In [14]:
import random

In [15]:
random.seed(1234)

In [16]:
training_id = random.sample(app_id_102s, 1000)

In [17]:
training_id_set = set(training_id)

In [18]:
app_id_except_training = [app for app in app_id_102s if app not in training_id_set]

In [19]:
len(app_id_except_training)

267955

In [20]:
testset_id = random.sample(app_id_except_training, 1000)

In [21]:
testset_id[0:5]

[14307191, 13137006, 12741959, 12643447, 14200253]

In [22]:
len(training_id), len(testset_id)

(1000, 1000)

### Keep citation info only for training-test set.

In [23]:
target_idset = set(training_id)|set(testset_id)

In [24]:
citations_info_target = oc_citations_102[oc_citations_102.app_id.isin(target_idset)]

In [25]:
citations_info_target.shape

(4179, 41)

In [26]:
citations_info_target = citations_info_target.reset_index(drop=True)

In [27]:
citations_info_target.to_pickle("../data/citations_info_2000.df.gz")

### Retrieve related application xmls

In [28]:
appsh5  = h5py.File('../data/patent_applications_with_office_actions_2005-2018.h5' , 'r')

In [29]:
appsh5[str(training_id[0])].value[0:50]

'<us-patent-application lang="EN" dtd-version="v4.3'

In [30]:
training_app_df = pd.DataFrame({"app_id": training_id, "xml": [appsh5[str(tid)].value for tid in training_id]})

In [31]:
training_app_df.head()

Unnamed: 0,app_id,xml
0,14222691,"<us-patent-application lang=""EN"" dtd-version=""..."
1,12515852,"<us-patent-application lang=""EN"" dtd-version=""..."
2,12033424,"<us-patent-application lang=""EN"" dtd-version=""..."
3,12402344,"<us-patent-application lang=""EN"" dtd-version=""..."
4,12155425,"<us-patent-application lang=""EN"" dtd-version=""..."


In [32]:
testset_app_df = pd.DataFrame({"app_id": testset_id, "xml": [appsh5[str(tid)].value for tid in testset_id]})

In [33]:
training_app_df.shape, testset_app_df.shape

((1000, 2), (1000, 2))

In [34]:
training_app_df.head().app_id

0    14222691
1    12515852
2    12033424
3    12402344
4    12155425
Name: app_id, dtype: int64

In [35]:
testset_app_df.iloc[1]

app_id                                             13137006
xml       <us-patent-application lang="EN" dtd-version="...
Name: 1, dtype: object

### Save application df

In [36]:
training_app_df.to_pickle("../data/training_app_1000.df.gz")
testset_app_df.to_pickle("../data/testset_app_1000.df.gz")

In [37]:
appsh5.close()

### Retrieve grants xmls

In [40]:
grantsh5 = h5py.File('../data/patent_grants_with_office_actions_2005-2012.h5', 'r')

In [41]:
grants_target_ids = sorted(list(citations_info_target.parsed.unique()))

In [42]:
grants_target_df = pd.DataFrame({"parsed": grants_target_ids, "xml": [grantsh5[str(pid)].value for pid in grants_target_ids]})

In [43]:
grants_target_df.shape

(2524, 2)

In [44]:
grants_target_df.to_pickle("../data/grants_for_2000.df.gz")

In [45]:
grantsh5.close()