# Loading see data from BigQuery

## Your Cloud Project ID
You'll need a Google Cloud project with BigQuery enabled (it's enabled by default) for this notebook and associated code to work. Put your project ID below. Go to [cloud.google.com](http://cloud.google.com) to create one if you don't already have an account. You can create the Cloud account for free and won't be auto-billed. Then copy your Project ID and paste it below into `bq_project`.

In [1]:
# bq_project = 'patent-embeddings'
bq_project = 'theta-style-318821'
#setup authentication with service account instead of user account
%env GOOGLE_APPLICATION_CREDENTIALS=/home/schikanski/projects/patent-embedding-visualization/theta-style-318821-65dd848288f4.json
    
%load_ext autoreload
%autoreload 2

env: GOOGLE_APPLICATION_CREDENTIALS=/home/schikanski/projects/patent-embedding-visualization/theta-style-318821-65dd848288f4.json


## Basic Configuration

In [2]:
# import tensorflow as tf
import pandas as pd
import os

# seed_name = 'hair_dryer'
# seed_name = 'video_codec'
seed_name = "contact_lens"
# seed_name = "contact_lens_us_c"
# seed_name = "3d_printer"

seed_file = 'seeds/'+ seed_name + '.seed.csv'

src_dir = "."

patent_dataset = 'patents-public-data:patents.publications_latest'
num_anti_seed_patents = 15000
if bq_project == '':
    raise Exception('You must enter a bq_project above for this code to run.')

## Patent Landscape Expansion

This section of the notebook creates an instance of the `PatentLandscapeExpander`, which accesses a BigQuery table of patent data to do the expansion of a provided seed set and produces each expansion level as well as the final training dataset as a Pandas dataframe.

In [3]:
import fiz_lernmodule.expansion

expander = fiz_lernmodule.expansion.PatentLandscapeExpander(
    seed_file,
    seed_name,
    bq_project=bq_project,
    patent_dataset=patent_dataset,
    num_antiseed=num_anti_seed_patents,
    us_only=True,
    prepare_training=False)


This does the actual expansion and displays the head of the final training data dataframe.

In [4]:
%%time

training_data_full_df, seed_patents_df, l1_patents_df, l2_patents_df, anti_seed_patents = \
    expander.load_from_disk_or_do_expansion()

Loading landscape data from filesystem at data/contact_lens/landscape_data.pkl
CPU times: user 7.49 ms, sys: 27.9 ms, total: 35.4 ms
Wall time: 54.9 ms


In [5]:
training_data_full_df

Unnamed: 0,pub_num,publication_number,country_code,family_id,priority_date,title_text,abstract_text,claims_text,refs,cpcs,ipcs,assignees_harmonized,ExpansionLevel
0,2009020683,US-2009020683-A1,US,24924433,20001201,High optical quality molds for use in contact ...,The invention provides molds and mold inserts ...,"1 . A mold insert, comprising at least one opt...","US-4327203-A,US-4703097-A,US-7422710-B2,US-584...","Y10S425/808,Y10T428/31663,B29D11/00134,G02B1/0...","B29C33/42,B29D11/00,B29L9/00,B28B7/28,B29C33/3...","STEFFEN ROBERT B,MATIACIO THOMAS A,WILDSMITH C...",Seed
1,8985764,US-8985764-B2,US,44059294,20091117,Contact lens,Provided is a contact lens of double thin type...,The invention claimed is: \n \n 1. A...,"JP-2007538288-A,US-2006203190-A1,US-5912719-A,...",G02C7/048,G02C7/04,"SAKAI YUKIHISA,YAMAGUCHI HIROYUKI,MENICON CO L...",Seed
2,5347674,US-5347674-A,US,22206014,19930709,Contact lens treatment apparatus,A contact lens treatment apparatus for polishi...,What is claimed as being new and desired to be...,"JP-H02123323-A,JP-S63187216-A,JP-S6421818-A","B08B11/00,Y10S134/901,G02C13/008,B08B11/02","B08B11/02,B08B11/00,G02C13/00",GABBERT CHUCK,Seed
3,5238843,US-5238843-A,US,23698565,19891027,Method for cleaning a surface on which is boun...,A method for cleaning a surface on which is bo...,What is claimed is: \n \n 1. A metho...,"EP-0233721-A2,JP-S632911-A,US-4521254-A,JP-S49...","C11D3/38636,A61K8/4933,A61Q5/02,A61Q11/02,A61Q...","A61Q17/04,A61K8/49,A61Q5/02,G02C13/00,A61Q15/0...","GENENCOR INT,PROCTER & GAMBLE",Seed
4,8770747,US-8770747-B2,US,45496267,20101214,Colored contact lens,"A contact lens, comprising a preprint of a fir...",What is claimed is: \n \n 1. A color...,"US-4720188-A,US-6834955-B2,US-4923480-A,WO-994...","G02C7/046,G02C7/04","G02C7/00,G02C7/04,G02C7/02","CORTI SANDRA,CREECH LAURA ASHLEY,NOVARTIS AG",Seed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3299,10139521,US-10139521-B2,US,58640923,20160420,Silicone elastomer-hydrogel hybrid contact lenses,A silicone elastomer-hydrogel hybrid contact l...,What is claimed is: \n \n 1. A silic...,"US-2015036100-A1,US-8979261-B2,US-9278489-B2,,...","B29D11/00134,B29D11/00067,B29K2083/00,C08L101/...","B29D11/00,B29K83/00,G02C7/04,B29K105/00,B29K33...",COOPERVISION INT HOLDING CO LP,Seed
3300,7572841,US-7572841-B2,US,38476986,20060615,Wettable silicone hydrogel contact lenses and ...,Silicone hydrogel contact lenses having ophtha...,1. A polymerizable silicone hydrogel contact l...,",US-6533415-B2,EP-0395583-B1,WO-9309154-A1,US-...","G02B1/043,C08L83/14","C08G77/14,C08F290/06,G02C7/04",COOPERVISION INT HOLDING CO LP,Seed
3301,2014183767,US-2014183767-A1,US,38537905,20060615,Wettable Silicone Hydrogel Contact Lenses And ...,Silicone hydrogel contact lenses having ophtha...,1 - 57 . (canceled) \n \n \n 5...,"US-5387632-A,US-5965631-A,US-8552085-B2","B29K2039/06,G02B1/043,B29K2077/00,B29K2105/006...","G02B1/04,B29D11/00",COOPERVISION INT HOLDING CO LP,Seed
3302,9625616,US-9625616-B2,US,50382476,20130315,Silicone hydrogel contact lenses,A method is provided for manufacturing ophthal...,What is claimed is: \n \n 1. A metho...,"US-2008048350-A1,,US-2007138668-A1,US-20160039...","G02B1/043,C08L83/06,B29D11/0025,B29D11/00192","G02B1/04,C08L83/06,B29D11/00",COOPERVISION INT HOLDING CO LP,Seed


### Show some stats about the landscape training data

In [6]:
print('Seed/Positive examples:')
print(training_data_full_df[training_data_full_df.ExpansionLevel == 'Seed'].count())

print('\n\nAnti-Seed/Negative examples:')
print(training_data_full_df[training_data_full_df.ExpansionLevel == 'AntiSeed'].count())

Seed/Positive examples:
pub_num                 3304
publication_number      3304
country_code            3304
family_id               3304
priority_date           3304
title_text              3304
abstract_text           3304
claims_text             3304
refs                    3304
cpcs                    3304
ipcs                    3304
assignees_harmonized    3304
ExpansionLevel          3304
dtype: int64


Anti-Seed/Negative examples:
pub_num                 0
publication_number      0
country_code            0
family_id               0
priority_date           0
title_text              0
abstract_text           0
claims_text             0
refs                    0
cpcs                    0
ipcs                    0
assignees_harmonized    0
ExpansionLevel          0
dtype: int64


In [7]:
l2_patents_df

Unnamed: 0,publication_number
