In [30]:

import selenium
from selenium import webdriver
from time import sleep
import pandas as pd
import pickle

## Scrape text for Abstracts using Selenium

Within a patent, abstracts tend to be shorter and tend to contain the gist of what the patent is about. Such distilled information tend to be helpful for topic modeling as it means better results and less cleaning/text normalization.

Read the `pickle` file from last step

In [31]:
file_path = './data/outputs/autonomous.pkl'
df = pd.read_pickle(file_path)

In [32]:
df = df[:50]

Import the functions to get chrome driver and do queries

In [33]:
from utils.selenium_utils import get_driver, get_element

In [34]:
driver = get_driver()

In [35]:
query = '//*[@id="abstract"]'

In [36]:
# Getting a single url to test the result
url = df.loc[1, 'result link']
url

'https://patents.google.com/patent/US9315178B1/en'

In [37]:
get_element(url, query, driver)

['',
 'Abstract\nIn an example method, a vehicle configured to operate in an autonomous mode could predict an output of the vehicle based on an input provided to control the vehicle and a state of the vehicle. The method could include receiving an indication of an input from at least one input-indication sensor and an indication of an output from at least one output-indication sensor. A predicted output value could be calculated based on the indication of the input and a state of the vehicle. The predicted output value could be compared with the indication of the output. If the comparison is not within a threshold range, an alert indicator could be created. Upon creating the alert indicator, an alert action could be activated.']

### Apply the function for all the urls

For this we will use `apply()` pandas function. This will run the function on every row

We can use `progress_apply()` to get a pretty progress bar printed (import `tqdm` is needed)

In [38]:
from tqdm.notebook import tqdm
tqdm.pandas()

In [49]:
# Apply the function to every ['result link'] 
df['text'] = df['result link'].progress_apply(lambda url: get_element(url,'//*[@id="abstract"]', driver ))

  0%|          | 0/50 [00:00<?, ?it/s]

Clean and change datatypes for Exploratory Data Analysis

In [50]:
df['priority date'] = pd.to_datetime(df['priority date'])
df['filing/creation date'] = pd.to_datetime(df['filing/creation date'])
df['publication date'] = pd.to_datetime(df['publication date'])
df['grant date'] = pd.to_datetime(df['grant date'], errors = 'ignore')

In [51]:
df.dtypes

id                              object
title                           object
assignee                        object
inventor/author                 object
priority date           datetime64[ns]
filing/creation date    datetime64[ns]
publication date        datetime64[ns]
grant date              datetime64[ns]
result link                     object
abstract                        object
text                            object
dtype: object

In [52]:
df.sample()

Unnamed: 0_level_0,id,title,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,abstract,text
priority date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-03-19,US-9835465-B2,Method for operating an autonomous vehicle on ...,"Ford Global Technologies, Llc",Frederic Stefan,2015-03-19,2016-03-16,2017-12-05,2017-12-05,https://patents.google.com/patent/US9835465B2/en,"[, Abstract\nA method for operating a motor ve...","[, Abstract\nA method for operating a motor ve..."


In [53]:
df.describe(datetime_is_numeric=True)

Unnamed: 0,priority date,filing/creation date,publication date,grant date
count,50,50,50,39
mean,2015-12-04 02:52:48,2016-11-22 04:48:00,2018-10-29 17:45:36,2018-10-19 09:50:46.153846272
min,2012-03-15 00:00:00,2012-04-13 00:00:00,2014-03-18 00:00:00,2014-03-18 00:00:00
25%,2014-04-23 00:00:00,2015-12-24 12:00:00,2017-10-26 06:00:00,2017-05-09 00:00:00
50%,2016-04-03 12:00:00,2017-06-12 12:00:00,2019-01-08 00:00:00,2019-02-05 00:00:00
75%,2017-06-12 18:00:00,2018-01-18 12:00:00,2020-01-08 12:00:00,2020-06-16 00:00:00
max,2018-07-19 00:00:00,2019-09-06 00:00:00,2021-11-09 00:00:00,2021-11-09 00:00:00


## Set index to patents' priority date

In [54]:
df.index = df['priority date']
# df = df.set_index('priority date')

In [55]:
df.head()

Unnamed: 0_level_0,id,title,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,abstract,text
priority date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-02,US-2019204842-A1,Trajectory planner with dynamic cost learning ...,GM Global Technology Operations LLC,"Sayyed Rouhollah Jafari Tafti, Guangyu J. Zou,...",2018-01-02,2018-01-02,2019-07-04,NaT,https://patents.google.com/patent/US2019020484...,"[, Abstract\nA vehicle, system and method of a...","[, Abstract\nA vehicle, system and method of a..."
2012-04-13,US-9315178-B1,Model checking for autonomous vehicles,Google Inc.,"David I. Ferguson, Dmitri A. Dolgov, Christoph...",2012-04-13,2012-04-13,2016-04-19,2016-04-19,https://patents.google.com/patent/US9315178B1/en,"[, Abstract\nIn an example method, a vehicle c...","[, Abstract\nIn an example method, a vehicle c..."
2017-01-13,US-2018201256-A1,Autonomous parking of vehicles inperpendicular...,"Ford Global Technologies, Llc","Eric Hongtei Tseng, Li Xu, Kyle Simmons, Dougl...",2017-01-13,2017-01-13,2018-07-19,NaT,https://patents.google.com/patent/US2018020125...,"[, Abstract\nMethod and apparatus are disclose...","[, Abstract\nMethod and apparatus are disclose..."
2015-03-19,US-9835465-B2,Method for operating an autonomous vehicle on ...,"Ford Global Technologies, Llc",Frederic Stefan,2015-03-19,2016-03-16,2017-12-05,2017-12-05,https://patents.google.com/patent/US9835465B2/en,"[, Abstract\nA method for operating a motor ve...","[, Abstract\nA method for operating a motor ve..."
2016-05-04,US-10486699-B2,Off-road autonomous driving,"Ford Global Technologies, Llc","Jianbo Lu, Davor David Hrovat, Hongtei Eric Tseng",2016-05-04,2017-09-28,2019-11-26,2019-11-26,https://patents.google.com/patent/US10486699B2/en,"[, Abstract\nA vehicle system includes a proce...","[, Abstract\nA vehicle system includes a proce..."


The column `text` is a list. We can use `explode()` function to create rows from it

In [57]:
df = df.explode('text')

In [None]:
df.head(3)

Unnamed: 0_level_0,id,title,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,abstract,text
priority date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-02,US-2019204842-A1,Trajectory planner with dynamic cost learning ...,GM Global Technology Operations LLC,"Sayyed Rouhollah Jafari Tafti, Guangyu J. Zou,...",2018-01-02,2018-01-02,2019-07-04,NaT,https://patents.google.com/patent/US2019020484...,"[, Abstract\nA vehicle, system and method of a...",
2018-01-02,US-2019204842-A1,Trajectory planner with dynamic cost learning ...,GM Global Technology Operations LLC,"Sayyed Rouhollah Jafari Tafti, Guangyu J. Zou,...",2018-01-02,2018-01-02,2019-07-04,NaT,https://patents.google.com/patent/US2019020484...,"[, Abstract\nA vehicle, system and method of a...","Abstract\nA vehicle, system and method of auto..."
2012-04-13,US-9315178-B1,Model checking for autonomous vehicles,Google Inc.,"David I. Ferguson, Dmitri A. Dolgov, Christoph...",2012-04-13,2012-04-13,2016-04-19,2016-04-19,https://patents.google.com/patent/US9315178B1/en,"[, Abstract\nIn an example method, a vehicle c...",


Let's remove the rows where text is empty

In [64]:
df = df[df['text'].str.len() > 10]

In [67]:
df.head(3)

Unnamed: 0_level_0,id,title,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,abstract,text
priority date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-02,US-2019204842-A1,Trajectory planner with dynamic cost learning ...,GM Global Technology Operations LLC,"Sayyed Rouhollah Jafari Tafti, Guangyu J. Zou,...",2018-01-02,2018-01-02,2019-07-04,NaT,https://patents.google.com/patent/US2019020484...,"[, Abstract\nA vehicle, system and method of a...","Abstract\nA vehicle, system and method of auto..."
2012-04-13,US-9315178-B1,Model checking for autonomous vehicles,Google Inc.,"David I. Ferguson, Dmitri A. Dolgov, Christoph...",2012-04-13,2012-04-13,2016-04-19,2016-04-19,https://patents.google.com/patent/US9315178B1/en,"[, Abstract\nIn an example method, a vehicle c...","Abstract\nIn an example method, a vehicle conf..."
2017-01-13,US-2018201256-A1,Autonomous parking of vehicles inperpendicular...,"Ford Global Technologies, Llc","Eric Hongtei Tseng, Li Xu, Kyle Simmons, Dougl...",2017-01-13,2017-01-13,2018-07-19,NaT,https://patents.google.com/patent/US2018020125...,"[, Abstract\nMethod and apparatus are disclose...",Abstract\nMethod and apparatus are disclosed f...


Save the new dastaframe as pickle

In [68]:
output_filepath = "./data/outputs/autonomous_clean.pkl"
df.to_pickle(output_filepath)