This is the page where the current iteration of BC's internal patent scraper lives. Below is a Python wrapper for the USPTO's PatentsView API, which allows for anyone who'd like to interact with the full database to do so more easily and fluidly. There's much more to be done with this tool that would allow it to be more intuitive and more powerful. Listed below are some of the adjustments that would significantly improve the usability of the tool:

---

1.   Creating some UI that allows for users to simply enter search terms into some box, as opposed to editing the list `` terms `` in the `` generate_search_terms `` function (additionally, the same for those search terms to be permuted (accessible under the `` generate_search_terms `` function, as well))
2.   Adding a drop-down selectable list of types of searches (currently, the only included is `` _text_phrase ``, as seen in `` format_search_terms ``) - other possible search types are listed on [this webpage](http://www.patentsview.org/api/query-language.html)
3.   Adding two input boxes for start- and end-dates of the search (currently, this is adjusted through the `` start_date `` and `` end_date `` parameters underneath the `` stringify_list `` function)
4.   Adding a better input method for the number of patents that should be returned (currently, this adjusted through the `` limit `` pararmeter underneath `` end_date ``)
5.   Adding a better input method for what types of data should be returned from the search (currently, this is adjusted through the `` fields `` list underneath `` limit ``, and all of the other options for what type of data can be output are listed in the "7 API Endpoints" drop-down in the menu of [this webpage](http://www.patentsview.org/api/doc.html))
6.   The wrapper currently outputs a JSON object, which yields something easily legible when put through a JSON reader (like [this one](https://jsoneditoronline.org)) - however, it'd be ideal to have this wrapper output a CSV (after flattening the JSON object); an unsuccessful attempt has been made at this in the current iteration of the wrapper, and is commented out at the bottom for anyone who'd like to play around with it and attempt to get it working properly
7.   Build in some method of combining multiple calls on the query API to allow for more than 90 search terms to be inputted, possibly of multiple types (the limit can be seen, written explicitly, in the `` generate_search_terms `` function)

---

For anyone who lands on this page, if you'd like, feel free to copy out the code and play around with it, making any of the above adjustments, if you're willing to own the changes. When updating this page, please preserve the above list, adding to it with any requests for features that you feel would offer a worthwhile improvement to the tool. If you implement any of the above, feel free to add those implementations to the code below, ensuring that everything continues to work properly. Please strikethrough in the above list the item implemented, without removing it, and add on the date the adjustment was made, as well as your name, for the sake of tracking changes.

---

The majority of the wrapper itself was written mostly by Anish Balaji, with some help from Sam Braverman.



In [None]:
import tkinter as tk
import requests as re
import json
import csv
import pandas

#Create the master window for input
master = tk.Tk()

#Label it as the scraper
master.title("USPTO scraper")

#Set its geometry
master.geometry("640x640+0+0")

#Label all of the input fields
tk.Label(master, text="Individual search terms").grid(row=0)
tk.Label(master, text="Processes or objects (e.g. X of CO2, where X is inputted here)").grid(row=1)
tk.Label(master, text="Type of process or object (e.g. adsorbent of X, where X is inputted here)").grid(row=2)

#Define three entries
e1 = tk.Entry(master)
e2 = tk.Entry(master)
e3 = tk.Entry(master)

#Place the entries in the window
e1.grid(row=0, column=1)
e2.grid(row=1, column=1)
e3.grid(row=2, column=1)

# Generate the list of search terms, with a list of individual terms (inputted into "terms") and two lists, processes and gases, which are permutated with each other as described under # Permutation
def generate_search_terms():

    # Enter individual terms here
    terms = [e1.get()]

    # For permuting
    process_or_object = [e2.get()]
    type_of_process_or_object = [e3.get()]

    # Permutation
    for t in type_of_process_or_object:
        for p in process_or_object:
            terms.append(t + " " + p)
            terms.append(p + " of " + t)

    #The PatentsView API will only accept searches with <= 90 search terms
    assert(len(terms) <= 90), "Too many search terms inputted."

    return terms

# Formatting of search terms into a form which can be appended onto the extant query URL
def format_search_terms(terms):
    
    base_a = "{\"_text_phrase\":{\"patent_abstract\":\""
    # Deprecated addition for searching patent titles, on top of the search for patent abstracts -> base_t = "{\"_text_phrase\":{\"patent_title\":\""
    close = "\"}}"

    return ",".join([base_a + str(t) + close for t in terms]) # Deprecated addition for searching patent titles, on top of the search for patent abstracts -> + [base_t + str(t) + close for t in terms])

# Converting a list to an acceptable string format
def stringify_list(lst):
    return "[" + ",".join(["\"" + str(e) + "\"" for e in lst]) + "]"

# Start and end date parameters
start_date = "2012-01-01"
end_date = "2012-12-31"

# Number of patents to be shown by the search
limit = 50


# List of what information should be returned by the search
fields = [
            "patent_number",
            "patent_date",
            "patent_abstract",
            "patent_title",
            "cpc_subsection_id",
            "cpc_subsection_title",
            "assignee_organization",
            "assignee_lastknown_country"
         ]


# Definition of the initial search query
query = "http://www.patentsview.org/api/patents/query?q={\"_and\":[{\"_gte\":{\"patent_date\":\"" + start_date + "\"}},{\"_lte\":{\"patent_date\":\"" + end_date + "\"}},{\"_or\":[" + format_search_terms(generate_search_terms()) + "]}]}"

# Appending on the search limit
query += "&o={\"per_page\":" + str(limit) + "}"

# Appending on the requested fields for the return data
query += "&f=" + stringify_list(fields)

#Function that will execute the search
def execute_search():
    print("Collecting data...")
    r = re.get(query)
    print(r.text)
    print("Search complete.")

#Create buttons which will eventually link to the appropriate functions
tk.Button(master, text='Quit', command=master.quit).grid(row=3, column=0)
tk.Button(master, text='Search', command=execute_search).grid(row=3, column=1)    
    
#mainloop for Tkinter to run
master.mainloop()

# # Signal start of query
# print("Collecting data...")

# # Get the query
# r = re.get(query)

# # Print the resultant JSON object
# print(r.text)

# # Signal completion of search
# print("Search complete.")

# # Use the pandas library to read the resultant JSON object into a dataframe [THIS LINE AND BELOW CURRENTLY NON-FUNCTIONAL]
# def flattenjson(b, delim):
#     val = {}
#     for i in b.keys():
#         if isinstance(b[i], dict):
#             get = flattenjson( b[i], delim )
#             for j in get.keys():
#                 val[i + delim + j] = get[j]
#         else:
#             val[i] = b[i]

#     return val

# df = pandas.read_json(r.text)
# flattened = flattenjson(df, "__")

# # (Again) use the pandas library to convert this dataframe object into a CSV
# df.to_csv("2012.csv")