This is the page where the current iteration of BC's internal patent scraper lives. Below is a Python wrapper for the USPTO's PatentsView API, which allows for anyone who'd like to interact with the full database to do so more easily and fluidly. There's much more to be done with this tool that would allow it to be more intuitive and more powerful. Listed below are some of the adjustments that would significantly improve the usability of the tool:

---

1.   <s>Creating some UI that allows for users to simply enter search terms into some box, as opposed to editing the list `` terms `` in the `` generate_search_terms `` function (additionally, the same for those search terms to be permuted (accessible under the `` generate_search_terms `` function, as well))</s> S.B. 30-06-18 <s>(Although now must fix permutation, as it no longer works proerly, because of how Tk.get() outputs)</s> S.B. 30-06-18
2.   Adding a drop-down selectable list of types of searches (currently, the only included is `` _text_phrase ``, as seen in `` format_search_terms ``) - other possible search types are listed on [this webpage](http://www.patentsview.org/api/query-language.html)
3.   <s>Adding two input boxes for start- and end-dates of the search (currently, this is adjusted through the `` start_date `` and `` end_date `` parameters underneath the `` stringify_list `` function)</s> S.B. 30-06-18
4.   <s>Adding a better input method for the number of patents that should be returned (currently, this adjusted through the `` limit `` pararmeter underneath `` end_date ``)</s> S.B. 30-06-18
5.   Adding a better input method for what types of data should be returned from the search (currently, this is adjusted through the `` fields `` list underneath `` limit ``, and all of the other options for what type of data can be output are listed in the "7 API Endpoints" drop-down in the menu of [this webpage](http://www.patentsview.org/api/doc.html))
6.   <s>The wrapper currently outputs a JSON object, which yields something easily legible when put through a JSON reader (like [this one](https://jsoneditoronline.org)) - however, it'd be ideal to have this wrapper output a CSV (after flattening the JSON object); an unsuccessful attempt has been made at this in the current iteration of the wrapper, and is commented out at the bottom for anyone who'd like to play around with it and attempt to get it working properly</s> A.B. 04-18
7.   Build in some method of combining multiple calls on the query API to allow for more than 90 search terms to be inputted, possibly of multiple types (the limit can be seen, written explicitly, in the `` generate_search_terms `` function)
8.   Build in some method of specifying an output filepath and a generator for descriptive filenames.

---

For anyone who lands on this page, if you'd like, feel free to copy out the code and play around with it, making any of the above adjustments, if you're willing to own the changes. When updating this page, please preserve the above list, adding to it with any requests for features that you feel would offer a worthwhile improvement to the tool. If you implement any of the above, feel free to add those implementations to the code below, ensuring that everything continues to work properly. Please strikethrough in the above list the item implemented, without removing it, and add on the date the adjustment was made, as well as your name, for the sake of tracking changes.

---

The majority of the wrapper itself was written mostly by Anish Balaji, with some help from Sam Braverman.

---

<b>Instructions for Use:</b>
1.   Clone/download this repository to your local machine, and navigate to the working directory for it. Open the "USPTO scraper.ipynb" file with Anaconda Navigator or similar.
2.   Run the cell below, either by the options in the bar at the top of the page or by pressing `` Shift + Enter/Return ``
3.   Input your desired idividual search terms as comma-separated values, with no quotes in any field. The "processes or objects" and "types of processes or objects" fields must have input input in the format `` ["term_1", "term_2", ..., "term_n"] ``. Apologies for that, but it was a surprisingly large hassle to find another way to get permutationn working. Dates should be in the format MM-DD-YY (sorry). Search terms in "processes or objects" and "types of processes or objects" will be permuted with each other. The "Display limit" field limits how many patents are returned by the search.
4.   Click search.
5.   The results will be printed in the output of the cell, and will be saved to the file "patents.csv" in the working directory of the repository.

In [None]:
import tkinter as tk
import requests as re
import json
import csv
import pandas as pd
from pandas.io.json import json_normalize
import ast

class USPTO_scraper(tk.Tk):
    def __init__(self):
        # tKinter initialization
        tk.Tk.__init__(self)
        # Create the entries
        self.terms = tk.Entry(self)
        self.process = tk.Entry(self)
        self.type = tk.Entry(self)
        self.start_date = tk.Entry(self)
        self.end_date = tk.Entry(self)
        self.display_limit = tk.Entry(self)
        # Create the buttons
        self.search_button = tk.Button(self, text="Search", command=self.execute_search)
        self.exit_button = tk.Button(self, text="Quit", command=self.quit)
        # Position the entries
        self.terms.grid(row=0, column=1)
        self.process.grid(row=1, column=1)
        self.type.grid(row=2, column=1)
        self.start_date.grid(row=3, column=1)
        self.end_date.grid(row=4, column=1)
        self.display_limit.grid(row=5, column=1)
        # Position the buttons
        self.search_button.grid(row=6, column=1)
        self.exit_button.grid(row=6, column=0)
        #Label the entries
        #Label all of the input fields
        tk.Label(self, text="Individual search terms:").grid(row=0)
        tk.Label(self, text="Processes or objects (e.g. X of CO2, where X is inputted here):").grid(row=1)
        tk.Label(self, text="Type of process or object (e.g. adsorbent of X, where X is inputted here):").grid(row=2)
        tk.Label(self, text="Start date:").grid(row=3)
        tk.Label(self, text="End date:").grid(row=4)
        tk.Label(self, text="Display limit:").grid(row=5)
        # UI-locked parameters
        self.fields = [
                        "patent_number",
                        "patent_date",
                        "patent_abstract",
                        "patent_title",
                        "cpc_subsection_id",
                        "cpc_subsection_title",
                        "assignee_organization",
                        "assignee_lastknown_country"
                    ]

    def generate_search_terms(self):

        # Grab individual search terms
        terms = [self.terms.get()]

        # Grab terms for permutation
        process_or_object = ast.literal_eval(self.process.get())
        type_of_process_or_object = ast.literal_eval(self.type.get())

        # Permutation
        for t in type_of_process_or_object:
            for p in process_or_object:
                terms.append(t + " " + p)
                terms.append(p + " of " + t)

        # The PatentsView API will only accept searches with <= 90 search terms
        assert(len(terms) <= 90), "Too many search terms inputted."

        return terms
    
    def format_search_terms(self):
        # Formatting of search terms into a form which can be appended onto the extant query URL
        base_a = "{\"_text_phrase\":{\"patent_abstract\":\""
        close = "\"}}"

        return ",".join([base_a + str(t) + close for t in self.generate_search_terms()])

    def stringify_fields(self):
        # Convert the list of fields to a query-acceptable format
        return "[" + ",".join(["\"" + str(e) + "\"" for e in self.fields]) + "]"

    def execute_search(self):
        # Definition of the initial search query
        query = "http://www.patentsview.org/api/patents/query?q={\"_and\":[{\"_gte\":{\"patent_date\":\"" + self.start_date.get() + "\"}},{\"_lte\":{\"patent_date\":\"" + self.end_date.get() + "\"}},{\"_or\":[" + self.format_search_terms() + "]}]}"

        # Appending on the search limit
        query += "&o={\"per_page\":" + str(self.display_limit.get()) + "}"

        # Appending on the requested fields for the return data
        query += "&f=" + self.stringify_fields()

        # Execute the search through request
        print("Collecting data...")
        query = re.get(query)
        print(query.text)
        output = json.loads(query.text)
        patents = output['patents']
        patent_df = json_normalize(patents)
        patent_df.to_csv("patents.csv")
        print("CSV saved.")
        print("Search complete.")

app = USPTO_scraper()
app.mainloop()

Collecting data...
{"patents":[{"patent_number":"8124049","patent_date":"2012-02-28","patent_abstract":"A high thermal efficiency process for hydrogen recovery is provided. The present invention includes combusting a first fuel stream to a reforming furnace, producing reforming heat and a hot exhaust stream. Then exchanging heat indirectly between the hot exhaust stream and a first feed water stream, producing a first steam stream. Then providing a hydrocarbon containing stream and a feed steam stream to the reforming furnace, utilizing the reforming heat and producing a hot raw syngas stream. Then exchanging heat indirectly between the hot raw syngas stream and second feedwater stream, producing a second steam stream and a cooled, raw syngas stream. Then introducing the cooled, raw syngas stream to a CO shift converter, producing a shifted syngas stream. Then introducing the shifted syngas stream into a pressure swing adsorption unit, producing a hydrogen product stream and a tail gas

In [None]:
# TO SAVE AS A JSON RUN THIS CELL
with open('patents.txt', 'w') as outfile:
    json.dump(output, outfile)
    print("SAVED JSON as TXT FILE")