# Hypothesis

### Can an algorithm save the USPTO money by more accurately classifying a patent application than the human contractor that the USPTO pays? (target US classification and IPC classification and assign it to the right art unit)

Years ago the classification of newly filed US patent applications was done by USPTO employees.  But at some point a decision got made within the USPTO to outsource the classification process for new patent applications.  Since then, the classification has been carried out by a government contractor.   The contractor often misclassifies cases — in our recent experience this happens about 10% of the time.  When it happens, the case has to be transferred from one art unit to another.  Normally this only delays things by a week or two.  I am not sure what exactly went wrong with all seven of SPE V’s efforts to kick the case.  But as of today he has not managed to get rid of the case despite four months of trying. [source](http://www.ipwatchdog.com/2014/03/11/when-uspto-classifies-an-application-incorrectly/id=48457/)

# Import data
Bulk patent data from USPTO is in JSON or XML format. I decided to go with JSON. No specific reason.
- To import JSON data using panda I used [pandas.read_jason()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html)
- To download bulk patent data I used the [USPTO bulk download site](https://pairbulkdata.uspto.gov/)
    - I limited the files to utility patents filed between 1/1/1995 and 9/6/2016.

###### Search Fields of bulk data download
What search criteria is available?
1.	Application Number - The application number is a number given to a patent application when it is filed. The number contains a two digit series code followed by a six digit serial number assigned by the USPTO (Example: 99999999 or 99/999999).
2.	Patent Number - A patent number is a number given to the patent and can fall into one of the following categories:
    o	Utility: Contains six or seven digits (Example: 8000000). Enter the number excluding commas and spaces and omit leading zeroes
    o	Reissue: Enter leading zeroes between "RE" and number to create 6 digits (Example: Re999999)
    o	Plant Patents: Enter leading zeroes between "PP" and number to create 6 digits (Example: PP999999)
    o	Design: Enter leading zeroes between "D" and number to create 7 digits (Example: D9999999)
    o	Additions of Improvements: Enter leading zeroes between "AI" and number to create 6 digits (Example: AI999999)
    o	X Patents: Enter leading zeroes between "X" and number to create 7 digits (Example: X9999999)
    o	H Documents: Enter leading zeroes between "H" and number to create 7 digits (Example: H9999999)
    o	T Documents: Enter leading zeroes between "T" and number to create 7 digits (Example: T9999999)
3.	PCT Number - Can be entered as 14 character format; contains a two-digit year and five-character sequence number, Example: 'PCT/US99/12345'.
    o	PCT/CCYY/99999, where
    o	PCT = "PCT" (Patent Cooperation Treaty)
    o	CC = 2 character Country Code
    o	YY = last 2 digits of the year filed
    o	99999 = is the 5 digit sequence number
4.	Filing or 371 (c) Date - The date that an application includes (1) a specification containing a description and, if the application is a non-provisional application, at least one claim, and (2) any required drawings
5.	Application Type - types of patent documents issued by USPTO covering different types of subject matter and offering different kinds of protection
    o	Utility: Patents Issued for the invention of a new and useful process machine, manufacture, or composition of matter, or a new and useful improvement thereof, it generally permits its owner to exclude others from making, using, or selling the invention for a period of up to twenty years from the date of patent application filing up to twenty years from the filing date, subject to the payment of maintenance fees
    o	Design: Patents Issued for a new, original, and ornamental design embodied in or applied to an article of manufacture, it permits its owner to exclude others from making, using, or selling the design for a period of fourteen years from the date of patent grant. Design patents are not subject to the payment of maintenance fees
    o	Plant: Patents Issued for a new and distinct, invented or discovered asexually reproduced plant including cultivated sports, mutants, hybrids, and newly found seedlings, other than a tuber propagated plant or a plant found in an uncultivated state, it permits its owner to exclude others from making, using, or selling the plant for a period of up to twenty years from the date of patent application filing. Plant patents are not subject to the payment of maintenance fees
    o	Re-examination includes the following categories:
        ♣	Re-examination: Patents for which Reexamination request has been filed. Ex parte reexaminations have a control number of the form 90/000,000; inter partes reexaminations have the form 95/000,000.
        ♣	Supplemental: Patents for which request for supplemental examination is submitted are categorized as re-examinations.
    o	Re-issue: Patents Issued to correct an error in an already issued utility, design, or plant patent, it does not affect the period of protection offered by the original patent. However, the scope of patent protection can change as a result of the reissue patent
    o	Provisional: A provisional patent application allows you to file without a formal patent claim, oath or declaration, or any information disclosure (prior art) statement. A provisional application is automatically abandoned 12 months after its filing date and is not examined
    o	PCT: The Patents under the Patents Cooperation Treaty (PCT) are those that can at a later date, lead to the grant of a patent in any of the states contracting to the PCT.
6.	Examiner Name - USPTO assigned examiners who review patent applications and accept or reject the application. The examiners are also in charge of classifying approved patents into the appropriate classes/subclasses.
7.	Group Art Unit - a working unit responsible for a cluster of related patent art. Staffed by one or more supervisory patent examiners (SPE) and a number of patent examiners who determine patentability on applications for a patent. Group Art Units are identified by a four digit number, Example: 1642.
8.	Confirmation Number - a four-digit number that is assigned to each newly filed patent application. The confirmation number, in combination with the application number, is used to verify the accuracy of the application number placed on correspondence filed with the Office to avoid misidentification of an application due to a transposition error (misplaced digits) in the application number. The Office recommends that applicants include the application's confirmation number (in addition to the application number) on all correspondence submitted to the Office concerning the application.
9.	Attorney Docket Number - An Attorney Docket Number is a number of up to 25 alphanumeric characters that is used to identify the attorney or its representative party who has filed a patent application. This number is not assigned by the USPTO and can be any combination of numbers and letters. A list of applications can be retrieved by a complete or partial attorney docket number.

10.	**Class** - A category, organized by subject matter, into which patents are classified
11.	**Subclass** - A category, organized by subject matter, into which patents are classified. Subclasses fall below classes in the patent organization hierarchy

12.	First Named Inventor - the Primary inventor name listed on a patent.
13.	First Named Applicant - the Applicant's name listed on a patent application.
14.	Entity Status - An applicant may qualify for small entity or micro entity status for fee purposes. See www.uspto.gov/patents-application-process/applying-online/entity-status-fee-purposes for more details.
15.	AIA (First Inventor to File) - whether the patent or patent application was filed under the America Invents Act. See www.uspto.gov/patent/laws-and-regulations/america-invents-act-aia/america-invents-act-aia-frequently-asked for more details.
16.	Correspondence Address Customer Number - A unique number assigned by the USPTO to an applicant.
17.	**Status** - The state of a patent application. Examples include "Patented Case", "Pending" or "Abandoned"
    - **"Sent to Classification contractor"** total in download starting from 1/1/2007 = 5412 applications
    
18.	Status Date - The date of the status.
19.	Location - A location is the current site of the official file. When the term "Electronic" appears as the "location" of an application or patent, the official file is an electronic image file as described in the Official Gazette Notice 1271 OG 100, published June 17, 2003.
20.	Location Date
21.	Earliest Publication Number (PGPUB) - Number given to a patent application when it is published. The number contains a four-digit year, followed by a seven-digit sequence code followed by a two-character Kind Code that is assigned by the USPTO. Example: 20140012712A1.
22.	Earliest Publication Date - Date of publishing domestic/US applications
23.	Issue Date of Patent
24.	International Registration Publication Date
25.	International Filing Date - An international filing date is accorded to the earliest date on which the requirements under PCT Article 11 (1) were satisfied
26.	Control Number - Made up of a two-digit series code followed by a six-digit serial number; assigned by the USPTO Ex parte reexaminations have a control number of the form 90/000,000; inter partes reexaminations have the form 95/000,000
27.	International Registration Number - Is a six digit number preceded with "DM/" assigned by the International Bureau (IB) of the World Intellectual Property Organization (WIPO). (Example: DM/999999)
28.	WIPO Publication number - WIPO (World Intellectual Property Organization) Publication is a publication of an International Application (IA) under PCT Article 21(2) (e.g., Publication No. WO 1999/12345) Source of information: http://www.wipo.int/pct/guide/en/gdvol1/annexes/annexk/ax_k.pdfWIPO number format includes:
    o	WO - The code "WO" is used in relation to the international publication under the Patent Cooperation Treaty (PCT) of international applications filed with any PCT receiving Office, as well as in the publication of international deposits of industrial designs under the Hague Agreement Concerning the International Deposit of Industrial Designs
    o	YYYY year
    o	XXXXX - all numeric serial number
29.	WIPO Publication Date - Date of publishing International Application
30.	Title of Invention - The title of the invention appears as the heading on the first page of the specification

**Below is the list of all available fields can be queried by API users:**
- patentNumber
- patentTitle
- applId
- appFilingDate
- appType
- appExamLastName, appExamFirstName, appExamMdlName, appExamName
- appConfrNumber
- appAttrDockNumber
- **appCls, appSubCls, appClsSubCls**
- primaryInventor, primaryInventorFirstName, primaryInventorMiddleName, primaryInventorLastName, primaryInventorCity, - primaryInventorRegion, primaryInventorCountry
- rankAndInventorsList_str
- appGrpArtNumber
- appCustNumber
- **appStatus**
- appStatusDate
- appEntityStatus
- appLocation
- appLocationDate
- appEarlyPubNumber
- appEarlyPubDate
- patentIssueDate
- firstInventorFile
- appPCTNumber
- appIntlPubNumber
- appIntlPubDate
- wipoEarlyPubNumber
- wipoEarlyPubDate
- firstnamedapplicant
- ptaPteType
- patentTermJson
- publishDocJson