Yang Yu (yy5bm@virginia.edu) DS 5001 Spring 2023

# Introduction

This project aims at exploring a corpus of Machine Learning patents published between 1997 and 2003 in the U.S., with the goal of identifying patents that are both innovative and inspire future works. 

The corpus contains 9399 patents, with text description, publication year, and CPC classification by USPTO for each patent.The OHCO has two levels: patent, sentence. We are not able to parse by paragraph because the text data has no double line breaks.

We follow Kelly et al. (2021) and compute a novel measure of patent importance based on a modified version of TFIDF, from which we are able to identify patents that are of utmost importance. In addition, we perform four data models: PCA, LDA, word2vec, and sentiment analysis.

# Source data

We collect data from three sources. 

First, we download from USPTO official website the Artificial Intelligence Patent Dataset (AIPD) in the formate of DTA https://www.uspto.gov/ip-policy/economic-research/research-datasets/artificial-intelligence-patent-dataset. This dataset includes all patents issued between 1976 and 2020 that are associated with one or more of several AI technology components (including machine learning, natural language processing, computer vision, speech, knowledge processing, AI hardware, evolutionary computation, and planning and control), and we further narrow down to the patents that are labelled 'machine learning'. We keep an observation if it belongs to 'machine learning' and was published between 1997 and 2003. The final dataset contains 9,399 observations, for each observation, we observe the patent id, publication date, and AI technology category. This dataset helps us identify all patents on machine learning.

Second, we download patent description data from 1997 to 2003 from PatentsView in the format of TSV. It is a web-based platform created and maintained by the United States Patent and Trademark Office (USPTO) that provides data on U.S. patents and patent applications https://patentsview.org/download/draw_desc_text. The dataset has over 1 million observations, each observation includes a patent id and its text description. The mean of text length is 3886. 

Finally, we download CPC classification data from PatentsView in the format of TSV https://patentsview.org/download/data-download-tables. This dataset provides information on the current CPC classifications of all granted U.S. patents. It consists of 48,473,812 observations, each observation is a patent and its CPC classification. 

We take the first dataset as our master dataset, and match to it the description and classification data from the second and third datasets based on the shared patent id. Table 1 shows what our final dataset looks like.

Table 1: LIB

<img src="tab1.png" alt="Alt text">

# Box link

Below is the definition and link to the source files:

- Data: raw data we used, https://virginia.box.com/s/hu8gh60xpevsrkfq2a0otz1bpyevz0fx

- LIB: libraray table, https://virginia.box.com/s/1td9sjj2r0wj2lj83yrlzun10561a6zz

- TOKENS: token table, https://virginia.box.com/s/e1cpk5mu5gxoptxdjuvj4hrgniso62x6

- VOCAB, vocab table, https://virginia.box.com/s/bcn9523mqk7ssy4r9l2lqnm8ybs7ucq0

- BOW: bag of words, https://virginia.box.com/s/2e238fvbj426u5h0r0oh5l5ulcn7tu0r

- LOADINGS_sk: loadings in PCA, https://virginia.box.com/s/pzrddfzdvril43vjkzku6l5e73qshmkq

- DCM_sk: DCM table in PCA, https://virginia.box.com/s/dwblbfqm9jwby75ngbkiucac7ygkg1tr

- THETA: theta table in Topic model, https://virginia.box.com/s/scg1u475eydp20wsd6o796qqrovfkfbc

- PHI: phi table in topic model, https://virginia.box.com/s/bbcl1eamnzgsjyzirt32lelwcaubdlnb

- Coord: Terms and embeddings, added to the VOCAB table, https://virginia.box.com/s/7zqtov9rgrrez832vf8z2r1c6907hwk5

- V: Sentiment and emotion values added to VOCAB, https://virginia.box.com/s/0bmfys62sa09rshlmbperjltnk9h3ul0

- EMO_PTS: Sentiment and emotion value for each patent, https://virginia.box.com/s/37ls9cuepl0e2yexgv76glllyp36s6dt

# Data Model
## A new measure of TF-BIDF

The goal in this section is to measure the importance of each patent. There are two dimensions in measuring the importance of a patent, novelty and influence. A patent is novel if it's unlike its predecessors, and is influential if it inspires subsequent research. Consistent with this spirit, we use backward and forward similarity to measure the novelty and influence. 


Instead of using the standard IDF in computing TFIDF, we follow Kelly et al. (2021) and use Backward IDF (BIDF). The reasoning is, timing is important for evaluating the novelty and influence. For example, if a concept such as 'machine learning' was first proposed in 1970 and becomes widespread in the subsequent patents, then using the traditional IDF will bias down the significance of 'machine learning' in the patent where it first showed up. Therefore, when computing IDF for a patent published in year $t$, we only use the pool of patents published before year $t$, and we call it BIDF.

Formally, to compute the TF-BIDF of a word $w$ in patent $i$, 

$$TF_{w,i} = \frac{c_{w,i}}{\sum_k c_{k,i}}\rightarrow \text{Frequency of word w in patent i}$$

$$BIDF_{w,i} = \log\left(\frac{\text{# of patents prior to }t}{1+\text{# of patents published two years prior to t that include word w}}\right)$$

$$t\text{: the publication year of patent i}$$

$$TFBIDF_{w,i}=TF_{w,i}\times BIDF_{w,i}$$

## PCA

Because the TFIDF table is large, we first perform PCA and then select a smaller set of components that explain 80% variance before moving on to compute patent importance. By Figure 1, we keep the top 2000 components.

Figure 1: Cumulative explained variance by number of components

<img src="fig1.png" alt="Alt text" width="600" height="400">

## Measure of importance

Since we use a two-year window to compute TF-BIDF, we drop patents published in 1997 and 1998 because their BIDF are not proper. Then we are left with patents from 1999 to 2003. Again we use two-year window to compute backward and forward similarity, which means we can only compute backward similarity for patents after (incl.) 2001 and forward similarity for patents before (incl.) 2001. As a result, we can only compute the importance measure that considers both backward and forward similarity for patents in 2001. 

We use 'cosine' distance to measure the similarity between two patents. Formally, for patent $i$, its backward similarity (BS) and forward similarity (FS are defined as following:

$$BS_i=\sum_{j\in B(i)} \text{cosine distance}(i,j)$$

$$FS_i=\sum_{j\in F(i)} \text{cosine distance}(i,j)$$

$B(i), F(i)$ are the set of patents that were issued within two years before and after the year patent $i$ was published. The importance of patent $i$, $\rho_i$ is formally defined as

$$\rho_i=\frac{FS_i}{BS_i}$$

Figure 2, 3, 4 show the distribution for BS, FS, and $\rho$. The key message from Figure 4 is that the importance of patent has fat left tail and long right tail, meaning that most patents are 'not' important, but there are few that are really valuable.

Figure 2: Backward similarity

<img src="BS.png" alt="Alt text" width="600" height="400">

Figure 3: Forward similarity

<img src="FS.png" alt="Alt text" width="600" height="400">

Figure 4: Importance

<img src="importance.png" alt="Alt text" width="600" height="400">

## Topic model

In estimating the LDA topic model, we only keep nouns, and set max feature as 4000 and topic number as 9, the topic number is consistant with the number of CPC classification. We perform LDA only on patents published in 1997 for the sake of computation capacity.

The $\theta$ and $\phi$ tables are shown in Table 2 and 3. The most important words in each topic is shown in Table 4. The topic similarity is shown in Figure 5.

By Table 4, the key insight is machine learning technology is applied in different fields. T4 is apparently a topic associated with biology and pharma industry, T8 is probably related to brain and neuroscience. Figure 5 shows how similar each topic is to each other.

Table 2: $\theta$

<img src="THETA.png" alt="Alt text">

Table 3: $\phi$

<img src="PHI.png" alt="Alt text">

Table 4: Topic components

<img src="topic.png" alt="Alt text">

Figure 5: Topic similarity

<img src="hac.png" alt="Alt text">

## Word embedding

For word embedding, we use machine learning patent published in 1997.

The parameters chosen for word embedding are: window = 5, vector_size = 246, min_count = 50, workers = 4. Figure 6 shows a cluster of words that are close to each other. This cluster is mainly about words on biology, including cells, stem, antibody, and these are words that tend to go hand in hand with each other.

Figure 6: Word embedding

<img src="we.png" alt="Alt text">

## Sentiment analysis

We use machine learning patent published in 1997 and compute the standard TFIDF based on which we perform sentiment analysis. We acknowledge that in principle, patent text is generally neutral and the result from sentiment analysis is not likely to be meaningful. We perform it for the purpose of this class. 

Figure 7 describes the distribution of emotion values for each patent. Most patents are neutral in terms of emotion, but there are few with low emotion values. To make sense of it, patents are usually about solving a problem, especially for the set of patents on pharma and biology, where words that describe diseases such as 'chronic', 'diagnosis' are not uncommon. Due to such nature of patents, we would expect that mechanically, some patents have low scores in their emotional values. 

Figure 7: Emotion value distribution

<img src="sa.png" alt="Alt text" width="600" height="400">