**Topic Modeling Patent Portfolios:** Identifying Technology Clusters with Latent Dirichlet Allocation

**Contact:**  
Author:  Tyler Seymour  
Email:   <tylerseymour@protonmail.com>   
Website: <https://tylerseymour.pw> 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#About" data-toc-modified-id="About-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>About</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Hide-Warnings" data-toc-modified-id="Hide-Warnings-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Hide Warnings</a></span></li></ul></li><li><span><a href="#Process-Text" data-toc-modified-id="Process-Text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Process Text</a></span><ul class="toc-item"><li><span><a href="#Ingest-Abstracts" data-toc-modified-id="Ingest-Abstracts-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Ingest Abstracts</a></span></li><li><span><a href="#Part-of-Speech-Tagging" data-toc-modified-id="Part-of-Speech-Tagging-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Part-of-Speech Tagging</a></span></li><li><span><a href="#Stopwords" data-toc-modified-id="Stopwords-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Stopwords</a></span></li><li><span><a href="#Tokenize-Abstracts" data-toc-modified-id="Tokenize-Abstracts-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Tokenize Abstracts</a></span></li></ul></li><li><span><a href="#Calculate-Word-Statistics" data-toc-modified-id="Calculate-Word-Statistics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Calculate Word Statistics</a></span><ul class="toc-item"><li><span><a href="#Count-Word-Frequencies" data-toc-modified-id="Count-Word-Frequencies-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Count Word Frequencies</a></span></li><li><span><a href="#Find-Frequent-Words" data-toc-modified-id="Find-Frequent-Words-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Find Frequent Words</a></span></li><li><span><a href="#Dictionary-of-Frequent-Words" data-toc-modified-id="Dictionary-of-Frequent-Words-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Dictionary of Frequent Words</a></span></li><li><span><a href="#Bag-of-Words" data-toc-modified-id="Bag-of-Words-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Bag-of-Words</a></span></li><li><span><a href="#TF-IDF-Model" data-toc-modified-id="TF-IDF-Model-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>TF-IDF Model</a></span></li><li><span><a href="#Associate-Token--->-ID" data-toc-modified-id="Associate-Token--->-ID-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Associate Token --&gt; ID</a></span></li><li><span><a href="#Calculate-TF-IDF-Values" data-toc-modified-id="Calculate-TF-IDF-Values-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span>Calculate TF-IDF Values</a></span></li></ul></li><li><span><a href="#Visualizations" data-toc-modified-id="Visualizations-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Visualizations</a></span><ul class="toc-item"><li><span><a href="#Latent-Dirichlet-Allocation" data-toc-modified-id="Latent-Dirichlet-Allocation-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Latent Dirichlet Allocation</a></span></li><li><span><a href="#Export-HTML-Visualization" data-toc-modified-id="Export-HTML-Visualization-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Export HTML Visualization</a></span></li></ul></li></ul></div>

# Introduction

## About

This notebook uses natural language processing techniques to model topics in a patent portfolio. The model is useful to break a large group of patents into natural technology groupings. It then generates an interactive visualization which can be exported to HTML and used to further the contents of the portfolio. In this example, 1000 patent abstracts in US Class 705, for software and business method patents, are analyzed.

The workhorse in this analyis is Latent Dirichlet Allocation ("LDA"), an unsupervised topic modeling technique. As it is used here, each patent abstact is divided into a sets of topics, and each topic is a set of technology-related words.  

The abstacts are pre-processed to remove frequently occuring patent jargon and other stop words. This increases the quality of the technology topics the system is able to identify. 

The number of technology groupings is currently set at 7, for processing-time's sake. LDA fits the abstracts into an aribtrary, pre-set number of technology groupings.  the 

## Imports

This patent topic modeling notebook uses the following natural language pocessing libraries: 

   * **gensim** library for topic modeling using TF-IDF and Bag-of-Words (LDA model);
   * **TextBlob** for part-of-speech tagging to extract nouns and plural nouns;
   * **pyLDAviz** for interactive visualization. 

In [1]:
from gensim import corpora, models
from collections import defaultdict
from textblob import TextBlob
import pyLDAvis.gensim
import os


print()
print("Imports complete.")


Imports complete.


## Hide Warnings

In [2]:
import warnings
warnings.simplefilter('ignore')

print()
print("No more warnings!")




# Process Text

## Ingest Abstracts

Read 1000 patent abstracts into a list of strings.

In [3]:
abstracts = []
print()
for line in open('./patent-analytics/1000x705-claims.txt','r'):
    abstracts.append(line)
    print("Abstract #" + str(len(abstracts)) + ":")
    print(line)

print()
print("Abstracts loaded.")


Abstract #1:
"An image processing apparatus for processing inputted image data defining a first image that includes a picture image and a portion that is not the picture image, the apparatus comprising:a watermark pattern creating means for creating a watermark pattern;a picture image extractor means for extracting the picture image from the first image shown by inputted image data and creating a second image that is a remaining image obtained by removing the picture image from the first image and including a position where the picture image had been arranged;a superimposing means for executing a processing of superimposing the watermark pattern onto the second image including the position where the picture image had been arranged to thereby define a third image; anda document image creating means for superimposing the extracted picture image over the third image in the position where the picture image had been arranged to thereby create a document image that includes the picture imag

"An advertising system comprising:memory which stores an advertising template for an advertising sponsor;memory which stores:a categorizer for categorizing a set of a user's images, based on image content, the categorizer including at least one classifier which has been trained on image signatures of a labeled set of training images, the labels being selected from a finite set of image categories, the image signatures for the user's images and training images being based on pixels of the respective image by extracting features from patches of the image;a comparison computing component for selecting an advertising image based on the categorization of the user's images, the advertising image being selected from a set of advertising images categorized by the categorizer; anda combining component for combining the template with the selected advertising image to create personalized advertising content for each user which is displayable to the user on a web page viewed by the user; anda comp

## Part-of-Speech Tagging

Tag each word with the part of speech using TextBlob and extract the nouns ("NN") and plural nouns ("NNS").

In [4]:
nounAbstracts = []
count = 0
for abstract in abstracts:
    count += 1
    nounsOnly = []
    blob = TextBlob(abstract)
    if count == 1:
        print()
        print("Part of Speech Tags for Abstract #1: ")
        print(blob.tags)    
    for word, pos in blob.tags:
        if pos == 'NN' or pos=='NNS':
            nounsOnly.append(word)
    nounAbstracts.append(nounsOnly)
print()
print("Dislaying Nouns for Abstracts #1-5 of " + str(len(nounAbstracts)) + ": ")
print()
print(nounAbstracts[0:4])


Part of Speech Tags for Abstract #1: 
[('An', 'DT'), ('image', 'NN'), ('processing', 'NN'), ('apparatus', 'NN'), ('for', 'IN'), ('processing', 'NN'), ('inputted', 'VBN'), ('image', 'NN'), ('data', 'NNS'), ('defining', 'VBG'), ('a', 'DT'), ('first', 'JJ'), ('image', 'NN'), ('that', 'WDT'), ('includes', 'VBZ'), ('a', 'DT'), ('picture', 'NN'), ('image', 'NN'), ('and', 'CC'), ('a', 'DT'), ('portion', 'NN'), ('that', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('the', 'DT'), ('picture', 'NN'), ('image', 'NN'), ('the', 'DT'), ('apparatus', 'NN'), ('comprising', 'NN'), ('a', 'DT'), ('watermark', 'NN'), ('pattern', 'NN'), ('creating', 'VBG'), ('means', 'NNS'), ('for', 'IN'), ('creating', 'VBG'), ('a', 'DT'), ('watermark', 'NN'), ('pattern', 'NN'), ('a', 'DT'), ('picture', 'NN'), ('image', 'NN'), ('extractor', 'NN'), ('means', 'VBZ'), ('for', 'IN'), ('extracting', 'VBG'), ('the', 'DT'), ('picture', 'NN'), ('image', 'NN'), ('from', 'IN'), ('the', 'DT'), ('first', 'JJ'), ('image', 'NN'), ('shown', 'VB

## Stopwords

Create a list of stopwords, including basics such as "for", "a", "of", and other common words. It also includes specialty words "comprising", "method", "apparatus", and other patent-jargon that is overly broad because it describes most of the inventions in this set (and therefore should not be counted when computing the word statistics). Many of these should already be excluded because we selected for nouns and plural nouns using TextBlob's POS tagger. 


In [5]:
stoplist = set('for a of the analysis client code program programs transaction transactions end ends online internet business businesses software application applications interface interfaces list lists product products object objects set network networks content item items and to in data user information consumer consumers computer computers device devices company companies transitory method methods service services apparatus system systems comprise comprising device devices item items plurality process i ii processing'.split())
print("There are " + str(len(stoplist)) + " words in the stoplist.")
print()
print(stoplist)

There are 63 words in the stoplist.

{'businesses', 'online', 'processing', 'information', 'end', 'in', 'consumers', 'code', 'applications', 'network', 'apparatus', 'data', 'i', 'interface', 'program', 'networks', 'system', 'transaction', 'products', 'internet', 'list', 'devices', 'object', 'comprise', 'a', 'process', 'content', 'the', 'and', 'programs', 'objects', 'software', 'analysis', 'business', 'comprising', 'ends', 'to', 'product', 'device', 'lists', 'user', 'method', 'services', 'plurality', 'consumer', 'methods', 'computers', 'of', 'set', 'application', 'ii', 'interfaces', 'item', 'transitory', 'for', 'computer', 'transactions', 'items', 'company', 'client', 'systems', 'service', 'companies'}


## Tokenize Abstracts

Create list of lists of single-word tokens per abstract.


In [6]:
texts = []

for abstract in nounAbstracts:
    tempAbstract = []
    for word in abstract:
        if word not in stoplist:
            tempAbstract.append(word)
    texts.append(tempAbstract)
print()
print("Displaying token lists for Abstracts #1-5: ")
print(texts[0:4])


Displaying token lists for Abstracts #1-5: 
[['image', 'image', 'image', 'picture', 'image', 'portion', 'picture', 'image', 'watermark', 'pattern', 'means', 'watermark', 'pattern', 'picture', 'image', 'extractor', 'picture', 'image', 'image', 'image', 'image', 'image', 'picture', 'image', 'image', 'position', 'picture', 'image', 'superimposing', 'watermark', 'pattern', 'image', 'position', 'picture', 'image', 'image', 'image', 'means', 'picture', 'image', 'image', 'position', 'picture', 'image', 'document', 'image', 'picture', 'image', 'watermark', 'pattern', 'portion', 'picture', 'image', 'watermark', 'pattern'], ['advertisements', 'radio', 'broadcast', 'steps', 'portion', 'radio', 'broadcast', 'portion', 'radio', 'broadcast', 'portion', 'transcription', 'text', 'portion', 'transcription', 'text', 'portion', 'radio', 'broadcast', 'portion', 'radio', 'broadcast', 'proximity', 'media', 'player', 'radio', 'broadcast', 'portion', 'radio', 'broadcast', 'portion', 'radio', 'broadcast', 'po

# Calculate Word Statistics

## Count Word Frequencies

Create dictionary and count occurrences for each word. 

In [7]:
frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1
print(frequency)



## Find Frequent Words

List of words occurring more than once. 

In [8]:
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
print(len(texts))
print(texts)

1000


## Dictionary of Frequent Words

Create dictionary to filter repeats in calculating BOW and TF-IDF.

In [9]:
dictionary = corpora.Dictionary(texts)
print()
print(dictionary)


Dictionary(2508 unique tokens: ['document', 'image', 'means', 'pattern', 'picture']...)


## Bag-of-Words

Converts to (word-id, frequency) pairs to make BOW. 


In [10]:
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

[[(0, 1), (1, 24), (2, 2), (3, 5), (4, 11), (5, 2), (6, 3), (7, 5)], [(5, 12), (8, 1), (9, 12), (10, 1), (11, 2), (12, 2), (13, 2), (14, 12), (15, 2), (16, 4), (17, 4)], [(18, 4), (19, 1), (20, 3), (21, 18), (22, 1), (23, 3), (24, 6), (25, 1), (26, 7), (27, 6), (28, 1), (29, 6), (30, 4), (31, 1), (32, 1), (33, 3), (34, 2), (35, 4)], [(5, 4), (11, 1), (36, 1), (37, 3), (38, 2), (39, 2), (40, 1), (41, 1), (42, 1), (43, 1), (44, 2), (45, 3), (46, 6), (47, 1)], [(48, 1), (49, 2), (50, 3), (51, 3), (52, 1), (53, 3), (54, 3), (55, 2), (56, 1), (57, 2), (58, 3), (59, 6), (60, 1), (61, 1), (62, 1)], [(40, 2), (41, 2), (63, 5), (64, 1), (65, 15), (66, 1), (67, 5), (68, 1), (69, 9), (70, 1), (71, 1), (72, 1), (73, 2), (74, 1), (75, 2), (76, 9), (77, 2), (78, 2), (79, 1)], [(18, 1), (39, 1), (40, 2), (41, 2), (53, 5), (67, 1), (79, 1), (80, 2), (81, 5), (82, 1), (83, 2), (84, 1), (85, 2), (86, 1), (87, 1), (88, 1), (89, 2), (90, 1), (91, 2), (92, 2), (93, 1), (94, 2), (95, 6), (96, 1), (97, 2), (

## TF-IDF Model

Calculate term frequncy - inverse document frequency using word statistics and gensim's model. 

In [11]:
tfidf = models.TfidfModel(corpus)
print()
print(tfidf)


TfidfModel(num_docs=1000, num_nnz=21143)


## Associate Token --> ID

Associate tokens with IDs.

In [12]:
print()
print(dictionary.token2id)




## Calculate TF-IDF Values

In [13]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.03575733899088009), (1, 0.6800162462758899), (2, 0.07908309730840547), (3, 0.2413612922414725), (4, 0.5721209138874248), (5, 0.043427145553783304), (6, 0.10727201697264029), (7, 0.3620419383622087)]
[(5, 0.29590022552527656), (8, 0.041741274859565074), (9, 0.6442122843264096), (10, 0.05081361825885325), (11, 0.06331046420736691), (12, 0.11495031693101872), (13, 0.1016272365177065), (14, 0.5906890878955187), (15, 0.059543538416063795), (16, 0.1912881117954457), (17, 0.2766029152891373)]
[(18, 0.1088349042291638), (19, 0.0671894664733831), (20, 0.22100217362465582), (21, 0.54712288186035), (22, 0.024842576591419654), (23, 0.23642762801100353), (24, 0.23428270297475667), (25, 0.05973557693020666), (26, 0.5156717384575302), (27, 0.22643165144105934), (28, 0.03372283136974718), (29, 0.2572825602137547), (30, 0.20378232312390904), (31, 0.04446765604614176), (32, 0.047837113311688835), (33, 0.10283491458367407), (34, 0.14003842637554356), (35, 0.24491686815416955)]
[(5, 0.2091225106399

[(5, 0.4401414641838312), (157, 0.564506951990288), (160, 0.2856977002170423), (161, 0.5201671849445277), (457, 0.06589212320404164), (638, 0.06589212320404164), (860, 0.17359791173339553), (1314, 0.0854323435062435), (1529, 0.17790818281100917), (2192, 0.2400844244664213)]
[(13, 0.5556683004572377), (43, 0.0358031546811619), (57, 0.15038149819806165), (63, 0.31672416177698026), (65, 0.37203347854033914), (94, 0.020821577159822927), (155, 0.3873850500623143), (159, 0.06727964749364625), (248, 0.06877647986492022), (290, 0.06680280491045133), (293, 0.07467620804558746), (296, 0.07757323358973343), (322, 0.29353051851652817), (929, 0.14672185747039784), (1433, 0.31425752305559934), (1665, 0.2152990664586915)]
[(39, 0.04685924014415898), (50, 0.24520706479055102), (54, 0.2090173471694838), (67, 0.05970578924339015), (94, 0.06836980550155079), (115, 0.07058597370478979), (143, 0.06569830123332171), (217, 0.0343911575791858), (314, 0.3739800835589249), (415, 0.06880350472185343), (441, 0.20

# Visualizations

## Latent Dirichlet Allocation

In [14]:
Lda = models.ldamodel.LdaModel
ldamodel = Lda(corpus_tfidf, num_topics=7, id2word = dictionary, alpha='auto', passes=5)
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

## Export HTML Visualization

In [15]:
name = input("Save As: ")
location = ("./patent-analytics/" + name + ".html")
pyLDAvis.save_html(vis, location)

cwd = os.getcwd()
user = (cwd.split("-"))

print()
print("Exported to: " + location)
link = ("https://tylerseymour.pw/user/" + user[1] + "/notebooks/patent-analytics/" + name + ".html")
print(link)

Save As: patentTopics

Exported to: ./patent-analytics/patentTopics.html
http://104.197.20.244/user/admin/notebooks/patent-analytics/patentTopics.html
