<a href="https://colab.research.google.com/github/rcmckee/Summarization-of-Claims/blob/master/Claim_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from gensim.summarization import summarize, keywords
from pprint import pprint
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.api as sm
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

# Load data from gDrive

In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [0]:
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
downloaded = drive.CreateFile({'id':'1UcoQDxQe5MGruMUoD4013HrTz346OH8i'}) 
downloaded.GetContentFile('small_700_through_710_descr_clm_code.csv')  
df = pd.read_csv('small_700_through_710_descr_clm_code.csv')
# Dataset is now stored in a Pandas Dataframe

# Look at the data

In [27]:
df.head()

Unnamed: 0.1,Unnamed: 0,descr,clm,code
0,0,This application claims priority under 35 U.S....,What is claimed is: \n \n 1 . A pr...,700
1,1,BACKGROUND \n 1. Field of Invention \n ...,What is claimed is: \n \n 1 . A st...,700
2,2,CROSS-REFERENCE TO RELATED APPLICATIONS \n ...,What is claimed is: \n \n 1 . A me...,700
3,3,FIELD OF THE INVENTION \n The present inve...,1 . A method for state-transition-controlled p...,700
4,4,RELATED APPLICATION \n This application cl...,What is claimed is: \n \n 1 . A me...,700


I only want the claims. I also need to clean the text and remove things like '\n'.

# Clean Text

In [0]:
def remove_string(dataframe,column_list,string_in_quotes):
    '''
    Input:
            dataframe: name of pandas dataframe
            column_list: list of column name strings (ex. ['col_1','col_2'])
            string_in_quotes: string to remove in quotes (ex. ',')
    
    Output:
            none
            modifies pandas dataframe to remove string.
                
    Example:
            remove_string(df, ['col_1','col_2'], ',')
    
    Warning:
            If memory issues occur, limit to one column at a time.
        
    '''
    for i in column_list:
        dataframe[i] = dataframe[i].str.replace(string_in_quotes,"").astype(str)

In [0]:
remove_string(df,['clm'],'\n')
remove_string(df,['clm'],'                   ')
remove_string(df,['clm'],'              ')
df.clm = df.clm.str.replace(r'\A(\D+)', '') #remove words before first claim
df.clm = df.clm.str.replace(r'\d+ .', '') #remove the number and period before each sentence
df.clm = df.clm.str.strip() #remove leading and trailing white spaces

In [59]:
df.head()

Unnamed: 0.1,Unnamed: 0,descr,clm,code
0,0,This application claims priority under 35 U.S....,A programming template for developing an appli...,700
1,1,BACKGROUND \n 1. Field of Invention \n ...,"A state machine engine, comprising: a state ma...",700
2,2,CROSS-REFERENCE TO RELATED APPLICATIONS \n ...,A method of controlling a PLC digital output s...,700
3,3,FIELD OF THE INVENTION \n The present inve...,A method for state-transition-controlled proce...,700
4,4,RELATED APPLICATION \n This application cl...,A method of online and dynamic schedule config...,700


In [60]:
df.clm[2]

'A method of controlling a PLC digital output signal, the method comprising: receiving the PLC digital output signal; and   interpolating a gradient of the PLC digital output signal by applying a nonlinear correction function to the received PLC digital output signal.    The method of  claim  wherein the interpolating of the gradient comprises interpolating the gradient by setting at least one of an interpolation frequency, a target value and a scan time.  The method of  claim  wherein the nonlinear correction function comprises a sigmoid function.  The method of  claim  further comprising: selecting at least one channel according to a user input, wherein the interpolating of the gradient is performed only when a digital-analog conversion and an interpolation are allowed in the selected channel.    The method of  claim  further comprising: performing an interpolation by applying a linear correction function to adjust an offset gain of the PLC digital output signal to which the nonlinea

There is no need to split the sentence into a tokenized list because gensim does the splitting using the built-in split_sentences() method in the gensim.summarization.texcleaner module.

# Text Summarization trial 1

Gensim implements the **textrank** summarization using the **summarize()** function in the summarization module. All you need to do is to pass in the text string along with either the output summarization ratio or the maximum count of words in the summarized output.

There is no need to split the sentence into a tokenized list because gensim does the splitting using the built-in split_sentences() method in the **gensim.summarization.texcleaner** module.

In [61]:
text = df.clm[2]
pprint(summarize(text, word_count=20))

('The method of  claim  further comprising: performing an interpolation by '
 'applying a linear correction function to adjust an offset gain of the PLC '
 'digital output signal to which the nonlinear correction function is applied.')


In [62]:
print(keywords(text))

comprising
comprises
unit
output
signal
correction
analog
interpolating
interpolation
interpolate
interpolated
interpolates


# Text Summarization trial 2
Increase number of words from 20 to 150 words (the max for a patent abstract)

In [63]:
text = df.clm[2]
print(summarize(text, word_count=150))

The method of  claim  wherein the interpolating of the gradient comprises interpolating the gradient by setting at least one of an interpolation frequency, a target value and a scan time.
The method of  claim  further comprising: performing an interpolation by applying a linear correction function to adjust an offset gain of the PLC digital output signal to which the nonlinear correction function is applied.
The apparatus of  claim  wherein the interpolation unit interpolates the PLC digital output signal calculated in the calculation unit by setting at least one of an interpolation frequency, a target value and a scan time.
The apparatus of  claim  wherein the nonlinear correction function used in the interpolation unit comprises a sigmoid function.
The apparatus of  claim  wherein the interpolation unit performs an interpolation by applying a linear correction function to adjust an offset gain of the PLC digital output signal to which the nonlinear correction function is applied.


# That wasn't great

# Text Summarization trial 3
textrank

In [64]:
import nltk
nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
from nltk.tokenize import sent_tokenize

In [0]:
sentences = []

for s in df.clm[2].split('.'):
  s = s.strip()
  sentences.append(sent_tokenize(s))
  


In [71]:
sentences[:5]

[['A method of controlling a PLC digital output signal, the method comprising: receiving the PLC digital output signal; and   interpolating a gradient of the PLC digital output signal by applying a nonlinear correction function to the received PLC digital output signal'],
 ['The method of  claim  wherein the interpolating of the gradient comprises interpolating the gradient by setting at least one of an interpolation frequency, a target value and a scan time'],
 ['The method of  claim  wherein the nonlinear correction function comprises a sigmoid function'],
 ['The method of  claim  further comprising: selecting at least one channel according to a user input, wherein the interpolating of the gradient is performed only when a digital-analog conversion and an interpolation are allowed in the selected channel'],
 ['The method of  claim  further comprising: performing an interpolation by applying a linear correction function to adjust an offset gain of the PLC digital output signal to whic

In [0]:
sentences = [y for x in sentences for y in x] # flatten list

In [73]:
sentences[:5]

['A method of controlling a PLC digital output signal, the method comprising: receiving the PLC digital output signal; and   interpolating a gradient of the PLC digital output signal by applying a nonlinear correction function to the received PLC digital output signal',
 'The method of  claim  wherein the interpolating of the gradient comprises interpolating the gradient by setting at least one of an interpolation frequency, a target value and a scan time',
 'The method of  claim  wherein the nonlinear correction function comprises a sigmoid function',
 'The method of  claim  further comprising: selecting at least one channel according to a user input, wherein the interpolating of the gradient is performed only when a digital-analog conversion and an interpolation are allowed in the selected channel',
 'The method of  claim  further comprising: performing an interpolation by applying a linear correction function to adjust an offset gain of the PLC digital output signal to which the non

## Download GloVe Word Embeddings

In [74]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2019-03-26 16:18:42--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2019-03-26 16:18:42--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2019-03-26 16:19:57 (11.0 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [0]:
# Extract word vectors
word_embeddings = {}

f = open('glove.6B.100d.txt', encoding='utf-8')

for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray(values[1:], dtype='float32')
  word_embeddings[word]= coefs

f.close()

In [76]:
len(word_embeddings)

400000

word vectors for 400,000 terms stored in the dictionary - 'word_embeddings'

## Text Preprocessing

In [0]:
# remove punctuations, numbers, and special characters from each sentence
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]"," ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [78]:
# get rid of stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [0]:
def remove_stopwords(sen):
  sen_new = " ".join([i for i in sen if i not in stop_words])
  return sen_new

In [0]:
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

## Vector Representation of Sentences

In [0]:
# create vectors for our sentences
# word vectors were created above

sentence_vectors = []

for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

## Similarity Matrix Preparation using Cosine Similarity
The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge.

In [0]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

In [0]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

## Applying PageRank Algorithm

In [0]:
import networkx as nx

In [0]:
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

## Summary Extraction
extract top N sentences based on their rankings for summary generation

In [0]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [89]:
# Extract top 10 sentences as a summary
for i in range(10):
  print(ranked_sentences[i][1])

The apparatus of  claim  wherein the interpolation unit performs an interpolation by applying a linear correction function to adjust an offset gain of the PLC digital output signal to which the nonlinear correction function is applied
The method of  claim  further comprising: performing an interpolation by applying a linear correction function to adjust an offset gain of the PLC digital output signal to which the nonlinear correction function is applied
The apparatus of  claim  wherein the interpolation unit interpolates the PLC digital output signal calculated in the calculation unit by setting at least one of an interpolation frequency, a target value and a scan time
An apparatus for controlling a PLC digital output signal, the apparatus comprising: a calculation unit to calculate a value of a signal to be actually output;   an interpolation unit to interpolate a signal by applying a nonlinear correction function;   a conversion unit to convert a digital signal into an analog signal;

## Not great either.