# What is Keyword Extraction?

Keyword Extraction is defined as a set of terms to describe the subject of given document.

This is an important technique in information retrieval system (IR), such as: keywords simplify and speed up research.

Its applications is that to reduce text dimensionality for further text analysis (subject modeling text classification), for instance:
+ indexing data
+ summarizing text
+ generating tag clouds with the most representative keywords

# Data settings

Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world. It covers topics ranging from deep learning and computer vision to cognitive science and reinforcement learning.

This dataset includes the title, authors, abstracts, and extracted text for all NIPS papers to date (ranging from the first 1987 conference to the current 2016 conference). I've extracted the paper text from the raw PDF files and are releasing that both in CSV files and as a SQLite database. The code to scrape and create this dataset is on GitHub. Here's a quick RMarkdown exploratory overview of what's in the data.

# Links & References

+ https://thecleverprogrammer.com/2020/12/01/keyword-extraction-with-python/
+ corpus: https://www.kaggle.com/benhamner/nips-papers/download (about 428 MB)


# Overview corpus

In [1]:
import os
import sys
import re
import numpy as np
import pandas as pd

In [2]:
file_input_path = "./nips_papers/papers.csv"

In [3]:
if not os.path.exists(file_input_path):
    print("File not found :", file_input_path)
    sys.exit(1)

In [4]:
!head -n3 $file_input_path

id,year,title,event_type,pdf_name,abstract,paper_text
1,1987,Self-Organization of Associative Database and Its Applications,,1-self-organization-of-associative-database-and-its-applications.pdf,Abstract Missing,"767



In [5]:
df = pd.read_csv(file_input_path, sep=",", header=0)

In [6]:
df.head(3)

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7241 entries, 0 to 7240
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          7241 non-null   int64 
 1   year        7241 non-null   int64 
 2   title       7241 non-null   object
 3   event_type  2422 non-null   object
 4   pdf_name    7241 non-null   object
 5   abstract    7241 non-null   object
 6   paper_text  7241 non-null   object
dtypes: int64(2), object(5)
memory usage: 396.1+ KB


In [8]:
df.describe()

Unnamed: 0,id,year
count,7241.0,7241.0
mean,3655.912167,2006.439718
std,2098.435219,8.759919
min,1.0,1987.0
25%,1849.0,2000.0
50%,3659.0,2009.0
75%,5473.0,2014.0
max,7284.0,2017.0


Thus, in file "papers.csv", there are 7 columns: id, year, title, event_type, pdf_name, abstract, paper_text.

We focus on the title, the abstract, and the paper_text for this notebook.

In [9]:
df = df[["title", "abstract", "paper_text"]]

In [10]:
df.head(3)

Unnamed: 0,title,abstract,paper_text
0,Self-Organization of Associative Database and ...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,A Mean Field Theory of Layer IV of Visual Cort...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,Storing Covariance by the Associative Long-Ter...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...


How many titles don't they contain "abstract"? (Abstract Missing)

In [11]:
len(df[df["abstract"]=="Abstract Missing"])

3317

In [12]:
# How to deal with SettingWithCopyWarning
df.loc[df["abstract"]=="Abstract Missing", "abstract"] = ""

In [13]:
df.head(3)

Unnamed: 0,title,abstract,paper_text
0,Self-Organization of Associative Database and ...,,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,A Mean Field Theory of Layer IV of Visual Cort...,,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,Storing Covariance by the Associative Long-Ter...,,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...


# Preprocessing

In [14]:
from nltk.corpus import stopwords

In [15]:
stopwords_en = set(stopwords.words("english"))

##Creating a list of custom stopwords
new_words = ["fig","figure","image","sample","using", 
             "show", "result", "large", 
             "also", "one", "two", "three", 
             "four", "five", "seven","eight","nine", "table"]
stopwords_en = list(stopwords_en.union(new_words))

In [16]:
# !pip install spacy

In [17]:
# !python -m spacy download en_core_web_sm

In [18]:
import spacy

In [19]:
nlp = spacy.load("en_core_web_sm")

In [20]:
def GetLemma(text, nlp):
    return " ".join([token.lemma_ for token in nlp(text)])

def Preprocessing(text, nlp):
    text = text.lower()
    
    text = re.sub('<[^>]*>','',text)
    
    # remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
        
    # remove stopwords
    text = " ".join([word for word in text.split() if word not in stopwords_en])
    
    # remove words less than three letters
    # text = [word for word in text if len(word) >= 3]
    
    # lemmatize
    text = GetLemma(text=text, nlp=nlp)
    
    return text

In [21]:
paper_texts = df["paper_text"].values.tolist()

In [22]:
num_papers = len(paper_texts)
print("Number of papers: ", num_papers)

Number of papers:  7241


In [23]:
file_output_path = "./nips_papers/paper_text_processed.csv"

if not os.path.exists(file_output_path):
    # Pre-processing Phase
    import time
    paper_text_processedp = []
    for i, content in enumerate(paper_texts):
        start_time = time.time()
        print("-" * 32)
        print(f"Processing article num: {i+1}/{num_papers}")
        p = Preprocessing(content, nlp)
        paper_text_processed.append(p)
        print(f"Duration: {time.time()-start_time} (s)")
    
    df["paper_text_processed"] = paper_text_processed
    
    # Save to file    
    df.to_csv(file_output_path, header=True, sep="\t", index=False)
else:
    # Load from file
    df1 = pd.read_csv(file_output_path, header=0, sep="\t")

In [24]:
df1.head(3)

Unnamed: 0,title,abstract,paper_text,paper_text_processed
0,Self-Organization of Associative Database and ...,,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...,self organization associative database applica...
1,A Mean Field Theory of Layer IV of Visual Cort...,,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...,mean field theory layer iv visual cortex appli...
2,Storing Covariance by the Associative Long-Ter...,,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...,store covariance associative long term potenti...


# Our approach

Using TF-IDF (stands for "Text Frequency Inverse Document Frequency").

The important of each word increases in proportion to the number of times a word appears in the document (Text Frequency). Moreover, it is offset by the frequency of the word in the corpus (Inverse Document Frequency).

Using the TF-IDF scheme, the keywords are the words with the highest TF-IDF score.

In addition, we use CountVectorizer method in scikit-learn to create a Bag-of-Word.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Create Bag-Of-Words
cv = CountVectorizer(
    max_df=0.99,         # ignore words that appear in 99% of documents
    max_features=10000,  # the size of the vocabulary
    ngram_range=(1,2)    # vocabulary contains 1-gram, bigrams, trigrams
)

tfidf = TfidfTransformer(smooth_idf=True, use_idf=True)

In [26]:
docs = df1["paper_text_processed"].values.tolist()

In [27]:
len(docs)

7241

In [28]:
docs = [str(d).strip() for d in docs if str(d).strip()]

In [29]:
len(docs)

7241

In [30]:
bow = cv.fit_transform(docs)

In [31]:
tfidf.fit(bow)

TfidfTransformer()

In [32]:
def Extract_TopN(feature_names, sorted_keywords, top_n=10):
    """Get the feature names and TF-IDF score of Top-N keywords
    """
    top_n_keywords = sorted_keywords[:top_n]
    
    scores = []
    features = []
    
    for i, score in top_n_keywords:
        feature_name = feature_names[i]
        
        # Keep track of feature name and its corresponding score
        scores.append(np.round(score, 6))
        features.append(feature_names[i])
        
    # Create a tuple of (feature, score)
    list_features_scores = list(zip(features, scores))    
    return dict((x, y) for x, y in list_features_scores)

def ExtractKeywords(tfidf, cv, idx, docs, feature_names):
    # Generate TF-IDF for given document
    tfidf_vector = tfidf.transform(
        cv.transform([docs[idx]])
    )
    
    # Sort the TF-IDF vectors by descending order of scores
    coo_matrix = tfidf_vector.tocoo()
    col_data = zip(coo_matrix.col, coo_matrix.data)
    sorted_keywords = sorted(col_data, key=lambda x: (x[1], x[0]), reverse=True)
    
    # Extract only the top n; n here is 10
    return Extract_TopN(feature_names, sorted_keywords, top_n=10)    
    
def PrintKeywords(df, keywords, idx):
    print("Title: ", df["title"][idx])
    print("Abstract: ", df["abstract"][idx])
    print("Keywords: ")
    for k, score in keywords.items():        
        # print(f"{k} \t {score}")
        print("{0:<22}{1:<8}".format(k, score))
        

In [33]:
docs[941][:10]

'algorithm '

In [34]:
# Get feature names
feature_names = cv.get_feature_names_out()

idx = 941
keywords = ExtractKeywords(tfidf, cv, idx, docs, feature_names)
PrintKeywords(df1, keywords, idx)

Title:  Algorithms for Non-negative Matrix Factorization
Abstract:  Non-negative matrix factorization (NMF) has previously been shown to 
be a useful decomposition for multivariate data. Two different multi- 
plicative algorithms for NMF are analyzed. They differ only slightly in 
the multiplicative factor used in the update rules. One algorithm can be 
shown to minimize the conventional least squares error while the other 
minimizes the generalized Kullback-Leibler divergence. The monotonic 
convergence of both algorithms can be proven using an auxiliary func- 
tion analogous to that used for proving convergence of the Expectation- 
Maximization algorithm. The algorithms can also be interpreted as diag- 
onally rescaled gradient descent, where the rescaling factor is optimally 
chosen to ensure convergence. 
Keywords: 
ht                    0.548523
update rule           0.278459
ht ht                 0.230532
update                0.219442
negative matrix       0.170051
aux