Type: Learning Project \
Author: Yash K

#### About: Keyword Extraction
Keyword extraction is defined as the task of Natural language processing that automatically identifies a set of terms to describe the subject of the text. This is an important method in information retrieval (IR) systems: keywords simplify and speed up research. Keyword extraction can be used to reduce text dimensionality for further text analysis (subject modeling text classification).

Data set: 

In [3]:
import nltk
# ^^^ pyforest auto-imports - don't write above this line
#imports
#generic/data
import numpy as np
import pandas as pd

#plotting
import matplotlib.pyplot as plt
import seaborn as sns

#utility
import re
from tqdm.notebook import tqdm

#ML
import tensorflow as tf
import sklearn

#NLP
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

In [4]:
df = pd.read_csv('./keyword-extraction-data/papers.csv')

### Data visualization

In [5]:
print('year range: ', min(df['year']), max(df['year']))

year range:  1987 2017


In [6]:
print('distinct event types:', df['event_type'].unique())

distinct event types: [nan 'Oral' 'Spotlight' 'Poster']


In [7]:
print('abstracts present:', len(df['abstract'].unique())-1)

abstracts present: 3922


In [8]:
paperTexts = df['abstract'].unique()

In [9]:
paperTexts[1267]

'Statistical models for networks have been typically committed to strong prior assumptions concerning the form of the modeled distributions. Moreover, the vast majority of currently available models are explicitly designed for capturing some specific graph properties (such as power-law degree distributions), which makes them unsuitable for application to domains where the behavior of the target quantities is not known a priori. The key contribution of this paper is twofold. First, we introduce the Fiedler delta statistic, based on the Laplacian spectrum of graphs, which allows to dispense with any parametric assumption concerning the modeled network properties. Second, we use the defined statistic to develop the Fiedler random field model, which allows for efficient estimation of edge distributions over large-scale random networks. After analyzing the dependence structure involved in Fiedler random fields, we estimate them over several real-world networks, showing that they achieve a m

### Processing the data

In [10]:
# get stop words for english
# nltk.download('stopwords')
# nltk.download('wordnet')

In [11]:
stop_words = set(stopwords.words('english'))

In [12]:
##Creating a list of custom stopwords
new_words = ["fig","figure","image","sample","using", 
             "show", "result", "large", 
             "also", "one", "two", "three", 
             "four", "five", "seven","eight","nine"]
stop_words = list(stop_words.union(new_words))

In [13]:
#create preprocessor
def pre_process(text):
    text=text.lower() #lowercase
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text) #remove tags
    text=re.sub("(\\d|\\W)+"," ",text) #remove digits and special characters = NOT(a-zA-Z0-9_) 
    text=text.split() #split text | Tokensize
    text=[word for word in text if word not in stop_words] #remove stopwords
    text=[word for word in text if len(word)>=3] #remove very small words
    lmtzr=WordNetLemmatizer()
    text=[lmtzr.lemmatize(word) for word in text] # Lemmatize
    return ' '.join(text)

In [14]:
docs = df['paper_text'].apply(lambda x:pre_process(x))

In [15]:
docs

0       self organization associative database applica...
1       mean field theory layer visual cortex applicat...
2       storing covariance associative long term poten...
3       bayesian query construction neural network mod...
4       neural network ensemble cross validation activ...
                              ...                        
7236    single transistor learning synapsis paul hasle...
7237    bias variance combination least square estimat...
7238    real time clustering cmos neural engine serran...
7239    learning direction global motion class psychop...
7240    correlation interpolation network real time ex...
Name: paper_text, Length: 7241, dtype: object

### TF-IDF 
TF-IDF stands for Text Frequency Inverse Document Frequency. The importance of each word increases in proportion to the number of times a word appears in the document (Text Frequency – TF) but is offset by the frequency of the word in the corpus (Inverse Document Frequency – IDF).

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
#create a vocabulary of words, 
cv=CountVectorizer(max_df=0.95,         # ignore words that appear in 95% of documents
                   max_features=10000,  # the size of the vocabulary
                   ngram_range=(1,3)    # vocabulary contains single words, bigrams, trigrams
                  )
word_count_matrix=cv.fit_transform(docs)

In [18]:
word_count_matrix # n_docs * vocab_size sparse matrix

<7241x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 5934926 stored elements in Compressed Sparse Row format>

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_matrix)

TfidfTransformer()

In [20]:
feature_names=cv.get_feature_names()



In [21]:
#get keywords in the idx index document
def get_keywords(idx, docs, n):
    #generate tf_idf for the given doc
    tf_idf_vec = tfidf_transformer.transform(cv.transform([docs[idx]]))
    #sort tf_idf vectors by descending order of scores
    sorted_items=sort_(tf_idf_vec.tocoo())
    #get only top n and return 
    keywords=extract_topn(feature_names, sorted_items, n)
    return keywords

#sort the coordinate sparse matrix
def sort_(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x:(x[1], x[0]), reverse=True)

#extract and return top n results
def extract_topn(feature_names, sorted_items, n):
    result = {feature_names[idx]:round(score, 3) for idx, score in sorted_items[:n]}
    return result

## Results

In [22]:
index = 1023
topn=20
keywords=get_keywords(index, docs, topn)
print(keywords)

{'ica': 0.349, 'independent component': 0.343, 'component': 0.301, 'kurtosis': 0.245, 'multiplier': 0.239, 'lagrange multiplier': 0.239, 'signal': 0.227, 'lagrange': 0.226, 'independent': 0.15, 'matrix': 0.145, 'independent component analysis': 0.126, 'constrained': 0.12, 'resulted': 0.112, 'source': 0.108, 'constraint': 0.107, 'variance': 0.106, 'snr': 0.092, 'component analysis': 0.091, 'augmented lagrangian': 0.09, 'ordering': 0.089}
