## Extracting Important Keywords by NLP Techniques with TF-IDF and Cosine Similarities.

This libraries is that they are meant as a pre-step for other tasks like clustering, topic modeling and text classification. [TF-IDF] can actually be used to extract important keywords from a document to get a sense of what characterizes a document.


## Dataset
 In  this example, we will be using a DBLP dataset shorter version which is slightly noisier and simulates what you could be dealing with in real life.

In [2]:
import pandas as pd

# read json into a dataframe
df_idf=pd.read_csv("/content/paperforsearch2.csv")

# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)


Schema:

 id         int64
ptitle    object
author    object
year       int64
dtype: object
Number of questions,columns= (99997, 4)


In [3]:
df_idf

Unnamed: 0,id,ptitle,author,year
0,1,Ontologies in HYDRA - Middleware for Ambient I...,"Peter Kostelnik, Martin Sarnovsky, Jan Hreno,",2009
1,2,Remote Policy Enforcement for Trusted Applicat...,"Fabio Martinelli, Ilaria Matteucci, Andrea Sar...",2013
2,3,A SIMPLE OBSERVATION REGARDING ITERATIONS OF F...,"Jerzy Mycka,",2009
3,4,Gait based human identity recognition from mul...,"Emdad Hossain, Girija Chetty,",2012
4,5,The GAME Algorithm Applied to Complex Fraction...,"Pavel Kordík, Václav Křemen, Lenka Lhotska,",2008
...,...,...,...,...
99992,99993,"A 14-bit, 200 MS/s digital-to-analog converter...","Kuo-Hsing Cheng, Tsung-Shen Chen, Chia Ming Tu,",2004
99993,99994,A Symmetrical Model Applied to Interval-Valued...,"Marco A. O. Domingues, Renata M. C. R. de Souz...",2009
99994,99995,Spontaneous Agent Networking,"Zhenyan Ji, Malmberg Ake,",2005
99995,99996,A design algorithm for ring topology centraliz...,"Naoki Agata, Akira Agata, Kosuke Nishimura,",2013


Take note that this DBLP dataset contains 4 fields including id, title, author, year. What we are mostly interested in the `ptitle`, `author`, and `year` which is our source. We will now create a field that combines both title and author so we have it in one field. Later, We will also combine year.

In [4]:
from numpy import char
import re
def pre_process(text):

    # lowercase
    text=text.lower()

    #remove tags
    text=re.sub("</?.*?>"," <> ",text)

    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)

    return text

df_idf['text'] = df_idf['ptitle'] + df_idf['author']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

#show the first 'text'
df_idf['text']

0        ontologies in hydra middleware for ambient int...
1        remote policy enforcement for trusted applicat...
2        a simple observation regarding iterations of f...
3        gait based human identity recognition from mul...
4        the game algorithm applied to complex fraction...
                               ...                        
99992    a bit ms s digital to analog converter without...
99993    a symmetrical model applied to interval valued...
99994    spontaneous agent networkingzhenyan ji malmber...
99995    a design algorithm for ring topology centraliz...
99996    the govstat statistical interactive glossary s...
Name: text, Length: 99997, dtype: object

## Data Cleaning and Pre-processing

In [5]:
# Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
df_idf['ptitle_author']=[entry.lower() for entry in df_idf['text']]

In [6]:
df_idf['ptitle_author'][0]

'ontologies in hydra middleware for ambient intelligent devices peter kostelnik martin sarnovsky jan hreno '

In [7]:
## data Cleaning for ptitle data
df_idf.ptitle_author =df_idf.ptitle_author.replace(to_replace='from:(.*\n)',value='',regex=True) #remove from to email
df_idf.ptitle_author =df_idf.ptitle_author.replace(to_replace='Re:',value='',regex=True)
df_idf.ptitle_author =df_idf.ptitle_author.replace(to_replace='lines:(.*\n)',value='',regex=True)
df_idf.ptitle_author =df_idf.ptitle_author.replace(to_replace='[!"#$%&\'()*+,/:;<=>?@[\\]^_`{|}~]',value=' ',regex=True) #remove punctuation except
df_idf.ptitle_author =df_idf.ptitle_author.replace(to_replace='-',value=' ',regex=True)
df_idf.ptitle_author =df_idf.ptitle_author.replace(to_replace='\s+',value=' ',regex=True)    #remove new line
df_idf.ptitle_author =df_idf.ptitle_author.replace(to_replace='  ',value='',regex=True)                #remove double white space
df_idf.ptitle_author =df_idf.ptitle_author.apply(lambda x:x.strip())  # Ltrim and Rtrim of whitespace

In [8]:
df_idf

Unnamed: 0,id,ptitle,author,year,text,ptitle_author
0,1,Ontologies in HYDRA - Middleware for Ambient I...,"Peter Kostelnik, Martin Sarnovsky, Jan Hreno,",2009,ontologies in hydra middleware for ambient int...,ontologies in hydra middleware for ambient int...
1,2,Remote Policy Enforcement for Trusted Applicat...,"Fabio Martinelli, Ilaria Matteucci, Andrea Sar...",2013,remote policy enforcement for trusted applicat...,remote policy enforcement for trusted applicat...
2,3,A SIMPLE OBSERVATION REGARDING ITERATIONS OF F...,"Jerzy Mycka,",2009,a simple observation regarding iterations of f...,a simple observation regarding iterations of f...
3,4,Gait based human identity recognition from mul...,"Emdad Hossain, Girija Chetty,",2012,gait based human identity recognition from mul...,gait based human identity recognition from mul...
4,5,The GAME Algorithm Applied to Complex Fraction...,"Pavel Kordík, Václav Křemen, Lenka Lhotska,",2008,the game algorithm applied to complex fraction...,the game algorithm applied to complex fraction...
...,...,...,...,...,...,...
99992,99993,"A 14-bit, 200 MS/s digital-to-analog converter...","Kuo-Hsing Cheng, Tsung-Shen Chen, Chia Ming Tu,",2004,a bit ms s digital to analog converter without...,a bit ms s digital to analog converter without...
99993,99994,A Symmetrical Model Applied to Interval-Valued...,"Marco A. O. Domingues, Renata M. C. R. de Souz...",2009,a symmetrical model applied to interval valued...,a symmetrical model applied to interval valued...
99994,99995,Spontaneous Agent Networking,"Zhenyan Ji, Malmberg Ake,",2005,spontaneous agent networkingzhenyan ji malmber...,spontaneous agent networkingzhenyan ji malmber...
99995,99996,A design algorithm for ring topology centraliz...,"Naoki Agata, Akira Agata, Kosuke Nishimura,",2013,a design algorithm for ring topology centraliz...,a design algorithm for ring topology centraliz...


### Checking  and dropempty data rows

In [9]:
## ## Checking  and drop empty data
for i,sen in enumerate(df_idf.ptitle_author):
    if len(sen.strip()) ==0:
        print(str(i))
        #file_data.text[i] = np.nan
        df_idf=df_idf.drop(str(i),axis=0).reset_index().drop('index',axis=1)

In [10]:
df_idf

Unnamed: 0,id,ptitle,author,year,text,ptitle_author
0,1,Ontologies in HYDRA - Middleware for Ambient I...,"Peter Kostelnik, Martin Sarnovsky, Jan Hreno,",2009,ontologies in hydra middleware for ambient int...,ontologies in hydra middleware for ambient int...
1,2,Remote Policy Enforcement for Trusted Applicat...,"Fabio Martinelli, Ilaria Matteucci, Andrea Sar...",2013,remote policy enforcement for trusted applicat...,remote policy enforcement for trusted applicat...
2,3,A SIMPLE OBSERVATION REGARDING ITERATIONS OF F...,"Jerzy Mycka,",2009,a simple observation regarding iterations of f...,a simple observation regarding iterations of f...
3,4,Gait based human identity recognition from mul...,"Emdad Hossain, Girija Chetty,",2012,gait based human identity recognition from mul...,gait based human identity recognition from mul...
4,5,The GAME Algorithm Applied to Complex Fraction...,"Pavel Kordík, Václav Křemen, Lenka Lhotska,",2008,the game algorithm applied to complex fraction...,the game algorithm applied to complex fraction...
...,...,...,...,...,...,...
99992,99993,"A 14-bit, 200 MS/s digital-to-analog converter...","Kuo-Hsing Cheng, Tsung-Shen Chen, Chia Ming Tu,",2004,a bit ms s digital to analog converter without...,a bit ms s digital to analog converter without...
99993,99994,A Symmetrical Model Applied to Interval-Valued...,"Marco A. O. Domingues, Renata M. C. R. de Souz...",2009,a symmetrical model applied to interval valued...,a symmetrical model applied to interval valued...
99994,99995,Spontaneous Agent Networking,"Zhenyan Ji, Malmberg Ake,",2005,spontaneous agent networkingzhenyan ji malmber...,spontaneous agent networkingzhenyan ji malmber...
99995,99996,A design algorithm for ring topology centraliz...,"Naoki Agata, Akira Agata, Kosuke Nishimura,",2013,a design algorithm for ring topology centraliz...,a design algorithm for ring topology centraliz...


In [11]:
# drop some useless col
df_idf = df_idf.drop('ptitle', axis=1)
df_idf = df_idf.drop('author', axis=1)
df_idf = df_idf.drop('id', axis=1)
df_idf = df_idf.drop('text', axis=1)
df_idf

Unnamed: 0,year,ptitle_author
0,2009,ontologies in hydra middleware for ambient int...
1,2013,remote policy enforcement for trusted applicat...
2,2009,a simple observation regarding iterations of f...
3,2012,gait based human identity recognition from mul...
4,2008,the game algorithm applied to complex fraction...
...,...,...
99992,2004,a bit ms s digital to analog converter without...
99993,2009,a symmetrical model applied to interval valued...
99994,2005,spontaneous agent networkingzhenyan ji malmber...
99995,2013,a design algorithm for ring topology centraliz...


### Word Tokenization

In [12]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
# Tokenization : In this each entry in the file_data will be broken into set of words
df_idf['ptitle_author']= [word_tokenize(entry) for entry in df_idf.ptitle_author]
df_idf

Unnamed: 0,year,ptitle_author
0,2009,"[ontologies, in, hydra, middleware, for, ambie..."
1,2013,"[remote, policy, enforcement, for, trusted, ap..."
2,2009,"[a, simple, observation, regarding, iterations..."
3,2012,"[gait, based, human, identity, recognition, fr..."
4,2008,"[the, game, algorithm, applied, to, complex, f..."
...,...,...
99992,2004,"[a, bit, ms, s, digital, to, analog, converter..."
99993,2009,"[a, symmetrical, model, applied, to, interval,..."
99994,2005,"[spontaneous, agent, networkingzhenyan, ji, ma..."
99995,2013,"[a, design, algorithm, for, ring, topology, ce..."


In [14]:
df_idf['ptitle_author']

0        [ontologies, in, hydra, middleware, for, ambie...
1        [remote, policy, enforcement, for, trusted, ap...
2        [a, simple, observation, regarding, iterations...
3        [gait, based, human, identity, recognition, fr...
4        [the, game, algorithm, applied, to, complex, f...
                               ...                        
99992    [a, bit, ms, s, digital, to, analog, converter...
99993    [a, symmetrical, model, applied, to, interval,...
99994    [spontaneous, agent, networkingzhenyan, ji, ma...
99995    [a, design, algorithm, for, ring, topology, ce...
99996    [the, govstat, statistical, interactive, gloss...
Name: ptitle_author, Length: 99997, dtype: object

In [15]:
df_idf.dtypes

year              int64
ptitle_author    object
dtype: object

### Word Lemmatization

In [16]:
import pandas as pd
import numpy as np
import os
import re
import operator
import pickle
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
# # WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
def wordLemmatizer(data):
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    file_clean_k =pd.DataFrame()
    for index,entry in enumerate(data):

        # Declaring Empty List to store the words that follow the rules for this step
        Final_words = []
        # Initializing WordNetLemmatizer()
        word_Lemmatized = WordNetLemmatizer()
        # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
        for word, tag in pos_tag(entry):
            # Below condition is to check for Stop words and consider only alphabets
            if len(word)>1 and word not in stopwords.words('english') and word.isalpha():
                word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
                Final_words.append(word_Final)
            # The final processed set of words for each iteration will be stored in 'text_final'
                file_clean_k.loc[index,'Keyword_final'] = str(Final_words)
                file_clean_k.loc[index,'Keyword_final'] = str(Final_words)
    return file_clean_k

In [18]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [20]:
df_clean_ptitle = wordLemmatizer(df_idf['ptitle_author'][0:99997])

In [21]:
df_clean_ptitle

Unnamed: 0,Keyword_final
0,"['ontology', 'hydra', 'middleware', 'ambient',..."
1,"['remote', 'policy', 'enforcement', 'trusted',..."
2,"['simple', 'observation', 'regard', 'iteration..."
3,"['gait', 'base', 'human', 'identity', 'recogni..."
4,"['game', 'algorithm', 'apply', 'complex', 'fra..."
...,...
99992,"['bit', 'ms', 'digital', 'analog', 'converter'..."
99993,"['symmetrical', 'model', 'apply', 'interval', ..."
99994,"['spontaneous', 'agent', 'networkingzhenyan', ..."
99995,"['design', 'algorithm', 'ring', 'topology', 'c..."


In [22]:
df_clean_ptitle['Keyword_final'][0]

"['ontology', 'hydra', 'middleware', 'ambient', 'intelligent', 'device', 'peter', 'kostelnik', 'martin', 'sarnovsky', 'jan', 'hreno']"

In [23]:
df_clean_ptitle.dtypes

Keyword_final    object
dtype: object

In [24]:
df_idf_year=df_idf['year']
df_idf_year

0        2009
1        2013
2        2009
3        2012
4        2008
         ... 
99992    2004
99993    2009
99994    2005
99995    2013
99996    2003
Name: year, Length: 99997, dtype: int64

In [25]:
# Using Series.astype() to convert column to string
df_idf_year=df_idf_year.values.astype(str)
print("After converting a specific column to string:\n", df_idf.dtypes)

After converting a specific column to string:
 year              int64
ptitle_author    object
dtype: object


In [26]:
df_idf_year

array(['2009', '2013', '2009', ..., '2005', '2013', '2003'], dtype='<U21')

In [27]:
df_clean_ptitle.insert(loc=1, column='year', value=df_idf_year.tolist())
df_clean_ptitle

Unnamed: 0,Keyword_final,year
0,"['ontology', 'hydra', 'middleware', 'ambient',...",2009
1,"['remote', 'policy', 'enforcement', 'trusted',...",2013
2,"['simple', 'observation', 'regard', 'iteration...",2009
3,"['gait', 'base', 'human', 'identity', 'recogni...",2012
4,"['game', 'algorithm', 'apply', 'complex', 'fra...",2008
...,...,...
99992,"['bit', 'ms', 'digital', 'analog', 'converter'...",2004
99993,"['symmetrical', 'model', 'apply', 'interval', ...",2009
99994,"['spontaneous', 'agent', 'networkingzhenyan', ...",2005
99995,"['design', 'algorithm', 'ring', 'topology', 'c...",2013


In [28]:
df_clean_ptitle=df_clean_ptitle.replace(to_replace ="\[.", value = '', regex = True)
df_clean_ptitle=df_clean_ptitle.replace(to_replace ="'", value = '', regex = True)
df_clean_ptitle=df_clean_ptitle.replace(to_replace =" ", value = '', regex = True)
df_clean_ptitle=df_clean_ptitle.replace(to_replace ='\]', value = '', regex = True)
df_clean_ptitle

Unnamed: 0,Keyword_final,year
0,"ontology,hydra,middleware,ambient,intelligent,...",2009
1,"remote,policy,enforcement,trusted,application,...",2013
2,"simple,observation,regard,iteration,finite,val...",2009
3,"gait,base,human,identity,recognition,multi,vie...",2012
4,"game,algorithm,apply,complex,fractionate,atria...",2008
...,...,...
99992,"bit,ms,digital,analog,converter,without,trimmi...",2004
99993,"symmetrical,model,apply,interval,value,data,co...",2009
99994,"spontaneous,agent,networkingzhenyan,ji,malmber...",2005
99995,"design,algorithm,ring,topology,centralize,radi...",2013


### Added WordLemmatize words into given dataframe

In [29]:
df_clean_ptitle['ptitle_author_year'] = df_clean_ptitle['Keyword_final'] + "," + df_clean_ptitle['year']
# # df_idf['ptitle_author_year'] = df_idf['ptitle_author_year'].apply(lambda x: str(x)+' ')

# # #show the first 'text'
df_clean_ptitle

Unnamed: 0,Keyword_final,year,ptitle_author_year
0,"ontology,hydra,middleware,ambient,intelligent,...",2009,"ontology,hydra,middleware,ambient,intelligent,..."
1,"remote,policy,enforcement,trusted,application,...",2013,"remote,policy,enforcement,trusted,application,..."
2,"simple,observation,regard,iteration,finite,val...",2009,"simple,observation,regard,iteration,finite,val..."
3,"gait,base,human,identity,recognition,multi,vie...",2012,"gait,base,human,identity,recognition,multi,vie..."
4,"game,algorithm,apply,complex,fractionate,atria...",2008,"game,algorithm,apply,complex,fractionate,atria..."
...,...,...,...
99992,"bit,ms,digital,analog,converter,without,trimmi...",2004,"bit,ms,digital,analog,converter,without,trimmi..."
99993,"symmetrical,model,apply,interval,value,data,co...",2009,"symmetrical,model,apply,interval,value,data,co..."
99994,"spontaneous,agent,networkingzhenyan,ji,malmber...",2005,"spontaneous,agent,networkingzhenyan,ji,malmber..."
99995,"design,algorithm,ring,topology,centralize,radi...",2013,"design,algorithm,ring,topology,centralize,radi..."


In [30]:
df_clean_ptitle.dtypes

Keyword_final         object
year                  object
ptitle_author_year    object
dtype: object

In [31]:
df_clean_ptitle['ptitle_author_year'][0]

'ontology,hydra,middleware,ambient,intelligent,device,peter,kostelnik,martin,sarnovsky,jan,hreno,2009'

In [32]:
df_clean_ptitle = df_clean_ptitle.drop('year', axis=1)
df_clean_ptitle = df_clean_ptitle.drop('Keyword_final', axis=1)
df_clean_ptitle

Unnamed: 0,ptitle_author_year
0,"ontology,hydra,middleware,ambient,intelligent,..."
1,"remote,policy,enforcement,trusted,application,..."
2,"simple,observation,regard,iteration,finite,val..."
3,"gait,base,human,identity,recognition,multi,vie..."
4,"game,algorithm,apply,complex,fractionate,atria..."
...,...
99992,"bit,ms,digital,analog,converter,without,trimmi..."
99993,"symmetrical,model,apply,interval,value,data,co..."
99994,"spontaneous,agent,networkingzhenyan,ji,malmber..."
99995,"design,algorithm,ring,topology,centralize,radi..."


In [82]:
# import pandas as pd

# # Read the CSV file into a DataFrame
# df = pd.read_csv('/content/FinalOneRowLeammaKeywords.csv')

In [33]:
df_clean_ptitle['ptitle_author_year'][0]

'ontology,hydra,middleware,ambient,intelligent,device,peter,kostelnik,martin,sarnovsky,jan,hreno,2009'

In [39]:
df_clean_ptitle.to_csv("/content/sample_data/PreprocessedDataset.csv")

In [32]:
# import pandas as pd
# import numpy as np
# import os
# import pickle
# import re
# import operator
# import pickle
# import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('stopwords')

# from nltk import pos_tag
# from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer
# from collections import defaultdict
# from nltk.corpus import wordnet as wn
# from nltk.tokenize import word_tokenize
# from sklearn.feature_extraction.text import TfidfVectorizer

# # read json into a dataframe
# df_clean_ptitle=pd.read_csv("/content/PreprocessedDataset.csv")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Create document search engine with TF-IDF

### TF-IDF by using  TfidfVectorizer from sklearn.feature_extraction.text

### CountVectorizer to create a vocabulary and generate word counts
The next step is to start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our dataset and generate counts for each row.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
import re

#create a vocabulary of words,
#ignore words that appear in 85% of documents,
#eliminate stop words
cv=CountVectorizer(max_df=0.85)
word_count_vector=cv.fit_transform(df_clean_ptitle['ptitle_author_year'])

In [12]:
word_count_vector

<99997x205563 sparse matrix of type '<class 'numpy.int64'>'
	with 1303621 stored elements in Compressed Sparse Row format>

let's check the shape of the resulting vector. Notice that the shape below is (99997,205563) because we have 99997 documents in our dataset (the rows) and the vocabulary size is 205563 meaning we have 205563 unique words (the columns) in our dataset minus the stopwords.

## Computing TF-IDF and Extracting Keywords

we will extract top keywords from the dataset

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
import operator

## Create Vocabulary
vocabulary = set()

for doc in df_clean_ptitle.ptitle_author_year:
    vocabulary.update(doc.split(','))

vocabulary = list(vocabulary)

# Intializating the tfIdf model
tfidf = TfidfVectorizer(vocabulary=vocabulary,dtype=np.float32)

# Fit the TfIdf model
tfidf.fit(df_clean_ptitle.ptitle_author_year)

tfidf_matrix = tfidf.fit_transform(df_clean_ptitle['ptitle_author_year'])

# Transform the TfIdf model
tfidf_tran_ptitle=tfidf.transform(df_clean_ptitle.ptitle_author_year)



In [14]:
# Find the feature names (words) from the TF-IDF vectorizer
feature_names = tfidf.get_feature_names_out()

# Calculate the TF-IDF scores for each feature
tfidf_scores = tfidf_matrix.sum(axis=0).A1

# Create a DataFrame to store the feature names and their TF-IDF scores
keyword_df = pd.DataFrame({'Keyword': feature_names, 'TF-IDF Score': tfidf_scores})

# Sort the keywords by TF-IDF score in descending order
keyword_df = keyword_df.sort_values(by='TF-IDF Score', ascending=False)

# You can set a threshold to select the top keywords
threshold = 0.2  # Adjust this threshold as needed
top_keywords = keyword_df[keyword_df['TF-IDF Score'] >= threshold]

# # Print or save the top keywords
print(top_keywords)

                  Keyword  TF-IDF Score
116258               base   1255.389771
70583              system   1007.648132
81019                 use    971.849182
114230               2015    901.400269
163507               2011    850.370789
...                   ...           ...
138539    evolutionharald      0.200657
77836           offkhalil      0.200369
28929            narinder      0.200369
26491   persistencekatrin      0.200254
165048         hatzikirou      0.200254

[204708 rows x 2 columns]


### Save Vacabulary


In [114]:
top_keywords.to_csv("/content/sample_data/Final_Keyword_for_Paper_Search_TFIDF5.0.csv")

In [101]:
Keyword_tfidf= top_keywords['Keyword']
Keyword_tfidf = list(Keyword_tfidf)

In [115]:
### Save keywords
with open("/content/sample_data/Final_Keyword_for_Paper_Search.0.txt", "w") as file:
    file.write(str(Keyword_tfidf))

### Create vector for Query/search keywords

In [15]:
## Create vector for Query/search keywords
def gen_vector_T(tokens):

    Q = np.zeros((len(vocabulary)))

    x= tfidf.transform(tokens)
    for token in tokens[0].split(','):
        try:
            ind = vocabulary.index(token)
            Q[ind]  = x[0, tfidf.vocabulary_[token]]
        except:
            pass
    return Q

### Calculate Cosine Similarity with formula

In [16]:
def cosine_sim(a, b):
    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim

### Calculate Cosine similarity of trained Tfidf to input query

In [17]:
def cosine_similarity_T(k, query):
    #print("Cosine Similarity")
    preprocessed_query = preprocessed_query = re.sub("\W+", " ", query).strip()
    tokens = word_tokenize(str(preprocessed_query))
    q_df = pd.DataFrame(columns=['q_clean'])
    q_df.loc[0,'q_clean'] =tokens
    q_df['q_clean'] =wordLemmatizer(q_df.q_clean)
    q_df=q_df.replace(to_replace ="\[.", value = '', regex = True)
    q_df=q_df.replace(to_replace ="'", value = '', regex = True)
    q_df=q_df.replace(to_replace =" ", value = '', regex = True)
    q_df=q_df.replace(to_replace ='\]', value = '', regex = True)
    #print("\nQuery:", query)
    #print("")
    #print(tokens)

    d_cosines = []

    query_vector = gen_vector_T(q_df['q_clean'])

    for d in tfidf_tran_ptitle.A:

        d_cosines.append(cosine_sim(query_vector, d))

    out = np.array(d_cosines).argsort()[-k:][::-1]
    #print("")
    d_cosines.sort()
    #print(out)
    a = pd.DataFrame()
    for i,index in enumerate(out):
        a.loc[i,'index'] = str(index)
        a.loc[i,'ptitle_author_year'] = df_clean_ptitle['ptitle_author_year'][index]
    for j,simScore in enumerate(d_cosines[-k:][::-1]):
        a.loc[j,'Score'] = simScore
    return a

In [34]:
%time cosine_similarity_T(5,"intelligent")

CPU times: user 58.2 s, sys: 59.7 s, total: 1min 57s
Wall time: 1min 10s


Unnamed: 0,index,ptitle_author_year,Score
0,61930,"machine,intelligence,hci,revisit,intelligent,a...",0.464695
1,55096,"intelligent,product,intelligent,being,agent,pa...",0.431101
2,56951,"structure,model,intelligent,plan,agentlei,wang...",0.34708
3,44131,"intelligent,sensor,analysis,design,eric,dekneu...",0.338577
4,86800,"intelligent,information,processing,iii,ifip,tc...",0.335324




*   Document Search by Md Mahbubur Rahman
*   Student Id.: 3820231103
*   Beijing Institute of Technology



