This is the repository for my Information Retrieval course project from ASU Spring 2015 course. It was implemented progressively.
update 0: TF-IDF based retrieval
update 1: added options for Authorites and Hubs
update 2: added options for page rank
To run the code. First you need to setup jython and install pylucene.
Then run the code as
jython SearchEngine.py <lexicon_filename>
CreateLexicon.py running as
jython CreateLexicon.py <lexicon_filname> will just create an index and save it down in a
.pkl.gz file. This also can save document norms pre-saved.
SearchFiles.py is the re-writing of the code
SearchFiles.java into python.
LinkAnalysis.py is the re-writing of the code
LinkAnalysis.java into python. This can be used to understand how
These can be used to understand how
Setting up SearchFiles.py
verbose = Falsewill run the code silently.
create_lexicon_flag = True. if
Truewill rebuild lexicon from scratch, if
Falsewill load a pre-created one as supplied in
normalize = False. if
Truewill use document norms and normalized tf-idf,
n_retrieves = 10number of documents to retreive
tf_idf_flag = True
Trueretrieves based on Tf/idf, False retrieves based on only Tf.
ah_flag = Truewill run the code using Authorites and Hubs weighting.
pr_flag = Truewill run the code using Page Rank weighting.
directory = '../index'will setup the location of index.
- having both the above
Falsewill simply run just TF-IDF.
citationsFileare text files containing links and citations needed for HITS and pagerank algorithms.
root_set_sizeis a variable that determines the root set for hits algorithm.
maxIterdetermines how many iterations for power iterating to get eigen values.