🐘 Inverted index for a book repository implemented using Hadoop Map-Reduce.
- The inverted index contains for each distinct unique word, a list of files containing the given word with its location within the file (line number).
Example: "masterpiece (pg35473 | 1066) (pg4300 | 2970, 19224)"
- When running the application you should have a small cluster/cloud of at least two nodes build from VMs – eventually a larger cluster build from all your individual VMs.
- A stopwords file
stopwords.txt
is required. hadoop/sbin/start-yarn.sh
hadoop/sbin/start-dfs.sh
jps
,ssh node1 jps
- make sure all required processes are running on every cluster-nodehdfs dfs -copyFromLocal input /
-input
directory contains the book repository- Build the
InversedIndex.jar
using gradle hadoop jar InversedIndex.jar /input /output
hdfs dfs -copyToLocal /output/* .
stop-all.sh
- Method 1 uses a
FilePreprocessor
for adding the line numbers in the book repository. - Method 2 uses the
LineJob
which computes the line numbers based on the offset.