HadoopInvertedIndex

🐘 Inverted index for a book repository implemented using Hadoop Map-Reduce.

The inverted index contains for each distinct unique word, a list of files containing the given word with its location within the file (line number).

Example: "masterpiece (pg35473 | 1066) (pg4300 | 2970, 19224)"

When running the application you should have a small cluster/cloud of at least two nodes build from VMs – eventually a larger cluster build from all your individual VMs.
A stopwords file stopwords.txt is required.
hadoop/sbin/start-yarn.sh
hadoop/sbin/start-dfs.sh
jps, ssh node1 jps - make sure all required processes are running on every cluster-node
hdfs dfs -copyFromLocal input / - input directory contains the book repository
Build the InversedIndex.jar using gradle
hadoop jar InversedIndex.jar /input /output
hdfs dfs -copyToLocal /output/* .
stop-all.sh

Method 1 uses a FilePreprocessor for adding the line numbers in the book repository.
Method 2 uses the LineJob which computes the line numbers based on the offset.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/main/java		src/main/java
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
settings.gradle		settings.gradle

Provide feedback