In this project, we implement:
- Punctuation Tokenizer
- PorterStemmer
- WordBreak--English version (Use Dynamic programming)
- WordBreakCKJ(class)
Chinese and Japanese both Version use the dictionary called dic_cn and dic_jp under resource directory have corresponding test case called WordBreakCJKTokenizerTest(6 testcases,3 for Chinese, 3 for Japanese)
In this project, we implement:
Based on previous project(analyer), it tokenlize and stem the input document. We implement a disk-based index structure is based on the idea of LSM (Log-Structured Merge tree). We use the one file to store the words dictionary and the the document ids. Beside, we use multi-thread merging and searching to improve the performance.
- write and read
- merge
- search(and and or)
- delete
In this project, we implement:
Based on the previous project, we add a poistional list for each element of inverted list. So we allow user to search with a specific order of key words. Also, we compressed the data based on delta encoding and variable-length encoding.
- run
mvn clean install -DskipTests
in command line - open IntelliJ -> Open -> Choose the directory. Wait for IntelliJ to finish importing and building.
- You can run the
HelloWorld
program undersrc/main/java/edu.uci.ics.cs221
package to test if everything works.
Implement ranking use TF-IDF and page rank.