- Fall 2021
In this phase, we implement an information retrieval system for both English and Persian.
Two corpora are used in this phase, one is in English, and one is in Persian. Corpora are located in the phase1-data
directory. The English texts are from the news, and the Persian texts are from Wikipedia.
This part includes actions to preprocess the text, including removing punctuations, normalization, tokenization, stemming, removing stop words, etc.
In this part, we create indexes on the preprocessed text. Two types of indexing are used: Positional indexing and Bigram indexing. Bigram
In this section, compression is performed on the indexes. Two compression method is implemented namely Variable Byte compression and Gamma Code compression.
In this part, the query is corrected by replacing misspelled words with the best alternative. We use Jaccard score to find options. The word with minimum edit distance is used to replace the misspelled word.
In this phase, our goal is to build an error correction system to fix the errors in a query.
This phase's data is split into three directories. The first part is the "corpus," which includes texts whose tokens are used to build the model. The second part is the "training set," which includes pair of phrases with an edit distance. The third part is "dev set," which includes phrases, their correct form (ground truth), and the corrected version by google. The data can be downloaded from here.
In this part, the prior distributions of unigrams and bigrams are calculated using MLE.
This model calculated the probability of an error in a query.
This part generated candidates to replace errors by getting the initial query.
In this section, candidates are scored using the language model and edit probability distance.
In this phase, we implement some classification and clustering algorithms on text. We have also implemented a crawler.
AG News dataset is used in this phase.
In this section, three classification algorithms, including Naive Bayes, SVM, and KNN, are trained and evaluated on text data.
Two clustering algorithms, namely K-means and agglomerative clustering, are implemented in this part.
In this part, a crawler on the ResearchGate website is implemented.