GitHub - salyaa/InformationRetrieval

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bm25		bm25
query_exp		query_exp
tfidf		tfidf
Project1_DIS_KaggleLaborers.pdf		Project1_DIS_KaggleLaborers.pdf
ReadMe.txt		ReadMe.txt

Repository files navigation

README

In the folder provided, the report for our group can be found alongside the code for the three models that we attempted to implement. The three models (bm25, tfidf, query expansion) can be found in their respective folders, with the bm25 model being the final model that we implemented and which can also be found on Kaggle. There are additional files in the bm25 folder which are used to produce some of the data that is on Kaggle, for example the pkl files.

Data
With respect to data, the bm25 model is the one that requires the most preprocessed data in order to run, however since it is the final model, all the data required is visible on Kaggle. In brief, the bm25 model uses pkl files for the corpus of all languages except English, while for English the corpus is split into 10 different parts and saved as a joblib file. This is also true for the vectorizer in English, which is also stored as a joblib. All the queries and the test file are preprocessed and saved as csv files.
For the tfidf model, the data required is a json file of the corpus for each language that has already been tokenized and stemmed, while the query file is the dev.csv file given to us originally. In terms of the query expansion model, the data required are csv files for each language s corpus and query that have been tokenized and stemmed in a method that is similar to the one used for bm25.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

salyaa/InformationRetrieval

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages