Skip to content

rishabh-sachdeva/Search-Engine

Repository files navigation

Search-Engine

Phase 1:

The objective of this assignment is to compare two approaches to tokenize and downcase all words in a collection of HTML documents. You may choose any of the following approaches: flex, javacc, other publicly available tokenizer, or custom code in C, C++, Perl, Python, PHP, or Java. Each program should read a directory name for the input documents from the command line and a directory name for the output documents from the command line. The program should produce three things: a directory of all tokenized documents (one output file per input file) a file of all tokens and their frequencies sorted by token a file of all tokens and their frequencies sorted by frequency You may use the UNIX sort facility to sort the output files. However, there must be a single command line call to your function, e.g., tokenize input-dir output-dir

Phase 2:

The objective of this assignment is to calculate the term weights for the tokens that occur in each document in your collection. To execute your program, it should be sufficient to type:

calcwts input-directory output-directory where the input-directory contains the original files (i.e., the unprocessed files directory) and the output-directory contains all generated output files. The calcwts command may be a lex/C/C++/Java program or a shell script that calls a series of programs written in C/C++/Java/UnixTools and/or lex/Java/C/C++/Perl/Python/Php/etc. It is permissible to create at most one set of temporary files. For example, you may tokenize and count frequencies within documents using Perl or php and output that information to one temporary file per input file. Preprocessing: You may choose to build on either preprocessor created in your Homework 1. If you are not satisfied with your preprocessors, you may choose to borrow a preprocessor from a classmate (but attribute the source). You must extend the preprocessing as follows: remove stopwords, remove words that occur only once in the entire corpus, and words of length 1. Use this list of stopwords. Term Weighting: You may use any of the tf * idf variants discussed in class and/or the textbook to weight your terms. Be sure to clearly describe the formula you chose to use in your report. You must normalize the term weights by the length of the documents. The easiest normalization is to use freq(wordi in documentj) / totalfreq(all tokens in documentj), but you may choose to do the proper sqrt of sum of squares vector normalization or any other normalization that takes some measure of document length into account. Output Files The goal is to build one output file per input file. In the output file, there would be one line per token that survives preprocessing. That line would contain the token and the token weight.

Algorithm: You may implement either a memory-based or file-based algorithm, i.e., you may create a hash table in memory for each document to store the information about term frequencies, or a temporary file on disk to store the term frequencies (deprecated). You will store the global information (numdocuments per term) in memory. You should time your term weight calculation program on a varying number of documents (10, 20, 40, 80, 100, 200, 300, 400, 500) and plot a graph of the indexing time as a function of the number of documents in your report. The timings should be done on the FULL process (starting with the raw documents).

Phase 3:

The objective of this assignment is to build an index for the documents in your collection. In this sense, an index is also called an inverted file: instead of looking up a file to see what terms occur in it, we're looking up a term to see what files that term occurs in.

To execute your program, it should be sufficient to type:

index input-directory output-directory where the input-directory contains the original files (unprocessed) and the output-directory contains all generated output files (temporary files created along the way should be deleted to save space). The index command may be written in the language(s) of your choice.

Terms and Term Weights You should use the terms, and term weights, generated by Homework 2. If you are not happy with your Homework 2, you may use any classmate's Homework 2 as your starting point. Mention this in your report. Output Files The goal is to build a pair of files of fixed length records. These files should be in ASCII for easier debugging. The dictionary records may be in either hash_order or alphabetical order. The dictionary should contain three fields:

the word the number of documents that contain that word (this corresponds to the number of records that word gets in the postings file) the location of the first record for that word in the postings file

Phase 4

The objectives of this assignment are to build a command line retrieval engine on top of the inverted files created in Assignment 3. The retrieval engine should take in queries which are lists of words, run them through the same preprocessing as the collection (i.e., downcase, optionally remove stopwords) and then match the query words against the inverted file to come up with document weights (the sum of the term weights in the document). Display the ten top-ranking document identifiers or filenames to the user.

You may use any algorithm you choose to locate the postings, but your report should discuss your choices (i.e., do binary search on index file on disk versus load entire inverted index with poistings into memory and search there). If you manage your inverted index between memory and disk, there will be a 3% bonus. You must use something other than a simple array to tabulate the document-query similarity scores. A hashtable, heap, something with skiplists, a TDM that uses compression somehow, or whatever. You may design your own query syntax, however you only need to handle queries that are lists of words. Entries in the inverted index shall be non-negative. The query must run in a single command from the linux command line, i.e., not be an interactive menu-based system. For example "retrieve dog cat bird" would be an example query.

Phase 5:

In this assignment we perform some analysis of the 504-document HTML test corpus, and introduce some concepts related to document clustering.

A similarity matrix is defined as a matrix in which entry i,j is the similarity of documents i and j, computed using the cosine score or some other metric. A document is perfectly similar to itself, so the entries on the main diagonal are all 1. Similarity is also symmetric, i.e. sim(i,j) = sim(j,i), so the similarity matrix is upper triangular in form. In this assignment, you'll need to construct a similarity matrix. You should code this section of the homework on your own.

Certain clustering algorithms use a similarity matrix to do their work. In hierarchical agglomerative clustering, two objects are merged if their centroids are closer to each other than are the centroids of any other pair of objects. An object could be a single document, or an existing cluster. Note that a document not yet in any cluster is in fact a trivial singleton cluster, with itself as the centroid. You may assume that the intersection between any two clusters is the empty set.

Your program is to execute agglomerative clustering, using the group average link method, or single link method. As each step of the algorithm proceeds, indicate which objects are being merged (clustered) together, where again an object can be a single document, or an existing cluster. The clustering will cease when no two clusters (or documents) have similarity greater than 0.4. Clustering is discussed in the textbook, and more information is available online. You have the option of using a proceedure call from a package or, if you code this yourself, you will recieve extra credit.

Remember that we are using similarity rather than distance. A good resource can be found in Chapter 17 of Manning's book on Information Retrieval.