Program that computes a similarity score between 2 text documents
The program vectorizes each text document. Both vectors are made up of Word objects that contain a string representing the word and a term frequency score (number of occurrences of the word in the document / total number of words in the document). I take the cosine similarity of the vectors to get a similarity score ranging from 0 (the documents are complete opposite) to 1 (the documents are identical).
- load in the stop words in a data structure more optimized for lookup (i.e. hash tables)
- perform more effiecent search to check if a word is a stop word (currently doing linear search)
- calculate the inverse document frequency , which takes into account the number of documents a word occurs in (used for analyzing multiple documents of the same topic)
- scale to analyze multiple documents by adding a files class is composed of Document objects
- classify documents based on similarity (learn classification techniques)