Skip to content

ra397/Text-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Similarity Calculator

Program that computes a similarity score between 2 text documents

Description

The program vectorizes each text document. Both vectors are made up of Word objects that contain a string representing the word and a term frequency score (number of occurrences of the word in the document / total number of words in the document). I take the cosine similarity of the vectors to get a similarity score ranging from 0 (the documents are complete opposite) to 1 (the documents are identical).

Next steps

  • load in the stop words in a data structure more optimized for lookup (i.e. hash tables)
  • perform more effiecent search to check if a word is a stop word (currently doing linear search)
  • calculate the inverse document frequency , which takes into account the number of documents a word occurs in (used for analyzing multiple documents of the same topic)
  • scale to analyze multiple documents by adding a files class is composed of Document objects
  • classify documents based on similarity (learn classification techniques)

About

C++ program that computes a similarity score between text documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages