Text Similarity Calculator

Program that computes a similarity score between 2 text documents

Description

The program vectorizes each text document. Both vectors are made up of Word objects that contain a string representing the word and a term frequency score (number of occurrences of the word in the document / total number of words in the document). I take the cosine similarity of the vectors to get a similarity score ranging from 0 (the documents are complete opposite) to 1 (the documents are identical).

Next steps

load in the stop words in a data structure more optimized for lookup (i.e. hash tables)
perform more effiecent search to check if a word is a stop word (currently doing linear search)
calculate the inverse document frequency , which takes into account the number of documents a word occurs in (used for analyzing multiple documents of the same topic)
scale to analyze multiple documents by adding a files class is composed of Document objects
classify documents based on similarity (learn classification techniques)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Document1.txt		Document1.txt
Document2.txt		Document2.txt
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp
stop_words_english.txt		stop_words_english.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Similarity Calculator

Description

Next steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Similarity Calculator

Description

Next steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages