SearchEngine_Inverted Index.ipyn: This is the source code file. Which has Four main processing tasks
- Scraping text from 6 URL Websites and Store the Preprocessed Text data alone in text file for each HTML pages.
- Creating Inverted index and document frequency with Posting locations.
- Finding the similarity between 6 docs using CosineSimilarity metrics.
- Implemented Inverted index as Simple Search Engine with information retrieval. (Cross-platform Application software support with python -- Tkinter)
Wiki page
- https://en.wikipedia.org/wiki/Machine_learning
- https://en.wikipedia.org/wiki/Engineering
- http://my.clevelandclinic.org/research
- https://en.wikipedia.org/wiki/Data_mining
- https://en.wikipedia.org/wiki/Data_mining#Data_mining
- http://cis.csuohio.edu/~sschung/
Computing Similarity between the documents using Cosine similarity metrics Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.
From the above results we can conclude that,- Doc 1 – Machine Learning,
- Doc 2- Engineering,
- Doc 3 – research,
- Doc 4 – Data mining,
- Doc 5 – Data mining # datamining
- Doc 6- ss chung
Eliminating 1.0 cosine score, because comparing the same document (di,di) will give 1.0 which is useless for analysis.
Top matches sorted
- (Doc4, Doc5) -1 similar matches, content of Doc5 is the part of Doc4.
- (Doc1, Doc4)-0.87 Similar matches Machine Learning vs Data mining
- (Doc1, Doc2)- 0.82 Similar matches Machine Learning vs Engineering
- (Doc2, Doc4)- 0.78 Similar matches Engineering vs Data mining
- (Doc5, Doc6)-0.65 Similar matches
- (Doc 1, Doc 6)-0.61 Similar matches
- (Doc 1, Doc 3)-0.61 Similar matches
- (Doc 2, Doc 3)-0.61 Similar matches
- (Doc 3, Doc 4)-0.59 Similar matches
- (Doc 3, Doc 5)-0.59 Similar matches
- (Doc 2, Doc 6)-0.55 Similar matches
- Search for the term ‘research’
- Search for the term ‘data’