FAQ-Retrieval-System

In this task, we have a corpus of frequently asked questions and answers from various domains that have been provided. The corpora of questions in the database are represented by Q. The query is in SMS language which may or may not contain noise. The goal of the task is to find a question Q* from the corpora of FAQ’s Q, that is the best possible match for the SMS query S.

I have two parameters for calculating the score of a question, keyword score and similarity score. The methods for calculating the keyword score, like disemvoweling, are based on the general observations made about the language and slangs used by people while typing SMS text. On the other hand, the similarity score is calculated using dynamic programming techniques for string comparison and pattern matching algorithms, like Longest Common Subsequence and Gestalt Pattern Matching.

System Implementation

Preprocessing Disemvoweling Removal of stop words Keyword matching Calculation of weight of each word using: *Similarity ratio *Longest Common Subsequence ratio *Levenshtein Distance *Inverse Document Frequency Creation of variant lists for each SMS word Similarity score Total Score

Preprocessing

We create a hash table of words W that contains all the words occurring in all the questions in Q with the keys being characters a-z and numbers 0-9.

Example: ‘i’ contains all the words in the set Q that start with ‘i’, like ‘insurance’, ‘improve’, and so on A list of stop words is also prepared and disemvoweled Digits occurring in SMS token are replaced by a string based on a manually designed digit-to-string mapping (“8”->“eight”). Single character words in the SMS query are removed.

Disemvoweling

We describe the process of removing vowels from a string as disemvoweling and the string from which vowels are removed is said to be disemvoweled. We apply this process of disemvoweling to the SMS query because in general, it has been observed that the user tries to compress the text by removing vowels.

Calculation of weight of a word

For each token of the SMS query (not disemvoweled), we calculate its similarity with every word w in the corpus W. The weight of a word is given by the equation: Weight(w,s)= LCSR(w,s)*SMRatio(w,s) *IDF(w) LevDistance(w,s) ...(2) *LCSR(w, s) - Longest Common Subsequence Ratio of the SMS query token s and the word w in W. *SMRatio(w, s) - Similarity ratio using Ratcliff/Obershelp algorithm. *LevDistance(w, s)-Levenshtein Distance between disemvoweled w and s *LevDistance(w, s)-Levenshtein Distance between disemvoweled w and s

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
##rough_work.txt		##rough_work.txt
#Maximum_frequency.py		#Maximum_frequency.py
#Sorting_a_hash_tableAccordingToValues.py		#Sorting_a_hash_tableAccordingToValues.py
#Sorting_a_hash_tableFile.py		#Sorting_a_hash_tableFile.py
1.py		1.py
AllQues.txt		AllQues.txt
Enter.py		Enter.py
Enter.pyc		Enter.pyc
FreqOfWords.txt		FreqOfWords.txt
How to Write a Spelling Corrector.html		How to Write a Spelling Corrector.html
IDF.txt		IDF.txt
Match.txt		Match.txt
Nslang.txt		Nslang.txt
Project 15.odt		Project 15.odt
README.md		README.md
Slang.txt		Slang.txt
Slang_imp.py		Slang_imp.py
Stop.txt		Stop.txt
StopWords.py		StopWords.py
StopWords.pyc		StopWords.pyc
Stop_wordFreq.py		Stop_wordFreq.py
a_Dumping_questions_in_file.py		a_Dumping_questions_in_file.py
accuracy.py		accuracy.py
d_lcs.py		d_lcs.py
eng.xml		eng.xml
gk sms qeries fire.txt		gk sms qeries fire.txt
hapi.py		hapi.py
health SMS queries Fire.txt		health SMS queries Fire.txt
i_Dumping_questions_in_file.py		i_Dumping_questions_in_file.py
ifr.html		ifr.html
ii_Calculating_frequency_of_each_word.py		ii_Calculating_frequency_of_each_word.py
iii_Calculating_idf.py		iii_Calculating_idf.py
irctc sms queries fire 2011.txt		irctc sms queries fire 2011.txt
iv_LCS.py		iv_LCS.py
iv_LCS.pyc		iv_LCS.pyc
lcs.py		lcs.py
lecture22.ppt		lecture22.ppt
main.py		main.py
main.pyc		main.pyc
making_slang_list.py		making_slang_list.py
mix.txt		mix.txt
phappi.py		phappi.py
prettify.css		prettify.css
prettify.js		prettify.js
prob.py		prob.py
prob.txt		prob.txt
probality2.py		probality2.py
removing stop words.py		removing stop words.py
report project.docx		report project.docx
report_rough_Idea.odt		report_rough_Idea.odt
sl.txt		sl.txt
slangs.py		slangs.py
slangs.pyc		slangs.pyc
sms2.py		sms2.py
test.txt		test.txt
test123.py		test123.py
test1234.py		test1234.py
test1234.pyc		test1234.pyc
top.txt		top.txt
v_SimilarityRatio.py		v_SimilarityRatio.py
v_SimilarityRatio.pyc		v_SimilarityRatio.pyc
vi_LevinstinesDistance.py		vi_LevinstinesDistance.py
vi_LevinstinesDistance.pyc		vi_LevinstinesDistance.pyc
vii_Union.py		vii_Union.py
viii_Query.py		viii_Query.py
viii_UsingDB_all_in_one.py		viii_UsingDB_all_in_one.py
viii_UsingDB_all_in_one_writting_frequency.py		viii_UsingDB_all_in_one_writting_frequency.py
while_ka_jamela.py		while_ka_jamela.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAQ-Retrieval-System

System Implementation

Preprocessing

Disemvoweling

Calculation of weight of a word

About

Releases

Packages

Languages

upadhysh/FAQ-Retrieval-System

Folders and files

Latest commit

History

Repository files navigation

FAQ-Retrieval-System

System Implementation

Preprocessing

Disemvoweling

Calculation of weight of a word

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages