with Python
TF-IDF
Information Retrieval
- Introduction
- Release
- Environment
- Author
- Installation
- User Manual
- Documentation
- Development Introduction
- Contact
- License
The project is an inverted index for a set of documents, and then the inverted index is used to implement quick search for various kinds of query. The input of the program is a large collection of English documents.
- Use the Inverted Index approach to implement a simple Index and Query system for the given document set.
- The system is able to search the relevant documents for a single term query, such as., “movie” or “retrieval”.
- Write an interface for the program which is able to
- Let users input the document directory for indexing;
- Let users input the query they want to search;
- Display the names of the retrieved documents for users.
- The system is able to search the relevant documents for a composite query consisting of two or more words connected by “OR” or “AND”, e.g., “movie” AND “retrieval”; “movie” OR “TV”, etc. The composite query using both “AND” and “OR”, like (“action” OR “comedy”) AND “movie”,.is NOT required,
- The system is able to search the relevant documents for a phrase query consisting of two or more words, like “good movie”.
- The system is able to rank the relevant documents according to their relevance to the query (for example, by Vector Space Model) and display the documents according to the ranking results.
- Develop a better interface for your program. For example, it can display the content of the retrieved documents to users. The words in the query should be highlighted.
- Any other functions which may improve the indexing/searching quality.
- Python: 3.6.5 (also work with python 2)
- Windows OS
git clone git@github.com:Cqyhid/SearchSystem.git
$ cd QueryIndex\CLI
$ python main.py
git clone git@github.com:Cqyhid/SearchSystem.git
$ cd QueryIndex\GUI
Follow Kivy's Document for installation
$ python kivy_main.py
$ python main.py
$ basic
or
$ advance
or
$ superior
Type your Documents's path
$ C:\Users\Alpha\Desktop\IR\Assignment-1\data\data
And wait until Sorting completed!
is displayed.
Type query
to search
$ query
Input the query
For Example
$ Indians
Check the results
Relative documents and term's occur time would be display.
For Example
{'D0601.M.250.A.A': 2, 'D0601.M.250.A.E': 2, 'D0601.M.250.A.C': 6, 'D0614.M.250.E.B': 1, 'D0601.M.250.A.B': 4}
Document name "D0601.M.250.A.A" have the term 'indians' two times.
Terminate the program by typing
$ exit
Type your Documents's path
$ C:\Users\Alpha\Desktop\IR\Assignment-1\data\data
And wait until Sorting completed!
is displayed.
Type query
to search
$ query
Select phrase search or boolean search
$ phrase
or
$ boolean
Type the query
For Example
$ indians comprise
or
$ indians comprise less
Check the results
For Exmaple
The results of the query is:
['D0601.M.250.A.A']
Terminate the program by typing
$ exit
Type the query
For Example
$ indians and less
or
$ indians or less
Check the results
Terminate the program by typing
$ exit
Type your Documents's path
$ C:\Users\Alpha\Desktop\IR\Assignment-1\data\data
And wait until Sorting completed!
is displayed.
Type query
to search
$ query
Type the query
For Example
$ indians less comprise
Check the results
For Example ('D0601.M.250.A.C', 0.5635279098601355)
Means that document D0601.M.250.A.C
with the score 0.563 for the query indians less comprise
The higher score indicates a better matched with the documents.
Terminate the program by typing
$ exit
File Chooser might malfunction for now, better to input your path manually
or
Continutly detecting user’s input and handle error inputs. And call different functions for different input corresponded.
fileReading(file_path): Read files from the directory of basic search. It stores all the term of documents into an inverted index. Convert all word into lowercase. A stop word list is used to delete the common words in all files. The file name and the term appear in each typical file with its times would be stored in a nested dictionary. This function returns a dictionary with {term:file} pairs. The repetitive file name is recorded which allows the program to calculate the times that one term appears in a file. This function also returns a dictionary contains the position information of each word. It returns a dictionary with {file:[]} pairs. The array in value store the words in the file in order. The program can, therefore, access to the position of each word.
termVectorFileReading(file_path): Read files from the directory of vector search. It searches the directory which inputs by the user and opens all the non-directory files. The string lines inside the file will be stored and broke into an array by a blank. The array is composited with words appear in the file. A stop word list is also applied. The nested dictionary is composited with {term : times} pairs while the out layer is the {file : {term:times}} pairs. The times will then be transformed into normalization format by using the vector space model. The Euclidean weights are applied in this function.
IDF(document_frequency,total_document_number): Calculate the IDF using the formula of log10(N/df)
init(self): Display the command line interface of main menu to user.
do_query_simple(self,file_collection): Get the query of user input of command line.
do_query_boolean(self,file_collection): Get the query of user input of command line of using Boolean searching.
do_query_multiwords(self,file_collection,position_information): Get the query of user input of command line of using Phase searching.
do_query_ranking(self,final_word_list,file_collection,doc_idf): Get the query of user input of command line of using Vector Space Ranking searching.
do_path_basic(self): Get the path of documents from user input.
do_path_vector(self): Get the path of documents from user input when using vector space searching.
display_results(self,data): Display the searching’s results.
search_mode(self): Get the searching’s mode of user input. For user to select the searching’s mode.
do_help(self): Display the command line instruction to users.
SearchModes.py simpleSearch(query,inverted_index): Using an inverted index to get the results of the query using the index of the dictionary in python. Return the results together with the counted number to the command line interface.
booleanSearch(query,inverted_index): Check the user input of keyword ‘and’ or ‘or’ first. Separate the query term and store in a query’s list. For ‘AND’ search, check all the query’s term in the inverted index vector. Then delete the document’s result when it was not contained in the later query’s term. For ‘OR’ search, add all the document occur in the for loop searching.
multiwordsSearch(query,totalInfo): Get the query list and find out the results of the first query using the inverted index of the dictionary in python. The inverted index and position information are packed in the object totalInfo. Then, check if the next word in the query list is the same as the next word in position information array. If not, delete the file in the final list.
rankingSearch(query,final_word_list,totalInfo): The query, final word list, normalized vector, and IDF information are used in this function. The function will turn the query into a vector then use the IDF information to transform the query vector into a normalized vector. Then do the cosine similarity between the normalized query vector and every normalized file term vectors. Store the file name and ranking points in a dictionary as a {name: value} pair. Use the sorted function in python to rank the file name then return the results.
We developed the basic one first. We wrote a program for a command line interface that allowed the user to type the command in the program. For the first requirement (Use the Inverted Index approach to implement a simple Index and Query system for the given document set.) We wrote the program to read all documents in the directory. Also, allow the user to indicate the path of the directory. All words in English are separated with a space character. So we found all the space character and store the word in the inverted index vector. Also, count the number of occurrence of the term in the document. The vector of the term indicated which document contained the word and the number of terms. In Python, we use the dictionary. Which is contain key and value pair. In this case, the key is the term. The value is the vector of the document. For example, {“apple”:[“doc1”, ”doc2”]}. After that, the searching is to put the user’s input into the vector of the inverted index. The dictionary could get the correspond key-value pair of the term. For example, query = “apple”, then inverted_index[query] would return the value [“doc1”,”doc2”]. Up to now, all Basic Requirements were fulfilled.
For the Boolean search, we decided to implement a program that first detected the “AND” searching or “OR” searching, then stored all the query’s term in a vector. For the “AND” search. We use a for loop function to check all the query’s term in the inverted index vector. Then delete the document’s result when it was not contained in the later query’s term. For example, the query is “Alice AND Bob” then the Alice result is [“doc1”, ”doc2”], and Bob result is [“doc1”], at that time, doc2 of the final result would be deleted. For the “OR” search. We just add all the document occur in the for loop searching.
For the phrase query’s searching. We store the term of in document in a vector that it also contains the position’s index. When processing the phrase query, we check the second’s term of the query is the next term (index + 1) of the first term or not. If it didn't then remove the document from the result.
We had removed the stop word from documents. And transform all the term to lower case. Also, apply that to query.
For the requirement of ranking search. We use the vector space model to implement the TF-IDF and the cosine similarity. First, we did not use the same way of indexing in the basic version. We read all the document to make a term collection for the calculation of Document Frequency later. And store the total number of documents too. Then, we used a nested dictionary to store the Term Frequency of different documents. Then we created a Document frequency. We create a dictionary to store the document frequency of each term. The program read all the document to count the number of each term’s appearance. Then we use TF and df to calculate IDF by implemented the formula in the program. Then use TF and IDF to calculate TF-IDF weight. And also calculate the length of each document and do the normalization of the weight. For the query function, we pass the IDF to the searching program for calculating the weight of the query. Then we construct a dictionary of the query which is the term frequency in the same way. And calculate the weight by multiplied TF with IDF. And compute the cosine similarity by multiplied normalized weight of document with query’s weight. Use this as a score to store in the dictionary. Then the dictionary showed all the score of the document for this query. Sorted by descended and 0 scores of documented were removed. Then return the result to the user. The result showed the document with the score.
For improve the indexing and searching quality, we use Hashtable in Python for both indexing and searching. The dictionary component in Python is using the data structure of Hashtable.
GUI, In this project, the python GUI was also developed by using kivy API. Users can choose the different mode in the kivy interface. When user click the button, the screen will jump to path input screen. User needs to type in the absolute path of the directory. If the path is incorrect, the system will inform user the error messages. After the sorting completed, the screen will jump to query interface where user can type in the query they want. If they choose the advance mode, they need to specify Boolean or Phrase search first before entering the query interface. In the query interface, the results will be displayed in the bottom of the screen as buttons. User can check the detail information of the return results by clicking the button with the document name they want. The screen will then display the full content of the file with the highlighted key query words. The GUI search system used the logic in basic, advance and superior search system. The file reading and search logic has no changed. The basic mode only allows one-word query and the Boolean mode cannot search “AND” and “OR” simultaneously. Besides, the superior mode will display the score of each return file, the user can therefore choose the one with higher rank.
If you like this, please leave a star.
Email: Peter chenqi1123344@gmail.com
Email: Key tavik002@gmail.com
MIT License
Copyright (c) 2019 key
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
All Copyright Reserved