# Step 2: Create documents
In order to create documents in tab separated values format and load them in memory, we built a class named 'DataLoading' containing two methods:
-LoadCSVandCreateTSVFiles transform a single csv file in multiple tsv files. Each of these tsv files corresponds to a single row in the csv file.
-LoadTSVFilesDataIntoString load the entire corpus in a list. Each item of this list is a string corresponding to a tsv file

In [21]:
import DataLoading
import Preprocessing
import TextManagement
import TextMining
import DisplayResults

In [5]:
dl = DataLoading.DataLoading()

In [6]:
dl.LoadCSVandCreateTSVFiles()

It assumes the csv file is stored in ./Resources/Airbnb_Texas_Rentals.csv and stores the tsv files in the folder ./Resources/tsvFiles

In [7]:
raw_data = dl.LoadTSVFilesDataIntoString()

Loads the data in memory

# Step 3: Search Engine
For the preprocessing task we built a class named 'Preprocessing'. Its responsibility is to preprocess the documents removing stopwords and punctuation and applying stemming.

In [8]:
preprocessing = Preprocessing.Preprocessing()
data = preprocessing.PreprocessDataForTextManagement(raw_data)

For a given list of strings, each containing the content of a tvs file, returns a list of objects representing the preprocessed tvs files. These objects are dictionaries where the keys are the names of the fields.

## 3.1.1) Create your index!
The class 'TextManagement' is responsible to create, save and load the inverted index.

In [9]:
textManagement = TextManagement.TextManagement()
invertedIndex = textManagement.CreateInvertedIndex(data)

In [10]:
textManagement.SaveInvertedIndexJson(invertedIndex, "inverted_index.json")

In [11]:
invertedIndex = textManagement.LoadInvertedIndexJson("inverted_index.json")

## 3.1.2) Execute the query

In [16]:
print("Please Enter Search Query: ")
searchQuery = input()

Please Enter Search Query: 
big garden


In [17]:
searchQueryProcessed = preprocessing.PreprocessDataForTextMining(searchQuery)

Before the actual execution of the query, we preprocess it just like we did for the documents. This is necessary, in fact there wouldn't be any match (query-term, inverted-index-row) otherwise.

In [18]:
textMining = TextMining.TextMining()
documentIndexes = textMining.SearchTextFromInvertedIndexAndReturnResults(invertedIndex, searchQueryProcessed)

The query is executed as follow:
for each word in the query:
    retrieve the corresponding term id
    retrieve all the documents containing the term id
The result is the intersection of the documents retrieved at each step.

In [29]:
from IPython.display import display

displayresults = DisplayResults.DisplayResults()
res = displayresults.PrintSimpleResults(documentIndexes)
display(res.head())

Unnamed: 0,Title,Description,City,Url
0,The Aggie Garden Cottage,My home is less than a mile to Texas A&amp;M U...,College Station,https://www.airbnb.com/rooms/13318421?location...
1,New! A little bit country close to town/The Ga...,My place is close to family-friendly activitie...,Kyle,https://www.airbnb.com/rooms/17270667?location...
2,Cozy cottage in middle of town.,Quaint cottage surrounded by trees. House has ...,Beaumont,https://www.airbnb.com/rooms/18131905?location...
3,Welcome home!,My place is close to Walking distance/across t...,Roanoke,https://www.airbnb.com/rooms/15603065?location...
4,"COTTAGE GARDEN: Heights 2-bath, 2 bedroom, 2-s...",COTTAGE GARDENS: \nPrivate Rear Cottage ~ open...,Houston,https://www.airbnb.com/rooms/14184646?location...


## 3.2) Conjunctive query & Ranking score

### 3.2.1) Inverted index


In [28]:
invertedIndex = textManagement.CreateScoredInvertedIndex(data)
textManagement.SaveInvertedIndexJson(invertedIndex, "tfidf_inverted_index.json")
invertedIndex = textManagement.LoadInvertedIndexJson("tfidf_inverted_index.json")

Now the inverted index is built using the tf-idf scheme.

### 3.2.2) Execute the query


In [31]:
print("Please Enter Search Query: ")
searchQuery = input()
searchQueryProcessed = preprocessing.PreprocessDataForTextMining(searchQuery)
documentIndexes = textMining.SearchTextFromInvertedScoredIndexAndReturnResults(invertedIndex,searchQueryProcessed)

Please Enter Search Query: 
big garden


# For Shahzad ... We use the tf*idf scheme for the query too, because in order to calculate the similarity ...

In [1]:
res = displayresults.PrintScoredResults(documentIndexes)
display(res.head())

NameError: name 'displayresults' is not defined