# Step 2: Create documents
In order to create documents in tab separated values format and load them in memory, we built a class named 'DataLoading' containing two methods:
-LoadCSVandCreateTSVFiles transform a single csv file in multiple tsv files. Each of these tsv files corresponds to a single row in the csv file.
-LoadTSVFilesDataIntoString load the entire corpus in a list. Each item of this list is a string corresponding to a tsv file

In [1]:
import DataLoading
import Preprocessing
import TextManagement
import TextMining
import DisplayResults

In [2]:
dl = DataLoading.DataLoading()

In [3]:
dl.LoadCSVandCreateTSVFiles()

It assumes the csv file is stored in ./Resources/Airbnb_Texas_Rentals.csv and stores the tsv files in the folder ./Resources/tsvFiles

In [4]:
raw_data = dl.LoadTSVFilesDataIntoString()

Loads the data in memory

# Step 3: Search Engine
For the preprocessing task we built a class named 'Preprocessing'. Its responsibility is to preprocess the documents removing stopwords and punctuation and applying stemming.

In [5]:
preprocessing = Preprocessing.Preprocessing()
data = preprocessing.PreprocessDataForTextManagement(raw_data)

For a given list of strings, each containing the content of a tvs file, returns a list of objects representing the preprocessed tvs files. These objects are dictionaries where the keys are the names of the fields.

## 3.1.1) Create your index!
The class 'TextManagement' is responsible to create, save and load the inverted index.

In [6]:
textManagement = TextManagement.TextManagement()
invertedIndex = textManagement.CreateInvertedIndex(data)

In [7]:
textManagement.SaveInvertedIndexJson(invertedIndex, "inverted_index.json")

In [8]:
invertedIndex = textManagement.LoadInvertedIndexJson("inverted_index.json")

## 3.1.2) Execute the query

In [9]:
print("Please Enter Search Query: ")
searchQuery = input()

Please Enter Search Query: 
big garden


In [10]:
searchQueryProcessed = preprocessing.PreprocessDataForTextMining(searchQuery)

Before the actual execution of the query, we preprocess it just like we did for the documents. This is necessary, in fact there wouldn't be any match (query-term, inverted-index-row) otherwise.

In [11]:
textMining = TextMining.TextMining()
documentIndexes = textMining.SearchTextFromInvertedIndexAndReturnResults(invertedIndex, searchQueryProcessed)

The query is executed as follow:
for each word in the query:
    retrieve the corresponding term id
    retrieve all the documents containing the term id
The result is the intersection of the documents retrieved at each step.

In [13]:
from IPython.display import display

displayresults = DisplayResults.DisplayResults()
res = displayresults.GetSimpleResults(documentIndexes)
display(res.head())

Unnamed: 0,Title,Description,City,Url
0,Travis Heights Bungalow 2/1,Charming 1940a bungalow in one of Austin's mos...,Austin,https://www.airbnb.com/rooms/5021987?location=...
1,Maison d'Etre,"Perfect ACL, F1, SXSW location! A colorful str...",Austin,https://www.airbnb.com/rooms/5037508?location=...
2,Lovely big room with private bath & entrance,"Bedroom with one bed, attached bath and privat...",San Antonio,https://www.airbnb.com/rooms/2905792?location=...
3,Luxurious Coastal Cottage,My place is close to The Seawall. My place is ...,Galveston,https://www.airbnb.com/rooms/15161770?location...
4,New! A little bit country close to town/The Ga...,My place is close to family-friendly activitie...,Kyle,https://www.airbnb.com/rooms/17270667?location...


## 3.2) Conjunctive query & Ranking score

### 3.2.1) Inverted index


In [14]:
invertedIndex = textManagement.CreateScoredInvertedIndex(data)
textManagement.SaveInvertedIndexJson(invertedIndex, "tfidf_inverted_index.json")
invertedIndex = textManagement.LoadInvertedIndexJson("tfidf_inverted_index.json")

Now the inverted index is built using the tf-idf scheme.

### 3.2.2) Execute the query


In [15]:
print("Please Enter Search Query: ")
searchQuery = input()
searchQueryProcessed = preprocessing.PreprocessDataForTextMining(searchQuery)
documentIndexes = textMining.SearchTextFromInvertedScoredIndexAndReturnResults(invertedIndex,searchQueryProcessed)

Please Enter Search Query: 
big garden


# For Shahzad ... We use the tf*idf scheme for the query too, because in order to calculate the similarity ...

In [16]:
res = displayresults.GetScoredResults(documentIndexes)
display(res.head())

Unnamed: 0,Title,Description,City,Url,Score
0,Luxurious Coastal Cottage,My place is close to The Seawall. My place is ...,Galveston,https://www.airbnb.com/rooms/15161770?location...,1.0
1,Peaceful home near airport & downtown,Our home is filled with warmth from lots of na...,San Antonio,https://www.airbnb.com/rooms/19014109?location...,1.0
2,Travis Heights Bungalow 2/1,Charming 1940a bungalow in one of Austin's mos...,Austin,https://www.airbnb.com/rooms/5021987?location=...,1.0
3,Travis Heights Bungalow 2/1,Charming 1940a bungalow in one of Austin's mos...,Austin,https://www.airbnb.com/rooms/5021987?location=...,1.0
4,Welcome home!,My place is close to Walking distance/across t...,Roanoke,https://www.airbnb.com/rooms/15603065?location...,1.0


# Step 4: Define a new score!
In order to define a new score we build an inverted index based on the terms in the title and description field.
For each document containing a specific term, we create a posting containing:
* index of the document
* average price per night
* number of bedrooms
* city
* publication date

The query string is splitted in two:
* words
* numbers

If a number is between 0 and 15, it's understood to be the number of bedrooms required by the user.
If it is greater than 15, it is recognized as a price.

In [17]:
invertedIndex = textManagement.CreateInvertedIndexWithNewScore(data)
#save table maybe
textManagement.SaveInvertedIndexJson(invertedIndex, "table-custom-scored.json")
#load table from file
invertedIndex = textManagement.LoadInvertedIndexJson("table-custom-scored.json")

Now the inverted index is built as previously specified

In [18]:
print("Please Enter Search Query: ")
searchQuery = input()

Please Enter Search Query: 
apartment with 2 bedrooms


In [19]:
searchQueryProcessed = preprocessing.PreprocessDataForTextMiningCustomScore(searchQuery)

The tokens of the query string are divided in two categories, namely numbers and words.
We apply stemming, stopword and punctuation removal to the tokens that are words.

In [20]:
documentIndexes = textMining.SearchTextFromInvertedCustomScoredIndexAndReturnResults(invertedIndex,searchQueryProcessed)

The execution of the query proceed as follows:
1. for each word w in the query
    2. get all the documents that contain w
    3. for each of these documents d:
        4. save the identifier of d
        5. save all other information of d (number of bedrooms, average price per night, ...)
6. create a set from all the identifier stored at 4.
7. create a priority queue from the set obtained in the previous point \*
8. extract the k documents with higher priority

Notice that the query is not conjuntive anymore.

\* the priority is defined as 15300 - (the sum of absolute differences in price/bedrooms/city(0 or 1)/date(in days)) / 15300. Where a weight of:
* 10000 is given to the price
* 1100 is given to the number of rooms
* 200 is given to the city
* 4000 is given to the date

In [21]:
res = displayresults.GetScoredResults(documentIndexes)
display(res.head())

Unnamed: 0,Title,Description,City,Url,Score
0,Beachfront Lovely Condo Sleeps 5-6 FREE WIFI!!,Lovely Beachfront Condo sleeps 5-6 (four adult...,Corpus Christi,https://www.airbnb.com/rooms/19209640?location...,0.950261
1,"Nice home & neighborhood Wifi, cable, kitchen etc","Nice house, we are renting a bedroom that shar...",Katy,https://www.airbnb.com/rooms/19244509?location...,0.950261
2,Courts of McCallum,This is a 2bhk apartment shared by students fr...,Dallas,https://www.airbnb.com/rooms/19389772?location...,0.945686
3,Autumn Sunrise Private bed and bath,"One private bedroom, full bath in a subdivisio...",San Antonio,https://www.airbnb.com/rooms/18757453?location...,0.941699
4,"Nice, quite, warm and welcoming. Nice part of ...",I have a new home only 5yrs old. Modern in sty...,Leander,https://www.airbnb.com/rooms/19133409?location...,0.938497


# Bonus Step: Make a nice visualization!

In [1]:
print("Enter comma seperated Cordinates:")
cc = input()
cor = (float(cc.split(",")[0]), float(cc.split(",")[1]))
print("Distance in km: ")
dis = float(input())


Enter comma seperated Cordinates:
29.804659, -95.397209
Distance in km: 
5


In [2]:
from MapDrawer impor 
md = MapDrawer.MapDrawer(cor, dis)
md.draw()

NameError: name 'MapDrawer' is not defined