# Homework 3 - Find the perfect place to stay in Texas!

### Our team:
Giulia Scikibu Maravalli  
Ivana Nastasic  
Sravya Chowdary

### Introduction:
Our target in this homework is to help an hypotetical user of Airbnb to find the perfect place to stay in Texas. In order to make it possible, we have to create a search engine that, given a request, will print out the most relevants results.  
We begin cleaning the data, and then creating a simple search engine that will retrive all the properties with a description/title that matches the user's query. After, we improve it by setting scoring functions. We allow the user to search on map as well.  

But first of all, let's import libraries needed.

In [1]:
# Import libraries

import pandas as pd
import csv, sys
import json
import nltk
from nltk.corpus import stopwords
import string
from textblob import TextBlob
from nltk.stem import PorterStemmer
from collections import defaultdict
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import heapq
import folium
import geopy
from geopy import distance

## Step 1 : Download the data.

We download the csv data file called *Airbnb_Texas_Rentals*. We store it in a Data frame using Pandas, and then perform some pre-cleaning:
1. drop column of *Unamed*
2. remove duplicates
2. remove newline '\n' (beacuse during printing it is interpreted as a character, not newline)
3. remove hex notation '\x...' (for similar reason as above)  

Note: we noticed that there are duplicates, i.e. same properties listed several times, as consequence when we show the result of our research they could appear multiple times. In our opinion, it would be inconvinient for the user, therefore we decide to remove all the duplicates.

In [2]:
# Read the initial csv file and remove unnamed column
df = pd.read_csv('Airbnb_Texas_Rentals.csv')
df = df.drop('Unnamed: 0', axis = 1)
#keeping only rows in which title is unique
df.drop_duplicates(subset='title', keep="last", inplace=True)
df.reset_index(drop=True, inplace=True)
df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\\n',  ' ', regex=True) #remove newline \n
df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'[^\x00-\x7f]',  ' ', regex=True) #remove hex notation

## Step 2: Create documents

In this step we create create a .tsv file for each record of the dataset *df* (i.e. every row corresponds to a different Airbnb listed property).
In order to perform this task, we take every row of the dataframe and write it inside a new .tsv file. 

In [3]:
for i in range(len(df)):
        #with str(i) we can create a different document for every iteration
        new = open('C:/Users/giuli/Desktop/Prova Generale/doc_i/doc_'+str(i)+'.tsv', 'w')
        for j in range(9):
            #since it is a .tsv file, we separate each of the 9 columns with a tab \t
            new.write('%s\t' %df.iloc[i, j])
        new.close()

## Step 3: Search Engine

Before implementing the search, we preprocess our new tsv docs by removing stopwords and punctuation, converting to lowercase and stemming (i.e. reducing words to their word stem, base or root).  
Each of these actions are implemented as separate function, afterwords we are going to apply them to each tsv file.

In [4]:
# Create function to remove stop words from specified column from data frame
# in our case we'll call it for the title (column 4) and description (column 7)
def remove_stopwords(df, n):
    #remove stopwords
    stop = stopwords.words('english')
    df[n] = df[n].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 
    return df

In [5]:
# Create function to remove punctuation from specified column from data frame
# in our case we'll call it for the title (column 4) and description (column 7)
def remove_punctuation(df, n):
    df[n] = df[n].str.replace('[^\w\s]','') 
    return df

In [6]:
# Create function to put everything to lowercase from specified column from data frame
# in our case we'll call it for the title (column 4) and description (column 7)
def to_lowercase(df, n):
    df[n] = df[n].apply(lambda x: " ".join(x.lower() for x in x.split())) 
    return df

In [7]:
# Create function to do stemming from specified column from data frame
# in our case we'll call it for the title (column 4) and description (column 7)
def stemming(df, n):
    st = PorterStemmer()
    df[n] = df[n].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
    return df

### 3.1) Conjunctive query
Then we estract all unique words from the documents, this will be useful, after, to build the dictionaries.  

In [8]:
# Create a function which makes unique words from max 2 given columns from data frame
# second argument can be omitted

def unique_words(df, n, m = None):
    a=set(" ".join(df[n]).split(" "))
    if m == None:
          b=set() # empty set for the second argument
    else:
        b=set(" ".join(df[m]).split(" "))
    # return unique words
    return a.union(b)

For every tsv file, we are going to store it into a data frame, and then work only on the columns *description* and *title* as required by the task. We apply the functions defined above for pre-processing and use the function unique_words to make a set of unique words appearing in the doc.   
At first we create an empty dictionary, called *vocabulary*, our intent is to map each word to an integer. Therefore we iterate through each word inside the unique_word set, and if the word is not yet in the dictionary we add a new numeric key and place the word as its value.  
Then in order to create the *Inverted Index*, we need to do an intermediate passage, we create a dictionary (with defalutfdict) that takes as key the unique word and as value the list of documents in with this word appears.  
In this way we have just to merge the two previuosly created dictionaries to create the *Inverted index* (the values of the first one are the key of the second one).  
We save the created dictionaries (*vocabulary* and *Inverted index*) in format of both csv and txt file. Afterwords, when we need to use this information we are going to work with the csv, and extract the dictionary as dataframe. 

In [9]:
# dictionary of all unique words appearing in documents
vocab={}
# word index of all unique words appearing in documents
word_idx=defaultdict(list)

# inverted index
inverted_idx={}

for i in range(len(df)):
    #temoporaly create a dataframe from our tsv files
    tsv = pd.read_csv('C:/Users/giuli/Desktop/Prova Generale/doc_i/doc_{}.tsv'.format(i), sep='\t', encoding = 'ISO-8859-1', header = None)
    tsv[4] = tsv[4].astype(str)
    tsv[7] = tsv[7].astype(str)
        
    # replacing / with the space
    tsv = tsv.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'/',  ' ', regex=True)
    
    # Data pre-processing
    tsv=remove_stopwords(tsv, 4) # remove stop words in title
    tsv=remove_stopwords(tsv, 7) # remove stop words in description
    tsv=remove_punctuation(tsv, 4) # remove punctuation in title
    tsv=remove_punctuation(tsv, 7) # remove punctuation in description
    tsv=to_lowercase(tsv, 4) # put title into lower case
    tsv=to_lowercase(tsv, 7) # put title into lower case
    tsv=stemming(tsv, 4) # make stemming in title
    tsv=stemming(tsv, 7) # make stemming in description
    
    # make a set of unique words from description and title of a document
    doc_unique_words=unique_words(tsv,4,7)
    
    # update current vocabulary with the new words found in a document
    for w in doc_unique_words:
        if w in vocab.values(): # if word is already in general vocabulary skip it
            continue
        else:
            if any(vocab)==False: #if vocabulary is empty then define initial integer key for the first word
                v_idx=10000 #term ids will start from ten thousand
                vocab.update( { v_idx: w} ) # add a word to the vocabulary
            else: # if vocabulary already has words, take the first available key (max(key)+1) for the next word
                v_idx=max(list(vocab.keys()))+1
                vocab.update( { v_idx: w} ) # add a word to the vocabulary
    
    # intemediate passage with defalutdict            
    # create index dictionary which for each word will have list of document numbers which contain that word
    # w is the word, and i in the doc id
    for w in doc_unique_words:
        word_idx[w].append(i)
word_idx=dict(word_idx)

# Inverted index for all the words - dictionary with word_id as a key and list of the id of the documents that contain the word
inverted_idx=dict((key,word_idx[value]) for (key, value) in vocab.items())

with open('inverted_index.csv', 'w') as f: #save dict as csv
    w = csv.DictWriter(f, inverted_idx.keys())
    w.writeheader()
    w.writerow(inverted_idx)
    
with open('inverted_index.txt', 'w') as file: #save dict as txt
    file.write(json.dumps(inverted_idx))

with open('vocabulary.csv', 'w') as f:
    w = csv.DictWriter(f, vocab.keys()) #save dict as csv
    w.writeheader()
    w.writerow(vocab)

with open('vocabulary.txt', 'w') as file: #save dict as txt
    file.write(json.dumps(vocab))

Now we define a function that takes as parameter a text query given in input by the user.
At first, we preprocess the query as we did with the tsv file (i.e. remove stopwords, stemming, etc.), then for every word inside the query, we check in which document it appers, thanks to the vocabulary and Inverted index created earlier, and make a set intersection to find the documents that cointain all the words of the query. The function returns a list with these doc-ids. 

In [10]:
def search_fun(q): # search function is taking a user query as an input
    
    query=[str(q)]
    query = pd.DataFrame(query) # transform input query into data frame
    
    
    # Pre-processing input query
    
    query=remove_stopwords(query, 0) # remove stop words
    query=remove_punctuation(query, 0) # remove punctuation
    query=to_lowercase(query, 0) # put query into lower case
    query=stemming(query, 0) # make stemming
       
    # find all unique words appearing in input query
    query_set = unique_words(query, 0)
    
    # read prepared vocabulary of all the words appearing in input documents
    # store the result in the data frame which will have word id-s as column names and all words in one row
    vocabulary = pd.read_csv('vocabulary.csv')
    document_idx=pd.read_csv('inverted_index.csv')

    search_res=set() # initiate search result as an empty set
    for w in query_set:
        # search if the word is in vocabulary and find corresponding id
        word_id=vocabulary.loc[:,vocabulary.eq(w).any()].columns.values[0]
        if(len(word_id)==0): # in case that word is not found in the vocabulary
            word_id=0
            return(list()) # if one of the words doesn't appear in vocabulary return empty result
        else:
            if len(search_res)==0:
                search_res=set(re.findall(r'\d+',(document_idx[word_id])[0]))
            else:
        # each time when word is found check if it appears in any common document with other words
                s=set(re.findall(r'\d+',(document_idx[word_id])[0]))
                search_res=search_res.intersection(s) 
    search_res=list(search_res)
    return search_res

Let's try our search on a query:

In [11]:
query = input('Book unique homes and experiences in Texas! Write below what are you looking for: ')

Book unique homes and experiences in Texas! Write below what are you looking for: Room in San Antonio


In [12]:
query

'Room in San Antonio'

In [13]:
search_res=search_fun(query)
#we need to convert to int beacuse in our dict the index is a int
search_res=[int(x) for x in search_res] 

Now we print out the result in a dataframe, not only title and description, but also city and url.

In [14]:
#create the dataframe with selected doc_i
d = {}
table = pd.DataFrame(columns=['Title', 'Description', 'City', 'Url']) #empty dataframe to be filled
for i in search_res:
    d[i] = pd.read_csv('C:/Users/giuli/Desktop/Prova Generale/doc_i/doc_{}.tsv'.format(i), sep='\t', encoding = 'ISO-8859-1', header = None) #trasform doc_i to dataframe
    df = d[i]
    
    #remove dollar sign beacuse it create a lot of troubles  
    df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\$',  ' ', regex=True) 
    df = df[[7, 4, 2, 8]] #keep just needed columns
    df.columns = ['Title', 'Description', 'City', 'Url'] #rename columns
    table = table.append(df) #append the corrected dataframe
    

#improve dataframe display:

#remove index
table.reset_index(drop = True, inplace = True)

#make url clickable
table = table.style.hide_index()
def make_clickable(val):
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)
table = table.format({'Url': make_clickable})

#change fontsize, allign the text and edit dimension of cells
table = table.set_table_styles({'props': [('font-size', '10pt')]}).set_properties(**{'font-size': '10.5pt'})
d = dict(selector="th", props=[('text-align', 'center')])
table = table.set_properties(**{'width':'20em', 'height':'10em', 'text-align':'center'}).set_table_styles([d])

table #print table

Title,Description,City,Url
Downtown Guest Suite With a View,"This is a great guest suite, practically a full fledged private apartment with a great view and fantastic natural light. Over 500 square feet. Just walking blocks or a trolley ride away from everything downtown San Antonio. Great living room to relax in. Kitchenette offers a refrigerator, microwave, coffee maker, toaster, all utensils, plates, bowls, glassware, and a set of wine glasses. There is a 2-burner hotplate--no stove or oven. Separate bedroom with queen bed, closet space, dresser and flat screen TV. There is a full-size bed in the living area to accommodate your entire party. Large bathroom with deep antique tub. Great private 2nd floor patio deck with a nice view. Enjoy a view of the area and catch a glimpse of the downtown skyline. Wireless Internet, Cable TV. 2 blocks from Metropolitan Methodist Hospital. Down the block from Luther s Cafe and Bar, Armadillo's (burgers), Main Street Pizza, Lulu's (giant Chicken Fried Steak and cinnamon rolls, Luby's Restaurant, Subway, The Cove (fish tacos and live rockabilly. Fast food strip just down the other block: Starbucks, McD's, Whataburger, Sonic, Pizza Hut, Burger King, Jack in the Box, etc. Crocket Park is across the street with a playground for the kids. 25 minute walk to central downtown and the Alamo. 5 minutes to Pearl Brewery. Many other easy to get to attractions just minutes away such as the San Antonio Zoo and Museums. 1 minute from IH35, IH37, Hwy 281, IH-10. You can be anywhere in minutes. Check in is at 3pm and check out is at 10am. See you soon. We have a secure property with electronic driveway gate. We will provide you with gate opener and keys. We are available via phone or text whenever you have questions or need suggestions. You have a private entrance and we have a separate entrance. Our neighborhood is very nice. We are in a community college area, they college is actually just a block away. We have a nice restaurants and local hang-outs like Luther's, The Cove, Armadillo's and others. We are a walking, running and biking community. And best of all, downtown and other great sites, like the San Antonio Museum of Art are just a walk way. Hey, if you haven't used Uber, you might want to give it a try. Please use our promo code: harveym22 if you do. Via Metro Bus lines are available down the block on San Pedro and Main streets. There's also a trolley stop about 4 blocks away that will take you downtown. Cost is 1.25 cash and you pay an additional .15 cents for a trolley transfer. The transfer needs to be used within 3 hours, so you can use it to return or transfer to another trolley. Taxis service is also available. The main thing we ask from our guests is that they make sure to close the gate upon arrival and departure. We also ask that they lock the front door upon arrival and departure. That's about it.",San Antonio,https://www.airbnb.com/rooms/2016685?location=Colorado%20River%2C%20TX
Private Room near Fiesta Texas,"Lovely quiet neighborhood just outside San Antonio in Helotes only 10 minutes away from Fiesta Texas & UTSA, and 15 minute to Sea World. Two full bathrooms available for use. You're welcome to use the kitchen, deck, TV room, and washer & dryer.",Helotes,https://www.airbnb.com/rooms/6360252?location=Boerne%2C%20TX
Friendly room near sea world,House in a neighborhood near sea world. 20 minutes from downtown San Antonio.,San Antonio,https://www.airbnb.com/rooms/18811810?location=Castroville%2C%20TX
Private room in cozy house.,Come and visit or stay longer here in beautiful San Antonio. 1 king bed 1 queen air mattress 1 sleeping pad all sheets for king and queen in room. Sleeping bag for sleeping pad/ 5th adult.,San Antonio,https://www.airbnb.com/rooms/16507221?location=Bulverde%2C%20TX
Private room (Lion Room) in NE San Antonio,"This 4 Bedroom 2.5 bath white brick home is located in a quiet neighborhood off of IH 35. Minutes from the Airport, Retama Polo Center, Randolph Air Force Base and Fort Sam Houston. 1.5 bath is shared with one other Airbnb guest.",San Antonio,https://www.airbnb.com/rooms/18209957?location=Cibolo%2C%20TX
Alamo Ranch/Sea World Area Room in San Antonio,"My place is close to Sea World and Fiesta Texas, Lackland Air Force Base, and the Helotes area. The airport, city center, and medical center are all an easy drive away using local highways. The room is cozy and comfortable with a queen sized bed (extra air mattress available as needed). Guests have their own bathroom just next to the bedroom and can use a shared kitchen, office, and laundry room. My place is good for couples, solo adventurers, business travelers, and families (with kids).",San Antonio,https://www.airbnb.com/rooms/13166279?location=Castroville%2C%20TX
"SIngle Room, Queensize bed,own bath","The house is in a very nice, new development. Quiet, it is five miles from downtown New Braunfels, 25 miles from downtown San Antonio and 60 Miles from Austin. The location is between Schertz and New Braunfels, Closer to New Braunfels, I am 5 miles from the circle in the middle of New Braunfels and 6 Miles from Schlitterbahn. If you are looking for a good price, close to New Braunfels, this is the listing for you. this area has uber and lyft.",New Braunfels,https://www.airbnb.com/rooms/8283243?location=Cibolo%2C%20TX
Cozy Room w/Private Bath & Entrance,This cozy 1 bedroom is the perfect place to rest your head after a day exploring the wonderful city of San Antonio. Located in the heart of Blossom Park on the NW Side of San Antonio. Comfortably sleeps 2 people. Please be sure to read house rules.,San Antonio,https://www.airbnb.com/rooms/14261491?location=Converse%2C%20TX
Comfy bedroom near SAMMC,"Military couple looking to rent out private bedroom and shared bathroom. Room includes memory foam topper on regular full size bed, dresser, closet and desk. Brand new house in quiet neighborhood of San Antonio, just 9 miles from SAMMC. Close to Forum shopping center, Costco and Randolph AFB. Please, no smoking or unauthorized guests. Owners live on site with well behaved dog. Room comes with Wifi and TV. Shared kitchen and other living spaces included with laundry.",San Antonio,https://www.airbnb.com/rooms/14686950?location=Cibolo%2C%20TX
Family Luxury Home in San Antonio,"Come and visit San Antonio in this Luxury Family Home. 1 mile from Shops at la Cantera and The Rim. The first bedroom has 1 king bed and its oun bathroom. Two more bedrooms with two doble beds each. One kids room with bunk beds (with playing space) and another bathroom to share. Spacious kitchen, dinning room and big TV room, laundry room, garage, outsid grill. Swimming pool in the common areas. Check out our pictures!! Very spacious, 2098 square feet!!",San Antonio,https://www.airbnb.com/rooms/17385274?location=Bulverde%2C%20TX


### 3.2) Conjunctive query & Ranking score
#### 3.2.1) Inverted index
In order to implement the *second Inverted Index*, we have to compute the tf-idf for every word in each document. We decide to proceed step by step to make the code more readable.  
At first, we preprocess the 'description'/'title' columns.     
After, we compute the idf. The Inverse document frequency is the log of the ratio of the total number of rows to the number of rows in which that word is present, then it does not change from document to document, i.e. we can compute idf once for every word in a set of documents.    
Then, we calculate tf (= term frequency, the number of time a term appears in a document), to compute it we iterate through each row of the created dataframe 'new'. Once we compute the tf, we normalize it dividing by the total number of words in that document (we do not want to favorize tf score of word in long sentences), after we multiply the normalized tf to its corresponding idf value, now we have tf-idf for every word in the document.  
At the end, we create the *second Inverted Index* dictionary with the structure requested by the assignment.  
But now let's see in practice how we implement the previous explanation in our code:

First, we need to use all the listed properties, then we decide to work again with *Airbnb_Texas_Rentals.csv* as a dataframe.  
For every row of the dataframe, we merge the description and title columns in one new cell and apply on it the same preprocessing that we did on tsv files to create dictionaries, in this way is like we have all our tsv file, cleaned, in a dataframe.

In [15]:
# Preparation
df = pd.read_csv('Airbnb_Texas_Rentals.csv')
df = df.drop('Unnamed: 0', axis = 1)
df.drop_duplicates(subset='title', keep="last", inplace=True)
df.reset_index(drop=True, inplace=True)
df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\\n',  ' ', regex=True) #remove newline \n
df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'[^\x00-\x7f]',  ' ', regex=True) #remove hex notation
new = df.copy()
new['text'] = new['description']+''+new['title'] 
new = new.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'/',  ' ', regex=True)
new['text'] = new['text'].astype(str)
new=remove_stopwords(new, 'text') # remove stop words in title
new=remove_punctuation(new, 'text') # remove punctuation in title
new=to_lowercase(new, 'text') # put title into lower case
new=stemming(new, 'text') # make stemming in title
text = new['text']

text[:5] #just a slice of 'text' series to show what it is, we will work with the whole series 

0    welcom stay privat room queen bed detach priva...
1    first class comfort condo best view bottom dec...
2    thi home north side san antonio 3 minut away g...
3    my place close downtown kerrvil beauti pine ce...
4    cute two bedroom lot window sunni back deck fr...
Name: text, dtype: object

To compute idf we use the function *TfidfVectorizer()*, and we fit it on 'text' (merging of description+title). With *vectorizer.vocabulary_* we create a dictionary, called *vec*, that map each word (key) to a index (value), and with *vectorizer.idf_* we compute the idf value for each word, the result is stored inside a list *l*. This list share as index, the value associated to each word in *vec* dictionary, then with a simple for loop, we put inside a new dictionary *idf* as key the words and as value the corresponding idf.

In [16]:
#calculate idf:

vectorizer = TfidfVectorizer()
vectorizer.fit(text) #vectorize the series of documents
vec = vectorizer.vocabulary_ 
l = vectorizer.idf_ #calculate the idf for each word and stored them into a list

#create a dictionary containing as key the word and as its idf as value:
idf = {}
for key, value in vec.items():
    idf[key] = l[value]

Now we iterate through every document in 'text' series. We calculate tf with Counter, and store the result inside a dictionary, that has as key the word and as value the tf, then normalize the just dividing by the length of the dictionary itself (= n. of words in the document).
We create a new dictioanty *tfIdf*, in witch keys are the words and value is the product of tf to its  corresponding idf value in *idf* dictionary (these dictionaries share the same keys).
At the end we store inside a defaultdict *improved_dict* the words as key and as value a list of tuples, each tuple contains the id of the document in which the word appears and the corresponding tf-idf.

In [17]:
improved_dict = defaultdict(list)

#for every document
for i in range(len(text)):
    
    #calcualte term frequency (tf) 
    test = text[i]
    count = Counter(test.split()) #calculare tf 
    d = dict(count)
    sum_ = len(d)
    tf = {key: value / sum_ for key, value in d.items()} #normalize tf by dividing for the total number of words in the doc
    
    #calculate tf-Idf
    tfIdf = {x: tf[x] * idf[x] for x in tf if x in idf} #for every word we compute tf*idf
    
    #create a dictionary with the word as key and a list as value, the list contains tuples, every tuple has as first element
    #the doc id in which the word appears and a 
    for key, values in tfIdf.items():
        improved_dict[key].append((i, values))

Now we need just to replace the key (words) in *improved_dict* with the corresponding term_id created in STEP 3.1 and stored inside *vocabulary*, in the following code we find the match and create our *final_dict*. It has as key term-id and as value a list of tuples, each tuple contains the id of the document in which the word appears and the correspondinf tf-idf.

In [18]:
final_dict = {}
vocabulary = pd.read_csv('vocabulary.csv')
#in this code we replace the key in improved_dict,i.e. the words, with their correspinding numeric index
#'vocabulary' was the previous dict, uploaded as dataframe, that has as value the word and as index the number associated to
#each word
for key, value in improved_dict.items():
        try:
            match = (vocabulary == key).any() #if the key (word) in improved dict equals the value of vocabulary
            match_n = match.index[match] #we match the correspondent key in vocabulary
            word_id = match_n[0]
            #we put in a new dict as key the number associated to the word, the value (list of tuples) remains unchanged
            final_dict[word_id] = value 
            
        except:
            pass

Now we save the *second Inverted Index* dictionary in a csv and txt files.

In [19]:
with open('second_inverteid.csv', 'w') as f: #save dict as csv
    w = csv.DictWriter(f, final_dict.keys())
    w.writeheader()
    w.writerow(final_dict)

with open('second_invertedid.txt', 'w') as file: #save dict as txt
    file.write(json.dumps(final_dict))

#use the code below if you want to open the file as dictionary
#d = json.load(open('second_invertedid.txt'))

#### 3.2.2) Execute the query
Based on tf-idf values, we now need to compute the cosine similarity between a query typed by the user and our documents, and display the top-k relevant results.  
We decide to show the top ten results, and to use a heap structure to retrive them in a time-efficient way.  
First, we preprocess the query (removing stopwords, stemming etc.), after with *TfidfVectorizer()* we make as vector both our 'text' (description+title) and the query (we fit the query on the 'text). We then compute the cosine similarity (the cosine of angle between two vectors) with the function *linear_kernel*. We have as result a sparse matrix, in few passages we store it into a heap, with *heapq.heappush()*, we put a negative sign on the cosine similarity we push inside the heap beacause the fuction used return the smallest values.
Then with a for loop, iterating k (=10) times, we retrive (using *heapq.heappop()*) the ten most relevant results.

In [20]:
query = input('Book unique homes and experiences in Texas! Write here what are you looking for: ')

Book unique homes and experiences in Texas! Write here what are you looking for: Room in San Antonio


In [21]:
query

'Room in San Antonio'

In [22]:
query=[query]
query = pd.DataFrame(query) # transform input query into data frame
    
    
# Pre-processing input query

query=remove_stopwords(query, 0) # remove stop words
query=remove_punctuation(query, 0) # remove punctuation
query=to_lowercase(query, 0) # put query into lower case
query=stemming(query, 0)

for i in query[0]:
    s = i #trasform back to string format

tfidf = TfidfVectorizer().fit_transform(text) #tf-idf for all the docs
queryTFIDF = TfidfVectorizer().fit(text)
queryTFIDF = queryTFIDF.transform([s]) #tf-idf for the query
cosine_similarities = linear_kernel(queryTFIDF, tfidf).flatten() #cosine similarity

#make cosine_similarities from an narray to a pd.Series()
similarity = list(cosine_similarities)
similarity = pd.Series(similarity)

#HEAP:
h = [] #empty list where we will push value with heap
count = 0
for i in range(len(new)):
    cosine_sim = similarity.values[i]
    heapq.heappush(h, (-cosine_sim, i)) #push values with heappush
    #we put cosine_sim to negative value beacuse the function return the smallest items

k = 10 #top-k we are condidering
l = []
for i in range(k):
    l.append(heapq.heappop(h)) #keep the first k elements 

#put in a list all similarity values and in another list all the correspondent doc ids
#the two list share the same index
#these two list will be useful later, when we are going to print out the results inside a dataframe
similarity = []
related_docs_indices = []
for i in l:
    similarity.append(i[0])
    related_docs_indices.append(i[1])

Now we display our result in a dataframe, showing Description and Title, as well as, City, Url and Cosine similarity.

In [23]:
#create the dataframe with selected doc_i
d = {}
sim = []
count = 0
table = pd.DataFrame(columns=['Title', 'Description', 'City', 'Url']) #empty dataframe to be filled
for i in related_docs_indices:
    d[i] = pd.read_csv('C:/Users/giuli/Desktop/Prova Generale/doc_i/doc_{}.tsv'.format(i), sep='\t', encoding = 'ISO-8859-1', header = None) #trasform doc_i to dataframe
    df = d[i]
    
    #remove dollar sign beacuse it create a lot of troubles  
    df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\$',  ' ', regex=True) 
    df = df[[7, 4, 2, 8]] #keep just needed columns
    df.columns = ['Title', 'Description', 'City', 'Url'] #rename columns
    table = table.append(df) #append the corrected dataframe

#remove index
table.reset_index(drop = True, inplace = True)

#add similarity column to table dataframe
table['Similarity'] = pd.Series(similarity) 
table.Similarity = table.Similarity*(-1)

#improve dataframe display:

#make url clickable
table = table.style.hide_index()
def make_clickable(val):
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)
table = table.format({'Url': make_clickable})

#change fontsize, allign the text and edit dimension of cells
table = table.set_table_styles({'props': [('font-size', '10pt')]}).set_properties(**{'font-size': '10.5pt'})
d = dict(selector="th", props=[('text-align', 'center')])
table = table.set_properties(**{'width':'20em', 'height':'10em', 'text-align':'center'}).set_table_styles([d])

table #print table

Title,Description,City,Url,Similarity
Home away from home! Full house in NW San Antonio.,"A charming 1,700 square foot house in a great neighborhood, close to shopping and dining at the Shops of La Cantera and The Rim. Located minutes from the University of Texas at San Antonio and Six Flags. 5 minutes to the closest HEB. Twenty-five minutes from San Antonio International Airport, downtown San Antonio, the Riverwalk and Sea World.",San Antonio,https://www.airbnb.com/rooms/15309942?location=Boerne%2C%20TX,0.581281
"Quiet Comfort, guest room w/queen bed & bathroom","This room is in a house that is located in a quiet, safe neighborhood. Two city parks are close by - one within walking distance (great for dog walking). It is close to San Antonio's bus system, and in both Uber and Lyft service areas. San Antonio airport is a 15 minute drive, and downtown San Antonio is less than 30 minutes. There are lots of dine-in and fast-food restaurants within a few miles. This place is best for couples, solo adventurers, business travelers, and furry friends (pets).",Converse,https://www.airbnb.com/rooms/15394842?location=Converse%2C%20TX,0.505368
Quaint room in NE San Antonio,Quaint room in NE San Antonio. Housemates reside in 2 of the 3 bedrooms. Easy access to freeway and accessible to shopping outlets.,Schertz,https://www.airbnb.com/rooms/19059682?location=Cibolo%2C%20TX,0.494236
Cozy private room with shared bathroom,"My place is close to downtown, and other trendy places in San Antonio including \",San Antonio,https://www.airbnb.com/rooms/17418648?location=Bulverde%2C%20TX,0.480808
Private Guesthouse close to Downtown San Antonio.,"Welcome to our Casita de San Antonio! Our residence Is located in Dignowity Hill, one of San Antonio's famed historic districts. This unique neighborhood is located just minutes from the Airport, Downtown (the Alamo), Pearl Brewery, Southtown, Alamo Quarry Market, Breckenridge Zoo, multiple museums including the Witte, DoSeum and San Antonio Museum of Art, the AT&T Center, and the AlamoDome. Your one-bedroom casita boasts brand new furniture, a separate living space and fully equipped kitchen.",San Antonio,https://www.airbnb.com/rooms/19196614?location=Alamo%20Heights%2C%20TX,0.47111
Two - Story Condo /17 min From Base,"This condo features the following: - A wood-burning fireplace - Two (2) bedrooms upstairs... each with its own private bath - One bedroom has a king size bed and the other bedroom has two (2) queen beds. - Sofa (downstairs) converts to queen bed - Half bath conveniently located downstairs - Covered parking for two vehicles - Outdoor pool for recreation Great location provides for short distance to many attractions in and around our great city. It is conveniently located close to Loop 410, making it easy access to downtown (15 minutes), San Antonio International Airport (20 minutes), Sea World (10 minutes), Fiesta Texas (20 minutes), movie theaters, malls and restaurants. You can visit San Antonio s lovely River walk, our historic missions, and the Alamo downtown. Have a romantic evening and dine on San Antonio s Riverwalk, then take an evening stroll or be romantic and ride through downtown San Antonio in a quaint carriage. Come visit San Antonio and stay with us. We want to be \",San Antonio,https://www.airbnb.com/rooms/969446?location=Castroville%2C%20TX,0.469921
Apartment near Downtown San Antonio,"One side of a duplex built in 1918 less than two miles from Downtown San Antonio. Everything is close on foot, public transit or bicycle. Near the Mission Reach of the SA River, San Antonio Missions & Southtown and King William Neighborhoods.",San Antonio,https://www.airbnb.com/rooms/8275239?location=Bulverde%2C%20TX,0.468378
25 minutes to Austin or San Antonio,20 minutes to Austin 25 minutes to San Antonio,Kyle,https://www.airbnb.com/rooms/4872865?location=Buda%2C%20TX,0.463806
Gorgeous quiet retreat,"Beautiful and unique, stay in this clean and remodeled home, in the quiet and safe Schertz community , perfect location you will be close to the nicest places San Antonio offers but without the chaos, 25 minutes from San Antonio Airport 26 minutes to New Braunfels 30 minutes to San Antonio River walk 23 minutes natural Bridge Caverns Great for any family or a group",Schertz,https://www.airbnb.com/rooms/18911324?location=Converse%2C%20TX,0.463154
Cozy San Antonio Vacation Home,"Our apartment is a home away from home. Room for the whole family. Great, central location for all San Antonio's tourist sites, family attractions and great golf courses. Just minutes from military bases and training.",San Antonio,https://www.airbnb.com/rooms/3803080?location=Cibolo%2C%20TX,0.459528


# Step 4: Define a new score!
In this step we will allow user to refine the search results. After entering the initial search text query, user can enter the maximum price he would like to pay per room per night and then we sort the result based on the price score.
For enabling user to correctly enter the maximum price we made the function *price_input*, which is checking user input and executes recursively until it is correct (i.e. non negative integer is entered). New score is defined in function *price_score* which takes a list of document numbers (result of the search function based on user query) and maximum price given by user. Score is calculated as a difference between price in the document and maximum price. Smaller the score value is, the higer document will be ranked in the final output.

In [24]:
# Function to handle correct user input for maximum price
# It is recursive until user enters correct non negative integer
def price_input():
    print("Please enter the maximum price per room per night: ")
    max_price=input()
    if max_price.isdigit() == False:
        print("Price should be a positive integer number")
        return price_input()
    else:
        return int(max_price)

In [25]:
# for the list of documents calculate their score relative to the room price
def price_score(related_docs_indices, max_price): 
    d={} # dictionary which will keep the score for each document
    for i in search_res:
        df= pd.read_csv('C:/Users/giuli/Desktop/Prova Generale/doc_i/doc_{}.tsv'.format(i), sep='\t', encoding = 'ISO-8859-1', header = None) #trasform doc_i to dataframe
        #remove dollar sign 
        df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\$',  '', regex=True) 
        df=df[[0]] # keep just a price column
        df.columns=['average_rate_per_night']
        if df['average_rate_per_night'].isnull()[0]:
            #in case that price is NaN score will have value equal to the max price specified by client
            score=1000000 
        else:
            # score is a difference between actual room price and specified max price 
            score=int(df['average_rate_per_night'])- max_price 
        d[i]=score # add a score for the document to the dictionary
    return d

Lets test a search with the new scoring mechanism:

In [26]:
# Enter a search query
query = input('Book unique homes and experiences in Texas! Write here what are you looking for: ')

Book unique homes and experiences in Texas! Write here what are you looking for: Room in San Antonio


In [27]:
# Enter a maximum price per room per night
max_price=price_input()

Please enter the maximum price per room per night: 
100


In [28]:
# Find documents which contain words from the input query
search_res=search_fun(query)
search_res=[int(x) for x in search_res]

In [29]:
# Calculate score based on the maximum price
doc_score=price_score(search_res,max_price)

Using the heap structure from Python we are sorting search result based on the price score. The smallest value for the score will be displayed the first.

In [30]:
# 10 results which are best scored
heap = [(value, key) for key,value in doc_score.items()]
score= heapq.nsmallest(10, heap) 

Now we print out the result in a dataframe, not only title and description, but also city, url and price.

In [31]:
#create the dataframe with selected doc_i
d = {}
table = pd.DataFrame(columns=['Title', 'Description', 'City', 'Url', 'Price']) #empty dataframe to be filled
for x in score:
    i=x[1]
    d[i] = pd.read_csv('C:/Users/giuli/Desktop/Prova Generale/doc_i/doc_{}.tsv'.format(i), sep='\t', encoding = 'ISO-8859-1', header = None) #trasform doc_i to dataframe
    df = d[i]
    
    #remove dollar sign beacuse it create a lot of troubles  
    df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\$',  ' ', regex=True) 
    df = df[[7, 4, 2, 8, 0]] #keep just needed columns
    df.columns = ['Title', 'Description', 'City', 'Url', 'Price'] #rename columns
    table = table.append(df) #append the corrected dataframe
    

#improve dataframe display:

#remove index
table.reset_index(drop = True, inplace = True)

#make url clickable
table = table.style.hide_index()
def make_clickable(val):
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)
table = table.format({'Url': make_clickable})

#change fontsize, allign the text and edit dimension of cells
table = table.set_table_styles({'props': [('font-size', '10pt')]}).set_properties(**{'font-size': '10.5pt'})
d = dict(selector="th", props=[('text-align', 'center')])
table = table.set_properties(**{'width':'20em', 'height':'10em', 'text-align':'center'}).set_table_styles([d])

table #print table

Title,Description,City,Url,Price
J&S Casa,"I am retired military looking to meet new folks and share part of my home. My space is great for couples, solo adventurers, and business travelers. The location is only 15 minutes from beautiful Downtown San Antonio/River Walk. Enjoy your time in a private bed room and bath. Please be aware that I have one small dog, but she will not disturb you during your stay :) I look forward to meeting you!",Converse,https://www.airbnb.com/rooms/17941881?location=Cibolo%2C%20TX,20
"Quiet Comfort, guest room w/queen bed & bathroom","This room is in a house that is located in a quiet, safe neighborhood. Two city parks are close by - one within walking distance (great for dog walking). It is close to San Antonio's bus system, and in both Uber and Lyft service areas. San Antonio airport is a 15 minute drive, and downtown San Antonio is less than 30 minutes. There are lots of dine-in and fast-food restaurants within a few miles. This place is best for couples, solo adventurers, business travelers, and furry friends (pets).",Converse,https://www.airbnb.com/rooms/15394842?location=Converse%2C%20TX,22
"Spacious Room Near Ft. Sam, Randolph and Airport.","Just north the busy street of Walzem and where you can find everything to include San Antonio's IN-N-OUT, this house is a brand new 3 bedroom, 2.5 bathroom in the perfect location. The inner 410 loop and I-35 junction is literally a straight shot west with easy access to both and only 5 miles from Randolph and 7 miles from Fort Sam Houston and 8 miles from the San Antonio Airport.",San Antonio,https://www.airbnb.com/rooms/16568668?location=Bulverde%2C%20TX,22
Cozy Room with fast Wifi only 15 min from Airport,"This guest room features a bed, two nightstands, a small couch, and lamp desk. There is plenty of space to store your clothes. We have high-speed Wifi (100Mbps). Luisa and I both work from home and our offices are located on the second floor as well. Our dog Jacoby is very friendly and social. HEB grocery store, bus station and restaurants are within walking distance. Airport: 15 min travel time. Downtown San Antonio: 25 min travel time. McAllister Park: 5 min travel time.",San Antonio,https://www.airbnb.com/rooms/16738572?location=Cibolo%2C%20TX,25
"Super Clean, Comfy and Cozy - Watercolor Room","The home is a lovely, clean, spacious, contemporary two-story in a warm, inviting setting. Park on right side of the driveway or in front of the home. Conveniently located 15 minutes from downtown San Antonio, just off of a major highway. One and a half bathrooms are shared, if there are other guests.",Converse,https://www.airbnb.com/rooms/2781214?location=Cibolo%2C%20TX,25
Private room (Lion Room) in NE San Antonio,"This 4 Bedroom 2.5 bath white brick home is located in a quiet neighborhood off of IH 35. Minutes from the Airport, Retama Polo Center, Randolph Air Force Base and Fort Sam Houston. 1.5 bath is shared with one other Airbnb guest.",San Antonio,https://www.airbnb.com/rooms/18209957?location=Cibolo%2C%20TX,25
Comfy bedroom near SAMMC,"Military couple looking to rent out private bedroom and shared bathroom. Room includes memory foam topper on regular full size bed, dresser, closet and desk. Brand new house in quiet neighborhood of San Antonio, just 9 miles from SAMMC. Close to Forum shopping center, Costco and Randolph AFB. Please, no smoking or unauthorized guests. Owners live on site with well behaved dog. Room comes with Wifi and TV. Shared kitchen and other living spaces included with laundry.",San Antonio,https://www.airbnb.com/rooms/14686950?location=Cibolo%2C%20TX,25
J and S Casa 2,"I am retired military looking to meet new folks and share part of my home. My space is great for solo adventurers, and business travelers. The location is only 15 minutes from beautiful Downtown San Antonio/River Walk. Enjoy your time in a private bed room and bath. Please be aware that I have one small dog, but she will not disturb you during your stay :) I look forward to meeting you!",Converse,https://www.airbnb.com/rooms/18152740?location=Cibolo%2C%20TX,25
Private Room & Bathroom by Randolph & San Antonio,"My husband and I have an extra private room in our house with a double bed and access to closet and drawers. The guest bathroom has been recently remodeled. Shared kitchen, fridge, and washer and dryer are available. We do have two cats for those who don't like pets or are allergic. My husband speaks German so German visitors are especially welcome. We are 21 minutes from the convention center in downtown San Antonio. We are about 10 minutes from Randolph Air Force Base.",Universal City,https://www.airbnb.com/rooms/16577690?location=Cibolo%2C%20TX,25
"Super Clean, Comfy and Cozy - Green Tree Room","The home is a lovely, clean, spacious, contemporary two-story in a warm, inviting setting. Park on right side of the driveway or in front of the home. Conveniently located 15 minutes from downtown San Antonio, just off of a major highway. One and a half bathrooms are shared, if there are other guests.",Converse,https://www.airbnb.com/rooms/2809776?location=Cibolo%2C%20TX,25


# Bonus Step: Make a nice visualization!

In this step we ask the user to enter as input the coordinates of the place and the distance by which he/she are willing to find an accomodation.
We create a map with folium library, the map has the center in the given coordinates, we also add a circle (the radius is the distance given by the user) to show the area taken into account for the search.
Then we iterate through all the coordinates presents in Airbnb texas retals. With geopy distance we compute the difference in distance between the location inserted by the user and every coodinates in our dataset, if the distance is less than the one selected by the user, a maker is added to the map.  
We make interactive markers, if you click on a marker the price of that property will be shown and it you click on that price, you will be redirected to the url link of this listed room/appartment in Airbnb.  
We have a try, inserting the coordinates of Texan city of San Antonio, and put a small distance, 2000m (=2km), as input (we choose a small area in order to make the visualisation more clear). 

In [32]:
#we take as inputs coordinates and the distance in meters 
lt = float(input('Enter a latitude: '))
lg = float(input('Enter a longitude: '))
dist = float(input('Enter distance in meters: '))
coord = [lt, lg]

#we create a map with the center in our coordinates
m = folium.Map(coord, zoom_start=13)
#we put a mark in center
tooltip = 'Location Selected'
folium.Marker(coord, tooltip=tooltip, icon=folium.Icon(icon='home', color = 'green')).add_to(m)
#create a circle of radius equal to distance inserted by the user
folium.Circle(location=coord, radius=dist, fill_color='#3186cc').add_to(m)

#keep needed columns from original dataframe 'new'
df = pd.read_csv('Airbnb_Texas_Rentals.csv')
df = df.drop('Unnamed: 0', axis = 1)
df.drop_duplicates(subset='title', keep="last", inplace=True)
df.reset_index(drop=True, inplace=True)
new = df.copy()
loc = new[['latitude', 'longitude', 'average_rate_per_night', 'title', 'url']] 
input_loc = (coord[0], coord[1])
for i in range(len(loc)+1): #for every document:
    try:
        lat = loc.iloc[i].latitude
        lon = loc.iloc[i].longitude
        new_loc = (lat, lon) #we take coordinates
        #caluculate distance (in km) from the coordinates given by the user
        d = distance.distance(input_loc, new_loc).km 
        n = dist/1000 #convert distance given by the user from m to km
        #if distance is less than the one selected by the user, we create a marker on the map
        if d < n: 
            lnk = "<a href=\""+loc.iloc[i].url+"\""+" target=\"_blank\">"+loc.iloc[i].average_rate_per_night+"</a>"
            folium.Marker([lat, lon], icon=folium.Icon(icon='home'),
                          popup=folium.Popup(lnk)).add_to(m)
    except:
        pass

m

Enter a latitude: 29.4241
Enter a longitude: -98.4936
Enter distance in meters: 2000
