# 3.1 Conjunctive query

I have created different python files for different purposes. 
1. I have created cleaning_of_data.py for cleaning up the data. 
2. I have created writing_of_data.py for writing anything to a file which includes creating tsv, vocabulary and inverted_index file. 
3. I have created reading_of_data.py for reading the data from the file. This includes reading inverted_index.json and vocabulary.json file. 
4. I have created search_engine_processing.py for applying any logic required for preprocessing. 

You can see them in action in below code. 

In [1]:
import cleaning_of_data
import writing_of_data
import reading_of_data
import search_engine_processing

CSV_FILE_NAME = "Airbnb_Texas_Rentals.csv"  # constant csv file name
FOLDER_NAME_FOR_TSV_File = "doc_files"
VOC_FILE_NAME = "vocabulary.json"
INVERTED_INDEX_FILE_NAME = "inverted_index.json"

# cleaning of data
# df = cleaning_of_data.open_csv_file_and_remove_extra_values(CSV_FILE_NAME)
# writing_of_data.create_tsv_files(df, FOLDER_NAME_FOR_TSV_File)

# applying NLTK techniques

# These methods are used to create vocabulary and dictionary files.
# writing_of_data.create_vocabulary_file(len(df), FOLDER_NAME_FOR_TSV_File, VOC_FILE_NAME)
# writing_of_data.create_inverted_index_file(len(df), FOLDER_NAME_FOR_TSV_File, INVERTED_INDEX_FILE_NAME)

# Now we have our inverted_index_file
inverted_index_dic = reading_of_data.get_inverted_index_file(INVERTED_INDEX_FILE_NAME)

query = input()
words = cleaning_of_data.remove_extras_from_query(query)

result_items = search_engine_processing.run_simple_conjunctive_query(words, inverted_index_dic)
df = writing_of_data.output_results(FOLDER_NAME_FOR_TSV_File, result_items)

beautiful house with garden


In [2]:
df

Unnamed: 0,Title,Description,City,Url
0,Unique Location! Alamo Heights - Designer Insp...,"Stylish, fully remodeled home in upscale NW – ...",San Antonio,https://www.airbnb.com/rooms/17481455?location...
1,Beautiful queen bedroom in NW Austin,"My house is close to Lakeline Mall, highways a...",Austin,https://www.airbnb.com/rooms/16755710?location...
2,Unique Location! Alamo Heights - Designer Insp...,"Stylish, fully remodeled home in upscale NW – ...",San Antonio,https://www.airbnb.com/rooms/17481455?location...
3,East Austin Hillside Gem,"Beautiful and modern 3Br, 2.5Ba located minute...",Austin,https://www.airbnb.com/rooms/17555039?location...
4,"The Woodlands, BEAUTIFUL HOME, 1 Floor, 2 BT, ...","Attractions: The Woodlands, incredible views, ...",Spring,https://www.airbnb.com/rooms/13065223?location...
5,The Vintage room in Fort Worth,Our place is a beautiful cozy open concept hou...,Fort Worth,https://www.airbnb.com/rooms/18959678?location...
6,"Vintage Airstream in East Austin, T",This fantastic backyard garden oasis has been ...,Austin,https://www.airbnb.com/rooms/949922?location=B...
7,Superb Studio Apartment with Garden View,Excellent design and natural beauty make this ...,San Antonio,https://www.airbnb.com/rooms/17255843?location...
8,The Vintage room in Fort Worth,Our place is a beautiful cozy open concept hou...,Fort Worth,https://www.airbnb.com/rooms/18959678?location...
9,Relaxing house and garden,Three room house situated in the Hill Country ...,Kyle,https://www.airbnb.com/rooms/2927741?location=...


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [41]:
docA = "The game of life is a game of everlasting learning"
docB = "The unexamined life is not worth living"
docC = "Never stop learning"
query = "life learning"
dictionary = docA + docB + docC

In [42]:
tfidf = TfidfVectorizer(input= query, sublinear_tf=False)

In [44]:
response = tfidf.fit_transform([query])

In [46]:
feature_names = tfidf.get_feature_names()

for doc in range(1):
    print("---------- Documet %s -------"%doc)
    feature_index = response[doc,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [response[doc, x] for x in feature_index])
    for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
        print (w, s)

---------- Documet 0 -------
life 0.7071067811865475
learning 0.7071067811865475


welcome  -  0.11704114719613057
to  -  0.11704114719613057
stay  -  0.11704114719613057
in  -  0.11704114719613057
private  -  0.3511234415883917
room  -  0.11704114719613057
with  -  0.23408229439226114
queen  -  0.11704114719613057
bed  -  0.23408229439226114
and  -  0.11704114719613057
detached  -  0.11704114719613057
bathroom  -  0.11704114719613057
on  -  0.11704114719613057
the  -  0.11704114719613057
second  -  0.11704114719613057
floor  -  0.11704114719613057
another  -  0.11704114719613057
bedroom  -  0.11704114719613057
sofa  -  0.11704114719613057
is  -  0.23408229439226114
available  -  0.23408229439226114
for  -  0.3511234415883917
additional  -  0.23408229439226114
guests  -  0.11704114719613057
10  -  0.23408229439226114
an  -  0.11704114719613057
guest  -  0.11704114719613057
10min  -  0.11704114719613057
from  -  0.11704114719613057
iah  -  0.11704114719613057
airport  -  0.23408229439226114
pick  -  0.11704114719613057
up  -  0.11704114719613057
drop  -  0.11704114719