## Purpose
- **Input:** Airbnb - listings, reviews, single location.
- **Output:** Creating the below files to generate dialogues, train data for conversation recommendation module.
    - listings_info_filter
    - ratings_filter
    - listings_slot_value_filter
    - Neo4j_nodes.csv
    
- 5.5k listings | 250k reviews

### References:

- https://neo4j.com/developer/guide-import-csv/

### Import libraries, raw data

In [24]:
import pandas as pd
import os
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Preprocessing helper methods

In [25]:
root='./Data/raw/'
processor = './Data/processing/'

In [26]:
listings = pd.read_csv(root+'listings.csv.gz')


In [27]:
## These are the columns that will be used in the dialog generation.
listingTemplate = pd.read_csv(processor+'listing_Template.csv')
listing_reviews = pd.read_csv(processor+'Processed_Airbnb/listing_with_reviews.csv')
listing_reviews = listing_reviews.set_index('listing_id').T.to_dict('list')

### Compute unique entities, create Neo4j nodes & edge map.

In [28]:
# listings = listings[listingTemplate.columns]
listings.dropna()
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5402 entries, 0 to 5401
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            5402 non-null   int64  
 1   listing_url                                   5402 non-null   object 
 2   scrape_id                                     5402 non-null   int64  
 3   last_scraped                                  5402 non-null   object 
 4   name                                          5402 non-null   object 
 5   description                                   5392 non-null   object 
 6   neighborhood_overview                         3908 non-null   object 
 7   picture_url                                   5402 non-null   object 
 8   host_id                                       5402 non-null   int64  
 9   host_url                                      5402 non-null   o

In [29]:
len(listings)

5402

In [30]:
### Add Geo-location, check more efficient way to add location.
listings['City'] = 'Amsterdam'
listings['Country'] = 'Netherlands'
listings[listings['neighbourhood_cleansed'].isna()]
listings['neighbourhood'] = listings['neighbourhood'].fillna(listings['neighbourhood_cleansed'])
listings['neighbourhood'] = listings['neighbourhood'].str.strip()
listings['State'] = listings['neighbourhood']

#### Add all unique entities

In [31]:
### Add listings text description
# 

col_parse = ['id',
 'picture_url',
 'host_identity_verified',
 'neighbourhood',
 'neighbourhood_cleansed',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms_text',
 'bedrooms',
 'beds',
 'price',
 'City',
 'Country',
 'State']

listings['Listing_Text'] = listings['City']

col = list(listings.columns.values)
for i,j in listings.iterrows():
    for c_name in col_parse:
        listings['Listing_Text'][i] = listings['Listing_Text'][i] + str(c_name) + ':' + str(j[str(c_name)])+';'
    


In [32]:
listings['Review_Text'] = listings['Listing_Text']
for i,j in listings.iterrows():
    lid = (listings['id'][i])
    try: 
        listings['Review_Text'][i] = listing_reviews[lid][1]
    except:
        listings['Review_Text'][i] = 'no review found'

In [33]:
listings['Review_Text']

0       ['Daniel is really cool. The place was nice an...
1       ["The location of Sasha's B&B makes it ideal f...
2       ['Why stay anywhere else in the city of canals...
3       ['Very nice place and people.  Great location!...
4       ['Not a long ago I stood at  Downtown Guesthou...
                              ...                        
5397                                      no review found
5398                                      no review found
5399                                      no review found
5400                                      no review found
5401                                      no review found
Name: Review_Text, Length: 5402, dtype: object

In [34]:
#library that contains punctuation
import string
# string.punctuation
# #defining the function to remove punctuation
# def remove_punctuation(text):
#     punctuationfree="".join([i for i in text if i not in string.punctuation])
#     return punctuationfree
#storing the puntuation free text
# listings['Review_Text']= listings['Review_Text'].apply(lambda x:remove_punctuation(x))
# listings['Listing_Text']= listings['Listing_Text'].apply(lambda x:remove_punctuation(x))


In [35]:
listings['Review_Text']= listings['Review_Text'].apply(lambda x: x.lower())
listings['Listing_Text']= listings['Listing_Text'].apply(lambda x: x.lower())


In [40]:
listings['Review_Text']

0       ['daniel is really cool. the place was nice an...
1       ["the location of sasha's b&b makes it ideal f...
2       ['why stay anywhere else in the city of canals...
3       ['very nice place and people.  great location!...
4       ['not a long ago i stood at  downtown guesthou...
                              ...                        
5397                                      no review found
5398                                      no review found
5399                                      no review found
5400                                      no review found
5401                                      no review found
Name: Review_Text, Length: 5402, dtype: object

In [36]:
# import nltk
# #Stop words present in the library
# stopwords = nltk.corpus.stopwords.words('english')
# #defining the function to remove stopwords from tokenized text
# def remove_stopwords(text):
#     output= [i for i in text if i not in stopwords]
#     return output
# listings['Review_Text']=listings['Review_Text'].apply(lambda x:remove_stopwords(x))
# listings['Listing_Text']=listings['Listing_Text'].apply(lambda x:remove_stopwords(x))


In [37]:
# #defining the function for lemmatization
# from nltk.stem import WordNetLemmatizer
# #defining the object for Lemmatization
# wordnet_lemmatizer = WordNetLemmatizer()
# def lemmatizer(text):
#     lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
#     return lemm_text
# listings['Review_Text']= listings['Review_Text'].apply(lambda x:lemmatizer(x))
# listings['Listing_Text']= listings['Listing_Text'].apply(lambda x:lemmatizer(x))


### Define headers

In [38]:
listings.to_csv(processor+'Processed_Airbnb/listings_text_processed.csv')
