## Feature Extraction from Booking.com Reviews

##### Questions:
 1. What are the top five features that customers talk about most in positive and negative comments about Booking.com hotels?
 2. What are the top five features that customers prefer most if they are a solo traveler vs traveling with a group vs on a business trip vs a leisure trip vs traveling as a couple vs a family with young children (in "Tags" field)? 
 3. What are the top five features that customers like most and top five features they complain about most about hotels in United Kingdom, France, Italy, and Spain (in "Hotel_Address" field)?

Note: We are assuming that hotel features are like "comfortable bed" or "poor location", i.e., ordered sequences of adjectives and nouns. So we will use POS tagging to extract "adjective noun" bigrams. 

In [1]:
import pandas as pd
import nltk

df = pd.read_excel('C:/Users/abhatt/Desktop/Text_Analytics/python/data/Hotel_Reviews_BookingDotCom.xlsx')
df.shape 

(515738, 12)

In [2]:
df.columns

Index(['Hotel_Name', 'Hotel_Address', 'Review_Count',
       'Non_Review_Scoring_Count', 'Average_Hotel_Score', 'Review_Date',
       'Reviewer_Nationality', 'Positive_Comments', 'Negative_Comments',
       'Total_Reviewer_Reviews', 'Reviewer_Score', 'Tags'],
      dtype='object')

There are over 500,000 hotel reviews in this data set, which takes quite a bit of time to process. To reduce computational load, let's use a 10% random sample of the data (50,000 records). 

In [3]:
d = df.sample(n=50000)

##### Top 5 features that customers mention most in positive and negative reviews

Function to extract top 5 adjective-noun pairs from an input Pandas series d

In [4]:
def top_5(d):
    comment_list = d.tolist()                                      # Convert a Pandas series to a list
    comment_list = [c for c in comment_list if pd.isnull(c)==False]
    comment_str  = ' '.join(comment_list)                          # Concatenate all comments to one string
    tokenized = nltk.word_tokenize(comment_str)                    # Tokenize into words
    tokenized = [w for w in tokenized if len(w)>2]
    tagged = nltk.pos_tag(tokenized, tagset='universal')           # Extract POS tags for each word
    bigrams = nltk.bigrams(tagged)                                 # Extract bigrams of POS tagged words
    adj_noun = [(x, y) for x, y in bigrams if x[1]=='ADJ' and y[1]=='NOUN']
    adj_noun = [i[0][0] + ' ' + i[1][0] for i in adj_noun]         # Convert ADJ-NOUN bigram list into a list of strings
    return nltk.FreqDist(adj_noun).most_common(5)                  # Return the 5 most common ADJ-NOUN strings

In [5]:
dpos = d['Positive_Comments'][d['Positive_Comments']!="No Positive"].str.lower()
dneg = d['Negative_Comments'][d['Negative_Comments']!="No Negative"].str.lower()
dpos

418362                                           everything
367148               very nice place and very good location
188321                             the bed and the cleaning
424922                great breakfast nice large comfy bed 
508300     the hotel is very boutique and is sumptuosly ...
                                ...                        
101694                      they looked after us very well 
306852                                             location
60431      spacious room friendly staff very very quiet ...
157658     this hotel is a gem of a find close to major ...
158596     the rooms although small are very comfortable...
Name: Positive_Comments, Length: 46441, dtype: object

In [6]:
top_5(dpos)

[('great location', 2945),
 ('friendly staff', 2388),
 ('good location', 1932),
 ('helpful staff', 1739),
 ('excellent location', 1113)]

In [7]:
top_5(dneg)

[('small room', 551),
 ('double room', 226),
 ('hot water', 211),
 ('little bit', 201),
 ('small rooms', 200)]

##### Top 5 positive features by tags: Leisure trip, Solo Traveler, Couple, Family with young children, Group

In [8]:
def top_5_by_tag(tagname):
    dpos = d['Positive_Comments'][d['Tags'].str.contains(tagname)].str.lower()
    return top_5(dpos)

In [9]:
# Top 5 features for Leisure trip
top_5_by_tag('Leisure trip')

[('great location', 2485),
 ('friendly staff', 1938),
 ('good location', 1535),
 ('helpful staff', 1452),
 ('excellent location', 925)]

In [10]:
top_5_by_tag('Solo traveler')

[('great location', 551),
 ('friendly staff', 534),
 ('good location', 392),
 ('helpful staff', 320),
 ('excellent location', 214)]

In [11]:
top_5_by_tag('Couple')

[('great location', 1501),
 ('friendly staff', 1171),
 ('good location', 936),
 ('helpful staff', 889),
 ('excellent location', 548)]

In [12]:
top_5_by_tag('Family with young children')

[('great location', 317),
 ('friendly staff', 260),
 ('good location', 218),
 ('helpful staff', 193),
 ('excellent location', 146)]

In [13]:
top_5_by_tag('Group')

[('great location', 392),
 ('friendly staff', 293),
 ('good location', 256),
 ('helpful staff', 213),
 ('excellent location', 131)]

##### Top 5 positive and negative features by country: United Kingdom, France, Italy, and Spain

In [14]:
def top_5_pos_by_country(cname):
    dtemp = d['Positive_Comments'][d['Hotel_Address'].str.contains(cname)].str.lower()
    return top_5(dtemp)

def top_5_neg_by_country(cname):
    dtemp = d['Negative_Comments'][d['Hotel_Address'].str.contains(cname)].str.lower()
    return top_5(dtemp)

In [15]:
country_list = ['United Kingdom', 'France', 'Italy', 'Spain']

for c in country_list:
    print('\nCountry:', c)
    print('Most liked features:', top_5_pos_by_country(c))
    print('Most disliked features:', top_5_neg_by_country(c))


Country: United Kingdom
Most liked features: [('great location', 1457), ('friendly staff', 1185), ('good location', 985), ('helpful staff', 801), ('excellent location', 513)]
Most disliked features: [('small room', 344), ('negative nothing', 322), ('negative room', 257), ('double room', 132), ('hot water', 122)]

Country: France
Most liked features: [('great location', 392), ('friendly staff', 296), ('helpful staff', 239), ('good location', 230), ('excellent location', 138)]
Most disliked features: [('negative nothing', 105), ('small room', 85), ('negative room', 66), ('small rooms', 40), ('negative breakfast', 40)]

Country: Italy
Most liked features: [('great location', 173), ('friendly staff', 154), ('good breakfast', 151), ('good location', 149), ('helpful staff', 112)]
Most disliked features: [('negative nothing', 62), ('negative breakfast', 28), ('negative room', 25), ('little bit', 25), ('hot water', 18)]

Country: Spain
Most liked features: [('great location', 386), ('friendly

##### It is NOT ENOUGH to just write some code and show some output, you have to (1) answer the questions asked and (2) and make sure that your answers "makes sense".

Since the top 5 liked features are location, staff, location, staff, and location, and top 5 disliked features are room, room, hot water, "little bit" and rooms, which makes no sense, how about we try the top 15 positive and negative features?

In [16]:
def top_15(d):
    comment_list = d.tolist()                                      # Convert a Pandas series to a list
    comment_list = [c for c in comment_list if pd.isnull(c)==False]
    comment_str  = ' '.join(comment_list)                          # Concatenate all comments to one string
    tokenized = nltk.word_tokenize(comment_str)                    # Tokenize into words
    tokenized = [w for w in tokenized if len(w)>2]
    tagged = nltk.pos_tag(tokenized, tagset='universal')           # Extract POS tags for each word
    bigrams = nltk.bigrams(tagged)                                 # Extract bigrams of POS tagged words
    adj_noun = [(x, y) for x, y in bigrams if x[1]=='ADJ' and y[1]=='NOUN']
    adj_noun = [i[0][0] + ' ' + i[1][0] for i in adj_noun]         # Convert ADJ-NOUN bigram list into a list of strings
    return nltk.FreqDist(adj_noun).most_common(15)                  # Return the 5 most common ADJ-NOUN strings

In [17]:
d = df.sample(frac=0.10)
dpos = d['Positive_Comments'][d['Positive_Comments']!="No Positive"].str.lower()
dneg = d['Negative_Comments'][d['Negative_Comments']!="No Negative"].str.lower()

top_15(dpos)

[('great location', 3022),
 ('friendly staff', 2510),
 ('good location', 1954),
 ('helpful staff', 1733),
 ('excellent location', 1166),
 ('good breakfast', 911),
 ('comfortable bed', 710),
 ('perfect location', 529),
 ('comfortable room', 518),
 ('great breakfast', 487),
 ('great staff', 486),
 ('good value', 481),
 ('clean room', 451),
 ('nice staff', 396),
 ('excellent staff', 376)]

In [18]:
top_15(dneg)

[('small room', 535),
 ('double bed', 248),
 ('double room', 239),
 ('little bit', 208),
 ('small rooms', 196),
 ('hot water', 166),
 ('next day', 164),
 ('next door', 158),
 ('star hotel', 150),
 ('only thing', 149),
 ('first night', 142),
 ('single room', 137),
 ('single beds', 135),
 ('free wifi', 123),
 ('other rooms', 122)]

##### Q1: What are the top five features that customers talk about most in positive and negative comments about Booking.com hotels?

Based on manual inspection of the top 15 results, the top 5 features customers like most are: (1) location, (2) staff helpfulness/friendliness, (3) breakfast quality, (4) bed, and (5) value (amenities commensurate with cost).The top 5 negative features that customers dislike most are: (1) room size, (2) not having hot water (?), (3) not having free wifi (?), (4) ??? and (5) ???. So even top 15 wasn't enough. Perhaps try top 25.

You also have to answer Q2 and Q3 (that I'm not answering here):
- What are the top five features that customers prefer most if they are a solo traveler vs traveling with a group vs on a business trip vs a leisure trip vs traveling as a couple vs a family with young children (in "Tags" field)? 
- What are the top five features that customers like most and top five features they complain about most about hotels in United Kingdom, France, Italy, and Spain (in "Hotel_Address" field)?

##### Questions:
1. In this analysis we used 50,000 records? What is a reasonable number of text records to work with?
2. In this exercise, we did not do stopword removal, lemmatization, etc. Was that a good or bad idea?
3. Do you see any problems with the above approach? Is there a better approach for feature extraction?

#### Feature Extraction and Scoring

In [19]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [20]:
def top_10_nouns(d):
    comment_list = d.tolist()
    comment_list = [c for c in comment_list if pd.isnull(c)==False]
    comment_str  = ' '.join(comment_list) 
    words = nltk.word_tokenize(comment_str) 
    words = [w for w in words if len(w)>2]
    words = [lemmatizer.lemmatize(w) for w in words]
    tagged = nltk.pos_tag(words, tagset='universal')
    nouns = [w[0] for w in tagged if w[1]=='NOUN']
    return nltk.FreqDist(nouns).most_common(10) 

In [21]:
dpos = d['Positive_Comments'][d['Positive_Comments']!="No Positive"].str.lower()
dneg = d['Negative_Comments'][d['Negative_Comments']!="No Negative"].str.lower()

In [22]:
top_10_nouns(dpos)

[('staff', 19149),
 ('location', 18433),
 ('room', 17643),
 ('wa', 13036),
 ('hotel', 12527),
 ('breakfast', 7747),
 ('bed', 4442),
 ('station', 3088),
 ('service', 2716),
 ('everything', 2659)]

In [23]:
top_10_nouns(dneg)

[('room', 20283),
 ('wa', 11794),
 ('hotel', 7694),
 ('breakfast', 5211),
 ('nothing', 3814),
 ('staff', 3735),
 ('night', 2792),
 ('bed', 2716),
 ('bathroom', 2619),
 ('time', 2292)]

Top features mentioned in positive comments are: (1) staff, (2) location, (3) room (size? amenities?), (4) breakfast, (5) service (room service? staff service?). Top features mentioned in negative comments are: (1) room, (2) breakfast, (3) staff, (4) bed, (5) bathroom. Combining both lists, the features hotel customers care most about are: (1) room, (2) staff, (3) location, (4) breakfast, (5) bed, and (6) bathroom. 

In [24]:
features = ['room', 'staff', 'location', 'breakfast', 'bed', 'bathroom']

To score each review on the above features, you can do the following.
 1. Sentence tokenize each review.
 2. Extract sentences containing the above features.
 3. Identify sentiment polarity for these sentences.
 4. Save feature and polarity for each review.
 5. Compute mean polarity of each feature in each review. These mean polarities are feature-wise rating.
 6. Compute mean polarity across reviews for each feature.
 7. Repeat for each hotel; save feature-wise review score for each hotel in a pandas dataframe for display and plotting.