(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing Hotel Ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will get some basic information for each hotel in Boston. Then, we will fit a regression model on this information and try to analyze it.

** Task 1 (30 pts)**

We will scrape the data using Beautiful Soup. For each hotel that our search returns, we will get the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

In [1]:
"""
This code uses identical set of steps from Davide's lecture. The first 2-3 functions are exactly the same.
I have added more code to scrape hotel's ratings and Omni Parker reviews. More description below.

This code takes huge time (2-2.5 hours) because I couldn't get selenium working properly (explained later). 
Instead, I've kept the execute output and submitted datafiles to make regression part easier. 

Implementation details in further comments:
"""
from BeautifulSoup import BeautifulSoup
import sys
import time
import os
import logging
import argparse
import requests
import codecs
import json
import collections
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, TimeoutException, StaleElementReferenceException
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.keys import Keys


base_url = "http://www.tripadvisor.com"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36"

""" STEP 1  Get the first page for Boston, MA"""
def get_tourism_page(city, state):
    """ 
        Return the json containing the
        URL of the tourism city page
    """
    url = "%s/TypeAheadJson?query=%s%%20%s&action=API" % (base_url, "%20".join(city.split()), state)
    print "URL TO REQUEST:", url
    
    # Given the url, request the HTML page
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')

    # Parse json to get url
    js = json.loads(html)
    results = js['results']
    print "RESULTS: ", results[0]
    urls = results[0]['urls'][0]

    # get tourism page url
    tourism_url = urls['url']
    return tourism_url

""" STEP 2  Get url for all hotels in Boston"""
def get_city_page(tourism_url):
    """ 
        Get the URL of the hotels of the city
        using the URL returned by the function
        get_tourism_page()
    """

    url = base_url + tourism_url

    # Given the url, request the HTML page
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')
    
    soup = BeautifulSoup(html)
    li = soup.find("li", {"class": "hotels twoLines"})
    city_url = li.find('a', href = True)
    print "CITY PAGE URL:", city_url['href']
    return city_url['href']


""" STEP 3 return html page for list of Boston hotels"""
def get_hotellist_page(city_url, count):
    """ Get the hotel list page given the url returned by
        get_city_page(). Return the html after saving
        it to the datadir 
    """
    print "Hotel page", count
    url = base_url + city_url
    # Sleep 2 sec before starting a new http request
    time.sleep(2)
    # Request page
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')

    return html

""" STEP 4 this is where majority of scraping is done. 
Check traveler ratings, traveler type in getTravelerRating() and 
scrape Omni Parker House reviews in scrapeReview()
"""
def parse_hotellist_page(html):
    """ 
    Parse the html pages returned by get_hotellist_page().
    Return the next url page to scrape (a city can have
    more than one page of hotels) if there is, else exit
    the script.
    """
    
    soup = BeautifulSoup(html)
# Extract hotel name, star rating and number of reviews
    hotel_boxes = soup.findAll('div', {'class' :'listing easyClear  p13n_imperfect '})
    print len(hotel_boxes), "hotels"
    for hotel_box in hotel_boxes:
        name = hotel_box.find('div', {'class' :'listing_title'}).find(text=True)
        try:
            rating = hotel_box.find('div', {'class' :'listing_rating'})
            reviews = rating.find('span', {'class' :'more'}).find(text=True)
            stars = hotel_box.find("img", {"class" : "sprite-ratings"})
        except Exception, e:
            print "no ratings for", name
            reviews = "N/A"
            stars = 'N/A'
        hotelref = hotel_box.findAll('a', href= True)
        #print "go to ", hotelref[0]['href']," and get traveler ratings"
        
        """We have a new hotel page. Go to that page and get ratings and dump to file"""
        print name,
        ratingfile.write("++NEW HOTEL++ %s\n" % name)
        print '.',
        getTraverlerRating(hotelref[0]['href'])
        
        
        if stars != 'N/A':
            #log.info("Stars: %s" % stars['alt'].split()[0])
            stars = stars['alt'].split()[0]
        
        """Scrape Omni Parker House reviews and store to file"""
        if name == "Omni Parker House":
            print "Found Omni Parker House. Scrape reviews"
            print "HOTEL NAME:", name
            print "HOTEL REVIEWS: ", reviews
            print "HOTEL STAR RATING:", stars
            omnihrefs = hotel_box.findAll('a', href= True)
            for omnihref in omnihrefs:
                #print omnihref, "######", omnihref['href']
                if omnihref.find(text = True) == 'Omni Parker House':
                    
                    pg = 0
                    #print "DEBUG: Review url is", omnihref['href']
                    print "page #", pg,',',
                    """scrapeReview() returns next url and None if last page"""
                    ret = scrapeReview(omnihref['href'], pg)
                    
                    while ret:
                        pg +=1
                        print "page #", pg,                         
                        ret = scrapeReview(ret, pg)
                    print "Done scraping omni hotel"

                #add this block in main flow to scrape everything
                #return

# # Get next URL page if exists, else exit
    div = soup.find("div", {"class" : "unified pagination standard_pagination"})
    # check if last page
    if div.find('span', {'class' : 'nav next ui_button disabled'}):
        print "\nReached last page"
        return None
    
    # If it is not las page there must be the Next URL
    hrefs = div.findAll('a', href= True)
    for href in hrefs:
        if href.find(text = True) == 'Next':
            print "Next url is", href['href']
            return href['href']

"""Get Traverler's ratings for every hotel"""
def getTraverlerRating(hotelurl):
    headers = { 'User-Agent' : user_agent }
    response = requests.get(base_url+hotelurl, headers=headers)
    #print response
    html = response.text.encode('utf-8')   
    hotelsoup = BeautifulSoup(html)
    """Find all ratings from hotel html page and store into file. 
    Sometimes the values might be missin so safe to use try except"""
    try: 
        filterbox = hotelsoup.findAll("div",{"class":"with_histogram"})
        ratebox = filterbox[0].findAll("div",{"class":"col rating "})
        ratinglist = ratebox[0].findAll("li")
        excel = ratinglist[0].findAll("label",{"for":"taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_5"})[0]
        ratingfile.write("Excellent:%s\n" % excel.findAll("span")[2].find(text=True))
        #print "excel", excel.findAll("span")[2].find(text=True), 
        vgood = ratinglist[1].findAll("label",{"for":"taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_4"})[0]
        ratingfile.write("Very good: %s\n" % vgood.findAll("span")[2].find(text=True))

        avg = ratinglist[2].findAll("label",{"for":"taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_3"})[0]
        ratingfile.write("Average:%s\n" % avg.findAll("span")[2].find(text=True))

        poor = ratinglist[3].findAll("label",{"for":"taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_2"})[0]
        ratingfile.write("Poor:%s\n" % poor.findAll("span")[2].find(text=True))

        terrible = ratinglist[4].findAll("label",{"for":"taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_1"})[0]
        ratingfile.write("Terrible:%s\n" % terrible.findAll("span")[2].find(text=True))

        typebox = filterbox[0].findAll("div",{"class":"col segment "})
        typelist = typebox[0].findAll("li")

        ratingfile.write("Traveler type: ")
        #print 'travel'
        for t in typelist:
            typevallist = t.findAll("span")[1].find(text=True)
            #print typevallist,
            ratingfile.write("%s, " % typevallist)
        ratingfile.write("\n")
        #print "."
        
    except IndexError:
        print "no rating for", base_url+hotelurl
        return
    
    
     
    #sys.exit()
        
"""STEP 5: Go through each review"""   
"""
Tried to use selenium but didn't work because of overlaying window tripadvisor prompts for every new session
The idea is to go on clicking on next review page and parse 6 reviews per http request
def scrapeFaster(url):
    driver.get(base_url+url)
    
    pagehtml = driver.page_source
    pgsoup = BeautifulSoup(pagehtml)
    try:
        nexturl = driver.find_element_by_link_text("More")
        print "More BUTTON", nexturl
    except NoSuchElementException:
        print "NO LINK"
        return
    nexturl.click() 
    time.sleep(0.2)
    
    print "page loaded"
"""    

"""
Takes reviewurl as input along with pagenum and returns next review page.

Goes through all reviews on pgnum, for each review, generates new get() and dumps review ratings to Omni Parker file
"""
def scrapeReview(reviewurl, pgnum):
    #return
    #print base_url+reviewurl
    # debug pupose reviewurl = globalurl
    debugfile.write("\nscrapeReview: url %s," % base_url+reviewurl)
    headers = { 'User-Agent' : user_agent }
    response = requests.get(base_url+reviewurl, headers=headers)
    #print response
    debugfile.write("scrapeReview: response %s\n" % response)
    html = response.text.encode('utf-8')   
    reviewsoup = BeautifulSoup(html) 
    
    revbox = reviewsoup.findAll("div", {"class":"reviewSelector   track_back"})
    olderrevbox = reviewsoup.findAll("div", {"class":"reviewSelector  "})
    oldestbox = reviewsoup.findAll("div", {"class":"reviewSelector  first_aph   track_back"})
    
    #if len(olderrevbox):
        #print "older reviews", len(olderrevbox)
    debugfile.write("scrapeReview: total reviews to be parsed %s\n" % str(len(revbox)+len(olderrevbox)+len(oldestbox)))
    print "(",len(revbox)+len(olderrevbox)+len(oldestbox),")|",
    pg = 1
    revbox += olderrevbox+oldestbox
    
    #click on more button and send expanded cells to getstars2 one-by-one
    for r in revbox:
        reviews = r.findAll('a', href=True)        
        for rev in reviews:            
            thisrevurl = rev['href']
            #print thisrevurl
            
            #now make http request for review url and write values to a file
            getStars2(thisrevurl)
            
    nextpages = reviewsoup.findAll("a", {"class":"pageNum taLnk"})
    pgnum = min(pgnum,4)
    try:
        #print "\nnext page?", nextpages[pgnum]['href']
        debugfile.write("scrapeReview: next page url %s\n" % nextpages[pgnum]['href'])
        return nextpages[pgnum]['href']
    except IndexError:
        print "Done with all pages", pgnum
        return None

"""STEP 6 : Access individual review, parse ratings and store in a file
This is the bottleneck function. Every review generates on get() causing huge delays"""

def getStars2(revurl):
    #print base_url+revurl,
    debugfile.write("getStars2: url %s," % base_url+revurl)
    headers = { 'User-Agent' : user_agent }
    response = requests.get(base_url+revurl, headers=headers)
    #print response
    debugfile.write("getStars2: response %s\n" % response)
    html = response.text.encode('utf-8') 
    
    reviewsoup = BeautifulSoup(html)
    reviewblock = reviewsoup.findAll("div",{"class":"deckC"})
    try:
        reviewlist = reviewblock[0].findAll("div",{"class":"  reviewSelector "})
    except IndexError:
        return
    review = reviewlist[0]
                
    #print review
    """Get the review id from review tab"""
    id = review['id']
    #print id, 
    debugfile.write("getStars2: id: %s\t" % id)
    ratelist = review.findAll("div", {"class":"rating-list"})
    #print ratelist
    
    """Only consider first review in the list. This could be optimized since a review page has 6 more ratings.
    But it'll need additional logic to track scraped reviews to avoid duplicate scraping"""
    try:
        stars = ratelist[0].findAll("li",{"class":"recommend-answer"})
    except IndexError:
        return

    """Access rating attribute and value and dump to file"""
    for val in stars:
        v = val.findAll("img")
        k = val.findAll("div",{"class":"recommend-description"})

        try:
            access = k[0],v[0]
        except IndexError:
            continue
        omnifile.write("%s:" % id)
        omnifile.write("%s:" % k[0].find(text=True))
        omnifile.write("%s\n" % v[0]['alt'][0])

"""
#This is a try using selenium. Couldn't get this to work correctly - next page not loading
def getStars(revurl, pg):
    print base_url+revurl
    driver.get(base_url+revurl)
    print "page #", pg
    #time.sleep(1)
    while True:
        #geturl = base_url+revurl
        #headers = { 'User-Agent' : user_agent }
        #response = requests.get(base_url+revurl, headers=headers)
        #print response
        #html = response.text.encode('utf-8')   
        
        html = driver.page_source
        reviewsoup = BeautifulSoup(html) 
        reviewblock = reviewsoup.findAll("div",{"class":"deckC"})
        reviewlist = reviewblock[0].findAll("div",{"class":"  reviewSelector "})
        #print reviewlist

        revnum = 0
        for review in reviewlist:        
            
            if pg and not revnum:
                revnum += 1
                continue
            
            #print review
            id = review['id']
            print id
            ratelist = review.findAll("div", {"class":"rating-list"})
            #print ratelist
            for i in xrange(len(ratelist)):

                stars = ratelist[i].findAll("li",{"class":"recommend-answer"})
                #inside stars, access sprite and description and write
                #print stars
                #ratedict = collections.defaultdict(list)
                for val in stars:
                    v = val.findAll("img")
                    k = val.findAll("div",{"class":"recommend-description"})
                    #print k,v
                    #print id, ":",k[0].find(text=True),":", v[0]['alt'][0]
                    #ratedict[k[0].find(text=True)] = v[0]['alt'][0]

                    #omnifile.write("%s:" % id)
                    #omnifile.write("%s:" % k[0].find(text=True))
                    #omnifile.write("%s\n" % v[0]['alt'][0])
            revnum += 1
            
        pg += 1
        try:
            nexturl = driver.find_element_by_link_text("Next")
            print "NEXT BUTTON", nexturl
        except NoSuchElementException:
            print "NO LINK"
            break
        nexturl.click() 
        body = driver.find_element_by_tag_name("body")
        body.send_keys(Keys.CONTROL + 't')
        def link_has_gone_stale():
            try:
                # poll the link with an arbitrary call
                nexturl.find_elements_by_id('doesnt-matter') 
                return False
            except StaleElementReferenceException:
                return True
        time.sleep(1)
        wait_for(link_has_gone_stale)
        
        nextpage = reviewsoup.findAll("a",{"class":"pageNum taLnk"})
        print "go to next", nextpage[0]['href']
        revurl = nextpage[0]['href']
    #return nextpage[0]['href']
    #return nextpage[0]['href']
    #omnidict[id] = ratedict
    #print id,":",k,":", v
    #print stars
def wait_for(condition_function):
    start_time = time.time()
    while time.time() < start_time + 10:
        if condition_function():
            return True
        else:
            time.sleep(0.1)
    raise Exception(
        'Timeout waiting for {}'.format(condition_function.__name__)
    )
"""

#globalurl = "/Hotel_Review-g60745-d89599-Reviews-or5270-Omni_Parker_House-Boston_Massachusetts.html#REVIEWS"
print 'get url'

"""
Just to be safe, adding NEWFILE to filename to avoid replacing the original complete file
"""
omnifile = open("omni-scrapte-out-NEWFILE.dat","w")
ratingfile = open("travel-rating.dat-NEWFILE","w")
debugfile = open("debug.log","w")
tourism_url = get_tourism_page('boston', 'massachusetts')
#Get URL to obtaint the list of hotels in a specific city
city_url = get_city_page(tourism_url)
c=0
#driver = webdriver.Firefox()
#driver.wait = WebDriverWait(driver, 5)
while(True):
    c +=1
    """Get first page of hotels in boston"""
    html = get_hotellist_page(city_url,c)
    
    """Invoke scraping for each hotel"""
    city_url = parse_hotellist_page(html)
    if not city_url:
        break
#close all files        
omnifile.close()
ratingfile.close()
debugfile.close()
#driver.quit()

get url
URL TO REQUEST: http://www.tripadvisor.com/TypeAheadJson?query=boston%20massachusetts&action=API
RESULTS:  {u'lookbackServlet': None, u'name': u'Boston, Massachusetts, United States', u'data_type': u'LOCATION', u'title': u'Destinations', u'url': u'/Tourism-g60745-Boston_Massachusetts-Vacations.html', u'value': 60745, u'coords': u'42.357277,-71.05834', u'urls': [{u'url': u'/Tourism-g60745-Boston_Massachusetts-Vacations.html', u'type': u'GEO', u'name': u'Boston Tourism', u'url_type': u'geo'}], u'scope': u'global', u'type': u'GEO'}
CITY PAGE URL: /Hotels-g60745-Boston_Massachusetts-Hotels.html
Hotel page 1
30 hotels
Omni Parker House . Found Omni Parker House. Scrape reviews
HOTEL NAME: Omni Parker House
HOTEL REVIEWS:  5,620 Reviews
HOTEL STAR RATING: 4
page # 0 , ( 10 )| page # 1 ( 10 )| page # 2 ( 10 )| page # 3 ( 10 )| page # 4 ( 10 )| page # 5 ( 10 )| page # 6 ( 10 )| page # 7 ( 10 )| page # 8 ( 10 )| page # 9 ( 10 )| page # 10 ( 10 )| page # 11 ( 10 )| page # 12 ( 10 )| page

** Task 2 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [6]:
import time, math

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import scipy as sp
import scipy.sparse.linalg as linalg
import scipy.cluster.hierarchy as hr
from scipy.spatial.distance import pdist, squareform

import sklearn.datasets as datasets
import sklearn.metrics as metrics
import sklearn.utils as utils
import sklearn.linear_model as linear_model
import sklearn.cross_validation as cross_validation
import sklearn.cluster as cluster
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm
import statsmodels.formula.api as smf

from patsy import dmatrices

import seaborn as sns
from sklearn.linear_model import LinearRegression
import pprint

def mergeOmnifile():
    with open("omni-scrapte-out.dat") as omnifile:
        lines = omnifile.readlines()
        for line in lines:
            #print line.split(":")[1],":", int(line.split(":")[2].split("\n")[0])
            splitline = line.split(":")
            k1,k2 = splitline[1], splitline[2][:-1] 
                    
            if k2 not in omnidict[k1]:
                omnidict[k1][k2] = 0
            omnidict[k1][k2] += 1                
    
    omnifile.close()

def parseDatafiles():
    
    newlist = [0]
    i = 0
    with open("travel-rating.dat") as tratefile:
        lines = tratefile.readlines()     
        hotelname = None
        for line in lines:            
            if line[0] == '+':   
                if hotelname:
                    #print allratedict[hotelname]
                    for attr in attrlist:
                        try:
                            newlist.append(allratedict[hotelname][attr])
                        except KeyError:
                            print hotelname, "missing", attr
                            newlist.append(0)
                    #print newlist, len(newlist)
                    hoteldict[hotelname] = newlist
                    #print len(hoteldict), hotelname, i
                i+=1
                hotelname = line.split('+')[4][1:-1]
                #print hotelname
                newlist = [0]
                avgscore = 0
            else:
                if line.split(":")[0] == "Traveler type":
                    typelist = line.split(":")[1].split(")")
                    for ttype in typelist:
                        try:
                            val = int(ttype.split("(")[1].replace(',',''))
                            #print val
                            newlist.append(int(val))
                        except IndexError:
                            continue
                    newlist[0] = avgscore
                else:
                    key = line.split(":")[0]
                    val = int(line.split(":")[1][:-1].replace(',',''))
                    #print key, val
                    newlist.append(int(val))
                    avgscore += val*avgscoredict[key]
    if hotelname:
        hoteldict[hotelname] = newlist
        
    tratefile.close()

def parseRateDat():        
    with open("rating-summary.dat") as ratefile:
        hotelratings = ratefile.readlines()
        prevhname = None
        newdict = collections.defaultdict(dict)
        for line in hotelratings:
            splitline = line.split(":")
            hotelname = splitline[0]
            if hotelname != prevhname:

                appendDict(newdict, prevhname)
                
                prevhname = hotelname
                #print "###NEW###", hotelname
                newdict = collections.defaultdict(dict)                                
                
            k1,k2,v = splitline[1], splitline[2], splitline[3][:-1]
            #print k1,k2,v
            newdict[k1][k2] = int(v)
        
    appendDict(newdict, prevhname)
    appendDict(omnidict, "Omni Parker House")
    #print omnidict
            
    ratefile.close()
    #print len(hoteldict)

def appendDict(newdict, prevhname):
    for k1 in newdict.keys():
        #print k1,newdict[k1]
        if k1 not in attrlist:
            #print k1
            continue
        ratesum = 0
        totalnum = 0
        for k2 in newdict[k1].keys():

            ratesum += int(k2)*int(newdict[k1][k2])
            totalnum += int(newdict[k1][k2])
        #print prevhname, k1, 1.0*ratesum/totalnum
        score = 1.0*ratesum/totalnum
        allratedict[prevhname][k1] = score #math.ceil(score) if (math.ceil(score)- score <0.5) else int(score) 
        #allratedict[prevhname][k1] = ratesum
        
attrlist = ['Location', 'Sleep Quality', 'Rooms', 'Service', 'Value', 'Cleanliness']
avgscoredict = {"Excellent":5, "Very good":4, "Average":3, "Poor":2, "Terrible":1}    
omnidict = collections.defaultdict(dict)
#  1 avg rating, 5 traveler ratings, 5 traveler types, 6 rating attributes,
allratedict = collections.defaultdict(dict)
hoteldict = collections.defaultdict(list)


mergeOmnifile()    
print len(omnidict), "ratings added to omnidict"


parseRateDat()
print len(allratedict), "hotels added to dict"


parseDatafiles()
print "Generated", len(hoteldict), "lists for all hotels"
print "Now feed all these", len(hoteldict), "vectors to fit a model"
rate_vector = []
avg_vector = []

X = pd.DataFrame(v for v in hoteldict.values())
print X.shape
X.columns = ['Avgscore', 'Excellent', 'Verygood', 'Average', 'Poor', 'Terrible', 'Families', 'Couples', 'Solo', 'Business','Friends', 'Location', 'SleepQuality', 'Rooms', 'Service', 'Value', 'Cleanliness']

#print X.head()

lm = smf.ols(formula='Avgscore ~ Excellent + Verygood + Average + Poor + Terrible + Families + Couples + Solo + Business + Friends + Location + SleepQuality + Rooms + Service + Value + Cleanliness', data=X).fit()
#lm = smf.ols(formula='Avgscore ~ Value + Location + Service + Excellent + Verygood + Average', data=X).fit()
print lm.summary()
data = X.fillna(0)


# create X and y
feature_cols = ['Excellent', 'Verygood', 'Average', 'Poor', 'Terrible', 'Families', 'Couples', 'Solo', 'Business','Friends', 'Location', 'SleepQuality', 'Rooms', 'Service', 'Value', 'Cleanliness']
X = data[feature_cols]
y = data.Avgscore

print "Created", X.shape, y.shape
print y.head()
#print X,y

lm = LinearRegression()
lm.fit(X, y)

# print coefficients
print zip(feature_cols, lm.coef_)

print "Done"

8 ratings added to omnidict
82 hotels added to dict
Hilton Boston Downtown / Faneuil Hall missing Location
Hilton Boston Downtown / Faneuil Hall missing Sleep Quality
Hilton Boston Downtown / Faneuil Hall missing Rooms
Hilton Boston Downtown / Faneuil Hall missing Service
Hilton Boston Downtown / Faneuil Hall missing Value
Hilton Boston Downtown / Faneuil Hall missing Cleanliness
Residence Inn Boston Back Bay / Fenway missing Location
Residence Inn Boston Back Bay / Fenway missing Sleep Quality
Residence Inn Boston Back Bay / Fenway missing Rooms
Residence Inn Boston Back Bay / Fenway missing Service
Residence Inn Boston Back Bay / Fenway missing Value
Residence Inn Boston Back Bay / Fenway missing Cleanliness
Element Boston Seaport missing Location
Generated 82 lists for all hotels
Now feed all these 82 vectors to fit a model
(82, 17)
                            OLS Regression Results                            
Dep. Variable:               Avgscore   R-squared:                       

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

-------

_Use the same datastructured parsed in previous part:_ Please execute previous cell before proceeding (one takes a second or two)

In [7]:

def parseExcellent():        
    with open("rating-summary.dat") as ratefile:
        hotelratings = ratefile.readlines()
        prevhname = None
        newdict = collections.defaultdict(dict)
        for line in hotelratings:
            splitline = line.split(":")
            hotelname = splitline[0]
            if hotelname != prevhname:

                appendExcel(newdict, prevhname)
                
                prevhname = hotelname
                #print "###NEW###", hotelname
                newdict = collections.defaultdict(dict)                                
                
            k1,k2,v = splitline[1], splitline[2], splitline[3][:-1]
            #print k1,k2,v
            newdict[k1][k2] = int(v)
        
    appendExcel(newdict, prevhname)
    appendExcel(omnidict, "Omni Parker House")
    #print omnidict
            
    ratefile.close()
    #print len(hoteldict)

def appendExcel(d, n):
    totalrev = 0
    maxstar = 0
    for k1 in d.keys():
        
        for k2 in d[k1].keys():
            if k2 == '5':
                maxstar +=d[k1][k2]
            totalrev+=d[k1][k2]
    if totalrev == 0:
        #print n, "000000000"
        return
    #print n, 1.0*maxstar/totalrev >= 0.6
    exceldict[n] = 1.0*maxstar/totalrev >= 0.6
        
exceldict = collections.defaultdict()
parseExcellent()
print len(exceldict), "hotels added true/false"

y = pd.DataFrame(v for v in exceldict.values())
print y.shape
y.columns = ['isExcellent']
print y.head()

excelfit = sm.Logit(y, X[feature_cols])
 
# fit the model
result = excelfit.fit() 
print result.summary()


82 hotels added true/false
(82, 1)
  isExcellent
0        True
1        True
2       False
3       False
4        True
Optimization terminated successfully.
         Current function value: 0.582810
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:            isExcellent   No. Observations:                   82
Model:                          Logit   Df Residuals:                       66
Method:                           MLE   Df Model:                           15
Date:                Tue, 29 Mar 2016   Pseudo R-squ.:                  0.1125
Time:                        08:29:05   Log-Likelihood:                -47.790
converged:                       True   LL-Null:                       -53.850
                                        LLR p-value:                    0.6699
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()