Part I

1. Extract the reviewer data from the json file into a Pandas DataFrame with reviewers in the rows, and the numerical ratings, review date, and review author name in columns.
2. Calculate the mean, and the minimum and maximum for each rating.
3. Save your numeric ratings data as a DataFrame in a pickle file in a shelve DB.
4. Save the reviewers' comments as text data indexed by reviewer name.  Include with each written review its review date.

In [2]:
import pandas as pd
import json
import pprint
pp = pprint.PrettyPrinter(indent=4)
    
with open('100506.json') as input_file:
    j_dict =json.load(input_file)

# iterate thru the json to get all the Ratings under Review    
unique_keys = set()
item_list = []
for x in j_dict['Reviews']:
    item_list.append(x['Ratings'])
    for rating_key in x['Ratings']:
        unique_keys.add(rating_key)
## use pp to wrap correctly for output to PDF
pp.pprint(unique_keys)

set([   u'Business service (e.g., internet access)',
        u'Check in / front desk',
        u'Cleanliness',
        u'Location',
        u'Overall',
        u'Rooms',
        u'Service',
        u'Sleep Quality',
        u'Value'])


In [3]:
df_scores = pd.DataFrame(item_list, columns=unique_keys)

## Convert columns to be numeric, not object in type, otherwise the 
## 'describe' method will return counts not averages
df_scores[['Check in / front desk','Overall','Value','Sleep Quality',
    'Rooms', 'Location', 'Service', 'Cleanliness', 
    'Business service (e.g., internet access)']] = df_scores[[
    'Check in / front desk','Overall','Value',
    'Sleep Quality','Rooms','Location', 'Service','Cleanliness', 
    'Business service (e.g., internet access)']].apply(pd.to_numeric)
print df_scores.head()

   Service  Cleanliness  Business service (e.g., internet access)  \
0      1.0          1.0                                       NaN   
1      4.0          4.0                                       NaN   
2      1.0          2.0                                       NaN   
3      1.0          1.0                                       NaN   
4      1.0          NaN                                       NaN   

   Check in / front desk  Overall  Value  Sleep Quality  Rooms  Location  
0                    NaN      1.0    1.0            1.0    1.0       5.0  
1                    NaN      4.0    3.0            5.0    3.0       5.0  
2                    NaN      1.0    1.0            1.0    1.0       1.0  
3                    NaN      1.0    1.0            1.0    1.0       1.0  
4                    NaN      1.0    3.0            NaN    1.0       5.0  


In [4]:
## Copy the original dataframe, get rid of unneeded columns so it can be
## joined to 'scores', to get the final frame as specified by the problem,
df_origin = pd.DataFrame(j_dict['Reviews'])
df_tmp = df_origin.copy()
df_tmp.drop(df_tmp.columns[[1,2,4,6]], axis=1, inplace=True)
print df_tmp.head()

          Author                Date     ReviewID
0  luvsroadtrips     January 3, 2012  UR122476164
1      estelle e   December 29, 2011  UR122239883
2     RobertEddy   December 20, 2011  UR121931325
3        James R    October 30, 2011  UR119896310
4       Shobha49  September 14, 2011  UR118110693


In [5]:
df_ratings = df_tmp.join(df_scores)
print df_ratings.head()

          Author                Date     ReviewID  Service  Cleanliness  \
0  luvsroadtrips     January 3, 2012  UR122476164      1.0          1.0   
1      estelle e   December 29, 2011  UR122239883      4.0          4.0   
2     RobertEddy   December 20, 2011  UR121931325      1.0          2.0   
3        James R    October 30, 2011  UR119896310      1.0          1.0   
4       Shobha49  September 14, 2011  UR118110693      1.0          NaN   

   Business service (e.g., internet access)  Check in / front desk  Overall  \
0                                       NaN                    NaN      1.0   
1                                       NaN                    NaN      4.0   
2                                       NaN                    NaN      1.0   
3                                       NaN                    NaN      1.0   
4                                       NaN                    NaN      1.0   

   Value  Sleep Quality  Rooms  Location  
0    1.0            1.0    1.0 

In [6]:
## per problem statement, we are indexing on review name, otherwise I'd 
## never drop ReviewID.  Drop everything except Author, Date and 
## Content to create the comments frame
df_comments = df_origin.copy()
df_comments.drop(df_comments.columns[[1,4,5,6]], axis=1, inplace=True)
print df_comments.head()

          Author                                            Content  \
0  luvsroadtrips  This place is not even suitable for the homele...   
1      estelle e  We stayed in downtown hotel Seattle for two ni...   
2     RobertEddy  i made reservations and when i showed up, i qu...   
3        James R  This hotel is so bad it's a joke. I could bare...   
4       Shobha49  My husband and I stayed at this hotel from 16t...   

                 Date  
0     January 3, 2012  
1   December 29, 2011  
2   December 20, 2011  
3    October 30, 2011  
4  September 14, 2011  


In [7]:
## Use the pandas describe() method to get basic summary 
## stats.  It returns more than requested, so drop
## the extras
df_stats = df_ratings.describe()
df_stats.drop(df_stats.index[[0,2,4,5,6]], inplace=True)
print df_stats

      Service  Cleanliness  Business service (e.g., internet access)  \
mean      2.3          2.0                                       1.0   
min       1.0          1.0                                       1.0   
max       5.0          5.0                                       1.0   

      Check in / front desk   Overall  Value  Sleep Quality     Rooms  \
mean                    3.0  1.666667    2.0       2.176471  1.545455   
min                     1.0  1.000000    1.0       1.000000  1.000000   
max                     5.0  4.000000    5.0       5.000000  5.000000   

      Location  
mean       4.0  
min        1.0  
max        5.0  


In [8]:
# pickle and save 
df_stats.to_pickle('stats.pkl')
df_ratings.to_pickle('ratings.pkl')
df_ratings.to_csv('ratings.csv', header=True, index=False, 
                  encoding='utf-8')
df_comments.to_csv('review_comments.csv', header=True, index=False, 
                   encoding='utf-8')
import glob
print glob.glob("*.pkl")
print glob.glob("*.csv")

['hotels.pkl', 'ratings.pkl', 'stats.pkl']
['hotel_info.csv', 'ratings.csv', 'review_comments.csv']


Part II

You'll be processing additional json data files to parse the "HotelInfo" data in all the json files into a single Pandas DataFrame that is suitable for subsequent anlyses.

You may be asked to share your code and data with others.

In [9]:
import glob
import re

## Get the list of JSON files from a specific directory, here the 
## directory 'hotel_files' which is a subdirectory of local
file_list = glob.iglob('./hotel_files/*.json')
row_count = 0

##initialize a clean dataframe
df_hotels = pd.DataFrame()

## iterate over every file in the list from the directory.  
## Get the JSON blob, iterate over it column-wise.  For 
## each row-column cell, run the JSON thru RegEx to clean 
## the HTML tags out, then save the to dataframe
for f in file_list:
    with open(f) as input_file:
        try:
            tmp = json.load(input_file)
            col_list = []
            for x in tmp['HotelInfo']:
                value = str(tmp['HotelInfo'][x])
                clean_val = re.sub('<[^>]*>', '', value)
                df_hotels.set_value(row_count, x, clean_val)
        except ValueError:
            print "File", f, "is not a valid JSON file"
    row_count+=1
print df_hotels.head()

                                Name  \
0                      Hotel Seattle   
1                                NaN   
2           Kendall Hotel and Suites   
3  San Diego Marriott Mission Valley   
4              Hotel Banys Orientals   

                                            HotelURL  HotelID  \
0  /ShowUserReviews-g60878-d100506-Reviews-Hotel_...   100506   
1  http://www.tripadvisor.com/ShowUserReviews-g60...  1217974   
2  /ShowUserReviews-g34438-d240124-Reviews-Kendal...   240124   
3  /ShowUserReviews-g60750-d80232-Reviews-San_Die...  2515575   
4  /ShowUserReviews-g187497-d287670-Reviews-Hotel...   287670   

                                             Address         Price  \
0               315 Seneca St., Seattle, WA 98101      $96 - $118*   
1                                                NaN       Unkonwn   
2       9100 North Kendall Drive, Miami, FL 33176     $124 - $181*   
3    8757 Rio San Diego Drive, San Diego, CA 9210...  $158 - $248*   
4          c/ Arge

In [10]:
## Save the HotelInfo dataframe
df_hotels.to_csv('hotel_info.csv', header=True, 
                 index=False, encoding='utf-8')            
df_hotels.to_pickle('hotels.pkl')
print glob.glob("*.pkl")
print glob.glob("*.csv")

['hotels.pkl', 'ratings.pkl', 'stats.pkl']
['hotel_info.csv', 'ratings.csv', 'review_comments.csv']
