<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 5

## Help Yelp

---

In this project you will be investigating a small version of the [Yelp challenge dataset](https://www.yelp.com/dataset_challenge). You'll practice using classification algorithms, cross-validation, gridsearching – all that good stuff.



---

### The data

There are 5 individual .csv files that have the information, zipped into .7z format like with the SF data last project. The dataset is located in your datasets folder:

    DSI-SF-2/datasets/yelp_arizona_data.7z

The columns in each are:

    businesses_small_parsed.csv
        business_id: unique business identifier
        name: name of the business
        review_count: number of reviews per business
        city: city business resides in
        stars: average rating
        categories: categories the business falls into (can be one or multiple)
        latitude
        longitude
        neighborhoods: neighborhoods business belongs to
        variable: "property" of the business (a tag)
        value: True/False for the property
        
    reviews_small_nlp_parsed.csv
        user_id: unique user identifier
        review_id: unique review identifier
        votes.cool: how many thought the review was "cool"
        business_id: unique business id the review is for
        votes.funny: how many thought the review was funny
        stars: rating given
        date: date of review
        votes.useful: how many thought the review was useful
        ... 100 columns of counts of most common 2 word phrases that appear in reviews in this review
        
    users_small_parsed.csv
        yelping_since: signup date
        compliments.plain: # of compliments "plain"
        review_count: # of reviews:
        compliments.cute: total # of compliments "cute"
        compliments.writer: # of compliments "writer"
        compliments.note: # of compliments "note" (not sure what this is)
        compliments.hot: # of compliments "hot" (?)
        compliments.cool: # of compliments "cool"
        compliments.profile: # of compliments "profile"
        average_stars: average rating
        compliments.more: # of compliments "more"
        elite: years considered "elite"
        name: user's name
        user_id: unique user id
        votes.cool: # of votes "cool"
        compliments.list: # of compliments "list"
        votes.funny: # of compliments "funny"
        compliments.photos: # of compliments "photos"
        compliments.funny: # of compliments "funny"
        votes.useful: # of votes "useful"
       
    checkins_small_parsed.csv
        business_id: unique business identifier
        variable: day-time identifier of checkins (0-0 is Sunday 0:00 - 1:00am,  for example)
        value: # of checkins at that time
    
    tips_small_nlp_parsed.csv
        user_id: unique user identifier
        business_id: unique business identifier
        likes: likes that the tip has
        date: date of tip
        ... 100 columns of counts of most common 2 word phrases that appear in tips in this tip

The reviews and tips datasets in particular have parsed "NLP" columns with counts of 2-word phrases in that review or tip (a "tip", it seems, is some kind of smaller review).

The user dataset has a lot of columns of counts of different compliments and votes. I'm not sure whether the compliments or votes are _by_ the user or _for_ the user.

---

If you look at the website, or the full data, you'll see I have removed pieces of the data and cut it down quite a bit. This is to simplify it for this project. Specifically, business are limited to be in these cities:

    Phoenix
    Surprise
    Las Vegas
    Waterloo

Apparently there is a city called "Surprise" in Arizona. 

Businesses are also restricted to at least be in one of the following categories, because I thought the mix of them was funny:

    Airports
    Breakfast & Brunch
    Bubble Tea
    Burgers
    Bars
    Bakeries
    Breweries
    Cafes
    Candy Stores
    Comedy Clubs
    Courthouses
    Dance Clubs
    Fast Food
    Museums
    Tattoo
    Vape Shops
    Yoga
    
---

### Project requirements

**You will be performing 4 different sections of analysis, like in the last project.**

Remember that classification targets are categorical and regression targets are continuous variables.

In [1]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV

from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import grid_search, datasets
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix

# Load graph packages
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-white')
%matplotlib inline

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Constructing a "profile" for Las Vegas

---

Yelp is interested in building out what they are calling "profiles" for cities. They want you to start with just Las Vegas to see what a prototype of this would look like. Essentially, they want to know what makes Las Vegas distinct from the other four.

Use the data you have to predict Las Vegas from the other variables you have. You should not be predicting the city from any kind of location data or other data perfectly associated with that city (or another city).

You may use any classification algorithm you deem appropriate, or even multiple models. You should:

1. Build at least one model predicting Las Vegas vs. the other cities.
- Validate your model(s).
- Interpret and visualize, in some way, the results.
- Write up a "profile" for Las Vegas. This should be a writeup converting your findings from the model(s) into a human-readable description of the city.

***
### Dataset
I've chosen to look at businesses_small_parsed.csv because I think the features are the most interesting... specifically the categories column. My working hypothesis is the frequency of certain business categories (for example Nightlife, Adult Entertainment, etc) make Las Vegas distinct from Phoenix, Surprise, and Waterloo.

In [2]:
tips = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/tips_small_nlp_parsed.csv')

In [3]:
checkins = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/checkins_small_parsed.csv')

In [4]:
reviews = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/reviews_small_nlp_parsed.csv')

In [5]:
users = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/users_small_parsed.csv')

In [6]:
business = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/businesses_small_parsed.csv')

In [7]:
business.head(2)

Unnamed: 0,business_id,name,review_count,city,stars,categories,latitude,longitude,neighborhoods,variable,value
0,EmzaQR5hQlF0WIl24NxAZA,Sky Lounge,25,Phoenix,2.5,"['American (New)', 'Nightlife', 'Dance Clubs',...",33.448399,-112.071702,[],attributes.Ambience.divey,False
1,SiwN7f0N4bs4ZtPc4yPgiA,Palazzo,19,Phoenix,3.0,"['Bars', 'Nightlife', 'Dance Clubs']",33.455885,-112.074177,[],attributes.Ambience.divey,False


In [8]:
business.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152832 entries, 0 to 152831
Data columns (total 11 columns):
business_id      152832 non-null object
name             152832 non-null object
review_count     152832 non-null int64
city             152832 non-null object
stars            152832 non-null float64
categories       152832 non-null object
latitude         152832 non-null float64
longitude        152832 non-null float64
neighborhoods    152832 non-null object
variable         152832 non-null object
value            152832 non-null object
dtypes: float64(3), int64(1), object(7)
memory usage: 12.8+ MB


In [9]:
#just trying an alternate way
business.ix[business.index[(business.T == np.nan).sum() > 1]]

Unnamed: 0,business_id,name,review_count,city,stars,categories,latitude,longitude,neighborhoods,variable,value


In [10]:
#convert city to binary
#I think it would be cool to also encode the other cities
#but for simplicity sake I'm restricting it to Las Vegas vs not Las Vegas
business['city'] = business['city'].map(lambda x: 0 if x == 'Las Vegas' else 1)

In [11]:
business.city.value_counts()

0    93818
1    59014
Name: city, dtype: int64

In [12]:
business.categories.value_counts()

['Burgers', 'Fast Food', 'Restaurants']                                                                         9522
['Fast Food', 'Restaurants']                                                                                    5752
['Fast Food', 'Sandwiches', 'Restaurants']                                                                      5616
['Burgers', 'Restaurants']                                                                                      4817
['Fast Food', 'Mexican', 'Restaurants']                                                                         4726
['Bars', 'Nightlife', 'Lounges']                                                                                3901
['Breakfast & Brunch', 'American (Traditional)', 'Restaurants']                                                 3595
['Bars', 'Nightlife']                                                                                           3030
['Breakfast & Brunch', 'Restaurants']                           

In [13]:
#creating binary values with get_dummies, which results in slightly different results than using kiefer's way 
stripped = business['categories'].map(lambda x: ''.join([y for y in x if not y in "[]'\""]))
new_cat = stripped.str.get_dummies(', ')

In [14]:
new_cat.head()

Unnamed: 0,Active Life,Adult Entertainment,Afghan,Airport Lounges,Airports,Amateur Sports Teams,American (New),American (Traditional),Arcades,Art Galleries,...,Vegan,Vegetarian,Venues & Event Spaces,Vintage & Consignment,Weight Loss Centers,Wholesale Stores,Wine & Spirits,Wine Bars,Womens Clothing,Yoga
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
#creating binary values - kiefer's way
business['cat_lists'] = business.categories.map(lambda x: eval(x))

In [16]:
unique_categories = []
for catlist in business.cat_lists.values:
    unique_categories.extend(catlist)
    
unique_categories = np.unique(unique_categories)
unique_categories[:10]

array(['Active Life', 'Adult Entertainment', 'Afghan', 'Airport Lounges',
       'Airports', 'Amateur Sports Teams', 'American (New)',
       'American (Traditional)', 'Arcades', 'Art Galleries'], 
      dtype='|S32')

In [18]:
new_cat_2 = pd.DataFrame()

for uc in unique_categories:
    new_cat_2[uc] = business.cat_lists.map(lambda x: 1 if uc in x else 0)

In [19]:
new_cat_2.head()


Unnamed: 0,Active Life,Adult Entertainment,Afghan,Airport Lounges,Airports,Amateur Sports Teams,American (New),American (Traditional),Arcades,Art Galleries,...,"Used, Vintage & Consignment",Vape Shops,Vegan,Vegetarian,Venues & Event Spaces,Weight Loss Centers,Wholesale Stores,Wine Bars,Women's Clothing,Yoga
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
business.categories.value_counts()

['Burgers', 'Fast Food', 'Restaurants']                                                                         9522
['Fast Food', 'Restaurants']                                                                                    5752
['Fast Food', 'Sandwiches', 'Restaurants']                                                                      5616
['Burgers', 'Restaurants']                                                                                      4817
['Fast Food', 'Mexican', 'Restaurants']                                                                         4726
['Bars', 'Nightlife', 'Lounges']                                                                                3901
['Breakfast & Brunch', 'American (Traditional)', 'Restaurants']                                                 3595
['Bars', 'Nightlife']                                                                                           3030
['Breakfast & Brunch', 'Restaurants']                           

In [22]:
#basically using the get_dummies way split on commas without regard to quotations
[c for c in new_cat.columns if c not in new_cat_2.columns]

['Beer',
 'Books',
 'Mags',
 'Music & Video',
 'Used',
 'Vintage & Consignment',
 'Wine & Spirits',
 'Womens Clothing']

In [24]:
#kiefer's way split with regard to quotations which is more appropriate here
[c for c in new_cat_2.columns if c not in new_cat.columns]

['Beer, Wine & Spirits',
 'Books, Mags, Music & Video',
 'Used, Vintage & Consignment',
 "Women's Clothing"]

In [14]:
# # annother approach that was pretty slow
# #convert categories to binary
# def fix_cat(row):
#     categories = eval(row['categories'])
#     for category in categories:
#         row[category] = 1
#     return row
# new_cat = business.apply(fix_cat, axis=1).fillna(0)

In [25]:
new_cat_2.head(2)

Unnamed: 0,Active Life,Adult Entertainment,Afghan,Airport Lounges,Airports,Amateur Sports Teams,American (New),American (Traditional),Arcades,Art Galleries,...,"Used, Vintage & Consignment",Vape Shops,Vegan,Vegetarian,Venues & Event Spaces,Weight Loss Centers,Wholesale Stores,Wine Bars,Women's Clothing,Yoga
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
new_cat_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152832 entries, 0 to 152831
Columns: 210 entries, Active Life to Yoga
dtypes: int64(210)
memory usage: 244.9 MB


In [27]:
#here I'm just cleaning up column names
new_cat_2.rename(columns=lambda x: x.lower().replace(' & ','_').replace(' ','_'), inplace=True)

In [28]:
new_cat_2.head(2)

Unnamed: 0,active_life,adult_entertainment,afghan,airport_lounges,airports,amateur_sports_teams,american_(new),american_(traditional),arcades,art_galleries,...,"used,_vintage_consignment",vape_shops,vegan,vegetarian,venues_event_spaces,weight_loss_centers,wholesale_stores,wine_bars,women's_clothing,yoga
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
#need confirmation this all worked!
list(new_cat_2.columns)

['active_life',
 'adult_entertainment',
 'afghan',
 'airport_lounges',
 'airports',
 'amateur_sports_teams',
 'american_(new)',
 'american_(traditional)',
 'arcades',
 'art_galleries',
 'arts_crafts',
 'arts_entertainment',
 'asian_fusion',
 'bagels',
 'bakeries',
 'barbeque',
 'barbers',
 'barre_classes',
 'bars',
 'beauty_spas',
 'beer_bar',
 'beer_gardens',
 'beer,_wine_spirits',
 'books,_mags,_music_video',
 'boot_camps',
 'botanical_gardens',
 'bowling',
 'brasseries',
 'breakfast_brunch',
 'breweries',
 'british',
 'bubble_tea',
 'buffets',
 'burgers',
 'cabaret',
 'cafes',
 'canadian_(new)',
 'candy_stores',
 'caribbean',
 'casinos',
 'caterers',
 'champagne_bars',
 'cheesesteaks',
 'chicken_shop',
 'chicken_wings',
 'chinese',
 'chocolatiers_shops',
 'cinema',
 'cocktail_bars',
 'coffee_tea',
 'comedy_clubs',
 'comfort_food',
 'comic_books',
 'convenience_stores',
 'cooking_classes',
 'country_dance_halls',
 'courthouses',
 'creperies',
 'cuban',
 'cultural_center',
 'cupcakes'

In [30]:
new_cat_2['adult_entertainment'].value_counts()

0    152503
1       329
Name: adult_entertainment, dtype: int64

In [48]:
#whats a good way to visualize these variables?

#### Here I pivoted but decided not to do anything with these columns because I opted for a simpler approach due to time constraints

In [31]:
business.value.value_counts()

False          70687
True           32400
00:00           5560
11:00           3212
casual          2577
10:00           2496
22:00           2439
1.0             1932
average         1892
07:00           1694
2.0             1679
21:00           1631
06:00           1576
02:00           1572
no              1556
full_bar        1470
none            1399
23:00           1397
09:00           1165
free             983
17:00            894
20:00            886
15:00            750
08:00            726
16:00            710
19:00            684
12:00            677
01:00            653
10:30            624
18:00            569
               ...  
15:30             62
23:30             51
18:30             44
16:30             44
4.0               38
02:30             36
paid              31
yes_corkage       24
12:30             24
13:30             21
04:30             20
23:59             13
00:30             12
03:30              6
18plus             6
01:20              6
05:45        

In [32]:
business.variable.value_counts()

open                                           4132
attributes.Accepts Credit Cards                3896
attributes.Price Range                         3843
attributes.Parking.valet                       3427
attributes.Parking.street                      3427
attributes.Parking.validated                   3427
attributes.Parking.garage                      3427
attributes.Parking.lot                         3427
attributes.Good For Groups                     3362
attributes.Outdoor Seating                     3267
attributes.Has TV                              3084
attributes.Alcohol                             3050
attributes.Good for Kids                       2909
attributes.Ambience.trendy                     2873
attributes.Ambience.touristy                   2873
attributes.Ambience.classy                     2873
attributes.Ambience.romantic                   2873
attributes.Ambience.intimate                   2873
attributes.Ambience.casual                     2873
attributes.A

In [33]:
def select_item_or_nan(x):
    x = x.iloc[0]
    if len(x) == 0:
        return np.nan
    else:
        return x

biz_wide = pd.pivot_table(business, columns=['variable'], values='value',
                            index=['business_id'], aggfunc=select_item_or_nan,
                            fill_value=np.nan)

In [34]:
biz_wide.head()

variable,attributes.Accepts Credit Cards,attributes.Accepts Insurance,attributes.Ages Allowed,attributes.Alcohol,attributes.Ambience.casual,attributes.Ambience.classy,attributes.Ambience.divey,attributes.Ambience.hipster,attributes.Ambience.intimate,attributes.Ambience.romantic,...,hours.Saturday.open,hours.Sunday.close,hours.Sunday.open,hours.Thursday.close,hours.Thursday.open,hours.Tuesday.close,hours.Tuesday.open,hours.Wednesday.close,hours.Wednesday.open,open
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--jFTZmywe7StuZ2hEjxyA,True,,,none,,,,,,,...,,,,,,,,,,True
-0HGqwlfw3I8nkJyMHxAsQ,True,,,none,,,,,,,...,,,,,,,,,,True
-0VK5Z1BfUHUYq4PoBYNLw,True,,,full_bar,True,False,False,False,False,False,...,,,,,,,,,,True
-0bUDim5OGuv8R0Qqq6J4A,True,,,,,,,,,,...,,,,,,,,,,False
-1bOb2izeJBZjHC7NWxiPA,True,,,none,True,False,False,False,False,False,...,06:30,14:30,06:30,14:30,06:30,14:30,06:30,14:30,06:30,True


In [35]:
biz_wide['attributes.Alcohol'].value_counts()

full_bar         1470
none             1399
beer_and_wine     181
Name: attributes.Alcohol, dtype: int64

In [36]:
biz_wide['hours.Sunday.close'].value_counts()

00:00    488
22:00    302
02:00    189
21:00    173
23:00    171
15:00     92
20:00     89
01:00     82
19:00     76
18:00     76
14:00     56
17:00     51
16:00     42
03:00     41
04:00     36
06:00     34
21:30     23
22:30     20
14:30     19
05:00     14
12:00     14
09:00     11
05:30     10
13:00     10
20:30      9
07:00      7
18:30      7
23:30      6
01:30      6
13:30      5
02:30      5
17:30      5
08:00      3
19:30      3
11:30      2
10:00      2
15:30      2
12:30      2
23:59      2
19:15      1
08:30      1
01:15      1
03:45      1
16:30      1
11:00      1
10:30      1
16:45      1
22:15      1
Name: hours.Sunday.close, dtype: int64

#### Looking at some more straight forward columns

In [41]:
business.columns

Index([u'business_id', u'name', u'review_count', u'city', u'stars',
       u'categories', u'latitude', u'longitude', u'neighborhoods', u'variable',
       u'value', u'cat_lists'],
      dtype='object')

In [42]:
business.review_count.isnull().sum()

0

In [43]:
business.review_count.describe()

count    152832.00000
mean        113.35011
std         262.24581
min           3.00000
25%          12.00000
50%          34.00000
75%         101.00000
max        5642.00000
Name: review_count, dtype: float64

In [44]:
business.stars.isnull().sum()

0

In [45]:
business.stars.describe()

count    152832.000000
mean          3.471966
std           0.768355
min           1.000000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: stars, dtype: float64

#### Looking at users & reviews briefly

In [46]:
reviews.head(2)

Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful,10 minutes,15 minutes,...,service great,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried,wait staff
0,o_LCYay4uo5N4eq3U5pbrQ,biEOCicjWlibF26pNLvhcw,0,EmzaQR5hQlF0WIl24NxAZA,0,3,2007-09-14,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,sEWeeq41k4ohBz4jS_iGRw,tOhOHUAS7XJch7a_HW5Csw,3,EmzaQR5hQlF0WIl24NxAZA,12,2,2008-04-21,3,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144206 entries, 0 to 144205
Data columns (total 21 columns):
yelping_since          144206 non-null object
compliments.plain      47034 non-null float64
review_count           144206 non-null int64
compliments.cute       13133 non-null float64
compliments.writer     33222 non-null float64
fans                   144206 non-null int64
compliments.note       39872 non-null float64
compliments.hot        31748 non-null float64
compliments.cool       41069 non-null float64
compliments.profile    12368 non-null float64
average_stars          144206 non-null float64
compliments.more       25066 non-null float64
elite                  144206 non-null object
name                   144206 non-null object
user_id                144206 non-null object
votes.cool             144206 non-null int64
compliments.list       7180 non-null float64
votes.funny            144206 non-null int64
compliments.photos     18759 non-null float64
compliments.funny  

#### Plan
Ideally I'd like to merge some of the columns from users & reviews (for example users['review_count'] with the idea being the Las Vegas user's review counts are higher than users in the other cities) with the business dataset but since I'm short on time now I'll look at review_count & stars in the business dataset. I'll try out KNN and Logistic Regression.

You may use any classification algorithm you deem appropriate, or even multiple models. You should:
Build at least one model predicting Las Vegas vs. the other cities.
Validate your model(s).
Interpret and visualize, in some way, the results.
Write up a "profile" for Las Vegas. This should be a writeup converting your findings from the model(s) into a human-readable description of the city.

In [43]:
business.columns

Index([u'business_id', u'name', u'review_count', u'city', u'stars',
       u'categories', u'latitude', u'longitude', u'neighborhoods', u'variable',
       u'value'],
      dtype='object')

In [44]:
biz_subset = business[['city', 'review_count', 'stars']]

In [45]:
business.city.value_counts()

0    93818
1    59014
Name: city, dtype: int64

In [77]:
#not sure it makes sense to do this
biz_subset.corr()

Unnamed: 0,city,review_count,stars
city,1.0,-0.096229,-0.004952
review_count,-0.096229,1.0,0.153375
stars,-0.004952,0.153375,1.0


#### Performing stratified cross-validation on a KNN classifier and logisitic regression

In [68]:
y = business.city.values
X = business[['review_count','stars']]

In [69]:
y.shape

(152832,)

In [70]:
X.shape

(152832, 2)

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)

In [74]:
params = {
    'n_neighbors':range(1,101),
    'weights':['uniform','distance']
}

knn = KNeighborsClassifier()

knn_gs = GridSearchCV(knn, params, cv=5, verbose=1, n_jobs=4)
knn_gs.fit(X_train, y_train)

print knn_gs.best_params_
best_knn = knn_gs.best_estimator_

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   23.5s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  2.3min


{'n_neighbors': 79, 'weights': 'distance'}


[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:  3.3min finished


In [75]:
print best_knn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=79, p=2,
           weights='distance')


In [76]:
cross_val_score(best_knn, X, y, cv=5)

array([ 0.74321981,  0.73873785,  0.74793078,  0.7466466 ,  0.74254867])

In [None]:
# I'd like to use all the category variables for logistic regression, regularization, lasso

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Different categories of ratings

---

Yelp is finally ready to admit that their rating system sucks. No one cares about the ratings, they just use the site to find out what's nearby. The ratings are simply too unreliable for people. 

Yelp hypothesizes that this is, in fact, because different people tend to give their ratings based on different things. They believe that perhaps some people always base their ratings on quality of food, others on service, and perhaps other categories as well. 

1. Do some users tend to talk about service more than others in reviews/tips? Divide up the tips/reviews into more "service-focused" ones and those less concerned with service.
2. Create two new ratings for businesses: ratings from just the service-focused reviews and ratings from the non-service reviews.
3. Construct a regression model for each of the two ratings. They should use the same predictor variables (of your choice). 
4. Validate the performance of the models.
5. Do the models coefficients differ at all? What does this tell you about the hypothesis that there are in fact two different kinds of ratings?

In [49]:
reviews.head()

Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful,10 minutes,15 minutes,...,service great,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried,wait staff
0,o_LCYay4uo5N4eq3U5pbrQ,biEOCicjWlibF26pNLvhcw,0,EmzaQR5hQlF0WIl24NxAZA,0,3,2007-09-14,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,sEWeeq41k4ohBz4jS_iGRw,tOhOHUAS7XJch7a_HW5Csw,3,EmzaQR5hQlF0WIl24NxAZA,12,2,2008-04-21,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1AqEqmmVHgYCuzcMrF4h2g,2aGafu-x7onydGoDgDfeQQ,0,EmzaQR5hQlF0WIl24NxAZA,2,2,2009-11-16,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,pv82zTlB5Txsu2Pusu__FA,CY4SWiYcUZTWS_T_cGaGPA,4,EmzaQR5hQlF0WIl24NxAZA,9,2,2010-08-16,6,0,0,...,0,0,0,0,0,0,0,0,0,0
4,jlr3OBS1_Y3Lqa-H3-FR1g,VCKytaG-_YkxmQosH4E0jw,0,EmzaQR5hQlF0WIl24NxAZA,1,4,2010-12-04,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
reviews.shape

(322398, 108)

In [53]:
reviews_samp = reviews.sample(frac=0.1)

In [54]:
reviews_samp.shape

(32240, 108)

In [75]:
reviews_samp.head(2)

Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful,10 minutes,15 minutes,...,service great,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried,wait staff
311069,0trYkEZFTej_lt8mi0ELWw,YBk0SZTs42V8TVbv2vkQyw,0,hsRwhrj0zPJWm_5Z3ZemMw,0,4,2015-06-28,0,0,0,...,0,0,0,0,0,0,0,0,0,0
139684,1Eevry0X_8yb6yzsQilptg,PDnODS86pUXDOxLBe9Jqug,0,8UhKnPs2Jvz7vjzqYCTcWg,0,4,2011-11-06,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [113]:
samp_service = reviews_samp.filter(like='service')
samp_service[['user_id', '10 minutes', '15 minutes', '20 minutes', '30 minutes', 'friendly staff', 'wait staff']] =\
reviews_samp[['user_id', '10 minutes', '15 minutes', '20 minutes', '30 minutes', 'friendly staff', 'wait staff']]
samp_service.head()

Unnamed: 0,bottle service,customer service,food service,good service,great service,service excellent,service food,service friendly,service good,service great,user_id,10 minutes,15 minutes,20 minutes,30 minutes,friendly staff,wait staff
311069,0,0,0,0,0,0,0,0,0,0,0trYkEZFTej_lt8mi0ELWw,0,0,0,0,0,0
139684,0,0,0,0,0,0,0,0,0,0,1Eevry0X_8yb6yzsQilptg,0,0,0,0,0,0
309880,0,0,0,0,0,0,0,0,0,0,O9sjk93qlDw3wSX7QNqZvw,0,0,0,0,0,0
90932,0,0,0,0,0,0,0,0,0,0,zF-rTa_IwAoYnykjVfWKgw,0,0,0,0,0,0
99008,0,0,0,0,0,0,0,0,0,0,W2nDVYm6tumd1VPFhIK2Dg,0,0,0,0,0,0


In [83]:
# list(reviews)

In [114]:
base_cols = reviews_samp.ix[:,0:8]

In [115]:
base_cols.head(2)

Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful
311069,0trYkEZFTej_lt8mi0ELWw,YBk0SZTs42V8TVbv2vkQyw,0,hsRwhrj0zPJWm_5Z3ZemMw,0,4,2015-06-28,0
139684,1Eevry0X_8yb6yzsQilptg,PDnODS86pUXDOxLBe9Jqug,0,8UhKnPs2Jvz7vjzqYCTcWg,0,4,2011-11-06,0


In [116]:
samp_service = pd.merge(samp_service, base_cols, on='user_id', how='left')

In [117]:
samp_service.head()

Unnamed: 0,bottle service,customer service,food service,good service,great service,service excellent,service food,service friendly,service good,service great,...,30 minutes,friendly staff,wait staff,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,YBk0SZTs42V8TVbv2vkQyw,0,hsRwhrj0zPJWm_5Z3ZemMw,0,4,2015-06-28,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,PDnODS86pUXDOxLBe9Jqug,0,8UhKnPs2Jvz7vjzqYCTcWg,0,4,2011-11-06,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,KcYm-Ob_tW2oDVr_vuUhQg,1,j72BJLvldv1gw9mSpUhiDg,1,5,2015-05-17,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,biIKD9AzVneziCxU19N9bA,1,HyfFenprdpIA4rmKu6DW3g,0,5,2015-08-18,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,R-73c5XSlyYP2EGOTbUS4g,1,50nln4YPr0QDo_v1uBmHfg,1,4,2010-06-09,1


In [118]:
samp_service.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67584 entries, 0 to 67583
Data columns (total 24 columns):
bottle service       67584 non-null int64
customer service     67584 non-null int64
food service         67584 non-null int64
good service         67584 non-null int64
great service        67584 non-null int64
service excellent    67584 non-null int64
service food         67584 non-null int64
service friendly     67584 non-null int64
service good         67584 non-null int64
service great        67584 non-null int64
user_id              67584 non-null object
10 minutes           67584 non-null int64
15 minutes           67584 non-null int64
20 minutes           67584 non-null int64
30 minutes           67584 non-null int64
friendly staff       67584 non-null int64
wait staff           67584 non-null int64
review_id            67584 non-null object
votes.cool           67584 non-null int64
business_id          67584 non-null object
votes.funny          67584 non-null int64
stars 

In [107]:
samp_not_service = [x for x in reviews_samp.columns if x not in samp_service]

In [109]:
reviews_samp[reviews_samp_non_service].head(2)

Unnamed: 0,bar food,beer selection,best ve,bloody mary,chicken waffles,dance floor,decided try,definitely come,definitely recommend,didn want,...,saturday night,second time,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried
311069,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
139684,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
service_reviews = reviews.ix[:,(base_cols + service_cols)]

In [111]:
# list(reviews_samp[reviews_samp_non_service])

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. Identifying "elite" users

---

Yelp, though having their own formula for determining whether a user is elite or not, is interested in delving deeper into what differentiates an elite user from a normal user at a broader level.

Use a classification model to predict whether a user is elite or not. Note that users can be elite in some years and not in others.

1. What things predict well whether a user is elite or not?
- Validate the model.
- If you were to remove the "counts" metrics for users (reviews, votes, compliments), what distinguishes an elite user, if anything? Validate the model and compare it to the one with the count variables.
- Think of a way to visually represent your results in a compelling way.
- Give a brief write-up of your findings.


<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 4. Find something interesting on your own

---

You want to impress your superiors at Yelp by doing some investigation into the data on your own. You want to do classification, but you're not sure on what.

1. Create a hypothesis or hypotheses about the data based on whatever you are interested in, as long as it is predicting a category of some kind (classification).
2. Explore the data visually (ideally related to this hypothesis).
3. Build one or more classification models to predict your target variable. **Your modeling should include gridsearching to find optimal model parameters.**
4. Evaluate the performance of your model. Explain why your model may have chosen those specific parameters during the gridsearch process.
5. Write up what the model tells you. Does it validate or invalidate your hypothesis? Write this up as if for a non-technical audience.

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. ROC and Precision-recall

---

Some categories have fewer overall businesses than others. Choose two categories of businesses to predict, one that makes your proportion of target classes as even as possible, and another that has very few businesses and thus makes the target varible imbalanced.

1. Create two classification models predicting these categories. Optimize the models and choose variables as you see fit.
- Make confusion matrices for your models. Describe the confusion matrices and explain what they tell you about your models' performance.
- Make ROC curves for both models. What do the ROC curves describe and what do they tell you about your model?
- Make Precision-Recall curves for the models. What do they describe? How do they compare to the ROC curves?
- Explain when Precision-Recall may be preferable to ROC. Is that the case in either of your models?