<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 4

## Help Yelp

---

In this project you will be investigating a small version of the [Yelp challenge dataset](https://www.yelp.com/dataset_challenge). You'll practice using classification algorithms, cross-validation, gridsearching – all that good stuff.



---

### The data

There are 5 individual .csv files that have the information, zipped into .7z format like with the SF data last project. The dataset is located in your datasets folder:

    DSI-SF-2/datasets/yelp_arizona_data.7z

The columns in each are:

    businesses_small_parsed.csv
        business_id: unique business identifier
        name: name of the business
        review_count: number of reviews per business
        city: city business resides in
        stars: average rating
        categories: categories the business falls into (can be one or multiple)
        latitude
        longitude
        neighborhoods: neighborhoods business belongs to
        variable: "property" of the business (a tag)
        value: True/False for the property
        
    reviews_small_nlp_parsed.csv
        user_id: unique user identifier
        review_id: unique review identifier
        votes.cool: how many thought the review was "cool"
        business_id: unique business id the review is for
        votes.funny: how many thought the review was funny
        stars: rating given
        date: date of review
        votes.useful: how many thought the review was useful
        ... 100 columns of counts of most common 2 word phrases that appear in reviews in this review
        
    users_small_parsed.csv
        yelping_since: signup date
        compliments.plain: # of compliments "plain"
        review_count: # of reviews:
        compliments.cute: total # of compliments "cute"
        compliments.writer: # of compliments "writer"
        compliments.note: # of compliments "note" (not sure what this is)
        compliments.hot: # of compliments "hot" (?)
        compliments.cool: # of compliments "cool"
        compliments.profile: # of compliments "profile"
        average_stars: average rating
        compliments.more: # of compliments "more"
        elite: years considered "elite"
        name: user's name
        user_id: unique user id
        votes.cool: # of votes "cool"
        compliments.list: # of compliments "list"
        votes.funny: # of compliments "funny"
        compliments.photos: # of compliments "photos"
        compliments.funny: # of compliments "funny"
        votes.useful: # of votes "useful"
       
    checkins_small_parsed.csv
        business_id: unique business identifier
        variable: day-time identifier of checkins (0-0 is Sunday 0:00 - 1:00am,  for example)
        value: # of checkins at that time
    
    tips_small_nlp_parsed.csv
        user_id: unique user identifier
        business_id: unique business identifier
        likes: likes that the tip has
        date: date of tip
        ... 100 columns of counts of most common 2 word phrases that appear in tips in this tip

The reviews and tips datasets in particular have parsed "NLP" columns with counts of 2-word phrases in that review or tip (a "tip", it seems, is some kind of smaller review).

The user dataset has a lot of columns of counts of different compliments and votes. I'm not sure whether the compliments or votes are _by_ the user or _for_ the user.

---

If you look at the website, or the full data, you'll see I have removed pieces of the data and cut it down quite a bit. This is to simplify it for this project. Specifically, business are limited to be in these cities:

    Phoenix
    Surprise
    Las Vegas
    Waterloo

Apparently there is a city called "Surprise" in Arizona. 

Businesses are also restricted to at least be in one of the following categories, because I thought the mix of them was funny:

    Airports
    Breakfast & Brunch
    Bubble Tea
    Burgers
    Bars
    Bakeries
    Breweries
    Cafes
    Candy Stores
    Comedy Clubs
    Courthouses
    Dance Clubs
    Fast Food
    Museums
    Tattoo
    Vape Shops
    Yoga
    
---

### Project requirements

**You will be performing 4 different sections of analysis, like in the last project.**

Remember that classification targets are categorical and regression targets are continuous variables.

In [1]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV

from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import grid_search, datasets
from sklearn.metrics import confusion_matrix

# Load graph packages
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-white')
%matplotlib inline

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Constructing a "profile" for Las Vegas

---

Yelp is interested in building out what they are calling "profiles" for cities. They want you to start with just Las Vegas to see what a prototype of this would look like. Essentially, they want to know what makes Las Vegas distinct from the other four.

Use the data you have to predict Las Vegas from the other variables you have. You should not be predicting the city from any kind of location data or other data perfectly associated with that city (or another city).

You may use any classification algorithm you deem appropriate, or even multiple models. You should:

1. Build at least one model predicting Las Vegas vs. the other cities.
- Validate your model(s).
- Interpret and visualize, in some way, the results.
- Write up a "profile" for Las Vegas. This should be a writeup converting your findings from the model(s) into a human-readable description of the city.

In [None]:
#questions....
#knn vs logistic regression
#pivot vs dummies?
#list data in categories and neighborhoods --> best way to handle this
#change city into 0 not Las Vegas / 1 Las Vegas --> but don't I lose some degree of info when I do this?
#should I merge other datasets? on business_id, then even user_id, etc??

In [3]:
business = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/businesses_small_parsed.csv')

In [6]:
business.head(2)

Unnamed: 0,business_id,name,review_count,city,stars,categories,latitude,longitude,neighborhoods,variable,value
0,EmzaQR5hQlF0WIl24NxAZA,Sky Lounge,25,Phoenix,2.5,"['American (New)', 'Nightlife', 'Dance Clubs',...",33.448399,-112.071702,[],attributes.Ambience.divey,False
1,SiwN7f0N4bs4ZtPc4yPgiA,Palazzo,19,Phoenix,3.0,"['Bars', 'Nightlife', 'Dance Clubs']",33.455885,-112.074177,[],attributes.Ambience.divey,False


In [27]:
business.value.value_counts()

False          70687
True           32400
00:00           5560
11:00           3212
casual          2577
10:00           2496
22:00           2439
1.0             1932
average         1892
07:00           1694
2.0             1679
21:00           1631
06:00           1576
02:00           1572
no              1556
full_bar        1470
none            1399
23:00           1397
09:00           1165
free             983
17:00            894
20:00            886
15:00            750
08:00            726
16:00            710
19:00            684
12:00            677
01:00            653
10:30            624
18:00            569
               ...  
15:30             62
23:30             51
18:30             44
16:30             44
4.0               38
02:30             36
paid              31
yes_corkage       24
12:30             24
13:30             21
04:30             20
23:59             13
00:30             12
03:30              6
18plus             6
01:20              6
05:45        

In [33]:
#seems like few neighborhoods for 4 cities... also I guess it doesn't make sense to use this in the model since 
#"The Strip" is obvi in Las Vegas
business.neighborhoods.value_counts()

[]                                              71260
['The Strip']                                   18489
['Westside']                                     9542
['Southeast']                                    9435
['Eastside']                                     7649
['Spring Valley']                                7067
['Downtown']                                     6463
['Southwest']                                    3401
['Northwest']                                    3005
['Centennial']                                   2755
['Chinatown']                                    2597
['Sunrise']                                      2168
['Summerlin']                                    1976
['University']                                   1149
['Eastside', 'The Strip']                        1034
['South Summerlin']                               996
['Southwest', 'Spring Valley']                    750
['Eastside', 'Southeast']                         644
['Eastside', 'University']  

In [12]:
business.variable.value_counts()

open                                           4132
attributes.Accepts Credit Cards                3896
attributes.Price Range                         3843
attributes.Parking.valet                       3427
attributes.Parking.street                      3427
attributes.Parking.validated                   3427
attributes.Parking.garage                      3427
attributes.Parking.lot                         3427
attributes.Good For Groups                     3362
attributes.Outdoor Seating                     3267
attributes.Has TV                              3084
attributes.Alcohol                             3050
attributes.Good for Kids                       2909
attributes.Ambience.trendy                     2873
attributes.Ambience.touristy                   2873
attributes.Ambience.classy                     2873
attributes.Ambience.romantic                   2873
attributes.Ambience.intimate                   2873
attributes.Ambience.casual                     2873
attributes.A

In [15]:
business.categories.value_counts()

['Burgers', 'Fast Food', 'Restaurants']                                                                         9522
['Fast Food', 'Restaurants']                                                                                    5752
['Fast Food', 'Sandwiches', 'Restaurants']                                                                      5616
['Burgers', 'Restaurants']                                                                                      4817
['Fast Food', 'Mexican', 'Restaurants']                                                                         4726
['Bars', 'Nightlife', 'Lounges']                                                                                3901
['Breakfast & Brunch', 'American (Traditional)', 'Restaurants']                                                 3595
['Bars', 'Nightlife']                                                                                           3030
['Breakfast & Brunch', 'Restaurants']                           

In [20]:
#is there a benefit over pivot vs dummy variables - with pivot I have the additional step of creating 1s and 0s

def select_item_or_nan(x):
    x = x.iloc[0]
    if len(x) == 0:
        return np.nan
    else:
        return x

biz_wide = pd.pivot_table(business, columns=['variable'], values='value',
                            index=['business_id'], aggfunc=select_item_or_nan,
                            fill_value=np.nan)

In [19]:
biz_wide.head()

variable,attributes.Accepts Credit Cards,attributes.Accepts Insurance,attributes.Ages Allowed,attributes.Alcohol,attributes.Ambience.casual,attributes.Ambience.classy,attributes.Ambience.divey,attributes.Ambience.hipster,attributes.Ambience.intimate,attributes.Ambience.romantic,...,hours.Saturday.open,hours.Sunday.close,hours.Sunday.open,hours.Thursday.close,hours.Thursday.open,hours.Tuesday.close,hours.Tuesday.open,hours.Wednesday.close,hours.Wednesday.open,open
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--jFTZmywe7StuZ2hEjxyA,True,,,none,,,,,,,...,,,,,,,,,,True
-0HGqwlfw3I8nkJyMHxAsQ,True,,,none,,,,,,,...,,,,,,,,,,True
-0VK5Z1BfUHUYq4PoBYNLw,True,,,full_bar,True,False,False,False,False,False,...,,,,,,,,,,True
-0bUDim5OGuv8R0Qqq6J4A,True,,,,,,,,,,...,,,,,,,,,,False
-1bOb2izeJBZjHC7NWxiPA,True,,,none,True,False,False,False,False,False,...,06:30,14:30,06:30,14:30,06:30,14:30,06:30,14:30,06:30,True


In [25]:
biz_wide['attributes.Alcohol'].value_counts()

full_bar         1470
none             1399
beer_and_wine     181
Name: attributes.Alcohol, dtype: int64

In [26]:
biz_wide['hours.Sunday.close'].value_counts()

00:00    488
22:00    302
02:00    189
21:00    173
23:00    171
15:00     92
20:00     89
01:00     82
19:00     76
18:00     76
14:00     56
17:00     51
16:00     42
03:00     41
04:00     36
06:00     34
21:30     23
22:30     20
14:30     19
05:00     14
12:00     14
09:00     11
05:30     10
13:00     10
20:30      9
07:00      7
18:30      7
23:30      6
01:30      6
13:30      5
02:30      5
17:30      5
08:00      3
19:30      3
11:30      2
10:00      2
15:30      2
12:30      2
23:59      2
19:15      1
08:30      1
01:15      1
03:45      1
16:30      1
11:00      1
10:30      1
16:45      1
22:15      1
Name: hours.Sunday.close, dtype: int64

In [7]:
reviews = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/reviews_small_nlp_parsed.csv')

In [8]:
reviews.head(2)

Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful,10 minutes,15 minutes,...,service great,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried,wait staff
0,o_LCYay4uo5N4eq3U5pbrQ,biEOCicjWlibF26pNLvhcw,0,EmzaQR5hQlF0WIl24NxAZA,0,3,2007-09-14,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,sEWeeq41k4ohBz4jS_iGRw,tOhOHUAS7XJch7a_HW5Csw,3,EmzaQR5hQlF0WIl24NxAZA,12,2,2008-04-21,3,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
users = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/users_small_parsed.csv')

In [31]:
users.head(2)

Unnamed: 0,yelping_since,compliments.plain,review_count,compliments.cute,compliments.writer,fans,compliments.note,compliments.hot,compliments.cool,compliments.profile,...,compliments.more,elite,name,user_id,votes.cool,compliments.list,votes.funny,compliments.photos,compliments.funny,votes.useful
0,2004-10,959.0,1274,206.0,327.0,1179,611.0,1094.0,1642.0,116.0,...,134.0,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201...",Jeremy,rpOyqD_893cqmDAtJLbdog,11093,38.0,7681,330.0,580.0,14199
1,2004-10,89.0,442,23.0,24.0,100,83.0,101.0,145.0,9.0,...,19.0,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201...",Michael,4U9kSBLuBDU391x6bxU-YA,732,4.0,908,24.0,120.0,1483


In [32]:
checkins = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/checkins_small_parsed.csv')

In [34]:
checkins.head(2)

Unnamed: 0,business_id,variable,value
0,SG_gEmEXL4ID6RAEinC5Bg,checkin_info.9-0,1.0
1,45puCRQ6Vh_IIAy7kkfFDQ,checkin_info.9-0,1.0


In [35]:
tips = pd.read_csv('/Users/VanessaG/desktop/DSI-SF-2-vnessified/datasets/yelp_arizona_data/tips_small_nlp_parsed.csv')

In [36]:
tips.head(2)

Unnamed: 0,user_id,business_id,likes,date,24 hours,amazing food,animal style,awesome food,awesome place,awesome service,...,service good,service great,slow service,staff friendly,staff great,steak eggs,super friendly,sweet potato,velvet pancakes,worth wait
0,trdsekNRD-gIs50EBrScwA,EmzaQR5hQlF0WIl24NxAZA,0,2012-02-27,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,liIQCDzDTnvXc7X8twBIjg,EmzaQR5hQlF0WIl24NxAZA,0,2013-04-01,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Different categories of ratings

---

Yelp is finally ready to admit that their rating system sucks. No one cares about the ratings, they just use the site to find out what's nearby. The ratings are simply too unreliable for people. 

Yelp hypothesizes that this is, in fact, because different people tend to give their ratings based on different things. They believe that perhaps some people always base their ratings on quality of food, others on service, and perhaps other categories as well. 

1. Do some users tend to talk about service more than others in reviews/tips? Divide up the tips/reviews into more "service-focused" ones and those less concerned with service.
2. Create two new ratings for businesses: ratings from just the service-focused reviews and ratings from the non-service reviews.
3. Construct a regression model for each of the two ratings. They should use the same predictor variables (of your choice). 
4. Validate the performance of the models.
5. Do the models coefficients differ at all? What does this tell you about the hypothesis that there are in fact two different kinds of ratings?

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. Identifying "elite" users

---

Yelp, though having their own formula for determining whether a user is elite or not, is interested in delving deeper into what differentiates an elite user from a normal user at a broader level.

Use a classification model to predict whether a user is elite or not. Note that users can be elite in some years and not in others.

1. What things predict well whether a user is elite or not?
- Validate the model.
- If you were to remove the "counts" metrics for users (reviews, votes, compliments), what distinguishes an elite user, if anything? Validate the model and compare it to the one with the count variables.
- Think of a way to visually represent your results in a compelling way.
- Give a brief write-up of your findings.


<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 4. Find something interesting on your own

---

You want to impress your superiors at Yelp by doing some investigation into the data on your own. You want to do classification, but you're not sure on what.

1. Create a hypothesis or hypotheses about the data based on whatever you are interested in, as long as it is predicting a category of some kind (classification).
2. Explore the data visually (ideally related to this hypothesis).
3. Build one or more classification models to predict your target variable. **Your modeling should include gridsearching to find optimal model parameters.**
4. Evaluate the performance of your model. Explain why your model may have chosen those specific parameters during the gridsearch process.
5. Write up what the model tells you. Does it validate or invalidate your hypothesis? Write this up as if for a non-technical audience.

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. ROC and Precision-recall

---

Some categories have fewer overall businesses than others. Choose two categories of businesses to predict, one that makes your proportion of target classes as even as possible, and another that has very few businesses and thus makes the target varible imbalanced.

1. Create two classification models predicting these categories. Optimize the models and choose variables as you see fit.
- Make confusion matrices for your models. Describe the confusion matrices and explain what they tell you about your models' performance.
- Make ROC curves for both models. What do the ROC curves describe and what do they tell you about your model?
- Make Precision-Recall curves for the models. What do they describe? How do they compare to the ROC curves?
- Explain when Precision-Recall may be preferable to ROC. Is that the case in either of your models?