<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 4

###  Big Query, SQL, Classification

---

### The Data

There are 5 individual tables that have the information, contained in a Google BigQuery database.  The setup info for BigQuery is located on our DSI wiki.  You will have to query with SQL, the dataset in order to complete this project.

The tables, with cooresonding attributes that exist are:

### businesses
- business_id: unique business identifier
- name: name of the business
- review_count: number of reviews per business
- city: city business resides in
- stars: average rating
- categories: categories the business falls into (can be one or multiple)
- latitude
- longitude
- neighborhoods: neighborhoods business belongs to
- variable: "property" of the business (a tag)
- value: True/False for the property

### reviews
- user_id: unique user identifier
- review_id: unique review identifier
- votes.cool: how many thought the review was "cool"
- business_id: unique business id the review is for
- votes.funny: how many thought the review was funny
- stars: rating given
- date: date of review
- votes.useful: how many thought the review was useful
- ... 100 columns of counts of most common 2 word phrases that appear in reviews in this review

### users
- yelping_since: signup date
- compliments.plain: # of compliments "plain"
- review_count: # of reviews:
- compliments.cute: total # of compliments "cute"
- compliments.writer: # of compliments "writer"
- compliments.note: # of compliments "note" (not sure what this is)
- compliments.hot: # of compliments "hot" (?)
- compliments.cool: # of compliments "cool"
- compliments.profile: # of compliments "profile"
- average_stars: average rating
- compliments.more: # of compliments "more"
- elite: years considered "elite"
- name: user's name
- user_id: unique user id
- votes.cool: # of votes "cool"
- compliments.list: # of compliments "list"
- votes.funny: # of compliments "funny"
- compliments.photos: # of compliments "photos"
- compliments.funny: # of compliments "funny"
- votes.useful: # of votes "useful"

### checkins
- business_id: unique business identifier
- variable: day-time identifier of checkins (0-0 is Sunday 0:00 - 1:00am,  for example)
- value: # of checkins at that time

### tips
- user_id: unique user identifier
- business_id: unique business identifier
- likes: likes that the tip has
- date: date of tip
- ... 100 columns of counts of most common 2 word phrases that appear in tips in this tip


The reviews and tips datasets in particular have parsed "NLP" columns with counts of 2-word phrases in that review or tip (a "tip", it seems, is some kind of smaller review).

The user dataset has a lot of columns of counts of different compliments and votes. We're not sure whether the compliments or votes are by the user or for the user.

Full details about this dataset area located here:
https://bigquery.cloud.google.com/dataset/bigquery-dsi-dave:yelp_arizona

---


If you look at the website, or the full data, you'll see I have removed pieces of the data and cut it down quite a bit. This is to simplify it for this project. Specifically, business are limited to be in these cities:

- Phoenix
- Surprise
- Las Vegas
- Waterloo

Apparently there is a city called "Surprise" in Arizona. 

Businesses are also restricted to at least be in one of the following categories, because we thought the mix of them was funny:

- Airports
- Breakfast & Brunch
- Bubble Tea
- Burgers
- Bars
- Bakeries
- Breweries
- Cafes
- Candy Stores
- Comedy Clubs
- Courthouses
- Dance Clubs
- Fast Food
- Museums
- Tattoo
- Vape Shops
- Yoga
    
---

### Project requirements

**You will be performing 4 different sections of analysis, like in the last project.**

Remember that classification targets are categorical and regression targets are continuous variables.

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Load your dataset(s) / setup / configure GBQ connection

---

Information about this dataset is located here:


**If you haven't done so, setup a project with the Google developer portal, following the directions here: [Getting Started with BigQuery](https://github.com/ga-students/DSI-SF-4/wiki/Getting-Started-with-BigQuery)**

In [83]:
import pandas as pd
import seaborn as sns

%matplotlib inline

project_id = "bigquery-dsi-vinnie"

sql = """
SELECT * FROM [bigquery-dsi-dave:yelp_arizona.reviews] 
LIMIT 2500
"""

reviews = pd.read_gbq(sql, project_id=project_id)

project_id = "bigquery-dsi-vinnie"

sql = """
SELECT * FROM [bigquery-dsi-dave:yelp_arizona.businesses] 
LIMIT 2500
"""

business = pd.read_gbq(sql, project_id=project_id)

project_id = "bigquery-dsi-vinnie"

sql = """
SELECT * FROM [bigquery-dsi-dave:yelp_arizona.users] 
LIMIT 2500
"""

users = pd.read_gbq(sql, project_id=project_id)

project_id = "bigquery-dsi-vinnie"

sql = """
SELECT * FROM [bigquery-dsi-dave:yelp_arizona.tips] 
LIMIT 2500
"""

tips = pd.read_gbq(sql, project_id=project_id)

project_id = "bigquery-dsi-vinnie"

sql = """
SELECT * FROM [bigquery-dsi-dave:yelp_arizona.checkins] 
LIMIT 2500
"""

checkins = pd.read_gbq(sql, project_id=project_id)

Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
  Got page: 1; 100.0% done. Elapsed 8.33 s.
Got 2500 rows.

Total time taken 9.08 s.
Finished at 2016-12-19 17:55:56.
Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
Got 2500 rows.

Total time taken 1.33 s.
Finished at 2016-12-19 17:55:57.
Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
Got 2500 rows.

Total time taken 2.13 s.
Finished at 2016-12-19 17:56:00.
Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
Got 2500 rows.

Total time taken 5.13 s.
Finished at 2016-12-19 17:56:05.
Requesting query... ok.
Query running...
Query done.
Cache hit.

Retrieving results...
Got 2500 rows.

Total time taken 1.02 s.
Finished at 2016-12-19 17:56:06.


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Constructing a "profile" for Las Vegas

---

Yelp is interested in building out what they are calling "profiles" for cities. They want you to start with just Las Vegas to see what a prototype of this would look like. Essentially, they want to know what makes Las Vegas distinct from the other four.

Use the data you have to predict Las Vegas from the other variables you have. You should not be predicting the city from any kind of location data or other data perfectly associated with that city (or another city).

You may use any classification algorithm you deem appropriate, or even multiple models. You should:

1. Build at least one model predicting Las Vegas vs. the other cities.
- Validate your model(s).
- Interpret and visualize, in some way, the results.
- Write up a "profile" for Las Vegas. This should be a writeup converting your findings from the model(s) into a human-readable description of the city.

*Research location data to find the city targets.*

In [87]:
business.head()

Unnamed: 0,business_id,name,review_count,city,stars,categories,latitude,longitude,neighborhoods,variable,value
0,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",36.118819,-115.182005,[],attributes.Takes Reservations,False
1,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",36.118819,-115.182005,[],attributes.Good For.dessert,False
2,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",36.118819,-115.182005,[],attributes.Take-out,True
3,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",36.118819,-115.182005,[],attributes.Has TV,False
4,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",36.118819,-115.182005,[],attributes.Good For.breakfast,True


In [88]:
business.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 11 columns):
business_id      2500 non-null object
name             2500 non-null object
review_count     2500 non-null int64
city             2500 non-null object
stars            2500 non-null float64
categories       2500 non-null object
latitude         2500 non-null float64
longitude        2500 non-null float64
neighborhoods    2500 non-null object
variable         2500 non-null object
value            2500 non-null bool
dtypes: bool(1), float64(3), int64(1), object(6)
memory usage: 197.8+ KB


In [90]:
#converting city to binary

business['city'] = business['city'].map(lambda x: 0 if x == 'Las Vegas' else 1)

In [91]:
business.city.value_counts()

1    2141
0     359
Name: city, dtype: int64

In [94]:
business.categories.value_counts()

['Burgers', 'Fast Food', 'Restaurants']                                                          194
['Breakfast & Brunch', 'American (Traditional)', 'Restaurants']                                  138
['Wine Bars', 'Bars', 'American (New)', 'Nightlife', 'Restaurants']                              120
['Fast Food', 'Restaurants']                                                                     117
['Fast Food', 'Mexican', 'Restaurants']                                                           95
['American (Traditional)', 'Fast Food', 'Restaurants']                                            89
['Fast Food', 'Sandwiches', 'Restaurants']                                                        79
['Burgers', 'Restaurants']                                                                        75
['Breakfast & Brunch', 'Restaurants']                                                             73
['Delis', 'Fast Food', 'Sandwiches', 'Restaurants']                                        

In [99]:
#creating binary values
business['cat_lists'] = business.categories.map(lambda x: eval(x))

unique_categories = []
for catlist in business.cat_lists.values:
    unique_categories.extend(catlist)
    
unique_categories = np.unique(unique_categories)
unique_categories[:10]

array(['Active Life', 'American (New)', 'American (Traditional)',
       'Arts & Entertainment', 'Bakeries', 'Barbeque', 'Bars',
       'Beer, Wine & Spirits', 'Breakfast & Brunch', 'Bubble Tea'], 
      dtype='|S22')

In [128]:
new_cat = pd.DataFrame()

for uc in unique_categories:
    new_cat[uc] = business.cat_lists.map(lambda x: 1 if uc in x else 0)

In [129]:
new_cat.head()

Unnamed: 0,Active Life,American (New),American (Traditional),Arts & Entertainment,Bakeries,Barbeque,Bars,"Beer, Wine & Spirits",Breakfast & Brunch,Bubble Tea,...,Shopping,Soul Food,Specialty Food,Sports Bars,Steakhouses,Tea Rooms,Tex-Mex,Vape Shops,Wine Bars,Yoga
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [130]:
business.categories.value_counts()

['Burgers', 'Fast Food', 'Restaurants']                                                          194
['Breakfast & Brunch', 'American (Traditional)', 'Restaurants']                                  138
['Wine Bars', 'Bars', 'American (New)', 'Nightlife', 'Restaurants']                              120
['Fast Food', 'Restaurants']                                                                     117
['Fast Food', 'Mexican', 'Restaurants']                                                           95
['American (Traditional)', 'Fast Food', 'Restaurants']                                            89
['Fast Food', 'Sandwiches', 'Restaurants']                                                        79
['Burgers', 'Restaurants']                                                                        75
['Breakfast & Brunch', 'Restaurants']                                                             73
['Delis', 'Fast Food', 'Sandwiches', 'Restaurants']                                        

In [131]:
new_cat.head()

Unnamed: 0,Active Life,American (New),American (Traditional),Arts & Entertainment,Bakeries,Barbeque,Bars,"Beer, Wine & Spirits",Breakfast & Brunch,Bubble Tea,...,Shopping,Soul Food,Specialty Food,Sports Bars,Steakhouses,Tea Rooms,Tex-Mex,Vape Shops,Wine Bars,Yoga
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [132]:
new_cat.rename(columns=lambda x: x.lower().replace(' & ','_').replace(' ','_'), inplace=True)


In [136]:
list(new_cat.columns)

['active_life',
 'american_(new)',
 'american_(traditional)',
 'arts_entertainment',
 'bakeries',
 'barbeque',
 'bars',
 'beer,_wine_spirits',
 'breakfast_brunch',
 'bubble_tea',
 'burgers',
 'cafes',
 'caribbean',
 'coffee_tea',
 'convenience_stores',
 'dance_clubs',
 'delis',
 'desserts',
 'diners',
 'dive_bars',
 'ethnic_food',
 'fast_food',
 'fitness_instruction',
 'food',
 'gay_bars',
 'grocery',
 'irish',
 'karaoke',
 'latin_american',
 'macarons',
 'mexican',
 'music_venues',
 'nightlife',
 'pet_services',
 'pets',
 'pizza',
 'pubs',
 'puerto_rican',
 'restaurants',
 'salad',
 'sandwiches',
 'seafood',
 'shopping',
 'soul_food',
 'specialty_food',
 'sports_bars',
 'steakhouses',
 'tea_rooms',
 'tex-mex',
 'vape_shops',
 'wine_bars',
 'yoga']

In [137]:
business.columns

Index([u'business_id', u'name', u'review_count', u'city', u'stars',
       u'categories', u'latitude', u'longitude', u'neighborhoods', u'variable',
       u'value', u'cat_lists'],
      dtype='object')

In [138]:
business.review_count.isnull().sum()

0

In [140]:
business.review_count.describe()

count    2500.000000
mean      102.607600
std       239.945949
min         3.000000
25%        11.000000
50%        24.500000
75%        83.000000
max      1474.000000
Name: review_count, dtype: float64

In [141]:
business.stars.isnull().sum()

0

In [144]:
business.stars.describe()

count    2500.00000
mean        3.41920
std         0.66772
min         2.00000
25%         3.00000
50%         3.50000
75%         4.00000
max         5.00000
Name: stars, dtype: float64

In [145]:
business.columns

Index([u'business_id', u'name', u'review_count', u'city', u'stars',
       u'categories', u'latitude', u'longitude', u'neighborhoods', u'variable',
       u'value', u'cat_lists'],
      dtype='object')

In [150]:
y = business.city.values
X = business[['review_count','stars']]

In [152]:
y.shape

(2500,)

In [153]:
X.shape

(2500, 2)

In [158]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)

In [160]:
params = {
    'n_neighbors':range(1,101),
    'weights':['uniform','distance']
}

knn = KNeighborsClassifier()

knn_gs = GridSearchCV(knn, params, cv=5, verbose=1, n_jobs=4)
knn_gs.fit(X_train, y_train)

print knn_gs.best_params_
best_knn = knn_gs.best_estimator_

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=4)]: Done 780 tasks      | elapsed:    2.7s


{'n_neighbors': 9, 'weights': 'distance'}


[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:    3.4s finished


In [161]:
print best_knn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=9, p=2,
           weights='distance')


In [164]:
cross_val_score(best_knn, X, y, cv=5)

array([ 0.84031936,  0.786     ,  0.768     ,  0.544     ,  0.74549098])

In [165]:
cross_val_score(best_knn, X_test, y_test, cv=5)

array([ 0.96987952,  0.98787879,  0.98787879,  0.96363636,  0.94512195])

In [None]:
# Didn't get a chance to try a different model on this question. I would have tried a logistic regression if I had more
# time or would have started earlier...

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. Different categories of ratings

---

Yelp is finally ready to admit that their rating system sucks. No one cares about the ratings, they just use the site to find out what's nearby. The ratings are simply too unreliable for people. 

Yelp hypothesizes that this is, in fact, because different people tend to give their ratings based on different things. They believe that perhaps some people always base their ratings on quality of food, others on service, and perhaps other categories as well. 

1. Do some users tend to talk about service more than others in reviews/tips? Divide up the tips/reviews into more "service-focused" ones and those less concerned with service.
2. Create two new ratings for businesses: ratings from just the service-focused reviews and ratings from the non-service reviews.
3. Construct a regression model for each of the two ratings. They should use the same predictor variables (of your choice). 
4. Validate the performance of the models.
5. Do the models coefficients differ at all? What does this tell you about the hypothesis that there are in fact two different kinds of ratings?

In [388]:
reviews.head()

Unnamed: 0,user_id,review_id,votes_cool,business_id,votes_funny,stars,date,votes_useful,minutes_10,minutes_15,...,service_great,staff_friendly,super_friendly,sweet_potato,tasted_like,time_vegas,try_place,ve_seen,ve_tried,wait_staff
0,0a-ISXPjsyyNAMZfDbrKpw,zixc17WL2V-jsO0eBhFGbA,0,s6zzKc7UvwDRxn-BFlhiTg,0,1,2011-01-25,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,xXmx1eUOzrXt6a9zZ8TjjQ,HiOWflsFyqUqOnYZd0_NTw,0,xH5axHFadmc7Veutt376TA,0,1,2013-08-29,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ao7dIEejgkyN-FHMqbhk2Q,heAu778iL4bIdpKGf_b4rQ,0,L2PpaD1BEW_zKCt5GRLvPg,1,1,2011-06-11,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,3VgHutPPmF3TbWwIzlSNCA,C6_ijYCetrV29WfkEJ0Qnw,0,wrmZqOF6YAE6Dw-QWAoMLQ,0,1,2009-01-14,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Bp-ww5tH22JDEuj0o3WisA,ECDiVK1w9njUuVue6814-w,0,rVUEZpHQfWI_3kt0lBwaxQ,0,1,2015-02-18,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [317]:
reviews.shape

(2500, 108)

In [389]:
service_cols = [col for col in reviews.columns if 'service' in col]
service_cols

['bottle_service',
 'customer_service',
 'food_service',
 'good_service',
 'great_service',
 'service_excellent',
 'service_food',
 'service_friendly',
 'service_good',
 'service_great']

In [390]:
def service_count(review):
    service = 0
    for col in service_cols:
        if reviews.ix[review][col] > 0:
            service += 1
    return service

reviews['service_mentions'] = [service_count(i) for i in range(reviews.shape[0])]

In [392]:
reviews['service_mentions'].value_counts()

0    2160
1     303
2      35
3       2
Name: service_mentions, dtype: int64

In [393]:
reviews['service_focused'] = reviews['service_mentions'].map(lambda x: 1 if x > 1 else 0)

In [396]:
reviews['stars'] = reviews['stars'].map(lambda x: int(x))

In [397]:
reviews['rating'] = reviews['stars'].map(lambda x: 1 if x > 2.5 else 0)

In [399]:
reviews['rating'].value_counts()

1    1278
0    1222
Name: rating, dtype: int64

In [400]:
#This is the portion of positive reviews in the service focused group.
np.sum(reviews['rating'][reviews['service_focused'] == 1])/float(len(reviews['rating'][reviews['service_focused'] == 1]))

0.4864864864864865

In [401]:
#This is the portion of positive reviews in the non-service focused group.
np.sum(reviews['rating'][reviews['service_focused'] == 0])/float(len(reviews['rating'][reviews['service_focused'] == 0]))

0.5115712545676004

In [402]:
predictors = ['votes_funny', 'votes_cool', 'votes_useful']

X_service = reviews[predictors][reviews['service_focused'] == 1]
y_service = reviews['rating'][reviews['service_focused'] == 1]

X_nonservice = reviews[predictors][reviews['service_focused'] == 0]
y_nonservice = reviews['rating'][reviews['service_focused'] == 0]

In [403]:
params = {
    'penalty':['l1', 'l2'],
    'solver':['liblinear'],
    'C': np.linspace(0.00002,1,100)
}

lr = LogisticRegression()
lr_gs = GridSearchCV(lr, params, cv=3, verbose=0)
gs_service = lr_gs.fit(X_service, y_service)

In [404]:
print "Service Best Params", gs_service.best_params_
print "Service Best Score", gs_service.best_score_

Service Best Params {'penalty': 'l1', 'C': 0.54546363636363637, 'solver': 'liblinear'}
Service Best Score 0.540540540541


In [405]:
lr_final = LogisticRegression(penalty = 'l2', C = 0.0707, solver = 'liblinear')
service_results = lr_final.fit(X_service, y_service)

In [406]:
non_service_results = lr_final.fit(X_nonservice, y_nonservice)

In [407]:
service_coef_df = pd.DataFrame({
        'predictor': predictors,
        'coef': service_results.coef_[0]
    })

nonservice_coef_df = pd.DataFrame({
        'predictor': predictors,
        'coef': non_service_results.coef_[0]
    })

In [412]:
nonservice_coef_df 


Unnamed: 0,coef,predictor
0,-0.219073,votes_funny
1,0.619518,votes_cool
2,-0.341643,votes_useful


In [413]:
service_coef_df

Unnamed: 0,coef,predictor
0,-0.219073,votes_funny
1,0.619518,votes_cool
2,-0.341643,votes_useful


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 4. Identifying "elite" users

---

Yelp, though having their own formula for determining whether a user is elite or not, is interested in delving deeper into what differentiates an elite user from a normal user at a broader level.

Use a classification model to predict whether a user is elite or not. Note that users can be elite in some years and not in others.

1. What things predict well whether a user is elite or not?
- Validate the model.
- If you were to remove the "counts" metrics for users (reviews, votes, compliments), what distinguishes an elite user, if anything? Validate the model and compare it to the one with the count variables.
- Think of a way to visually represent your results in a compelling way.
- Give a brief write-up of your findings.


In [None]:
# Didn't get a chance to start this question. If i got a chance to start it I would have set up a binary for the 
# elite column. Then train-test-split the data and set up a logistic regression?

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. Find something interesting on your own

---

You want to impress your superiors at Yelp by doing some investigation into the data on your own. You want to do classification, but you're not sure on what.

1. Create a hypothesis or hypotheses about the data based on whatever you are interested in, as long as it is predicting a category of some kind (classification).
2. Explore the data visually (ideally related to this hypothesis).
3. Build one or more classification models to predict your target variable. **Your modeling should include gridsearching to find optimal model parameters.**
4. Evaluate the performance of your model. Explain why your model may have chosen those specific parameters during the gridsearch process.
5. Write up what the model tells you. Does it validate or invalidate your hypothesis? Write this up as if for a non-technical audience.

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. ROC and Precision-recall

---

Some categories have fewer overall businesses than others. Choose two categories of businesses to predict, one that makes your proportion of target classes as even as possible, and another that has very few businesses and thus makes the target varible imbalanced.

1. Create two classification models predicting these categories. Optimize the models and choose variables as you see fit.
- Make confusion matrices for your models. Describe the confusion matrices and explain what they tell you about your models' performance.
- Make ROC curves for both models. What do the ROC curves describe and what do they tell you about your model?
- Make Precision-Recall curves for the models. What do they describe? How do they compare to the ROC curves?
- Explain when Precision-Recall may be preferable to ROC. Is that the case in either of your models?