In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("Proj4.ipynb")

# Project 4: Mongo 

## Due Date: Tuesday 5/4, 11:59 PM

In this project, we will be investigating how different database systems handle semi-structured JSON data. In particular, we will be placing emphasis on the use of MongoDB: a database system that stores data in a construct known as documents. These documents are very similar to the JSON objects we've explored in lecture with a few differences in representation and indexing that we will explore in the following questions. In this project, we will be working with the Yelp Academic Dataset which contains a dataset of `businesses`, `reviews`, and `users`. Due to the limitations of JupyterHub and the Mongo instances we are working with, `reviews` and `users` are truncated to 7500 reviews and 1000 users. We will be using the full `businesses` dataset, however.

Throughout the course of this project, you should hopefully understand what Mongo can (and cannot) do with regards to its documents and compare and contrast this to other data representation formats such as the relational and dataframe model.

## Scoring Breakdown
Question | Points
--- | ---
1a	| 1
1b  | 1
1c	| 1
1d	| 1
1e	| 2
1f  | 1
2a	| 1
2b	| 2
2c  | 2
3a	| 1
3b	| 1
3c	| 2
3d  | 1
3e  | 3
4a	| 1
4b	| 1
4c	| 1
4d  | 1
**Total** | 24

## Loading Up Mongo
We will be using Pymongo, a Python wrapper for MongoDB, for this project. Every student should have access to their own MongoDB instance, running on the localhost of your Datahub server. Be cautious with your queries, however: this is the first time Mongo has been run on Datahub so you might run into some hiccups along the way! After running the following cell, for the rest of the project, you can use the Python variables business, review, and user to access the corresponding collection.

In [2]:
import json
import pymongo

In [3]:
myclient = pymongo.MongoClient("mongodb://localhost")
mydb = myclient["yelp"]
business = mydb["businesses"]
review = mydb["reviews"]
user = mydb["users"]

## Troubleshooting
You might run into issues on the project where you are certain your code works but the output is incorrect. This may be because your collections have been corrupted. Run the following cell and uncomment the specific collections you would like to drop if you would like to remake your collections from scratch. **Be sure to re-run the Load Datasets cell if you drop your collections so you aren't working with empty collections!**

In [3]:
# RUN THIS CELL IF YOU WOULD LIKE TO REMAKE YOUR COLLECTIONS FROM SCRATCH. IF YOU DROP ANY COLLECTIONS,
# RE-RUN THE NEXT CELL TOO TO LOAD IN DATA.

# review.drop()
# business.drop()
# user.drop()

business = mydb["businesses"]
review = mydb["reviews"]
user = mydb["users"]

## Load Datasets
The following 2 cells will load the JSON datasets into the appropriate Mongo collections. These second cell will probably take a couple of minutes to run. Be sure to re-run this cell if you dropped your collection in the previous box. 

In [5]:
import zipfile
import os.path

if not os.path.isfile('yelp_academic_dataset_review.json'):
    with zipfile.ZipFile('yelp_academic_dataset_review.json.zip', 'r') as zip_ref:
        zip_ref.extractall()

if not os.path.isfile('yelp_academic_dataset_user.json'):
    with zipfile.ZipFile('yelp_academic_dataset_user.json.zip', 'r') as zip_ref:
        zip_ref.extractall()

if not os.path.isfile('yelp_academic_dataset_business.json'):
    with zipfile.ZipFile('yelp_academic_dataset_business.json.zip', 'r') as zip_ref:
        zip_ref.extractall()

In [17]:
# THIS CELL MAY TAKE AT MOST 5 MINUTES. BUT HOPEFULLY YOU WILL ONLY NEED TO RUN IT ONCE.
if business.count_documents({}) == 0:
    with open('yelp_academic_dataset_business.json', encoding='utf-8') as f:
        for line in f:
            business.insert_one(json.loads(line))

if review.count_documents({}) == 0:
    with open('yelp_academic_dataset_review.json', encoding='utf-8') as f:
        for line in f:
            review.insert_one(json.loads(line))
            
if user.count_documents({}) == 0:
    with open('yelp_academic_dataset_user.json', encoding='utf-8') as f:
        for line in f:
            user.insert_one(json.loads(line))

Let's take a quick look at our collections. For the command below, replace `user` with `review` or `business` to count the number of documents in each collection.

In [4]:
user.count_documents({})

1000

Now let's inspect our collections. Replace `business` with `review` and `user` to see the first document in each collection.

In [5]:
list(business.aggregate([{"$limit": 1}]))

[{'_id': ObjectId('63bc8fa899903bc4b31f7573'),
  'business_id': '6iYb2HFDywm3zjuRg0shjw',
  'name': 'Oskar Blues Taproom',
  'address': '921 Pearl St',
  'city': 'Boulder',
  'state': 'CO',
  'postal_code': '80302',
  'latitude': 40.0175444,
  'longitude': -105.2833481,
  'stars': 4.0,
  'review_count': 86,
  'is_open': 1,
  'attributes': {'RestaurantsTableService': 'True',
   'WiFi': "u'free'",
   'BikeParking': 'True',
   'BusinessParking': "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}",
   'BusinessAcceptsCreditCards': 'True',
   'RestaurantsReservations': 'False',
   'WheelchairAccessible': 'True',
   'Caters': 'True',
   'OutdoorSeating': 'True',
   'RestaurantsGoodForGroups': 'True',
   'HappyHour': 'True',
   'BusinessAcceptsBitcoin': 'False',
   'RestaurantsPriceRange2': '2',
   'Ambience': "{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casua

If you see a document containing a business named `Oskar Blues Taproom` when you run the command above, it means that our JSON data has successfully been imported into the collection! Now we can get started with exploring Mongo in a bit more detail. Run the following two cells for grading purposes.

In [20]:
# ! mkdir -p results

In [5]:
import bson
from bson.objectid import ObjectId
import pickle
import pandas as pd
import pprint

## Question 1: Basic MQL

### Question 1a

In lecture, we discussed how one could find specific attributes from a JSON object using dot notation. 

While you can still use the dot notation in queries, PyMongo represents documents returned from Mongo queries using Python dictionaries, making it convenient to manipulate JSON using a mix of Mongo queries and array indexing. Specifically, given the result of a retrieval `find` query, you can look up the third document by appending `[2]`. Then, given this document, you can look up the field `'amount'` by appending `['amount']` etc., adding multiple square brackets as needed to "walk down" the JSON tree representation via `collection.find(...)[2]['amount']`. 

As a warmup to get you familiarized with PyMongo syntax, find the Tuesday hours for the restaurant named Legal Sea Foods at 100 Huntington Ave, Boston, MA. Be careful! There are many Legal Sea Foods in Boston!

In [27]:
q = {
    "name": "Legal Sea Foods",
    "address": "100 Huntington Ave",
    "city": "Boston",
    "state": "MA"
}

tuesday_hours = business.find_one(q)['hours']['Tuesday']

In [28]:
question_1a_str = tuesday_hours

In [29]:
# Do not delete/edit this cell
pickle.dump(question_1a_str, open("results/result_1a.p","wb"))

In [30]:
grader.check("q1a")

### Question 1b
Now let's get some practice with aggregation and filtering. Our goal is to write a query that computes the average star rating for all businesses in Colorado with 30 reviews or greater. However, this won't be as easy as setting the state to CO! If we inspect this dataset more closely, we will notice that some cities are not matched up with the right states. As an example, run `list(business.find({"state": "CA"}))` below.

In [33]:
pprint.pprint(list(business.find({"state": "CA"})))

[{'_id': ObjectId('63bc8fa999903bc4b31f7c8e'),
  'address': '',
  'attributes': {'BusinessAcceptsBitcoin': 'False',
                 'BusinessAcceptsCreditCards': 'True',
                 'WiFi': "u'no'"},
  'business_id': 'SNCRnaSy6E5fHgQuoCmmbQ',
  'categories': 'Shopping, Clothing Rental, Event Planning & Services, '
                'Fashion, Event Photography, Photographers, Session '
                'Photography',
  'city': 'Portland',
  'hours': {'Friday': '8:0-22:0',
            'Monday': '8:0-22:0',
            'Saturday': '8:0-22:0',
            'Sunday': '8:0-22:0',
            'Thursday': '8:0-22:0',
            'Tuesday': '8:0-22:0',
            'Wednesday': '8:0-22:0'},
  'is_open': 1,
  'latitude': 45.4501529,
  'longitude': -122.8849111,
  'name': 'Katia Photography',
  'postal_code': '97007',
  'review_count': 11,
  'stars': 5.0,
  'state': 'CA'},
 {'_id': ObjectId('63bc8fbd99903bc4b31fecfa'),
  'address': '4655 SW Griffith Dr, Ste 125',
  'attributes': {'BikeParking': 

Notice how cities like Portland, Atlanta, and Austin are classified as California cities! However, the latitude and longitude is generally correct. The latitude of Colorado is between 37 and 41 inclusive and the longitude is between -109 and -102 inclusive. Now, use this to find the average star rating of all businesses in this range with 30 reviews or greater.

Recall that in SQL, we would use a GROUP BY with the AVG aggregation function. In Mongo, we use an aggregation pipeline, comprised of multiple stages. Each stage transforms the documents in some way. Pipeline stages do not need to produce one output document for every input document. For example, some stages may generate new documents or filter out documents.

**HINT**: as in the previous question, you may find it helpful to use the PyMongo array notation to extract the pertinent information once you have composed the right Mongo aggregation query.

In [46]:
pipeline = [
    {
        "$match": {
            "latitude": {"$gte": 37, "$lte": 41},
            "longitude": {"$gte": -109, "$lte": -102},
        }
    }
]

business_in_colorado = list(business.aggregate(pipeline))
avg_stars_for_business_in_colorado = sum([bic['stars'] for bic in business_in_colorado]) / len(business_in_colorado)

In [47]:
question_1b_str = avg_stars_for_business_in_colorado

In [48]:
# Do not delete/edit this cell
pickle.dump(question_1b_str, open("results/result_1b.p","wb"))

In [49]:
grader.check("q1b")

### Question 1c

In this question, we will explore aggregation and grouping further. We will also make use of the `$project` operator which allows us to output documents with certain fields of our choosing. 

For this question, we would like to create an aggregation pipeline to find towns in each state with the highest average number of stars. **We will only consider towns with greater than or equal to 5 reviews in total across all the restaurants in that town so that the average is meaningful.** Your final output should contain exactly two fields: `city_state` which is the name of the town with the highest value of average stars in the state concatenated with a comma followed by the state initials and `averageStars` which contains the average number of stars for the corresponding town. To ensure your output is consistent with the autograder, sort in descending order by `averageStars` and break ties by sorting second on `city_state` in alphabetical order.

As a concrete example, imagine that Berkeley and Austin have the highest average stars in California and Texas respectively (and both have more than or equal to 5 total reviews). If Berkeley and Austin both have an average star rating of 5.0, your final output should be:

```
{'averageStars': 5.0, 'city_state': 'Austin, TX'}
{'averageStars': 5.0, 'city_state': 'Berkeley, CA'}
```

**NOTE:** You will provide a pipeline to `business.aggregate(...)` as your solution. Make sure that you save your pipeline to `q1c_pipeline` or you will not pass the autograder! 

**HINT:** You may find the `concat` operator helpful. See: https://docs.mongodb.com/manual/reference/operator/aggregation/concat/

In [116]:
pipeline = [
    {
        "$match": {
            "review_count": {"$gte": 5}
        }
    },
    {
        "$project": {
            "city": 1,
            "stars": 1,
            "state": 1,
            "city_state": {"$concat": ["$city", ", ", "$state"]}
        }
    },
    {
        "$group": {
            "_id": "$city_state",
            "state": {"$first": "$state"},
            "avgRating": {"$avg": "$stars"}
        }
    },
    {
        "$group": {
            "_id": "$state",
            "averageStars": {"$max": "$avgRating"}
        }
    },
    {
        "$project": {
            "_id": 0,
            "city_state": "$_id",
            "averageStars": 1
        }
    },
    {
      "$sort": { "averageStars": -1, "city_state": 1 }
    }
]

res = list(business.aggregate(pipeline))
pprint.pprint(len(res))
pprint.pprint(res)

31
[{'averageStars': 5.0, 'city_state': 'BC'},
 {'averageStars': 5.0, 'city_state': 'CA'},
 {'averageStars': 5.0, 'city_state': 'DC'},
 {'averageStars': 5.0, 'city_state': 'FL'},
 {'averageStars': 5.0, 'city_state': 'GA'},
 {'averageStars': 5.0, 'city_state': 'MA'},
 {'averageStars': 5.0, 'city_state': 'NY'},
 {'averageStars': 5.0, 'city_state': 'OH'},
 {'averageStars': 5.0, 'city_state': 'ON'},
 {'averageStars': 5.0, 'city_state': 'OR'},
 {'averageStars': 5.0, 'city_state': 'TX'},
 {'averageStars': 5.0, 'city_state': 'WA'},
 {'averageStars': 4.5, 'city_state': 'ABE'},
 {'averageStars': 4.5, 'city_state': 'AZ'},
 {'averageStars': 4.5, 'city_state': 'CO'},
 {'averageStars': 4.5, 'city_state': 'DE'},
 {'averageStars': 4.5, 'city_state': 'HI'},
 {'averageStars': 4.0, 'city_state': 'IL'},
 {'averageStars': 4.0, 'city_state': 'NH'},
 {'averageStars': 3.5, 'city_state': 'ME'},
 {'averageStars': 3.5, 'city_state': 'MI'},
 {'averageStars': 3.5, 'city_state': 'MN'},
 {'averageStars': 3.5, 'city

In [117]:
q1c_pipeline = pipeline

cur = business.aggregate(q1c_pipeline)

In [118]:
# Do not delete/edit this cell
myclient = pymongo.MongoClient("mongodb://localhost")
mydb = myclient["yelp"]
business = mydb["businesses"]
pipeline_for_test_1c = q1c_pipeline[:]
cur_test_1c = business.aggregate(pipeline_for_test_1c)
cur_test_1c = pickle.dump(list(cur_test_1c), open("results/result_1c.p","wb"))

In [119]:
grader.check("q1c")

### Question 1d

In class, we've described structured (rectangular) data as well as semi-structured data. We haven't quite covered unstructured data -- this is basically free-form text. Often, in semi-structured JSON you may have unstructured text data embedded within, such as the text field in the review collection.

MongoDB allows us to build a so-called `text index` to retrieve the relevant document based on keywords found in text in a predefined field. This index converts our free-form text into a structure that allows us to easily look up documents by its contents. To leverage this text search capability, we must first build a text index on the text field. This has been done for you.

We will then use this text index to do basic sentiment analysis and find all the restaurants we should avoid! Using the text index given, write a query to find all the reviews with "disgusting", "horrible", "horrid", "gross", "bad", or "hate". To use the text index, use the keywords `$text` and `$search` as detailed here: https://docs.mongodb.com/manual/text-search/. Once the index has been created (i.e. you run the following cell at least once), the `$text` and `$search` commands should work immediately contingent on proper syntax.

Enter your query after `cursor` and use the pre-written code to read some of the reviews. Your query should be of the form `review.find({...})`.  How many reviews contain either of these 6 words?

In [125]:
from pymongo import TEXT

# We create a text index here.
if 'text_text' not in review.index_information():
    review.create_index([('text', TEXT)])

cursor = review.find({
    "$text": {
        "$search": "disgusting horrible horrid gross bad hate"
    }
})

limit = 0
for business_review in cursor:
    print(business_review["text"])
    limit += 1
    if limit == 2:
        break
        
question_1d_str = len(list(cursor))

I had been coming here for years, but Habana's has lost it ways. The food was so bad and expensive.  Bad food that cost a lot leaves you feeling more than a little ripped off.  The drink menu does not have prices,  this is because a glass of wine is 10 dollars.  The $23 Lechon Asado could have been mistaken for bad jerky.  
There are NOW much better Cuban here in Austin, let this one fade away.
I don't have any idea why but my coffee was so bad I couldn't drink it. The same dark roast I order at other nearby locations and love yet so bad here. I thought it could be a one bad day thing so I gave it another try and got the same horrible results. When your morning starts with coffee you can't drink you are in for a long day. I can only guess that employees didn't make the coffee as ordered or else they are using water from Lake Apopka.


In [126]:
# Do not delete/edit this cell
pickle.dump(question_1d_str, open("results/result_1d.p","wb"))

In [127]:
grader.check("q1d")

### Question 1e

Now let's learn Mongo updates, deletions, and creation. Create a new collection called `reviews_boolean` which is the exact same as `reviews` EXCEPT there is a new field called `to_avoid` which is the string "true"  if the review `text` contains the words "disgusting", "horrid", "horrible", "gross", "bad", or "hate" and the string "false" if not.  

This is a tricky task! We have not discussed creation, updates, or insertions in great detail during lecture but luckily, Mongo uses a similar approach to SQL. 

*Insertions*: In order to insert into a document, you may use the functions [`review_boolean.insert_one(...)`](https://docs.mongodb.com/manual/reference/method/db.collection.insertOne/) or [`review_boolean.insert_many(...)`](https://docs.mongodb.com/manual/reference/method/db.collection.insertMany/). These functions take in a document or a list of documents and inserts them into the collection. 

*Updates*: In order to update a document, you may use the functions [`review_boolean.update_one(...)`](https://docs.mongodb.com/manual/reference/method/db.collection.updateOne/) or [`review_boolean.update_many(...)`](https://docs.mongodb.com/manual/reference/method/db.collection.updateMany/). These functions take in two parameters. The first specifies which documents should be modified. If the first parameter is `{}`, this indicates that all documents should be updated. However, you can put a more specific filter here if you would like. The second parameter specifies what you would like to update your field to (the [`$set`](https://docs.mongodb.com/manual/reference/operator/update/set/) operator may come in handy here). Recall that in our SQL model, updates are performed as `UPDATE ... SET ... WHERE ...`. In our case, the first ellipsis corresponds to `reviews_boolean`, the second ellipsis corresponds to the second parameter of `update_*`, and the third ellipsis corresponds to the first parameter of `update_*`.

*Creation*: We handle creation of the collection for you. But in Pymongo, creation of a collection is as simple as writing `variable_name = db[collection_name]` where db is the the Pymongo database object variable you have already created.

Some additional reminders and hints:
- The empty collection `reviews_boolean` has already been created for you and is stored in the variable `review_boolean`
- `review.find({})` creates an iterator that allows you to iterate over every document in `review`
- Do not forget that in order to pass the hidden tests, the `to_avoid` field must exist for every document in `reviews_boolean`!

In [139]:
review_boolean.drop() # UNCOMMENT THIS IF RUNNING INTO DUPLICATEKEYERROR
 
review_boolean = mydb["reviews_boolean"]

# YOUR ANSWER BEGINS HERE
all_reviews = [{**r, "to_avoid": False} for r in review.find({})]
review_boolean.insert_many(all_reviews)

cursor = review.find({
    "$text": {
        "$search": "disgusting horrible horrid gross bad hate"
    }
})

for business_review in cursor:
    review_boolean.update_one({"_id": business_review["_id"]},
                              {"$set": {"to_avoid": True}})

In [140]:
# Do not delete/edit this cell
import pickle
myclient = pymongo.MongoClient("mongodb://localhost")
mydb = myclient["yelp"]
reviews_boolean_1e = mydb["reviews_boolean"]
pickle.dump(list(reviews_boolean_1e.find({}, {'_id': 0})), open("results/result_1e_1.p","wb"))
pickle.dump(mydb.list_collection_names(), open("results/result_1e_2.p","wb"))

In [141]:
grader.check("q1e")

### Question 1f

Now, you had a change of heart: you decide that it's unfair to label restaurants as `to_avoid` without at least giving them a chance! Remove the `to_avoid` field from the `reviews_boolean` collection. Calculate the `difference` between the data size of `reviews_boolean` with the `to_avoid` field and without it. The code for making this calculation is provided but it is up to you to actually remove the field. Before running the next cell, make sure to re-run your cell for 1e so you don't get a difference of 0!

*Deletions*: Deletions in Mongo make use of the `review_boolean.update_one(...)` or `review_boolean.update_many(...)` functionality discussed in Question 1e. However, this time, instead of using the `$set` operator which allows for the creation of new fields, we will use the [`$unset`](https://docs.mongodb.com/manual/reference/operator/update/unset/) operator which deletes them! Very tidy!

Before running the next cell, make sure to re-run your cell for 1e so you don't get a difference of 0!

In [143]:
with_avoid = mydb.command("collstats", "reviews_boolean")['size']

# YOUR ANSWER BEGINS HERE
review_boolean.update_many({}, {"$unset": {"to_avoid": ""}})
# END

without_avoid = mydb.command("collstats", "reviews_boolean")['size']
difference = with_avoid - without_avoid

In [144]:
# Do not delete/edit this cell
pickle.dump(difference, open("results/result_1f.p","wb"))

In [145]:
grader.check("q1f")

## Question 2: JSON and Relational Models

### Question 2a

Now we have a good idea of how to do retrieval, aggregation, and updates in Mongo. But we haven't talked about why we
would want to use Mongo to store JSON! In order to explore this, let's take another look at the `business`
collection. We will look at the first two entries.

In [146]:
list(business.aggregate([{"$limit": 2}]))

[{'_id': ObjectId('63bc8fa899903bc4b31f7573'),
  'business_id': '6iYb2HFDywm3zjuRg0shjw',
  'name': 'Oskar Blues Taproom',
  'address': '921 Pearl St',
  'city': 'Boulder',
  'state': 'CO',
  'postal_code': '80302',
  'latitude': 40.0175444,
  'longitude': -105.2833481,
  'stars': 4.0,
  'review_count': 86,
  'is_open': 1,
  'attributes': {'RestaurantsTableService': 'True',
   'WiFi': "u'free'",
   'BikeParking': 'True',
   'BusinessParking': "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}",
   'BusinessAcceptsCreditCards': 'True',
   'RestaurantsReservations': 'False',
   'WheelchairAccessible': 'True',
   'Caters': 'True',
   'OutdoorSeating': 'True',
   'RestaurantsGoodForGroups': 'True',
   'HappyHour': 'True',
   'BusinessAcceptsBitcoin': 'False',
   'RestaurantsPriceRange2': '2',
   'Ambience': "{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casua

<!-- BEGIN QUESTION -->

What are two pros of storing this data in MongoDB with JSON over a relational database management system such as Postgres?
Please reference specific examples from the `business` collection to back up your claims. 

<!--
BEGIN QUESTION
name: q2a
manual: true
points: 1
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 2b

It seems like MongoDB is getting all the love when it comes to JSON support! However, modern iterations of relational databases
such as Postgres 9.3+ also have [excellent JSON functionality](https://www.postgresql.org/docs/9.3/functions-json.html) as we will soon explore in this task. First, let's set up a
bit of scaffolding. The following cell will import the `yelp_academic_dataset_review.json` data into a table called `reviews`.

# Errors!

- got "detail:  Character with value 0x0a must be escaped." error
- to "NaN" result when indexing JSON

In [118]:
import pandas as pd

In [98]:
df = pd.read_json('yelp_academic_dataset_review.json', lines=True)
df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4,3,1,1,Apparently Prides Osteria had a rough summer a...,2014-10-11 03:34:02
1,8bFej1QE5LXp4O05qjGqXA,YoVfDbnISlW0f7abNQACIg,RA4V8pr014UyUbDvI-LW2A,4,1,0,0,This store is pretty good. Not as great as Wal...,2015-07-03 20:38:25
2,NDhkzczKjLshODbqDoNLSg,eC5evKn1TWDyHCyQAwguUw,_sS2LBIGNT5NQb6PD1Vtjw,5,0,0,0,I called WVM on the recommendation of a couple...,2013-05-28 20:38:06
3,T5fAqjjFooT4V0OeZyuk1w,SFQ1jcnGguO0LYWnbbftAA,0AzLzHfOJgL7ROwhdww2ew,2,1,1,1,I've stayed at many Marriott and Renaissance M...,2010-01-08 02:29:15
4,sjm_uUcQVxab_EeLCqsYLg,0kA0PAJ8QFMeveQWHFqz2A,8zehGz9jnxPqXtOc7KaJxA,4,0,0,0,The food is always great here. The service fro...,2011-07-28 18:05:01


In [113]:
df.text = df.text.apply(lambda x: x.replace("\n", ""))
df.text = df.text.apply(lambda x: x.replace("\r", ""))

In [114]:
df.to_json('pd_yelp_academic_dataset_review.json', orient='records', lines=True)

# YES GOT IT WORKING

In [110]:
%reload_ext sql
%sql postgresql://postgres:postgres@127.0.0.1:5432/

In [111]:
POSTGRES_URI = "postgresql://postgres:postgres@127.0.0.1:5432"

In [115]:
# ! psql "$POSTGRES_URI" -c 'DROP DATABASE IF EXISTS yelp'
# ! psql "$POSTGRES_URI" -c 'CREATE DATABASE yelp'
! psql "$POSTGRES_URI/yelp" -c 'DROP TABLE IF EXISTS reviews'
! psql "$POSTGRES_URI/yelp" -c 'CREATE TABLE reviews(data JSON);'
! cat pd_yelp_academic_dataset_review.json | psql "$POSTGRES_URI/yelp" -c "COPY reviews (data) FROM STDIN;"
%sql \l

DROP TABLE
CREATE TABLE
COPY 7500
 * postgresql://postgres:***@127.0.0.1:5432/
   postgresql://postgres:***@127.0.0.1:5432/yelp
18 rows affected.


Name,Owner,Encoding,Collate,Ctype,Access privileges
baseball,postgres,UTF8,en_US.utf8,en_US.utf8,
coursera_dwh,postgres,UTF8,en_US.utf8,en_US.utf8,
datacamp,postgres,UTF8,en_US.utf8,en_US.utf8,
dbt_tutorial,postgres,UTF8,en_US.utf8,en_US.utf8,
dpt-viz,postgres,UTF8,en_US.utf8,en_US.utf8,
dvdrental,postgres,UTF8,en_US.utf8,en_US.utf8,
eightweeksqlchallenge,postgres,UTF8,en_US.utf8,en_US.utf8,
imdb,postgres,UTF8,en_US.utf8,en_US.utf8,
mode,postgres,UTF8,en_US.utf8,en_US.utf8,
postgres,postgres,UTF8,en_US.utf8,en_US.utf8,


Run the next two cells to observe how this new `reviews` table looks. **Please note that the `data` column is stored as TEXT and not as JSON.**

In [116]:
%sql postgresql://postgres:postgres@127.0.0.1:5432/yelp

In [117]:
%%sql
SELECT * FROM public.reviews LIMIT 2;

   postgresql://postgres:***@127.0.0.1:5432/
 * postgresql://postgres:***@127.0.0.1:5432/yelp
2 rows affected.


data
"{'review_id': 'lWC-xP3rd6obsecCYsGZRg', 'user_id': 'ak0TdVmGKo4pwqdJSTLwWw', 'business_id': 'buF9druCkbuXLX526sGELQ', 'stars': 4, 'useful': 3, 'funny': 1, 'cool': 1, 'text': ""Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating. After reading the mixed reviews of late I was a little tentative over our choice but luckily there was nothing to worry about in the food department. We started with the fried dough, burrata and prosciutto which were all lovely. Then although they don't offer half portions of pasta we each ordered the entree size and split them. We chose the tagliatelle bolognese and a four cheese filled pasta in a creamy sauce with bacon, asparagus and grana frita. Both were very good. We split a secondi which was the special Berkshire pork secreto, which was described as a pork skirt steak with garlic potato puru00e9e and romanesco broccoli (incorrectly described as a romanesco sauce). Some tables received bread before the meal but for some reason we did not. Management also seems capable for when the tenants in the apartment above began playing basketball she intervened and also comped the tables a dessert. We ordered the apple dumpling with gelato and it was also quite tasty. Portions are not huge which I particularly like because I prefer to order courses. If you are someone who orders just a meal you may leave hungry depending on you appetite. Dining room was mostly younger crowd while the bar was definitely the over 40 set. Would recommend that the naysayers return to see the improvement although I personally don't know the former glory to be able to compare. Easy access to downtown Salem without the crowds on this month of October."", 'date': 1412998442000}"
"{'review_id': '8bFej1QE5LXp4O05qjGqXA', 'user_id': 'YoVfDbnISlW0f7abNQACIg', 'business_id': 'RA4V8pr014UyUbDvI-LW2A', 'stars': 4, 'useful': 1, 'funny': 0, 'cool': 0, 'text': 'This store is pretty good. Not as great as Walmart (or my preferred, Milford Target), but closer and in a easier area to get to. The store itself is pretty clean and organized, the staff are friendly (most of the time), and BEST of all is the Self Checkout this store has! Great clearance sections throughout, and great prices on everything in the store, in general (they pricematch too!). Christian, Debbie, Jen and Hanna are all very friendly, helpful, sensitive to all customer needs. Definitely one of the better Target locations in the area, and they do a GREAT job assisting customers for being such a busy store. Located directly in the Framingham Mall on Cochituate Rd / Route 30. 4 stars.', 'date': 1435955905000}"


Observe how the reviews table consists of one column named `data`. This column contains all the JSON documents in the 
reviews collection *in text format*. Use [Postgres' JSON functions](https://www.postgresql.org/docs/9.3/functions-json.html) to write a query that converts the JSON fields into their own columns. To be more concrete, your query should contain 8 columns: `review_id`, `user_id`, `business_id`, `stars`, `useful`, `funny`, `cool`, and `text`. All of these columns should be `TEXT` columns. Each row should correspond to one JSON document. Be sure to `ORDER BY review_id` and `LIMIT 10` so your output corresponds with the autograder. We will also need Pandas for this question's autograder. We will also use Pandas rather extensively in the next question so we will import it in the following cell.

In [147]:
%%sql result_2b <<
WITH
json_data AS (
    SELECT
        data->>'review_id'::TEXT AS review_id,
        data->>'user_id' AS user_id,
        data->>'business_id' AS business_id,
        data->>'stars' AS stars,
        data->>'useful' AS useful,
        data->>'funny' AS funny,
        data->>'cool' AS cool,
        data->>'text' AS text
    FROM reviews
)
SELECT
    review_id,
    user_id,
    business_id,
    stars::FLOAT,
    useful::FLOAT,
    funny::FLOAT,
    cool::FLOAT,
    text
FROM json_data
ORDER BY review_id
LIMIT 10
;

   postgresql://postgres:***@127.0.0.1:5432/
 * postgresql://postgres:***@127.0.0.1:5432/yelp
10 rows affected.
Returning data to local variable result_2b


In [148]:
# Do not delete/edit this cell
result_2b.DataFrame().to_csv('results/result_2b.csv', index=False)

In [149]:
pd.read_csv('results/result_2b.csv')

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text
0,000bviMESLXmlIFKDzCEfw,f7LnyAbhP5OSXvv_xiuZhw,SNuCspoI3HKcwJpZL5FcjQ,3.0,2.0,0.0,0.0,I went there not having read any reviews. Fro...
1,003VeQn6SrVQS4sYHlc0gg,ba8ZSYE11LVepGCxwP9Vpg,rdS7hBBeukiX4Led9OT8sg,5.0,0.0,0.0,0.0,Molana has one of the most tender Chicken Barg...
2,00HovWV7VcZZPx5IleoeWA,Nzaq0bJcE3q_bRdFrsFRsA,Ln-8CbKGZGmF-GCqMoMcpA,4.0,0.0,0.0,0.0,"This place is really new, but they were pretty..."
3,00pmZ82_w6Mpky6dl2jpiA,r7e-6OS8A_gE_0CTUjQR3g,IDxaD_0_9TlWyKKXBFwjMA,1.0,0.0,0.0,0.0,The food was great .. we had breakfast... the ...
4,021UtGruSN1RA5YRS92E7w,6cyn5sP2OCarYV02KWtuGQ,5HMXgD_gui5n0Tc_hadesg,3.0,0.0,0.0,0.0,The atmosphere is great for drinks but I'd say...
5,03lRn8FzfnMojQvCvYmPfg,HUiCXGr6u91zQ6RppIpPdg,slBGJOdBV1KHzNoWf2e2FQ,2.0,0.0,0.0,0.0,This place is a so-so. The service is ok. We g...
6,03V4Y5Yql2T7_Fs186Fbfg,mpDCL8ZtHh6hbH_JfHPL2w,CCK1fCaqCC8LUWu5tRCbmg,3.0,0.0,0.0,0.0,"Food is ok, seemed like too many veggies and n..."
7,04dUiVd_qE8K289SZdliyg,iwvndJ05-Q5KjSsHhjNKOw,24bxH8U1DRu1biUYaaEv8w,4.0,0.0,0.0,0.0,I recently tried Hound Dogs Pizza...a little l...
8,05QXu-K-g7Lt1SjqmSvuLA,HYijQXEbn8osKsiaNjqHeg,ERoYrBHNmTEEChY3RGaOGQ,5.0,0.0,0.0,1.0,Make sure to beat the Saturday rush for brunch...
9,05xtlX1HTqaUWssNCy_DEw,uaDx7fnZhiBz3Hqq8VuroQ,PX_xyQcEj1bnaec2oMwH2w,1.0,1.0,0.0,0.0,DON'T GO HERE !!!! YOU WILL WAIT 2 HOURS FOR ...


In [150]:
grader.check("q2b")

### Question 2c

One important aspect of data engineering that we have not referred to yet are joins. We saw, through the use of indices, selection/projection pushdown, and various physical implementations (as well as orderings), joins could be done quite efficiently in relational SQL based databases. How do joins fare in Mongo where the data stored is inherently semistructured? Let's investigate! For this question, we have provided you access to the tables `business_complete` and `review_complete` which contain the business and review collections in relational form as described in 2b (the columns of the relations
are fields in the JSON document). Each relation has its respective id (`business_id` or `review_id`) column as its primary key.

In [151]:
! psql "$POSTGRES_URI/yelp" -c 'DROP TABLE IF EXISTS business_complete'
! psql "$POSTGRES_URI/yelp" -c 'CREATE TABLE business_complete(business_id TEXT PRIMARY KEY, name TEXT, address TEXT, city TEXT, state TEXT, postal_code TEXT, latitude TEXT,longitude TEXT, stars TEXT, review_count TEXT, is_open TEXT, attributes TEXT, categories TEXT, hours TEXT);'
! psql "$POSTGRES_URI/yelp" -c 'DROP TABLE IF EXISTS review_complete'
! psql "$POSTGRES_URI/yelp" -c 'CREATE TABLE review_complete(review_id TEXT PRIMARY KEY, user_id TEXT, business_id TEXT, stars TEXT, useful TEXT, funny TEXT, cool TEXT,text TEXT);'
! cat business.csv | psql "$POSTGRES_URI/yelp" -c "COPY business_complete (business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours) FROM STDIN CSV HEADER;"
! cat review.csv | psql "$POSTGRES_URI/yelp" -c "COPY review_complete (review_id, user_id, business_id, stars, useful, funny, cool, text) FROM STDIN CSV HEADER;"

NOTICE:  table "business_complete" does not exist, skipping
DROP TABLE
CREATE TABLE
NOTICE:  table "review_complete" does not exist, skipping
DROP TABLE
CREATE TABLE
COPY 35
COPY 7500


Let's take a look at how `review_complete` looks.

In [152]:
%%sql
SELECT * FROM public.review_complete LIMIT 2;

   postgresql://postgres:***@127.0.0.1:5432/
 * postgresql://postgres:***@127.0.0.1:5432/yelp
2 rows affected.


review_id,user_id,business_id,stars,useful,funny,cool,text
lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4.0,3,1,1,"Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating. After reading the mixed reviews of late I was a little tentative over our choice but luckily there was nothing to worry about in the food department. We started with the fried dough, burrata and prosciutto which were all lovely. Then although they don't offer half portions of pasta we each ordered the entree size and split them. We chose the tagliatelle bolognese and a four cheese filled pasta in a creamy sauce with bacon, asparagus and grana frita. Both were very good. We split a secondi which was the special Berkshire pork secreto, which was described as a pork skirt steak with garlic potato purée and romanesco broccoli (incorrectly described as a romanesco sauce). Some tables received bread before the meal but for some reason we did not. Management also seems capable for when the tenants in the apartment above began playing basketball she intervened and also comped the tables a dessert. We ordered the apple dumpling with gelato and it was also quite tasty. Portions are not huge which I particularly like because I prefer to order courses. If you are someone who orders just a meal you may leave hungry depending on you appetite. Dining room was mostly younger crowd while the bar was definitely the over 40 set. Would recommend that the naysayers return to see the improvement although I personally don't know the former glory to be able to compare. Easy access to downtown Salem without the crowds on this month of October."
8bFej1QE5LXp4O05qjGqXA,YoVfDbnISlW0f7abNQACIg,RA4V8pr014UyUbDvI-LW2A,4.0,1,0,0,"This store is pretty good. Not as great as Walmart (or my preferred, Milford Target), but closer and in a easier area to get to. The store itself is pretty clean and organized, the staff are friendly (most of the time), and BEST of all is the Self Checkout this store has! Great clearance sections throughout, and great prices on everything in the store, in general (they pricematch too!). Christian, Debbie, Jen and Hanna are all very friendly, helpful, sensitive to all customer needs. Definitely one of the better Target locations in the area, and they do a GREAT job assisting customers for being such a busy store. Located directly in the Framingham Mall on Cochituate Rd / Route 30. 4 stars."


At this current moment in time, Mongo only supports left outer joins. This is what we will compare against SQL. 
Provided for you below is the EXPLAIN plan for a Mongo left outer join between `review` and `business`. For more information on what each aspect of `$lookup` does, refer to https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/.

In [153]:
pipeline = [
	{
		"$lookup":
		{
			"from": "business",
			"localField": "business_id",
			"foreignField": "business_id",
			"as": "business_info"
		}
	}
]
mydb.command('explain', {'aggregate': 'review', 'pipeline': pipeline, 'cursor': {}}, verbosity='executionStats')

{'stages': [{'$cursor': {'query': {},
    'queryPlanner': {'plannerVersion': 1,
     'namespace': 'yelp.review',
     'indexFilterSet': False,
     'parsedQuery': {},
     'winningPlan': {'stage': 'EOF'},
     'rejectedPlans': []},
    'executionStats': {'executionSuccess': True,
     'nReturned': 0,
     'executionTimeMillis': 0,
     'totalKeysExamined': 0,
     'totalDocsExamined': 0,
     'executionStages': {'stage': 'EOF',
      'nReturned': 0,
      'executionTimeMillisEstimate': 0,
      'works': 1,
      'advanced': 0,
      'needTime': 0,
      'needYield': 0,
      'saveState': 1,
      'restoreState': 1,
      'isEOF': 1}}}},
  {'$lookup': {'from': 'business',
    'as': 'business_info',
    'localField': 'business_id',
    'foreignField': 'business_id'}}],
 'serverInfo': {'host': 'd84a02ad403b',
  'port': 27017,
  'version': '4.2.23',
  'gitVersion': 'f4e6602d3a4c5b22e9d8bcf0722d0afd0ec01ea2'},
 'ok': 1.0}

Now write the same left outer join in SQL which we will run on Postgres.

In [154]:
sql_join_str = """
SELECT *
FROM review_complete r1
JOIN review_complete r2
ON r2.business_id = r1.business_id
"""
!psql "$POSTGRES_URI/yelp" -c "explain analyze $sql_join_str"

                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1282.75..4425.60 rows=31948 width=1144) (actual time=10.792..42.938 rows=38052 loops=1)
   Hash Cond: (r1.business_id = r2.business_id)
   ->  Seq Scan on review_complete r1  (cost=0.00..639.00 rows=7500 width=572) (actual time=0.012..1.300 rows=7500 loops=1)
   ->  Hash  (cost=639.00..639.00 rows=7500 width=572) (actual time=10.684..10.685 rows=7500 loops=1)
         Buckets: 8192  Batches: 2  Memory Usage: 2383kB
         ->  Seq Scan on review_complete r2  (cost=0.00..639.00 rows=7500 width=572) (actual time=0.002..1.295 rows=7500 loops=1)
 Planning Time: 0.913 ms
 Execution Time: 45.330 ms
(8 rows)



For this specific query, which join was faster: 

A. Mongo
B. Postgres 

As a sanity check, one should be at least 2-3 milliseconds slower than the other.

**NOTE**: Your answer should either look like `q2c_part1 = ['A']` or `q2c_part1 = ['B']`

In [157]:
q2c_part1 = ['A']

In [158]:
grader.check("q2ci")

It seems like we have a winner! But wait! Remember that due to space limitations on Jupyter, we joined `business` with a truncated version of the full Yelp `reviews` dataset. Now let's assume we had joined `business` (160585 rows, 125MB) with the full `review` collection (8635403 rows, 7 GB). For this specific query, which join would be faster:

A. Mongo
B. Postgres

Though this might require some research, it is best to try to use your intuition to think about how both Mongo and Postgres store data and how each database would approach joins.

**NOTE**: Your answer should either look like `q2c_part2 = ['A']` or `q2c_part2 = ['B']`

In [5]:
q2c_part2 = ['B']

In [6]:
grader.check("q2cii")

If your answers to `q2c_part1` and `q2c_part2` were different, explain why. If your answers were the same, explain what gives that database system you chose an advantage over the other.

_Type your answer here, replacing this text._

## Question 3: ETL with Pandas

### Question 3a

So far, we've talked about document databases like Mongo and relational databases like Postgres. Now, we will explore ETL in yet a different context: dataframes. Dataframes are similar to relations with some differences as we will dive into here. To that end, we will use Pandas which is a Python package that allows you to work with dataframes. Pandas is widely adopted by data scientists for data loading, wrangling, cleaning, and analysis. To start, let us export our MongoDB collections into Pandas using a function called `json_normalize`. We need to truncate
`business` before we can use it to meet the memory constraints set by Jupyter. The variable `business_trunc` will contain the reference to `businesses_trunc`, our truncated business collection.

In [11]:
query = {}
business_trunc = mydb["businesses_trunc"]
count = 0
if business_trunc.count_documents({}) != 1000:
    for document in business.find({}):
        count += 1
        business_trunc.insert_one(document)
        if count == 1000:
            break

business_cursor = business_trunc.find(query)
review_cursor = mydb["reviews"].find(query)
user_cursor = mydb["users"].find(query)

# Load the collections into Pandas. 
from pandas import json_normalize
user_df = json_normalize(user_cursor)
review_df = json_normalize(review_cursor)
business_df = json_normalize(business_cursor)

For the rest of Question 3, please use the 3 dataframes we just created: `user_df`, `review_df`, and `business_df`. Let's take a look at the first 5 rows of `business_df`.

In [12]:
business_df.head()

Unnamed: 0,_id,business_id,name,address,city,state,postal_code,latitude,longitude,stars,...,attributes.GoodForDancing,attributes.BestNights,attributes.Music,attributes.BYOB,attributes.CoatCheck,attributes.Smoking,attributes.DriveThru,attributes.BYOBCorkage,attributes.Corkage,attributes.RestaurantsCounterService
0,63bc8fa899903bc4b31f7573,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,...,,,,,,,,,,
1,63bc8fa899903bc4b31f7574,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,...,,,,,,,,,,
2,63bc8fa899903bc4b31f7575,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,...,,,,,,,,,,
3,63bc8fa899903bc4b31f7576,oaepsyvc0J17qwi8cfrOWg,Great Clips,2566 Enterprise Rd,Orange City,FL,32763,28.914482,-81.295979,3.0,...,,,,,,,,,,
4,63bc8fa899903bc4b31f7577,PE9uqAjdw0E4-8mjGl3wVA,Crossfit Terminus,1046 Memorial Dr SE,Atlanta,GA,30316,33.747027,-84.353424,4.0,...,,,,,,,,,,


<!-- BEGIN QUESTION -->

What do you notice about how `json_normalize` constructed the columns of `business_df`? Compare and contrast this 
dataframe representation with the document representation we saw with Mongo.

<!--
BEGIN QUESTION
name: q3a
manual: true
points: 1
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 3b

In the previous question, we talked about how Mongo and Postgres approach joins. Pandas is also capable of performing
joins using the [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function! For this task, perform a inner join on `business_df` with itself on `stars`. The final dataframe
should be saved to a variable called `final_df` and should only contain 3 columns: the name of the first restaurant called
`first`, the name of the second restaurant called `second`, and the number of the stars called `stars`.

The columns don't need to be in any particular order.

In [43]:
# Don't forget to name your dataframe final_df!

# YOUR ANSWER HERE
final_df = (business_df
            .merge(business_df, how='inner', on='stars')
            .loc[:, ['name_x', 'name_y', 'stars']]
            .rename(columns={'name_x': 'first', 'name_y': 'second'})
)
final_df.head()

Unnamed: 0,first,second,stars
0,Oskar Blues Taproom,Oskar Blues Taproom,4.0
1,Oskar Blues Taproom,Flying Elephants at PDX,4.0
2,Oskar Blues Taproom,Crossfit Terminus,4.0
3,Oskar Blues Taproom,Capital City Barber Shop,4.0
4,Oskar Blues Taproom,Star Kreations Salon and Spa,4.0


In [47]:
# Do not delete/edit this cell
pickle.dump(len(final_df), open("results/result_3b_1.p","wb"))
pickle.dump(list(final_df.columns), open("results/result_3b_2.p","wb"))

In [48]:
final_df_q3b = pickle.load(open("results/result_3b_1.p", "rb" ))
final_df_q3b

153964

In [51]:
grader.check("q3b")

### Question 3c

Due to the nested representation of the data, there are a lot of missing fields with NaN values in the Pandas dataframes as you may have noticed in 3a. Construct a dataframe `missing_value_df` with two columns: `column_name` and `percent_missing`. `percent_missing` should be the percentage of NaN values in the corresponding column (HINT: use Pandas' [`isnull`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function for this). For example, 
if we do `missing_value_df.loc[missing_value_df['column_name'] == '_id']["percent_missing"][0] == 0.0`, we should get `True`. Plot a histogram distribution the percentage of NaN values across all columns (via Pandas [`hist()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) function). How many of the columns 
are entirely null?

In [70]:
# Remember to name your final dataframe missing_value_df!

# YOUR ANSWER HERE
missing_value_df = (pd.DataFrame(business_df.isnull().sum() / len(business_df))
                    .reset_index()
                    .rename(columns={'index': 'column_name', 0: 'percent_missing'})
                   )
missing_value_df.head()

Unnamed: 0,column_name,percent_missing
0,_id,0.0
1,business_id,0.0
2,name,0.0
3,address,0.0
4,city,0.0


In [72]:
# Do not delete/edit this cell
missing_value_df.to_csv('results/result_3c.csv', index=False)

In [73]:
grader.check("q3ci")

How many columns are 90%+ NaN? Input your answer into q3cii_str as a string (e.g. if your answer is 6, then `q3cii_str = "6"`)

<!--
BEGIN QUESTION
name: q3cii
points: 1
-->

In [77]:
missing_value_df[missing_value_df['percent_missing'] > .9].shape

(16, 2)

In [78]:
q3cii_str = '16'

In [79]:
grader.check("q3cii")

### Question 3d

Let us now alter `business_df` to exclude the columns with more than 80%+ null values. This likely means the corresponding attributes are not an important factor for most businesses so we can get rid of them in our `business_df`. Create a new dataframe 
called `important_attribute_business_df` which only contains these columns. It might be useful to use `missing_value_df` from the previous subpart!

In [90]:
important_attributes = missing_value_df[missing_value_df['percent_missing'] < .8]['column_name'].values
important_attribute_business_df = business_df[important_attributes]

In [91]:
# Do not delete/edit this cell
important_attribute_business_df.to_csv('results/result_3d.csv', index=False)

In [99]:
grader.check("q3d")

### Question 3e

At this point, you have had experience with manipulating data on Mongo, Postgres, and Pandas. In this question, we will provide 3 scenarios and using the lessons you've learned so far, please specify which of the three (Mongo, Postgres, or Pandas) would work best for this specific use case.

1. You are doing a data journalism piece on college sports. You collect a list of colleges and for each collegiate sport program within that college, you find the budget assigned for that program. You have a choice between the following:

    A) Representing this data in JSON (e.g. 
    ```
    {
        "UC Berkeley": {
            "football": "10000000", 
            "wrestling": "344582", 
            ...}
    }
    ```
    ) and importing into Mongo.
    
    B) Representing this data as a schema in Postgres where the columns are the names of the sports.
    
    C) Representing this data as a dataframe in Pandas where the columns are the names of the sports.

You would like to find the aggregate of budgets across different sports (average, sum, median, mode). What would be the best option for storing this data?

**NOTE**: Your answer should look like `q3ei_str = ['A']` or `q3ei_str = ['B']` or `q3ei_str = ['C']` or `q3ei_str = ['D']`

<!--
BEGIN QUESTION
name: q3ei
points: 1
-->

In [104]:
q3ei_str = ['B']

In [105]:
grader.check("q3ei")

2. You would now like to investigate what effect does budget have on student-athlete scholarships. After doing some research, you find a dataset that contains a list of every single athlete at every single college and their sport and scholarship levels (this is a massive 10GB+ dataset with millions of rows). You find another dataset that contains a list of colleges, their sports programs, and the program budget. This is another massive dataset with hundreds of thousands of rows. You would like to perform an inner join between the two datasets on school and program so you can view each student-athlete's scholarship with their sport's budget. You have a choice between the following:

    A) Representing each dataset in JSON (e.g. 
    ```
    {"athletes": [
        {"Chase Garbers": {
            "school": "UC Berkeley", 
            "scholarship": "full", 
            "sport": "football", 
            ...
            }
        }, 
        ...
    ]}
    ```
    and 
    ```
    {"schools": [
        {"UC Berkeley": {
            "football": {
                "budget": "10000000"
             }, 
             ...
             }
        }, 
        ...
     ]}
     ```
    ), importing into Mongo, and doing a join there.
    
    B) Representing this data as 2 schemas in Postgres where the columns for the first schema are 
    [`student_name`, `school`, `sport`, `scholarship`] and for the second [`school`, `sport`, `budget`].
    
    C) Representing this data as 2 dataframes in Pandas with the same columns as Postgres.

What would be the best option for storing this data?

**NOTE**: Your answer should look like `q3eii_str = ['A']` or `q3eii_str = ['B']` or `q3eii_str = ['C']` or `q3eii_str = ['D']`

<!--
BEGIN QUESTION
name: q3eii
points: 1
-->

In [106]:
q3eii_str = ['B']

In [107]:
grader.check("q3eii")

3. Finally, you are ready to start writing your article! You decide to focus on just the data from UC Berkeley. You have access to a dataset of just UC Berkeley athletes along with their sports and scholarship levels. The scholarship level data was improperly cleaned: some scholarships are recorded as strings "full", "half", or "none" and some are recorded as integer percentages 0-100. You would like to provide this data to your readers in a format that is susceptible to easy visualizations: e.g. graphs that show how many athletes have a full vs. half vs. no scholarship, which sports have the highest percentages of athletes with full scholarships etc. What is the best way to store this data for this purpose?

    A) Represent the dataset in JSON e.g.
    ```
    {"athletes": [
        {
           "Chase Garbers": {
             "scholarship": "full", 
             "sport": "football"
           }
        },
        {
            "Danielle Vosk": {
              "scholarship": 25,
              "sport": "basketball"
            }
        },
        ...
        ]
    }
    ```
    B) Represent this data as a schema in Postgres where the columns are [`student_name`, `sport`, `scholarship`]
    
    C) Represent this data as a dataframe in Pandas with the same columns as Postgres.
    
**NOTE**: Your answer should look like `q3eiii_str = ['A']` or `q3eiii_str = ['B']` or `q3eiii_str = ['C']` or `q3eiii_str = ['D']`

<!--
BEGIN QUESTION
name: q3eiii
points: 1
-->

In [108]:
q3eiii_str = ['C']

In [109]:
grader.check("q3eiii")

## Question 4: Messy JSON

### Question 4a
Many of the queries you've seen or written thus far were relatively reliable: aggregating and collecting over fields
that you know exist for sure. But the nature of Mongo documents is that they are inherently flexible and semi-structured. Not every document will share every single field! In this question, we will explore how Mongo handles these use cases.

For this question and the rest of Q4, please use the `businesses` collection (this should be stored in the `business` Python variable).

Imagine you are in charge of managing your family reunion. You would like to book a private room at a restaurant.
However, you would also like to optimize for chaos. You notice that there is an attribute called `RestaurantsGoodForGroups`. You would like to write a query that returns all restaurants that **do not** have the `RestaurantsGoodForGroups` attribute so that the trajectory of the reunion is determined by fate. Your output for the autograder will be the number of restaurants that do not have the `RestaurantsGoodForGroups` attribute stored in `q4a_str`. 

**NOTE: You would like this list to consist solely of restaurants. This means that the business must have `Restaurants` in the `categories` field.**

In [114]:
list(business.aggregate([{"$limit": 1}]))

[{'_id': ObjectId('63bc8fa899903bc4b31f7573'),
  'business_id': '6iYb2HFDywm3zjuRg0shjw',
  'name': 'Oskar Blues Taproom',
  'address': '921 Pearl St',
  'city': 'Boulder',
  'state': 'CO',
  'postal_code': '80302',
  'latitude': 40.0175444,
  'longitude': -105.2833481,
  'stars': 4.0,
  'review_count': 86,
  'is_open': 1,
  'attributes': {'RestaurantsTableService': 'True',
   'WiFi': "u'free'",
   'BikeParking': 'True',
   'BusinessParking': "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}",
   'BusinessAcceptsCreditCards': 'True',
   'RestaurantsReservations': 'False',
   'WheelchairAccessible': 'True',
   'Caters': 'True',
   'OutdoorSeating': 'True',
   'RestaurantsGoodForGroups': 'True',
   'HappyHour': 'True',
   'BusinessAcceptsBitcoin': 'False',
   'RestaurantsPriceRange2': '2',
   'Ambience': "{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casua

In [None]:
from pymongo import TEXT

# The following text index may be useful!
if 'categories_text' not in business.index_information():
    business.create_index([('categories', TEXT)])

In [7]:
# YOUR ANSWER HERE
pipeline = [
    {
        "$match": {
            "$text": {"$search": "Restaurants"}
        }
    },
    {
        "$match": {
            "attributes.RestaurantsGoodForGroups": {"$eq": 'False'}
        }
    },
]

res = list(business.aggregate(pipeline))
print(len(res))
pprint.pprint(res[0])

6658
{'_id': ObjectId('63bc8fcb99903bc4b3203fdb'),
 'address': '2179 Lawrenceville Hwy',
 'attributes': {'Ambience': "{'romantic': False, 'intimate': False, "
                            "'touristy': False, 'hipster': False, 'divey': "
                            "False, 'classy': False, 'trendy': False, "
                            "'upscale': False, 'casual': False}",
                'BikeParking': 'False',
                'BusinessAcceptsCreditCards': 'True',
                'BusinessParking': "{'garage': False, 'street': False, "
                                   "'validated': False, 'lot': False, 'valet': "
                                   'False}',
                'Caters': 'False',
                'GoodForKids': 'True',
                'HasTV': 'True',
                'NoiseLevel': "u'quiet'",
                'OutdoorSeating': 'False',
                'RestaurantsAttire': "'casual'",
                'RestaurantsDelivery': 'True',
                'RestaurantsGoodForGroups': '

How many restaurants do not have the `RestaurantsGoodForGroups` attribute? You may either enter input this is a function with respect to your query or hardcode in either the String or the numeric version of the answer you computed.

In [8]:
q4a_str = 6658

In [9]:
# Do not delete/edit this cell
pickle.dump(q4a_str, open("results/result_4a.p","wb"))

In [10]:
grader.check("q4a")

### Question 4b

Your relatives inform you that they would like to be at the restaurant when it opens to beat the crowds. Furthermore, after sending
a when2meet, most of your relatives would prefer for the meal to be on a Friday and the start time of the meal to be 
between 5-6:59PM (17:00-18:59). Find the number of restaurants that open on Fridays between 17:00-18:59 and store this in a variable labelled `q4b_str`. As a reminder, in order for a business to be a restaurant, it must have `Restaurant` in its categories. Be aware that `hours` can either be an array or `None`!

**HINT**: It will be advantageous to use the `aggregate()` Pymongo function along with the `$addFields` (adds a new field to documents) and `$match` (filters out documents based on a condition) stage operators.You may also want to use the `$split` operator (similar to Python's string `split()` function) to parse out the Friday hours.

In [79]:
# BEGIN QUERY HERE
pipeline = [
    {
        "$match": {
            "$text": {"$search": "Restaurants"},
            "attributes.RestaurantsGoodForGroups": {"$eq": 'False'},
        }
    },
    {
        "$project": {
            "hours.Friday": {"$ifNull": ["$hours.Friday", False]}
        }
    },
    {
        "$match": {
            "hours.Friday": {"$ne": False}
        }
    },
    {
        "$project": {
            "hours.Friday": 1,
            "openString": {"$arrayElemAt":[{"$split": ["$hours.Friday" , "-"]}, 0]},
            "closeString": {"$arrayElemAt":[{"$split": ["$hours.Friday" , "-"]}, 1]},
        }
    },
    {
        "$project": {
            "hours.Friday": 1,
            "openHour": {"$toInt": {"$arrayElemAt":[{"$split": ["$openString" , ":"]}, 0]}},
            "closeHour": {"$toInt": {"$arrayElemAt":[{"$split": ["$closeString" , ":"]}, 0]}},
        }
    },
    {
        "$project": {
            "hours.Friday": 1,
            "reservable": {
                "$cond": { "if": { "$lt": [ "$openHour", 17 ] }, "then": {"$and": [{"$ne": [ "$closeHour", 19 ]}]}, "else": False }
            }
        }
    },
    {
        "$match": {
            "reservable": {"$eq": True}
        }
    }
]

res = list(business.aggregate(pipeline))
print(len(res))
pprint.pprint(res[0])

4686
{'_id': ObjectId('63bc8ffc99903bc4b321a059'),
 'hours': {'Friday': '10:0-0:0'},
 'reservable': True}


How many restaurants open on Fridays between 17:00-18:59?

In [80]:
q4b_str = 4686

In [81]:
# Do not delete/edit this cell
pickle.dump(q4b_str, open("results/result_4b.p","wb"))

In [82]:
grader.check("q4b")

### Question 4c

Some members of your family are vegetarian so you would like to only eat at restaurants with the Vegetarian category. 
However, the `categories` are stored as a single string! You would like to make it easy to access Vegetarian as a separate field. Write a query that does the following: for every category in `categories`, add a new document that contains the category as a field with the value `'true'`, the `ObjectId` for the previous document (labelled `_id`), and the name of the business (labelled `name`).

For example, a document 
```
{
    "_id": ObjectId('606ffb0123cf2e5079dbd91f'), 
    "name": "Wendy's", 
     ..., 
     categories" : "Salad, Vegetarian"
} 
```
would become 
```
{
    “Salad”: "true", 
    "_id": ObjectId('606ffb0123cf2e5079dbd91f'), 
    "name": "Wendy's"
}
```
and 
```
{
    “Vegetarian”: "true", 
    "_id": ObjectId('606ffb0123cf2e5079dbd91f'), 
    "name": "Wendy's"
}
```

We will save your pipeline to a variable called `q4c_pipeline`. **Do not change this variable name or you won't pass the autograder!** This query should be of the form `business.aggregate(q4c_pipeline)`. In order to get full credit, you will need to enter your pipeline into the variable `q4c_pipeline`. 

In [96]:
pipeline = [
    {
        "$project": {
            "_id": 1,
            "name": 1,
            "category": {"$split": ["$categories" , ", "]},
        }
    },
    {
        "$unwind": "$category"
    },
    {
        "$addFields": {
            "array": [{
                "k": "$category",
                "v": "true"
            }]
        }
    },
    {
        "$replaceRoot": {
            "newRoot": {
                "$mergeObjects": [
                    { "$arrayToObject": "$array" },
                    "$$ROOT",
                ]
            }
        }
    },
    {
        "$unset": ['array', 'category']
    }
]

res = list(business.aggregate(pipeline))
print(len(res))
pprint.pprint(res[0])

708968
{'Gastropubs': 'true',
 '_id': ObjectId('63bc8fa899903bc4b31f7573'),
 'name': 'Oskar Blues Taproom'}


In [97]:
q4c_pipeline = pipeline

cur = business.aggregate(q4c_pipeline)

In [98]:
# Do not delete/edit this cell
myclient = pymongo.MongoClient("mongodb://localhost")
mydb = myclient["yelp"]
business = mydb["businesses"]
pipeline_for_test_4c = q4c_pipeline[:]
pipeline_for_test_4c.extend([{"$match": {"name": "Everything POP Shopping & Dining"}}, {"$project": {"_id": 0}}])
cur_test_4c = business.aggregate(pipeline_for_test_4c)
pickle.dump(list(cur_test_4c), open("results/result_4c.p","wb"))

pipeline_for_test_4c_2 = q4c_pipeline[:]
pipeline_for_test_4c_2.extend([{"$match": {"name": "Longwood Galleria"}}, {"$project": {"_id": 0}}])
cur_test_4c_2 = business.aggregate(pipeline_for_test_4c_2)
pickle.dump(list(cur_test_4c_2), open("results/result_4c_2.p","wb"))

In [99]:
grader.check("q4c")

### Question 4d
This change in representation has made it super easy to view all the vegetarian restaurants and count them without the use of an index since
we can now simply filter by whether or not 'Vegetarian' is a field in our document! We have provided some code here to
count how many vegetarian restaurants are in our dataset. Simply provide the actual number, in either string or numeric format, to get points for this question.

In [100]:
# Question 4d
myclient = pymongo.MongoClient("mongodb://localhost")
mydb = myclient["yelp"]
business = mydb["businesses"]
pipeline_for_4d = q4c_pipeline[:]
pipeline_for_4d.append({"$match": {"Vegetarian": 'true'}})
cur_for_4d = business.aggregate(pipeline_for_4d)

In [103]:
veg_count = len(list(cur_for_4d))

In [104]:
# Do not delete/edit this cell
pickle.dump(veg_count, open("results/result_4d.p","wb"))

In [105]:
grader.check("q4d")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()

## Results

Your Gradescope submission will require a results.zip file. Run the following cell to generate this file.

In [None]:
!zip -r results.zip results