In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

# Homework 04: Food Safety (Continued)

You must submit this assignment to Gradescope by the on-time deadline. **We strongly encourage you to plan to submit your work to Gradescope several days (hours) before the stated deadline.** This way, you will have ample time to reach out to staff for support if you encounter difficulties with submission. While course staff is happy to help guide you with submitting your assignment ahead of the deadline, we will not respond to last-minute requests for assistance (TAs need to sleep, after all!).

**Please read the instructions carefully when you are submitting your work to Gradescope**.




## This Assignment

In this homework, we will continue our exploration of restaurant food safety scores for restaurants in San Francisco. The main goal for this assignment is to focus more on the analysis of the dataset, building on the data cleaning we have done earlier in HW 03. 

**Ethical Note:**  
- This dataset contains publicly available health inspection scores for restaurants in San Francisco.  
- While the data is used here for educational purposes, please interpret findings responsibly and avoid drawing unfair conclusions about specific businesses.




## Before You Start

For each question in the assignment, please write down your answer in the answer cell(s) right below the question. 

We understand that it is helpful to have extra cells breaking down the process towards reaching your final answer. If you happen to create new cells below your answer to run code, **NEVER** add cells between a question cell and the answer cell below it. It will cause errors when we run the autograder, and it will sometimes cause a failure to generate the PDF file.

**Important note: The local autograder tests will not be comprehensive. You can pass the automated tests in your notebook but still fail tests in the autograder.** Please be sure to check your results carefully.

Finally, unless we state otherwise, **do not use for loops or list comprehensions**. The majority of this assignment can be done using built-in commands in `pandas` and `NumPy`.  Our autograder isn't smart enough to check, but you're depriving yourself of key learning objectives if you write loops / comprehensions, and you also won't be ready for the midterm.

### Debugging Guide
If you run into any technical issues, we highly recommend checking out the [Debugging Guide](https://mtu.instructure.com/courses/1571598/pages/debugging-guide). In this guide, you can find general questions about Jupyter notebooks / Datahub, Gradescope, and common pandas errors.

In [None]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.style.use('fivethirtyeight')

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

In HW02, we took you through the entire process of reading data from a file to perform some exploration of the data. Here, we again load the dataset that we will be using in HW04 along with some of the columns we had added in HW02. For any additional context regarding the dataset, feel free to revisit HW22.

In [None]:
bus = pd.read_csv('data/bus.csv', encoding='ISO-8859-1').rename(columns={"business id column": "bid"})
bus['postal5'] = bus['postal_code'].str[:5]
ins = pd.read_csv('data/ins.csv')
ins['timestamp'] = pd.to_datetime(ins['date'], format='%m/%d/%Y %I:%M:%S %p')
ins['bid'] = ins['iid'].str.split("_", expand=True)[0].astype(int) 

ins = ins[ins["score"] > 0]

In [None]:
bus.head()

We also join the `ins` `DataFrame` with the appropriate portion of the `bus` `DataFrame` as was done in HW03.

In [None]:
ins_named = pd.merge(ins, bus[['bid', 'name', 'address']])
ins_named.head()

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

# Part 1: Let Them Eat Cake

Now that you've analyzed and found out which restaurants to avoid in SF (HW03), we can turn toward the more interesting question of what dessert places are the best! For the purposes of this question, we assume that cake is the best dessert. 

<br/>

--- 

## Question 1.1

In your quest to find the best cake shop, the first step is to find all the businesses in `ins_named` that **contain the word 'cake'** in their `name`, and assign the resulting `DataFrame` to `cake_shops`. To help you out, we created the `lowercase_name` column so you do not need to worry about checking for capitalized letters when checking if `name` contains `'cake'`.

**Hint:** You might find the `.str` accessors useful yet again!

In [None]:
ins_named['lowercase_name'] = ins_named['name'].str.lower()
cake_shops = ...
cake_shops.head(5)

In [None]:
grader.check("q11")

<!-- BEGIN QUESTION -->

<br/>

--- 

## Question 1.2

Assign `cake_at_least_3` to a `DataFrame` consisting of only those cake shops that have had at least 3 inspections. Remember, the `bid` uniquely defines a cake shop, not its `name`!

In [None]:
...
cake_at_least_3 = ...
cake_at_least_3.head()

In [None]:
grader.check("q12")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br/>

--- 
## Question 1.3

In the cell below, run the following line of code:  
`q1c_df = cake_at_least_3.sort_values('timestamp').groupby('bid').agg('first')`

Is the granularity of `cake_at_least_3` the same as the granularity of `q1c_df`? In other words, what does a single row of `q1c_df` represent, and what does a single row in `cake_at_least_3` represent? Explain the granularity of each `DataFrame`. Your answer does not need to be more than 2-3 lines, but you should be specific. 

**Note**: For more details on what the granularity of a `DataFrame` means, feel free to check [Section 8.6](https://learningds.org/ch/08/files_granularity.html) of the LDS book! 


In [None]:
q1c_df = cake_at_least_3.sort_values('timestamp').groupby('bid').agg('first')
q1c_df.head()

*Type your answer here, replacing this text*

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br/>

--- 
## Question 1.4

Rather than the inspection scores, you find that the number of vowels present in the business `name` is a better indicator of how good the cake is when it comes to the shops in `cake_at_least_3`. Using the helper function `count_vowels` we have defined for you, sort all the cake shops in `cake_at_least_3` based on the number of vowels in the business's name in descending order. Then, return a **Python `list`** consisting of the top 3 **uniquely named** cake shops using this sorted `DataFrame`.  You do not need to stick to the skeleton code provided, but you are **not allowed to do not add any new columns!**

This is pretty challenging, but rest assured, the price of knowing the best cake shops is well worth it! 

**Hint**: When working on this problem, it might be helpful to check out Lecture 04 - Pandas III on custom sorts.  

In [None]:
def count_vowels(name):
    vowels = 'aeiouAEIOU'
    count = 0
    return sum([letter in vowels for letter in name])

In [None]:
sorted_by_vowel_count = ...
top_3_cake = ...

top_3_cake

In [None]:
grader.check("q14")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br/>

---
## Question 1.5

Finally, to examine different parts of a chained `pandas` statement, describe the purpose of each of the functions used (`.loc`, `.groupby`, `idxmax()`) in words. 

Secondly, share what you think this line of code accomplishes. In other words, write a question that could be answered using this statement.

While the first part of this question will be graded for correctness, the second part of this question is a bit more open-ended. Answers demonstrating your understanding will get full credit.

In [None]:
cake_at_least_3.loc[cake_at_least_3.groupby("bid")["score"].idxmax()].head()

*Replace text here, with your answer*

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br/><br/>

--- 

## Question 2

As a final challenge, we consider a scenario involving restaurants with multiple ratings over time.

Let's see which restaurant location has had the most extreme improvement in its scores. Let the "swing" of a restaurant location be defined as the difference between its highest-ever and lowest-ever score. **Only consider restaurant locations with at least 3 scores—that is, restaurants that were rated at least 3 times.** Assign `max_swing` to the name of the restaurant that has the maximum swing. 

We have not provided any skeleton, as there are many paths to getting the correct answer. The recommended approach to solving this problem is to break it down into smaller chunks (e.g., first, ensure all restaurants have at least 3 scores; second, compute the swing, etc.). This will likely require more than one line, so feel free to add/remove columns and define new temporary variables. Remember to assign your solution - a string containing the `name` of the restaurant location that experienced the most extreme improvement - to `max_swing` after you do so. 

**Note**: The "swing" is of a specific restaurant location. There might be some restaurants with multiple locations; we are focusing on the swing of a particular restaurant as specified by its `name` and `address`.

In [None]:
...

max_swing = ...

<!-- END QUESTION -->

## Congratulations! You have finished Homework 04! ##

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers.  If you run into any issues when running this cell, feel free to check the [Debugging Guide](https://mtu.instructure.com/courses/1571598/pages/debugging-guide).


### Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)