In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw02B.ipynb")

# 🕵️ Homework 2B: Food Safety (Continued)

## Due Date: Thursday, February 13, 11:59 PM
You must submit this assignment to Gradescope by the on-time deadline, Thursday, February 13, 11:59 PM. Please read the syllabus for the Slip Day policy. No late submissions beyond what is outlined in the Slip Day policy will be accepted. **We strongly encourage you to plan to submit your work to Gradescope several hours before the stated deadline.** This way, you will have ample time to reach out to staff for support if you encounter difficulties with submission. While course staff is happy to help guide you with submitting your assignment ahead of the deadline, we will not respond to last-minute requests for assistance (TAs need to sleep, after all!).

Please read the instructions carefully when you are submitting your work to Gradescope.

## 👥 Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you **write your solutions individually**. If you do discuss the assignments with others, please **include their names** below.

**Collaborators**: *list collaborators here*


## 📝 This Assignment

In this homework, we will continue our exploration of restaurant food safety scores for restaurants in San Francisco. The main goal for this assignment is to focus more on the analysis of the dataset, building on the data cleaning we have done earlier in HW 2A. 


After this homework, you should be comfortable with:
* Reading `pandas` documentation and using `pandas` methods,
* Working with data at different levels of granularity,
* Using `groupby` with different aggregation functions, and
* Chaining different `pandas` functions and methods to find answers to exploratory questions.


## Score Breakdown 
Question | Manual | Points
--- | --- | ---
1a | no | 2
1b | no | 3
1c | no | 3
2a | no | 2
2b | no | 3
2c | no | 3
3a | yes | 4
3b | yes | 4
4a | no | 2
4b | no | 2
4c | no | 3
4d | no | 3
4e | yes | 1
Total | 3 | 35


## 🏎️ Before You Start

For each question in the assignment, please write down your answer in the answer cell(s) right below the question. 

We understand that it is helpful to have extra cells breaking down the process towards reaching your final answer. If you happen to create new cells below your answer to run code, **NEVER** add cells between a question cell and the answer cell below it. It will cause errors when we run the autograder, and it will sometimes cause a failure to generate the PDF file.

**Important note: The local autograder tests will not be comprehensive. You can pass the automated tests in your notebook but still fail tests in the autograder.** Please be sure to check your results carefully.

Finally, unless we state otherwise, **do not use for loops or list comprehensions**. The majority of this assignment can be done using built-in commands in `pandas` and `NumPy`. Our autograder isn't smart enough to check, but you're depriving yourself of key learning objectives if you write loops / comprehensions, and you also won't be ready for the midterm.

### 🐛 Debugging Guide
If you run into any technical issues, we highly recommend checking out the [Data 100 Debugging Guide](https://ds100.org/debugging-guide/). In this guide, you can find general questions about Jupyter notebooks / Datahub, Gradescope, and common `pandas` errors.

In [None]:
import numpy as np
import pandas as pd

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

In HW 2A, we took you through the entire process of reading data from a file to perform some exploration of the data. Here, we again load the dataset that we will be using in HW 2B along with some of the columns we had added in HW 2A. For any additional context regarding the dataset, we encourage you to revisit HW 2A.

In [None]:
bus = pd.read_csv('data/bus.csv', encoding='ISO-8859-1').rename(columns={"business id column": "bid"})
bus['postal5'] = bus['postal_code'].str[:5]
ins = pd.read_csv('data/ins.csv')
ins['timestamp'] = pd.to_datetime(ins['date'], format='%m/%d/%Y %I:%M:%S %p')
ins['bid'] = ins['iid'].str.split("_", expand=True)[0].astype(int) 

# This code is essential for the autograder to function properly. Do not edit.
ins_test = ins

<br/>

---

# 🔎 Question 1: Inspecting the Inspections

## 🚀 Question 1a

Let's start by looking again at the first 5 rows of `ins` to see what we're working with.

In [None]:
ins.head(5)

To better understand how the scores have been allocated, let's examine how the maximum score varies for each type of inspection. 

Create a `DataFrame` object `ins_score_by_type`, indexed by all the inspection types (e.g., New Construction, Routine - Unscheduled, etc.), with a single column named `max_score` containing the highest score received. Additionally, order `ins_score_by_type` by `max_score` in descending order. 

**Hint:** You may find the `rename` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)) to be useful! 

In [None]:
ins_score_by_type = ...
ins_score_by_type

In [None]:
grader.check("q1a")

<br/>

---

## 🚀 Question 1b


Given the variability of `ins['score']` observed in 1a, let's examine the inspection scores `ins['score']` further.

In [None]:
ins['score'].value_counts().head()

There are a large number of inspections with a score of -1. These are probably missing values. Let's see what types of inspections have scores and which do not (score of -1). 

- First, define a new column `Missing Score` in `ins` where each row maps to the string `"Yes"` if the `score` for that business is -1 and `"No"` otherwise. 

- Then, use `groupby` to find the number of inspections for every combination of `type` and `Missing Score`. Store these values in a new column `Count`. 

- Finally, sort `ins_missing_score_group` by descending `Count`s. 
The result should be a `DataFrame` that looks like the one shown below.

**Hint**: You may find the `map` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html)) useful for defining `Missing Score`! 

<table border="1" class="dataframe" >  <thead>    
    <tr style="text-align: right;">      <th></th>      <th></th>      <th>Count</th>   </tr>
    <tr style="text-align: right;">      <th>type</th>      <th>Missing Score</th>      <th></th>   </tr>    <tr align="right"> <tbody>    
    <tr  align="right">      <th>Routine - Unscheduled</th>      <th>No</th>      <td>14031</td>         </tr>    
    <tr  align="right">      <th>...</th>      <td>...</td>      <td>...</td>        </tr>    
    <tr  align="right">      <th>...</th>      <td>...</td>      <td>...</td>       </tr>    </tbody> </table>

In [None]:
ins['Missing Score'] = ...
ins_missing_score_group = ...


ins_missing_score_group

In [None]:
grader.check("q1b")

<br/>

---

## 🚀 Question 1c


Using `groupby` to perform the analysis above gave us a `DataFrame` that wasn't the most readable at first glance. There are better ways to represent the information above that take advantage of the fact that we are looking at combinations of two variables. It's time to pivot (pun intended)!

Create a `DataFrame` that looks like the one below, and assign it to the variable `ins_missing_score_pivot`. 

You'll want to use the `pivot_table` method of the `DataFrame` class, which you can read about in the `pivot_table` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html). 

- Once you create `ins_missing_score_pivot`, add another column titled `Percentage Missing`, which contains the proportion of missing scores within each `type`. 

- Then, sort `ins_missing_score_pivot` in ascending order of `Percentage Missing`. Reassign the sorted `DataFrame` back to `ins_missing_score_pivot`.

**Hint:** Consider what happens if no values correspond to a particular combination of `Missing Score` and `type`. Looking at the documentation for `pivot_table`, is there any function argument that allows you to specify what value to fill in?

If you've done everything right, you should observe that inspection scores appear only to be assigned to `Routine - Unscheduled` inspections and that `ins_missing_score_pivot` looks like the `DataFrame` below:


<table border="1" class="dataframe" >  <thead>    
    <tr style="text-align: right;">      <th>Missing Score</th>      <th>No</th>      <th>Yes</th>      <th>Percentage Missing</th>    </tr>    <tr style="text-align: right;">      <th>type</th>      <th></th>      <th></th>      <th></th>    </tr>  </thead>  <tbody>    
    <tr  align="right">      <th>Routine - Unscheduled</th>      <td>14031</td>      <td>46</td>      <td>0.003268</td>    </tr>    
    <tr  align="right">      <th>...</th>      <td>...</td>      <td>...</td>      <td>...</td>    </tr>    
    </tbody></table>


In [None]:
ins_missing_score_pivot = ...

...

ins_missing_score_pivot

In [None]:
grader.check("q1c")

Notice that inspection scores appear only to be assigned to `Routine - Unscheduled` inspections. Also, it is reasonable for inspection types such as `New Ownership` and `Complaint` to have no associated inspection scores, but you might be curious why there are no inspection scores for the `Reinspection/Followup` inspection type. Later in the HW, we will examine these `Reinspection/Followup` inspections.

<br/>

---

# 🚀 Question 2: Joining Data Across Tables

In this question, we will start to connect data across multiple tables. We will be using the `pd.merge` function. 

<br/>

--- 

## 🚀 Question 2a

Let's figure out which restaurants had the lowest scores. Before we proceed, filter out missing scores from `ins` so that negative scores don't influence our results. 

In [None]:
ins = ins[ins["score"] > 0]

We'll start by creating a new `DataFrame` called `ins_named`. `ins_named` should be exactly the same as `ins`, except that it should have the name and address of every business, as determined by the `bus` `DataFrame`. 

**Hint**: Use the `DataFrame` method `merge` to join the `ins` `DataFrame` with the appropriate portion of the `bus` `DataFrame`. See the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) for guidance on how to use `merge` function to combine two `DataFrame` objects. The first few rows of `ins_named` `DataFrame` are shown below:

<img src="pics/2a.png" width="1080"/>

In [None]:
...
ins_named.head()

In [None]:
grader.check("q2a")

<br/>

--- 

## 🚀 Question 2b

Look at the 20 businesses in `ins_named` with the lowest scores. Order `ins_named` by each business's minimum score in ascending order. Use the business names in ascending order to break ties. The resulting `DataFrame` should look like the table below.

This one is pretty challenging! Don't forget to rename the `score` column. 

**Hint**: The `agg` function can accept a dictionary as an input. See the `agg` [documentation](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html). Additionally, when thinking about what aggregation functions to use, ask yourself what value would be in the `name` column for each entry across the group? Can we select just one of these values to represent the whole group?

As usual, **YOU SHOULD NOT USE LOOPS OR LIST COMPREHENSIONS**. Try to break down the problem piece by piece instead, gradually chaining together different `pandas` functions. Feel free to use more than one line!

<table border="1" class="dataframe">  <thead>    
    <tr style="text-align: right;">      <th></th>      <th>name</th>      <th>min score</th>    </tr> 
    <tr  align="right">  <th align="right">bid</th>      <th></th>      <th></th>    </tr> </thead>  <tbody>    
    <tr  align="right">      <th>86718</th>      <td>Lollipot</td>      <td>45</td>    </tr>  
    <tr  align="right">      <th>...</th>      <td>...</td>      <td>...</td>    </tr> 
  </tbody></table>

In [None]:
twenty_lowest_scoring = ... 

# DO NOT USE LIST COMPREHENSIONS OR LOOPS OF ANY KIND!!!

...

twenty_lowest_scoring

In [None]:
grader.check("q2b")

<br/>

--- 
## 🚀 Question 2c
Let's do some more interesting analysis with our lowest score calculations. In the cell below, assign `worst_3_inspection_restaurants` to a two-column `DataFrame` with 15 rows.

- One column is the `name` of each business.

- The other column is a modified average inspection score of each business called `lowest 3 average`. 

    - To calculate `lowest 3 average`, find the average of each business's **three lowest inspection scores**. 

    - If a business has less than three inspection scores, take the average of all of its inspection scores (i.e., either one or two scores). 

- Finally, assign `worst_3_inspection_restaurants` to a `DataFrame` of the 15 rows with the lowest `lowest 3 average`, sorted by `lowest 3 average` ascending.  

`worst_3_inspection_restaurants` should look like the one below.

**Hint**: 2b’s advice also applies here! Furthermore, your answer to 2b may be helpful as a starting point. This question is intentionally left open-ended, so feel free to use any combination of `pandas` functions found online. Similarly to 2b, do not use loops or list comprehensions. Use as many lines as you see fit, so long as your final answer is saved to `worst_3_inspection_resturants`. For isolating each business's lowest three inspection scores, it may be useful to know that when `groupby` applies aggregating functions, it preserves the sort order of the inputted `DataFrame`. 


<table border="1" class="dataframe">  <thead>    
    <tr style="text-align: right;">      <th></th>      <th>name</th>      <th>lowest 3 average</th>    </tr> 
    <tr  align="right">  <th align="right">bid</th>      <th></th>      <th></th>    </tr> </thead>  <tbody>    
    <tr  align="right">      <th>84590</th>      <td>Chaat Corner</td>      <td>54.0</td>    </tr>  
    <tr  align="right">      <th>...</th>      <td>...</td>      <td>...</td>    </tr> 
  </tbody></table>


In [None]:
worst_3_inspection_restaurants = ...

...
worst_3_inspection_restaurants

In [None]:
grader.check("q2c")

<br/>

---

# 🌮 Question 3: `pandas` Potpourri

In this question, we ask you to describe `pandas` operations and explain specific concepts using `ins_named`.

<!-- BEGIN QUESTION -->

<br/>

---

## 🌮 Question 3a

Consider the chained `pandas` statement below:

`q3a_df = ins_named[ins_named["name"].str.lower().str.contains("taco")].groupby("bid").filter(lambda sf: sf["score"].max() > 95).agg("count")`

We can decompose this statement into three parts:

```
temp1 = ins_named[ins_named["name"].str.lower().str.contains("taco")]
 
temp2 = temp1.groupby("bid").filter(lambda sf: sf["score"].max() > 95)
 
q3a_df = temp2.agg("count")
```

For each line of code above, write one sentence describing what the line of code accomplishes. Feel free to create a cell to see what each line does. In total, you'll write three sentences.

Finally, write an example homework question whose answer is `q3a_df`. 

- This example homework question should only be one sentence.

**Note: While the first part of this question will be graded for correctness, the second part is a bit more open-ended. Answers that demonstrate correct understanding will receive full credit.** 

An example answer will look like the following: "`temp1` creates a ... `temp2` transforms `temp1` by ... Finally, `q3a_df` results in a `DataFrame` that ... A question that is answered by this chain of operations is ..."

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br/>

---

## 🌮 Question 3b

Consider `ins_named`, `temp1`, `temp2`, and `q3a_df` from the previous problem. What is the granularity of each `DataFrame`? Explain your answer in no more than four sentences.

**Note**: For more details on what the granularity of a `DataFrame` means, feel free to check the [course notes](https://ds100.org/course-notes/eda/eda.html#granularity)!

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/>

---

# 🚀 Question 4: Missing Inspections

With our inspection data, we are given the `type` of each inspection. These categories were lightly investigated in Question 1, centered on the number of missing scores within each `type`. Since the `timestamp` and `score` for each inspection are also provided, we can do a more interesting analysis relating the `score` and `timestamp` of specific types of inspections. 

Specifically, in Question 4, we are interested in the possible relationship between inspections of the `type` "Routine - Unscheduled" and "Reinspection/Followup" (the two most frequent inspection types in our dataset). We might guess that a follow-up ("Reinspection/Followup") inspection occurs more frequently when an initial ("Routine - Unscheduled") inspection receives a low score. To confirm this hunch, let’s investigate the rate of follow-up inspections for different initial scores. To simplify your analysis, we have provided a new `DataFrame` (`reinspections`). 

- `reinspections` contains every "Routine - Unscheduled" inspection, along with the relevant `bid` and `name` associated with the initial inspection. 
- `routine timestamp` indicates when the initial inspection occurred. 
- `routine score` is the score that the initial inspection received. 
- `day difference` is the number of days between the initial inspection and a follow-up inspection if done within one year. 
    
Some initial inspections did not have any follow-up inspections within one year. In these cases, `day difference` is assigned a filler value of -1.

Run the cell below to load in `reinspections`.

In [None]:
reinspections = pd.read_csv('data/reinspections.csv')
reinspections

<br/>

--- 
## 🚀 Question 4a
First, create a new `Boolean` column `recent reinspection?` that indicates whether a follow-up inspection occurred within 62 days inclusive (~2 months) of an initial inspection. 

In [None]:
reinspections['recent reinspection?'] = ...

reinspections

In [None]:
grader.check("q4a")

<br/>

--- 
## 🚀 Question 4b
To simplify our analysis, let’s assign `routine score`s to buckets. Buckets are similar to the bins of a histogram. Each bucket contains all scores that fall in a particular range.

Below we have defined the function `bucketify`. Use `bucketify` to create a new column in `reinspections` called `score buckets` that **maps** the score of an initial inspection to one of these predefined buckets.

**Hint:** You may find the `map` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) useful. Alternatively, see [this demonstration](https://ds100.org/course-notes/pandas_3/pandas_3.html#approach-3-sorting-using-the-map-function) in the course notes for an example use case.

In [None]:
def bucketify(score):
    if score < 65: 
        return '0 - 65'
    elif score < 70:
        return '65 - 69'
    elif score < 75:
        return '70 - 74'
    elif score < 80:
        return '75 - 79'
    elif score < 85:
        return '80 - 74'
    elif score < 90:
        return '85 - 89'
    elif score < 95:
        return '90 - 94'
    else:
        return '95 - 100'
        
reinspections['score buckets'] = ...

reinspections

In [None]:
grader.check("q4b")

<br/>

--- 
## 🚀 Question 4c
Before we complete our analysis, remove all rows whose `score buckets` contain less than 125 rows. Assign `reinspection_filtered` to this new `DataFrame`. 

In [None]:
reinspections_filtered = ...

reinspections_filtered

In [None]:
grader.check("q4c")

<br/>

--- 
## 🚀 Question 4d

To conclude our analysis, use `resinpsections_filtered` to generate a `DataFrame` with the **proportion** of initial inspections within each bucket that were reinspected within 62 days, along with the total **count** of initial inspections included in each bucket. Sort this `DataFrame` by ascending counts. Assign this new `DataFrame` to `reinspection_proportions`.

`reinspection_proportions` should look like the `DataFrame` below.

<table border="1" class="dataframe" >  <thead>    
    <tr style="text-align: right;">      <th></th>      <th>recent reinspection?</th>   <th></th> </tr>    
    <tr style="text-align: right;">      <th></th>      <th>proportion</th>      <th>count</th>    </tr>    
    <tr style="text-align: right;">      <th>score buckets</th>      <th></th>      <th></th>     </tr>  </thead>  <tbody>    
    <tr  align="right">      <th>70 - 74</th>      <td>0.407821</td>      <td>358</td>    </tr>    
    <tr  align="right">      <th>...</th>      <td>...</td>      <td>...</td>    </tr>    
    </tbody></table>

In [None]:
reinspection_proportions = ...

reinspection_proportions

In [None]:
grader.check("q4d")

<!-- BEGIN QUESTION -->

<br/>

--- 
## 🚀 Question 4e

Do you notice any trends? Are your results consistent with your prior knowledge about restaurants that receive high or low health inspection scores? Answer in the cell below.

**This question is graded on effort, there is no one "correct" answer.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Summary of Inspections Data

We have done a lot in this homework! 
 
- Broke down the inspection scores in detail using `groupby` and `pivot_table`.
- Joined the business and inspection data and identified restaurants with the worst ratings.
- Took a deep dive into understanding any trends between an inspection score and reinspection frequency.

Over the course of this 2-part homework, we hope you have become more familiar with `pandas` - in terms of identifying when to use particular functions, how they work, when they can support EDA - as well as with EDA and Data Cleaning, as part of the broader Data Science Lifecycle. These tools will serve you well as a data scientist!

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Congratulations! You have finished Homework 2B! ##

Coco and Oreo say hi :)

<img src = "pics/IMG_2887.jpg" width = "400" class="center">

### Course Content Feedback

If you have any feedback about this assignment or about any of our other weekly, weekly assignments, lectures, or discussions, please fill out the [Course Content Feedback Form](https://forms.gle/Yc3kdzNLPsVKNz2g6). Your input is valuable in helping us improve the quality and relevance of our content to better meet your needs and expectations!

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Once you submit this file to the HW 2B Coding assignment on Gradescope, Gradescope will automatically submit a PDF file with your written answers to the HW 2B Written assignment. If you run into any issues when running this cell, feel free to check this [section](https://ds100.org/debugging-guide/autograder_gradescope/autograder_gradescope.html#why-does-grader.exportrun_teststrue-fail-if-all-previous-tests-passed) in the Data 100 Debugging Guide.

**Important**: Please check that your written responses were generated and submitted correctly to the HW 2B Written Assignment.

**You are responsible for ensuring your submission follows our requirements and that the PDF for HW 2B written answers was generated/submitted correctly. We will not be granting regrade requests nor extensions to submissions that don't follow instructions.** If you encounter any difficulties with submission, please don't hesitate to reach out to staff prior to the deadline. 

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)