In [11]:
# Initialize Otter
import otter
grader = otter.Notebook("project.ipynb")

# Project 1 – Gradebook 💯

## DSC 80, Fall 2024

### Checkpoint Due Date: Tuesday, October 8th (Questions 1-7)
### Due Date: Tuesday, October 15hh

## Instructions

Welcome to Project 1! Be sure to read the instructions below carefully to understand how projects differ from labs.

### Working on the Project

This Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems.

* Like the lab, your coding work will be developed in the accompanying `project.py` file, that will be imported into the current notebook. This code will be autograded.
* There is no manually-graded component to Project 1, so the only thing you will ever submit is `project.py`.
* **For the Checkpoint, which is required, you only need to turn in a `project.py` containing solutions for Questions 1-7!**
    - The "Project 1 Checkpoint" autograder on Gradescope does not thoroughly check your code – it only runs the public tests on Questions 1-7 to make sure that you have completed them. There are no hidden tests for the checkpoint, and you will see your score upon submission. 
    - When you submit the final version of the project, however, we will use hidden tests to check your answers more thoroughly.
    - Note that this means you will ultimately have to submit the project twice – once to the "Project 1 Checkpoint" autograder (Questions 1-7 only), and once to the "Project 1" autograder (once you're fully done).
- **Do not change the function names in `project.py` file!** The functions in `project.py` are how your assignment is graded, and they are graded by their name. If you changed something you weren't supposed to, you can find the original code in the [course GitHub repository](https://github.com/dsc-courses/dsc80-2024-ss2).
- **To ensure that all of your work to be submitted is in `project.py`, we've included a script named `project-validation.py` in the project folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.** More details on its usage are given at the bottom of this notebook.
- You are encouraged to write your own additional helper functions to solve the project, as long as they also end up in `project.py`.

### Working with a Partner

You may work together on projects (and projects only!) with a partner. If you work with a partner, you are both required to actively contribute to all parts of the project. You must both be working on the assignment at the same time together, either physically or virtually on a Zoom call. You are encouraged to follow the pair programming model, in which you work on just a single computer and alternate who writes the code and who thinks about the problems at a high level.

In particular, you **cannot** split up the project and each work on separate parts independently.

Note that if you do work with a partner, you and your partner must submit the Checkpoint together and the whole project together. See [here](https://dsc80.com/syllabus/#projects) for more details.

In [12]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

In [14]:
from project import *

## About the Assignment

In this project, you'll work with the gradebook for CSD 18, a fictional data science course with 535 students co-taught by Professors Yutian and Dylan. You'll help Professors Yutian and Dylan compute the total course grade for every student in their course and analyze their students' performances throughout the quarter.

---

### Navigating the Project

Click on the links below to navigate to different parts of the project. Note that Part 1 – that is, Questions 1, 2, 3, 4, 5, 6, and 7 – constitutes your Checkpoint submission.

- [Part 1: Initial Calculations 🔢](#part1)
    - [Question 1 ](#Question-1-(Checkpoint-Question))
    - [Question 2 ](#Question-2-(Checkpoint-Question))
    - [Question 3 ](#Question-3-(Checkpoint-Question))
    - [Question 4 ](#Question-4-(Checkpoint-Question))
    - [Question 5 ](#Question-5-(Checkpoint-Question))
    - [Question 6 ](#Question-6-(Checkpoint-Question))
    - [Question 7 ](#Question-7-(Checkpoint-Question))
- [Part 2: Redemption 🙏](#part2)
    - [Question 8](#Question-8)
    - [Question 9](#Question-9)
    - [Question 10](#Question-10)
- [Part 3: Analysis 🧠](#part3)
    - [Question 11](#Question-11)
    - [Question 12](#Question-12)
    - [Question 13](#Question-13)

<!--     - [✅ Question 1 (Checkpoint Question)](#Question-1-(Checkpoint-Question))
    - [✅ Question 2 (Checkpoint Question)](#Question-2-(Checkpoint-Question))
    - [✅ Question 3 (Checkpoint Question)](#Question-3-(Checkpoint-Question))
    - [✅ Question 4 (Checkpoint Question)](#Question-4-(Checkpoint-Question))
    - [✅ Question 5 (Checkpoint Question)](#Question-5-(Checkpoint-Question))
    - [✅ Question 6 (Checkpoint Question)](#Question-6-(Checkpoint-Question))
    - [✅ Question 7 (Checkpoint Question)](#Question-7-(Checkpoint-Question)) -->
---

### The Syllabus

Professor Yutian has taught this course several times, so the instructors decide to use her syllabus at the start of the quarter. (Note that this syllabus is **not** the same as the course syllabus for DSC 80 in Spring 2024)

* **Lab assignments (20% total)**
    - Each lab is worth the same amount, regardless of each lab's raw point total.
    - The lowest lab is dropped.
    - Each lab may be revised for up to (and including) one week after the deadline for a 10% penalty, for up to (and including) two weeks after the deadline for a 30% penalty, and beyond that for a 60% penalty. Such revisions are reflected in the `'Lateness'` columns in the gradebook.
    - Labs also have a two hour grace period that needs to be factored in before assigning late penalties.
    - Note that lateness penalties are not assessed for any other type of assignment – that is, students can submit projects, checkpoints, and discussions late without penalty.
* **Projects (30% total)** 
    - Each project consists of an autograded portion, and **possibly** a free response portion.
    - The total points for a single project consist of the sum of the raw score of the two portions.
    - Each project is worth the same amount, regardless of each project's raw point total.
* **Checkpoints (2.5% total)**
    - Each project checkpoint is worth the same amount, regardless of each project checkpoint's raw point total.
* **Discussions (2.5% total)**
    - Each discussion is worth the same amount, regardless of each discussion's raw point total.
* **Midterm Exam (15%)**
* **Final Exam (30%)**

You will need to refer to this syllabus repeatedly throughout the project, and several questions will link you back to it.

---

### Generalization

Your code only needs to work for courses that follow the syllabus above. That is, you may assume that the DataFrame `grades` looks **like** the given one in `data/grades.csv`.

However, your code should work regardless of:
- The numbers of labs, projects, discussions, and checkpoints in the course.
- The number of students in the course.

For instance, if CSD 18 is taught in a different quarter with more labs, fewer projects, and fewer students, your code should still work on a `grades.csv` from that quarter.

You may assume the course components and the naming conventions are as given in `grades.csv`, and you may assume that the course has no more than 99 of any type of assignment.

---

### Putting Everything Together

Here are a few remarks and tips for approaching Project 1, and projects more generally:

1. If you are having trouble figuring out what a question is asking you to do, look at the big picture and try to understand what the current step is doing to contribute to this big picture. This may clarify what's being asked!
1. These questions intentionally build off of each other and the final result matters! In fact, you can "get a question correct," but only receive partial credit for it because a previous answer was wrong.
    - Credit for a question will typically receive partial credit based on *how close* your answer is to correct (as well as some credit for a solution in the correct form). 
    - You should try to assess your answer to each question based on what you understand of the data. This might involve writing extensive code (that isn't turned in) just to check your work! Suggestions on checking your work are given in the assignment, but you should also think of your own ways of checking your work.
    - As you do this project, think about the data from the perspective of the student (which should be easy to do, since you've used Gradescope before!)
1. To test the correctness of your answers:
     - Once you have implemented a particular function in `project.py`, you should test out your function in the notebook. In particular, you should inspect/analyze the output to assess its correctness.
    - Run your functions on the main dataset (`grades`, and later `grades_combined` and `grades_analysis`) and ask yourself if the output *looks correct.*
    - Run your functions on very small datasets (e.g. 1-5 row DataFrames that you construct by hand), calculate the expected output by hand, and see if the function output matches (this *is* unit-testing your code with data).
    * Run your functions on (large and small) samples of the dataset `grades`. Does your code break, or does it still run as expected?

Run the cell below to load in the aforementioned `grades` dataset.

In [15]:
grades_fp = Path('data') / 'grades.csv'
grades = pd.read_csv(grades_fp)
grades.head()

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion07 - Lateness (H:M:S),discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S)
0,A99706914,ERC,JR,A22,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,...,00:00:00,8.895294,10,00:00:00,10.0,10,780:01:28,10.0,10,00:00:00
1,A99237411,Eighth,JR,A29,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,...,669:12:21,9.022407,10,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00
2,A99690544,Revelle,SR,A12,86.513369,100.0,00:00:00,47.80282,100.0,00:00:00,...,00:00:00,3.030538,10,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00
3,A99427381,Seventh,JR,A14,100.0,100.0,00:00:00,100.0,100.0,00:00:00,...,00:00:00,10.0,10,00:00:00,9.249126,10,00:00:00,10.0,10,00:00:00
4,A99489712,Sixth,JR,A24,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,...,00:00:00,4.439606,10,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00


***Tip***: The `grades` DataFrame has 101 columns, and you can't see them all right now. To get a feel for what all of the columns represent, you might consider opening `grades.csv` with a spreadsheet application, like Google Sheets or Excel.

<a name='part1'></a>

## Part 1: Initial Calculations 🔢

([return to the outline](#Navigating-the-Project))

In Part 1, you'll compute students' letter grades in CSD 18 using [the syllabus](#The-Syllabus) provided above. As you'll see, this requires many steps. Let's get started!

<!-- ### ✅ Question 1 (Checkpoint Question) -->
### Question 1 


<a name='Question-1-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `get_assignment_names`, which takes in a DataFrame like `grades` and returns a dictionary with the following structure:
- The keys are the general areas of [the syllabus](#The-Syllabus): `'lab'`, `'project'`, `'midterm'`, `'final'`, `'disc'`, and `'checkpoint'`.
- The values are **lists** that contain all the assignment names of that type. For example, the lab assignments all have names of the form `'labXX'` where `XX` is a zero-padded two digit number. If the class has 5 labs, the returned dictionary's value for the `'lab'` key should be `['lab01', 'lab02', 'lab03', 'lab04', 'lab05']`.

***Notes***: 
- Some of the column names in the DataFrame contain the assignment name in the zero-padded fashion requested; you should use this to your advantange when building the dictionary.
- The point of this question is to familiarize you with the names of the columns in `grades`. Try to reuse your `get_assignment_names` function in future questions – if you find yourself never using it again, you may be redoing work unnecessarily.

In [16]:
[x for x in list(grades.columns) if 'checkpoint' in x]

['project02_checkpoint01',
 'project02_checkpoint01 - Max Points',
 'project02_checkpoint01 - Lateness (H:M:S)',
 'project02_checkpoint02',
 'project02_checkpoint02 - Max Points',
 'project02_checkpoint02 - Lateness (H:M:S)',
 'project03_checkpoint01',
 'project03_checkpoint01 - Max Points',
 'project03_checkpoint01 - Lateness (H:M:S)']

In [17]:
get_assignment_names(grades)

{'lab': ['lab01',
  'lab02',
  'lab03',
  'lab04',
  'lab05',
  'lab06',
  'lab07',
  'lab08',
  'lab09'],
 'project': ['project01', 'project02', 'project03', 'project04', 'project05'],
 'midterm': ['Midterm'],
 'final': ['Final'],
 'disc': ['discussion01',
  'discussion02',
  'discussion03',
  'discussion04',
  'discussion05',
  'discussion06',
  'discussion07',
  'discussion08',
  'discussion09',
  'discussion10'],
 'checkpoint': ['project02_checkpoint01',
  'project02_checkpoint02',
  'project03_checkpoint01']}

In [18]:
grader.check("q1")

Now you're ready to compute each student's overall grade on the first type of assignment – projects.

<!-- ### ✅ Question 2 (Checkpoint Question) -->
### Question 2


<a name='Question-2-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `projects_total`, which takes in a DataFrame like `grades` and returns a Series containing the total project grade for each student for the entire quarter, according to [the syllabus](#The-Syllabus). The output Series should contain values between 0 and 1.

***Notes***:

- If a student didn't turn in a particular project, what should their grade for it be? 
- Some projects have free response components that you need to account for when calculating the total points earned by a student and the max points possible for that project.
    - For instance, let's say Tiffany earned 82 points on the autograded portion of Project 1 and 13 points on the free response portion. This means that her overall Project 1 grade should be:
    $$
        \text{Project 1 Grade} = \frac{82+13}{85+15} = 0.95
    $$
- Per [the syllabus](#The-Syllabus), students may submit projects (and checkpoints and discussions) late without penalty.
- Do not include scores on checkpoint assignments in your calculations.
- To check your work, try:
    1. Calculating the total project scores for a few types of students by hand.
    2. Calculating summary statistics for the whole class' performance on a few projects in particular and ensuring the results seem reasonable.

In [19]:
grades.loc[:, get_assignment_names(grades)['project']]

Unnamed: 0,project01,project02,project03,project04,project05
0,75.282632,75.000000,85.519583,68.230985,73.917020
1,52.929482,62.926075,88.201035,49.884266,57.680370
2,46.122801,62.290938,77.043708,41.548308,46.714963
3,79.121806,70.558311,94.299439,74.267897,75.000000
4,41.823703,71.043328,90.805754,44.484302,50.848312
...,...,...,...,...,...
530,78.936816,65.932566,98.070705,71.937036,75.000000
531,72.076801,71.058797,76.384220,68.344038,73.221261
532,66.273252,75.000000,90.107681,58.207189,63.822615
533,63.965217,75.000000,88.717789,51.220883,60.524285


In [20]:
x = projects_total(grades)
x

0      0.916234
1      0.765932
2      0.681279
3      0.962581
4      0.737446
         ...   
530    0.949434
531    0.866795
532    0.862050
533    0.813468
534    0.939433
Length: 535, dtype: float64

In [21]:
grader.check("q2")

Now that projects are out of the way, you need to clean and process the lab grades. This will involve a bit more work than was necessary for projects. Specifically, you'll:
- identify late submissions (Question 3), 
- compute normalized scores for each lab assignment, factoring in late penalties (Question 4), and 
- drop the lowest lab grade and compute a total lab score for each student (Question 5).

<!-- ### ✅ Question 3 (Checkpoint Question) -->
### Question 3 


<a name='Question-3-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Recall, per [the syllabus](#The-Syllabus), labs are the only assignment category for which late penalties are enforced:

>  Each lab may be revised for up to (and including) one week after the deadline for a 10% penalty, for up to (and including) two weeks after the deadline for a 30% penalty, and beyond that for a 60% penalty. Such revisions are reflected in the `'Lateness'` columns in the gradebook.

For labs, students have a **two hour grace period** after the deadline during which their submissions are counted as on time. The grace period only applies to the original deadline – for instance, if a student submits one week and one hour late, their submission falls into the "up to (and including) two weeks after the deadline" category and they are assessed a 30% penalty.

Your job is to adjust lab grades to penalize **truly** late submissions, factoring in the grace period. To adjust a student's grade, multiply their lab score by `1` (on time, factoring in the grace period), `0.9`, `0.7`, or `0.4`. We'll call these four numbers – `1`, `0.9`, `0.7`, and `0.4` – "lateness multipliers." 

Complete the implementation of the function `lateness_penalty`, which takes in a Series containing information on how late each student turned in a particular lab, such as `grades['lab01 - Lateness (H:M:S)']`, and returns a Series containing each student's lateness multiplier for that lab. The only possible values in the returned Series should be `1.0`, `0.9`, `0.7`, and `0.4`.

**Don't forget to factor in the grace period!** Remember, we will only be enforcing late penalties for labs, not for any other assignment category.

**Note**: There is no grace period for real Gradescope!! Make sure you submit your assignments on time.

In [22]:
lateness_penalty(grades['lab05 - Lateness (H:M:S)']).iloc[150]


np.float64(0.4)

In [23]:
grades['lab05 - Lateness (H:M:S)'].iloc[150]

'384:50:21'

In [24]:
grades['lab05 - Lateness (H:M:S)'].idxmax()

150

In [25]:
lateness_penalty(grades['lab04 - Lateness (H:M:S)']).unique()

array([1. , 0.9, 0.7])

In [26]:
grader.check("q3")

<!-- ### ✅ Question 4 (Checkpoint Question) -->
### Question 4 

<a name='Question-4-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `process_labs`, which takes in a DataFrame like `grades` and returns a DataFrame of processed lab scores. The returned DataFrame should:
* have the same index as `grades`,
* have one column for each lab assignment (e.g. `'lab01'`, `'lab02'`,..., `'lab09'`),
* have values representing the final score for each lab assignment, adjusted for lateness and **normalized** to a score between 0 and 1.

Remember to correctly handle the case where a student _doesn't_ turn in a lab.

In [27]:
processed = process_labs(grades)
processed

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,0.997353,0.849902,0.637744,1.000000,1.000000,0.994518,0.389141,0.887917,0.874913
1,0.988295,0.507842,0.714477,0.783672,1.000000,0.393887,0.914061,0.944378,0.902977
2,0.865134,0.478028,0.433667,0.738875,0.927838,0.345076,0.734070,0.718204,0.757840
3,1.000000,1.000000,0.925903,0.950614,0.891614,0.688403,0.985371,0.963307,0.777880
4,0.665070,0.334224,0.706932,0.747915,0.659720,0.731345,0.607859,0.370186,1.000000
...,...,...,...,...,...,...,...,...,...
530,0.900000,0.820228,1.000000,0.792935,1.000000,0.284106,0.770281,0.931245,1.000000
531,1.000000,0.874981,0.809945,0.592866,0.987597,0.759688,0.856178,0.849694,0.582645
532,0.886566,0.903260,1.000000,1.000000,0.941425,0.768909,0.967282,0.877898,1.000000
533,0.837997,0.856369,0.909363,0.955287,0.737854,0.382781,0.769093,0.947450,0.867373


In [28]:
grader.check("q4")

<!-- ### ✅ Question 5 (Checkpoint Question) -->
### Question 5 

<a name='Question-5-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `lab_total`, which takes in a DataFrame returned by `process_labs` – that is, a DataFrame that contains each student's score on each lab after lateness penalties – and returns a Series containing the total lab grade for each student according to [the syllabus](#The-Syllabus) (i.e. with the lowest lab dropped). All values in the returned Series should be proportions between 0 and 1. 

For example, if CSD 18 only has 3 labs, and Aritra received lab scores of 20%, 90%, and 100% after lateness penalties, then your output Series should contain the value `0.95` for Aritra. This is because we drop the lowest score, and then compute the average of just 90% and 100%, which is 95%, or 0.95 as a proportion.

In [29]:
lab_total(processed)

0      0.905293
1      0.844463
2      0.706707
3      0.936836
4      0.686128
         ...   
530    0.901836
531    0.841369
532    0.947054
533    0.860098
534    0.865609
Length: 535, dtype: float64

In [30]:
grader.check("q5")

Now that projects and labs are processed, we're almost ready to compute the letter grade of each student in CSD 18.

### Question 6
<!-- ### ✅ Question 6 (Checkpoint Question) -->

<a name='Question-6-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

First, you need to compute each student's course grade, which results from adding their total grades in each course component according to the weights given in [the syllabus](#The-Syllabus).

Complete the implementation of the function `total_points`, which takes in a DataFrame like `grades` and returns a Series containing each student's course grade. **Course grades should be proportions between 0 and 1.**

***Notes***: 

- Don't repeat yourself when computing the checkpoint and discussion portions of the course.
- Remember, only the lab portion of the course accounts for late assignments; you may assume all assignments in other portions are turned in without penalty.
- Do the work by hand for a few students to check your code!

In [31]:
total_points(grades)

0      0.902389
1      0.816924
2      0.759308
3      0.908499
4      0.674545
         ...   
530    0.864016
531    0.765390
532    0.859673
533    0.866263
534    0.896365
Length: 535, dtype: float64

In [32]:
grader.check("q6")

<!-- ### ✅ Question 7 (Checkpoint Question) -->
### Question 7

<a name='Question-7-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

How well did the students in CSD 18 do?

#### `final_grades`

Complete the implementation of the function `final_grades`, which takes in a Series of final course grades (as computed by `total_points` in Question 6) and returns a Series of letter grades as determined by the following cutoffs:

| Letter Grade | Cutoff |
|:--- | --- |
| A | grade >= 0.9 |
| B | 0.8 <= grade < 0.9 |
| C | 0.7 <= grade < 0.8 |
| D | 0.6 <= grade < 0.7 |
| F | grade < 0.6 |

***Note***: These cutoffs do not have pluses or minuses. **Do not round** anyone's course grade when determining their letter grade.

<br>

#### `letter_proportions`

Complete the implementation of the function `letter_proportions`, which takes in a Series of final course grades (as computed by `total_points` in Question 6) and returns a Series containing the proportion of the class that received each letter grade. For instance, this Series might tell us that the proportion of the class receiving B's was 0.45, A's was 0.33, C's was 0.16, D's was 0.05, and F's was 0.01 (though these are made up numbers). The index of this Series should be letters, and the **values should be sorted in decreasing order**.

***Notes***: 

- The values in your returned Series should add up to exactly `1.0`. If you are getting something close such as `0.99999`, that means there is an issue with your code in a function you implemented earlier.
- **Do not round**.

In [33]:
tp = total_points(grades)
tp
final_grades(total_points(grades))

0      A
1      B
2      C
3      A
4      D
      ..
530    B
531    C
532    B
533    B
534    B
Length: 535, dtype: object

In [34]:
lp = letter_proportions(tp)
lp

B    0.499065
C    0.228037
A    0.211215
D    0.031776
F    0.029907
Name: proportion, dtype: float64

In [35]:
assert 1.0 == lp.sum()

In [36]:
grader.check("q7")

<a name='part2'></a>

## Part 2: Redemption 🙏

([return to the outline](#Navigating-the-Project))

The syllabus we've used so far was put together by Professor Yutian, who has taught CSD 18 for several iterations. This was Professor Dylan's first time teaching CSD 18, and towards the end of the quarter he proposed a new idea to reward students for showing an improvement in their understanding of the earlier ideas in the course on the final exam. Specifically, here's what he proposed:

- The instructors will identify the questions on the final exam that contain content that was also covered on the midterm exam. Call these "redemption questions."
- For each student, compute their "raw redemption score", which is the proportion of points available on redemption questions that they earned. If they did not take the final exam, their raw redemption score is 0.
- Convert the class' raw redemption scores to z-scores, i.e. to standard units.
- Convert the class' original midterm exam grades, as proportions, to z-scores.
- If a student's raw redemption z-score is higher than their original midterm exam z-score, replace their original midterm exam score with one that has a z-score equal to their raw redemption z-score. This is done by converting their raw redemption z-score back to a midterm grade proportion using the standard deviation and mean of the midterm exam.
- If not, leave their original midterm exam score as-is. **Note that this policy can only increase a student's midterm exam score (and, hence, their total course grade), not decrease!**

As a refresher from [DSC 10](https://dsc-courses.github.io/dsc10-2022-fa/resources/lectures/lec21/lec21.html#Standard-units), to convert a sequence of numbers to z-scores, or standard units, we use the following formula:

$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of }x}$$

To illustrate this redemption policy, let's look at a concrete example.

- Suppose the final exam was worth 80 points. 55 of these points came from Questions 2, 4, 6, 8, and 9, which were the redemption questions. The class' mean score on just the redemption questions was 0.8, with a standard deviation of 0.15.
- Suppose the midterm exam was worth 70 points. The class' mean score on the midterm exam was 0.6, with a standard deviation of 0.25.
- Jasmine, a student in the course, earned a $\frac{74}{80}$ on the final exam, including a $\frac{51}{55}$ on the redemption questions, and a $\frac{53}{70}$ on the midterm exam. Then:
    - Her raw redemption score is $\frac{51}{55}$, and her redemption z-score is $\frac{\frac{51}{55} - 0.8}{0.15} \approx 0.8485$.
    - Her midterm z-score is $\frac{\frac{53}{70} - 0.6}{0.25} \approx 0.6286$.
    - Since her redemption z-score, $0.8485$, is greater than her midterm z-score, $0.6286$, her midterm exam score of $\frac{53}{70} \approx 0.7571$ will be replaced with:
    
    $$\text{Jasmine's redemption z-score} \cdot \text{class' midterm SD} + \text{class' midterm mean} \approx 0.8485 \cdot 0.25 + 0.6 = \boxed{0.8121}$$

Now, your job will be to implement this redemption policy and recompute each student's total course points. Before proceeding, you should think about _why_ Professor Dylan has chosen to implement the redemption policy in terms of z-scores, rather than in terms of raw scores.

A few more things to consider:
- We rounded in the example above, but you should not round at any point in this part.
- After redemption, midterm exam grades should be capped at 1 (as a proportion), i.e. 100%.

It turns out that CSVs like `grades.csv` don't actually contain all of the information you'll need to implement this policy. For instance, `grades` only contains each student's total final exam grade, but not the number of points they earned on each question.

That information will come from another source. For the students whose grades are in `grades`, the CSV `data/final_exam_breakdown.csv` contains the number of points each student earned on each question of CSD 18's final exam. Run the cell below to load this CSV in as a DataFrame named `final_breakdown`.

In [37]:
final_breakdown_fp = Path('data') / 'final_exam_breakdown.csv'
final_breakdown = pd.read_csv(final_breakdown_fp)
final_breakdown.head()

Unnamed: 0,PID,Question 1 (5.0 pts),Question 2 (6.0 pts),Question 3 (8.0 pts),Question 4 (6.0 pts),Question 5 (10.0 pts),Question 6 (6.0 pts),Question 7 (10.0 pts),Question 8 (6.0 pts),Question 9 (9.0 pts),Question 10 (10.0 pts),Question 11 (4.0 pts),Question 12 (7.0 pts)
0,A99432453,3.0,6.0,8.0,5.0,5.0,6.0,4.0,3.0,4.0,4.0,4.0,4.0
1,A99152420,5.0,6.0,8.0,5.0,10.0,6.0,9.0,6.0,4.0,10.0,4.0,5.0
2,A99892710,3.0,5.0,8.0,4.0,4.0,5.0,3.0,6.0,9.0,9.0,4.0,7.0
3,A99381181,,,,,,,,,,,,
4,A99990217,4.0,6.0,8.0,6.0,7.0,6.0,7.0,5.0,9.0,4.0,4.0,7.0


Note that `final_breakdown` has the same number of rows as `grades`, but a different number of columns:

In [38]:
final_breakdown.shape

(535, 13)

Also note that student `'A99381181'` has a score of `NaN` for each question because they did not take the final exam:

In [39]:
grades.loc[grades['PID'] == 'A99381181', 'Final']

86   NaN
Name: Final, dtype: float64

### Question 8

([return to the outline](#Navigating-the-Project))

Let's get started.

#### `raw_redemption`

Complete the implementation of the function `raw_redemption`, which takes in a DataFrame like `final_breakdown` and a list of integers, corresponding to the question numbers for "redemption questions." The function should return a DataFrame with two columns:
- `'PID'`, the PID for each student in `final_breakdown`.
- `'Raw Redemption Score'`, which is the proportion of points each student earned, when only considering redemption questions.

For example, suppose `example_breakdown` is as follows:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>PID</th>
      <th>Question 1 (6.0 pts)</th>
      <th>Question 2 (3.0 pts)</th>
      <th>Question 3 (1.0 pts)</th>
      <th>Question 4 (4.5 pts)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A99706914</td>
      <td>6</td>
      <td>3</td>
      <td>1</td>
      <td>4.5</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A99237411</td>
      <td>2</td>
      <td>0</td>
      <td>1</td>
      <td>4.5</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A99489712</td>
      <td>4</td>
      <td>1</td>
      <td>0</td>
      <td>4.0</td>
    </tr>
  </tbody>
</table>

`raw_redemption(example_breakdown, [1, 3])` should return the following DataFrame:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>PID</th>
      <th>Raw Redemption Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A99706914</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A99237411</td>
      <td>0.428571</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A99489712</td>
      <td>0.571429</td>
    </tr>
  </tbody>
</table>



***Notes***:
- **Assume that for each question in `final_breakdown`, at least one student received a perfect score.**
- Assume that the input DataFrame will be in the same format as `final_breakdown`, in that the column at position 0 will be labeled `'PID'`, the column at position 1 will contain scores for Question 1, the column at position 2 will contain scores for Question 2, and so on.
- If a student didn't take the final, their raw redemption score should be 0.
- Again, do not round.

<br>

#### `combine_grades`

Then, complete the implementation of the function `combine_grades`, which takes in a DataFrame like `grades` and a DataFrame like the one returned by `raw_redemption`. The function should return a new DataFrame with all the columns from `grades`, plus a new column labelled `'Raw Redemption Score'` which contains the raw redemption score for each student.

***Hint***: We cannot directly add the `'Raw Redemption Score'` from the redemption DataFrame to the `grades` DataFrame, as the `'PID'` columns in the two DataFrames won't necessarily match.

In [40]:
final_breakdown

Unnamed: 0,PID,Question 1 (5.0 pts),Question 2 (6.0 pts),Question 3 (8.0 pts),Question 4 (6.0 pts),Question 5 (10.0 pts),Question 6 (6.0 pts),Question 7 (10.0 pts),Question 8 (6.0 pts),Question 9 (9.0 pts),Question 10 (10.0 pts),Question 11 (4.0 pts),Question 12 (7.0 pts)
0,A99432453,3.0,6.0,8.0,5.0,5.0,6.0,4.0,3.0,4.0,4.0,4.0,4.0
1,A99152420,5.0,6.0,8.0,5.0,10.0,6.0,9.0,6.0,4.0,10.0,4.0,5.0
2,A99892710,3.0,5.0,8.0,4.0,4.0,5.0,3.0,6.0,9.0,9.0,4.0,7.0
3,A99381181,,,,,,,,,,,,
4,A99990217,4.0,6.0,8.0,6.0,7.0,6.0,7.0,5.0,9.0,4.0,4.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A99190272,5.0,6.0,8.0,6.0,9.0,6.0,5.0,6.0,9.0,10.0,4.0,6.0
531,A99330622,5.0,6.0,8.0,6.0,9.0,6.0,8.0,6.0,6.0,10.0,4.0,4.0
532,A99152694,3.0,3.0,5.0,6.0,7.0,5.0,9.0,6.0,7.0,10.0,3.0,7.0
533,A99174029,5.0,3.0,5.0,6.0,6.0,1.0,8.0,5.0,7.0,5.0,4.0,1.0


In [41]:
raw_redemption(final_breakdown, [1, 2])

Unnamed: 0,PID,Raw Redemption Score
0,A99432453,0.818182
1,A99152420,1.000000
2,A99892710,0.727273
3,A99381181,0.000000
4,A99990217,0.909091
...,...,...
530,A99190272,1.000000
531,A99330622,1.000000
532,A99152694,0.545455
533,A99174029,0.727273


In [42]:
grader.check("q8")

For our particular offering of CSD 18, the redemption questions on the final exam were Questions 1, 2, 3, 7, 9, and 12. Run the cell below to define a new DataFrame named `grades_combined` that results from calling the above two functions on grades from this class.

In [43]:
grades_combined = combine_grades(grades, raw_redemption(final_breakdown, [1, 2, 3, 7, 9, 12]))
grades_combined.head()

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S),Raw Redemption Score
0,A99706914,ERC,JR,A22,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,...,8.895294,10,00:00:00,10.0,10,780:01:28,10.0,10,00:00:00,0.844444
1,A99237411,Eighth,JR,A29,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,...,9.022407,10,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00,0.866667
2,A99690544,Revelle,SR,A12,86.513369,100.0,00:00:00,47.80282,100.0,00:00:00,...,3.030538,10,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00,0.777778
3,A99427381,Seventh,JR,A14,100.0,100.0,00:00:00,100.0,100.0,00:00:00,...,10.0,10,00:00:00,9.249126,10,00:00:00,10.0,10,00:00:00,0.888889
4,A99489712,Sixth,JR,A24,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,...,4.439606,10,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00,0.822222


### Question 9

([return to the outline](#Navigating-the-Project))

Now that we have all of our information about each student in one DataFrame, we can compute their z-score on both the original midterm exam and the redemption questions on the final exam.

#### `z_score`

Complete the implementation of the function `z_score`, which takes in a Series of numbers and returns a Series in which all elements are converted to z-scores. As a reminder, to convert a sequence of numbers to z-scores, or standard units, we use the following formula:

$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of }x}$$

***Notes***:

- Make sure to set the `ddof=0` in whichever method or function you use to compute standard deviation. `numpy` and `pandas` both use different default denominators when computing standard deviation. (`ddof=0` computes the "population" standard deviation and `ddof=1` computes the "sample" standard deviation.)
- Do **not** fill null values – that is, if a value in the input Series is `NaN`, its value in the output Series should also be `NaN`. (Depending on how you implement `z_score`, this may happen automatically.)
    - Address null midterm scores in `add_post_redemption`, not `z_score`.

<br>

#### `add_post_redemption`

Complete the implementation of the function `add_post_redemption`, which takes in a DataFrame like `grades_combined` and returns a DataFrame with all the columns from `grades_combined` in addition to two new columns:
- `'Midterm Score Pre-Redemption'`, which contains each student's midterm exam score as a proportion between 0 and 1 **before** redemption.
- `'Midterm Score Post-Redemption'`, which containing each student's midterm exam score **after** the redemption policy has been applied, again as a proportion between 0 and 1.

You can use your `z_score` function to compute the z-scores of each student's original midterm exam grades and raw redemption scores. **Note that there are students who didn't take the midterm; such students need to have their `NaN` scores fixed prior to calculating their pre-redemption z-scores**, otherwise, you may end up incorrectly giving them `NaN` post-redemption midterm scores. None of the redemption z-scores should be `NaN`, since you handled null values in your implementation of `raw_redemption`.

If it's not clear, **computing the `'Midterm Score Post-Redemption'` column is the most complicated part of this question**. Make sure you understand how the redemption policy for CSD 18 works before approaching this question. If you need to refresh your understanding, re-read the instructions at the start of [Part 2](#part2).

In [53]:
add_post_redemption(grades_combined)

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S),Raw Redemption Score,Midterm Score Pre-Redemption,Midterm Score Post-Redemption
0,A99706914,ERC,JR,A22,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,...,00:00:00,10.000000,10,780:01:28,10.000000,10,00:00:00,0.844444,1.000000,1.000000
1,A99237411,Eighth,JR,A29,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,...,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00,0.866667,0.912166,0.912166
2,A99690544,Revelle,SR,A12,86.513369,100.0,00:00:00,47.802820,100.0,00:00:00,...,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00,0.777778,0.804012,0.804012
3,A99427381,Seventh,JR,A14,100.000000,100.0,00:00:00,100.000000,100.0,00:00:00,...,00:00:00,9.249126,10,00:00:00,10.000000,10,00:00:00,0.888889,0.947108,0.947108
4,A99489712,Sixth,JR,A24,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,...,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00,0.822222,0.416396,0.826174
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A99073025,Warren,JR,A12,100.000000,100.0,47:26:10,82.022753,100.0,00:00:00,...,12:08:58,9.169447,10,00:00:00,10.000000,10,00:00:00,0.755556,0.882106,0.882106
531,A99257552,Warren,SO,A02,100.000000,100.0,00:00:00,87.498073,100.0,00:00:00,...,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00,0.800000,0.537590,0.801022
532,A99592629,Revelle,JR,A15,88.656641,100.0,00:00:00,90.326041,100.0,00:00:00,...,00:00:00,8.878946,10,00:00:00,10.000000,10,00:00:00,0.711111,0.880445,0.880445
533,A99033808,Seventh,SR,A13,83.799719,100.0,00:00:00,85.636947,100.0,00:00:00,...,00:00:00,8.655478,10,419:06:41,8.102277,10,00:00:00,0.866667,1.000000,1.000000


In [51]:
grader.check("q9")

### Question 10

([return to the outline](#Navigating-the-Project))

Now, we're equipped to re-compute each student's course grade after the redemption policy.

#### `total_points_post_redemption`

Complete the implementation of the function `total_points_post_redemption`, which takes in a DataFrame like `grades_combined` and returns a Series containing each student's course grade after redemption. As a refresher, **course grades should be proportions between 0 and 1.**

You should not have to repeat any of your calculations for assignments other than the midterm exam – use your output from `total_points` and adjust it. Remember that, per [the syllabus](#The-Syllabus), the midterm exam is worth 15%.

<br>

#### `proportion_improved`

Finally, complete the implementation of the function `proportion_improved`, which takes in a DataFrame like `grades_combined` and returns the **proportion of students in the class whose letter grade increased** due to the redemption policy.

***Hints***:
- If you've implemented everything correctly, `proportion_improved(grades_combined)` should evaluate to a proportion between 0.07 and 0.12.
- Remember, it is impossible for a student's letter grade to decrease due to the redemption policy.

In [54]:
grades_combined

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S),Raw Redemption Score
0,A99706914,ERC,JR,A22,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,...,8.895294,10,00:00:00,10.000000,10,780:01:28,10.000000,10,00:00:00,0.844444
1,A99237411,Eighth,JR,A29,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,...,9.022407,10,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00,0.866667
2,A99690544,Revelle,SR,A12,86.513369,100.0,00:00:00,47.802820,100.0,00:00:00,...,3.030538,10,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00,0.777778
3,A99427381,Seventh,JR,A14,100.000000,100.0,00:00:00,100.000000,100.0,00:00:00,...,10.000000,10,00:00:00,9.249126,10,00:00:00,10.000000,10,00:00:00,0.888889
4,A99489712,Sixth,JR,A24,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,...,4.439606,10,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00,0.822222
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A99073025,Warren,JR,A12,100.000000,100.0,47:26:10,82.022753,100.0,00:00:00,...,10.000000,10,12:08:58,9.169447,10,00:00:00,10.000000,10,00:00:00,0.755556
531,A99257552,Warren,SO,A02,100.000000,100.0,00:00:00,87.498073,100.0,00:00:00,...,10.000000,10,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00,0.800000
532,A99592629,Revelle,JR,A15,88.656641,100.0,00:00:00,90.326041,100.0,00:00:00,...,9.878661,10,00:00:00,8.878946,10,00:00:00,10.000000,10,00:00:00,0.711111
533,A99033808,Seventh,SR,A13,83.799719,100.0,00:00:00,85.636947,100.0,00:00:00,...,7.759434,10,00:00:00,8.655478,10,419:06:41,8.102277,10,00:00:00,0.866667


In [58]:
add_post_redemption(grades_combined)

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S),Raw Redemption Score,Midterm Score Pre-Redemption,Midterm Score Post-Redemption
0,A99706914,ERC,JR,A22,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,...,00:00:00,10.000000,10,780:01:28,10.000000,10,00:00:00,0.844444,1.000000,1.000000
1,A99237411,Eighth,JR,A29,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,...,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00,0.866667,0.912166,0.912166
2,A99690544,Revelle,SR,A12,86.513369,100.0,00:00:00,47.802820,100.0,00:00:00,...,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00,0.777778,0.804012,0.804012
3,A99427381,Seventh,JR,A14,100.000000,100.0,00:00:00,100.000000,100.0,00:00:00,...,00:00:00,9.249126,10,00:00:00,10.000000,10,00:00:00,0.888889,0.947108,0.947108
4,A99489712,Sixth,JR,A24,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,...,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00,0.822222,0.416396,0.826174
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A99073025,Warren,JR,A12,100.000000,100.0,47:26:10,82.022753,100.0,00:00:00,...,12:08:58,9.169447,10,00:00:00,10.000000,10,00:00:00,0.755556,0.882106,0.882106
531,A99257552,Warren,SO,A02,100.000000,100.0,00:00:00,87.498073,100.0,00:00:00,...,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00,0.800000,0.537590,0.801022
532,A99592629,Revelle,JR,A15,88.656641,100.0,00:00:00,90.326041,100.0,00:00:00,...,00:00:00,8.878946,10,00:00:00,10.000000,10,00:00:00,0.711111,0.880445,0.880445
533,A99033808,Seventh,SR,A13,83.799719,100.0,00:00:00,85.636947,100.0,00:00:00,...,00:00:00,8.655478,10,419:06:41,8.102277,10,00:00:00,0.866667,1.000000,1.000000


In [60]:
total_points_post_redemption(grades_combined)

0      0.902389
1      0.816924
2      0.759308
3      0.908499
4      0.736012
         ...   
530    0.864016
531    0.804905
532    0.859673
533    0.866263
534    0.921190
Length: 535, dtype: float64

In [61]:
total_points(grades_combined)

0      0.902389
1      0.816924
2      0.759308
3      0.908499
4      0.674545
         ...   
530    0.864016
531    0.765390
532    0.859673
533    0.866263
534    0.896365
Length: 535, dtype: float64

In [62]:
proportion_improved(grades_combined)

0.10654205607476636

In [63]:
grader.check("q10")

Great! Thanks to your implementation of the redemption policy, a sizeable fraction of CSD 18 students saw their letter grades improve.

<a name='part3'></a>

## Part 3: Analysis 🧠

([return to the outline](#Navigating-the-Project))


Now that we have students' letter grades before and after redemption, it's time to analyze how the class performed overall. First, because we're going to use them frequently in this part, we'll add a few extra columns to `grades_combined` and call the resulting DataFrame `grades_analysis`.

In [64]:
grades_analysis = grades_combined.assign(**{
    'Total Points Pre-Redemption': total_points(grades_combined),
    'Letter Grade Pre-Redemption': final_grades(total_points(grades_combined)),
    'Total Points Post-Redemption': total_points_post_redemption(grades_combined),
    'Letter Grade Post-Redemption': final_grades(total_points_post_redemption(grades_combined))
})
grades_analysis.head()

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S),Raw Redemption Score,Total Points Pre-Redemption,Letter Grade Pre-Redemption,Total Points Post-Redemption,Letter Grade Post-Redemption
0,A99706914,ERC,JR,A22,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,...,10,780:01:28,10.0,10,00:00:00,0.844444,0.902389,A,0.902389,A
1,A99237411,Eighth,JR,A29,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,...,10,00:00:00,9.437368,10,00:00:00,0.866667,0.816924,B,0.816924,B
2,A99690544,Revelle,SR,A12,86.513369,100.0,00:00:00,47.80282,100.0,00:00:00,...,10,00:00:00,9.624617,10,00:00:00,0.777778,0.759308,C,0.759308,C
3,A99427381,Seventh,JR,A14,100.0,100.0,00:00:00,100.0,100.0,00:00:00,...,10,00:00:00,10.0,10,00:00:00,0.888889,0.908499,A,0.908499,A
4,A99489712,Sixth,JR,A24,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,...,10,00:00:00,6.282712,10,00:00:00,0.822222,0.674545,D,0.736012,C


You may have noticed that `grades_analysis` has a `'Section'` column that we haven't yet touched. There are 30 unique values in the `'Section'` column – `'A01'`, `'A02'`, ..., `'A30'`, corresponding to the 30 different discussion sections the students CSD 18 were enrolled in. Discussion sections and discussion assignments have nothing to do with one another, for the purposes of calculating grades, and moving forward, we'll refer to these just as "sections."

In [247]:
grades_analysis['Section'].nunique()

In [248]:
grades_analysis['Section'].unique()

Much of our analysis in this part will pertain to how students in different sections performed in CSD 18.

### Question 11

([return to the outline](#Navigating-the-Project))

#### `section_most_improved`

Complete the implementation of the function `section_most_improved`, which takes in a DataFrame like `grades_analysis` and returns the section in which **the greatest proportion of students had their letter grades increase due to the redemption policy**. For example, if 48\% of students in section `'A25'` had their letter grades increase due to the redemption policy, and no other section had more than 48\% of students increase, then `section_most_improved` should return `'A25'`. 

If there is a tie, return any one of the sections.

<br>

#### `top_sections`

Complete the implementation of the function `top_sections`, which takes in a DataFrame like `grades_analysis`, a float `t` between 0 and 1, and an integer `n`, and returns **an array containing the sections in which at least `n` students earned a raw score of at least `t` on the final exam**. The section names in the returned array should be sorted in alphanumeric order.

For example, `top_sections(grades_analysis, 0.75, 10)` should return an array of the sections in which at least 10 students scored at least 75% on the final exam.

In [None]:
grader.check("q11")

### Question 12

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `rank_by_section`, which takes in a DataFrame like `grades_analysis` and returns a DataFrame describing **students' _ranks_ based on total points (post-redemption) for each section**.

Specifically, the DataFrame should have `n` rows that describe the rank – indexed `1`, `2`, ..., `n` (where `n` is the number of students in the largest section), in that order – and 30 columns – `'A01'`, `'A02'`, ..., `'A30'`, in that order. **The entry in row `r` and column `s` should correspond to the PID of the student who had the `r`th most total points in section `s`, after redemption.** For sections that have fewer than `n` students, fill the extra entries in those columns with **empty strings**. There might exist ties for students with total points of 0.

For instance, suppose there were only four sections, and the largest section had five students. The DataFrame returned by `rank_by_section` might look like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>Section</th>
      <th>A01</th>
      <th>A02</th>
      <th>A03</th>
      <th>A04</th>
    </tr>
    <tr>
      <th>Section Rank</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>A99404117</td>
      <td>A99318825</td>
      <td>A99093358</td>
      <td>A99339719</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A99477753</td>
      <td>A99396913</td>
      <td>A99933171</td>
      <td>A99082089</td>
    </tr>
    <tr>
      <th>3</th>
      <td></td>
      <td>A99159214</td>
      <td>A99164028</td>
      <td>A99950565</td>
    </tr>
    <tr>
      <th>4</th>
      <td></td>
      <td>A99322859</td>
      <td></td>
      <td>A99715029</td>
    </tr>
    <tr>
      <th>5</th>
      <td></td>
      <td>A99739120</td>
      <td></td>
      <td></td>
    </tr>
  </tbody>
</table>

Note that the PIDs in your DataFrame will be different than those above; also note that your DataFrame may have a different string where the example has `'Section Rank'`, and that's fine.

***Hints***: 
- Our solution used `groupby` with a helper function, and then `pivot` on the result. This is a tricky problem – work through it one step at a time.

- Try to use `.sort_values` rather than `.rank` in this question. This is because ties are assigned the mean of the ranks of the ties by default if you use `.rank`. For more information, please refer to `.rank` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html).

In [None]:
grader.check("q12")

### Question 13

([return to the outline](#Navigating-the-Project))

To wrap up, let's _visualize_ how the students in each section of CSD 18 performed.

Complete the implementation of the function `letter_grade_heat_map`, which takes in a DataFrame like `grades_analysis` and returns a `plotly` figure object containing a heatmap describing **the distribution of letter grades (post-redemption) for each section**.

Specifically, the heatmap should have 5 rows – `'A'`, `'B'`, `'C'`, `'D'`, and `'F'`, in that order – and 30 columns – `'A01'`, `'A02'`, ..., `'A30'`, in that order. **The color of the square in row `g` and column `s` should correspond to the proportion of students in section `s` who earned a letter grade (post-redemption) of `g`.**

To create your figure, you'll use the `px.imshow` function and provide several arguments. This [`plotly` article](https://plotly.com/python/imshow/) will be extremely helpful.

Here are some additional requirements to get full credit for your heatmap:
- Set the color scale to be something other than the default. Note that in this heatmap, you should use a sequential color scheme, which means that the intensity of the color assigned to a square is proportional to the value being plotted for that square (e.g. darker colors should correspond to larger proportions and lighter colors should correspond to smaller proportions, or vice versa). Read more about the theory of sequential and diverging color schemes [here](https://blog.datawrapper.de/diverging-vs-sequential-color-scales/).
- Set the title of the plot to `'Distribution of Letter Grades by Section'`.

An example plot that satisfies all of these conditions is shown below, though we encourage you to customize yours within the confines above. Can you change the font?

<img src="data/heatmap-example.png" width=100%>

It's fine if your x-axis labels are rotated.

Remember to return the figure object itself. That is, somewhere in your code you will have `fig = px.imshow(...)`; make sure to also `return fig`.

***Hint***: Most of the work in this question is creating the DataFrame to call `px.imshow` on.

Run the cell below to see your heatmap.

In [264]:
# Run this cell to see the result, and don't change this cell --- it is needed for the tests.
fig = letter_grade_heat_map(grades_analysis)
fig.show()

In [None]:
grader.check("q13")

## Congratulations, you've finished Project 1! 🎉

As a reminder, all of the work you want to submit needs to be in `project.py` – this notebook should not be uploaded because there are no manually-graded questions in this project.

To ensure that all of the work you want to submit is in `project.py`, we've included a script named `project-validation.py` in the project folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.

Once you've finished the project, you should open the command line and run, in the directory for this project:

```
python project-validation.py
```

**This will run all of the `grader.check` cells that you see in this notebook, but only using the code in `project.py` – that is, it doesn't look at any of the code in this notebook. If all of your `grader.check` cells pass in this notebook but not all of them pass in your command line with the above command, then you likely have code in your notebook that isn't in your `project.py`!**

You can also use `project-validation.py` to test individual questions. For instance,

```
python project-validation.py q1 q4 q7 q8
```

will run the `grader.check` cells for Questions 1, 4, 7, and 8 – again, only using the code in `project.py`.

Once `python project-validation.py` shows that you're passing all test cases, you're ready to submit your `project.py` (and only your `project.py`) to Gradescope. Once submitting to Gradescope, make sure to stick around until all test cases pass.

There is also a call to `grader.check_all()` below in _this_ notebook, but make sure to also follow the steps above.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()