# DSC 80: Project 01

### Due Date: Thursday, April 18, 11:59:59 PM

---
# Instructions

This Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems.  
* Like the lab, your coding work will be developed in the accompanying `project01.py` file, that will be imported into the current notebook. This code will be autograded.
* The project also has free response questions. To answer the free response questions, edit the markdown cell where specified (as in DSC 10). Submission of the project include uploading a pdf of this notebook to gradescope for manual grading.

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Do not change the free response cells outside the horizontal lines**
- The format of the cells will be used in grading the free response questions.


**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the HW! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `project01.py` (much like we do in the notebook).
- Always document your code!

**Tips for writing the free response questions**:
- You should treat the notebook as a final report for the assignment, containing conclusions and answers to open ended questions that are graded.
- Upon submitting the notebook, there should not be extraneous code in the notebook (e.g. any debugging code). You should only have your answers the the questions, and the necessary code and corresponding output data that serves as evidence for your responses.
- Generally, the free response questions will involve you *using* the functions defined in your `.py` file to justify portions of your argument.
- They should not be long, verbose answers! Typically a short paragraph will do.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import project01 as proj

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np

import os

# The other side of Gradescope

The file contains the grade-book from a fictional data science course with 535 students. 

**Note: this dataset is synthetically generated; it does not contain real student grades.**

In this project, you will:
1. clean and process the data to compute total course grades according to the fictional syllabus (below),
2. qualitatively understand how students did in the course,
3. create a curve and assess its effect.

---

The course syllabus is as follows:

* The course consists of HW assignments, projects, 1 midterm, and a final exam.
* The weight of the course components are HW (20%), projects (30%), midterm (20%), final (30%).
* For the HW assignments, students can revise an assignment for one week after submission for a 10% penalty, for two weeks after submission for a 20% penalty, and beyond that for a 50% penalty. Such revisions are reflected in the `Lateness` columns in the gradebook.
* The lowest HW assignment is dropped.
* Students can earn extra-credit through the `extra-credit` assignment, as well as turning in project checkpoints. All of the extra-credit should amount to the equivalent of *one HW assignment*.

### A note on generalization

You may assume that your code will only need to work on a gradebook for a class with the syllabus given above. That is, you may assume that the dataframe `grades` looks like the given on (in `data/grades.csv`), but 
1. may have more/fewer HW and projects,
2. may have more/fewer students.

You may assume the course components and the naming conventions are as given in the data file. 

In [4]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)

In [5]:
hw_cols = []
max_points = []
lateness_cols = []

for c in grades.columns:
    if(c in ['hw%0*d' % (2, d) for d in range(1,100)] ):
        hw_cols.append(c)
    if(c in ['hw%0*d - Max Points' % (2, d) for d in range(1,100)] ):
        max_points.append(c)
    if(c in ['hw%0*d - Lateness (H:M:S)' % (2, d) for d in range(1,100)] ):
        lateness_cols.append(c)

### Computing homework grades

First, you will clean and process the HW grades. To do this, you will develop functions that normalize the grades, adjust for lateness, drops the lowest grade, and totals the HW grades for each student.

*Note:* You should adapt the questions in this section to process the project assignments as well, as you will need to compute the project grades for a later question. The two are similar (but not identical).

**Question 1**

Create a function `normalize_hw` that takes in a dataframe like `grades` and outputs a dataframe of normalized HW grades (see doctest for the format of the output). The output should **not** take the late penalty into account.


In [6]:
hw = pd.DataFrame(columns = hw_cols)
for i in range(len(hw_cols)):
    hw_col = hw_cols[i]
    mp_col = max_points[i]
    hw[hw_col] = grades[hw_col] / grades[mp_col]

**Question 2**

Unfortunately, Gradescope sometimes experiences a delay in registering when an assignment is submitted during "periods of heavy usage" (i.e. near a submission deadline). You need to assess when a student's assignment was actually turned in on time, even if Gradescope did not process it in time. To do this, it is helpful to know:
* Every late submission has to be submitted by a TA (late submissions are turned off).
* TAs never submitted a late assignment "just after" the deadline. 
* The deadlines were at midnight and students had to come to staff hours to late-submit their assignment.

Create a function `last_minute_submissions` that takes in the dataframe `grades` and outputs the number of submissions that were turned in on time by the student and marked 'late' by Gradescope (for each homework assignment). See the doctest for more details.

*Note:* You have to figure out what's truly a late submission by looking at the data and understanding the facts about the data generating process above. There is some ambiguity in finding which submissions are truly late; your answer will be specific to this dataset.

In [8]:
late_count = []
for i in range(len(lateness_cols)):
    lateness_col = lateness_cols[i]
    
    late_time_split = grades.loc[grades[lateness_col] != '00:00:00', lateness_col].str.split(':')
    count = late_time_split.apply(lambda x: int(x[0]) < 9).sum()
    late_count.append(count)
    
late = pd.Series(late_count, index = hw_cols)
late

hw01     2
hw02     0
hw03     2
hw04    12
hw05     7
hw06     8
hw07    16
hw08    11
hw09    26
dtype: int64

**Question 3**

Now you need to adjust the HW grades for late submissions. Create a function `adjust_lateness` that takes in the dataframe `grades` and returns a dataframe of HW grades adjusted for lateness according to the syllabus. Only *truly* late submissions should be counted as late (as in question 2). The adjusted HW grades should be proportions between 0 and 1.

*Note:* You should use your work from question 1 here!

In [150]:
grades = pd.read_csv(grades_fp)

one_week = 168
two_week = 336
    
hw = proj.normalize_hw(grades)
    
def adjust(hms):
    index = next(index_iter)
    if(int(hms[0]) < one_week):
        hw.loc[index, col] = hw.loc[index, col] * 0.9
    elif(int(hms[0]) < two_week):
        hw.loc[index, col] = hw.loc[index, col] * 0.8
    else:
        hw.loc[index, col] = hw.loc[index, col] * 0.5
        
for i in range(len(lateness_cols)):
    lateness_col = lateness_cols[i]

    marked_late_df = grades.loc[grades[lateness_col] != '00:00:00']
    late_time_split = marked_late_df[lateness_col].str.split(':')
    actually_late = late_time_split.apply(lambda x: int(x[0]) >= 9)
    actually_late_df = marked_late_df.loc[actually_late == True]
    
    col = hw_cols[i]
    index_iter = iter(actually_late_df[lateness_col].str.split(":").index)
    
    actually_late_df[lateness_col].str.split(":").apply(adjust)
hw

Unnamed: 0,hw01,hw02,hw03,hw04,hw05,hw06,hw07,hw08,hw09
0,0.990,0.860,0.720,0.980,1.000000,0.976471,0.485,0.880,0.860
1,0.980,0.520,0.730,0.770,1.000000,0.500000,0.890,0.940,0.860
2,0.860,0.450,0.400,0.730,0.900000,0.429412,0.720,0.710,0.760
3,1.000,1.000,0.920,0.910,0.885714,0.670588,1.000,0.950,0.780
4,0.660,0.330,0.690,0.729,0.642857,0.741176,0.600,0.360,1.000
5,0.910,1.000,1.000,0.970,1.000000,0.917647,0.910,1.000,0.860
6,0.960,0.470,0.740,0.756,0.871429,0.752941,0.490,0.640,0.540
7,1.000,0.980,0.970,1.000,0.971429,0.952941,0.910,0.920,0.980
8,0.900,1.000,0.990,0.873,1.000000,0.476471,0.860,0.620,1.000
9,1.000,1.000,0.820,0.801,0.900000,0.458824,0.920,0.500,0.740


**Question 4**

Create a function `hw_total` that takes in a dataframe of lateness-adjusted HW grades, and computes the total HW grade for each student according to the syllabus. All homework assignments should be equally weighted. Your answer should be a proportion between 0 and 1. (Don't forget to drop the lowest score!)

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense)

In [12]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
adj = proj.adjust_lateness(grades)

adj = adj.fillna(0)
def adjusted_mean(data):
    return (np.mean(data) * len(data) - data.min()) / (len(data) - 1)
adj.apply(adjusted_mean , axis=1)


0      0.908309
1      0.836250
2      0.694926
3      0.930714
4      0.677879
5      0.963456
6      0.718796
7      0.971796
8      0.905375
9      0.835125
10     0.898382
11     0.675000
12     0.867911
13     0.931429
14     0.905000
15     0.852710
16     0.923897
17     0.596607
18     0.866250
19     0.835746
20     0.840082
21     0.863569
22     0.683036
23     0.796339
24     0.782712
25     0.827779
26     0.798588
27     0.814054
28     0.823411
29     0.849107
         ...   
505    0.890875
506    0.894518
507    0.557379
508    0.915000
509    0.972500
510    0.896250
511    0.780200
512    0.878929
513    0.752300
514    0.865893
515    0.865000
516    0.851429
517    0.838265
518    0.904375
519    0.905221
520    0.911250
521    0.956250
522    0.903500
523    0.773618
524    0.809397
525    0.915641
526    0.873687
527    0.988088
528    0.883319
529    0.739950
530    0.895250
531    0.823414
532    0.938571
533    0.845357
534    0.856460
Length: 535, dtype: floa

**Question 5** 

Now, you want to understand the effect that "missing assignments" have on the HW grade distribution.

* Create a function `average_student` that takes in a dataframe like `grades` and outputs the overall HW grade of a student who hypothetically received the average grade on each HW assignment. When computing the 'average of each assignment' you *shouldn't* include people who didn't turn in the assignment.

* Is this value lower or higher than the average total HW grades given by the function `hw_total`? Write your answer in the function `higher_or_lower`.

In [14]:
total = []
mp = 0
for i in range(len(hw_cols)):
    col = hw_cols[i]
    mp_col = max_points[i]
    
    temp = grades[col].dropna()
    total.append(temp.mean())
    mp += grades[mp_col].mean()

total.remove(min(total))
np.mean(total)


79.017427896830185

In [15]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = proj.average_student(grades)
import numbers
print(isinstance(out, numbers.Real))
print(np.isclose(out, 80, atol=5))

True
True


### Computing extra-credit grades

**Question 6**

Compute the extra credit grades. To do this, you need to identify which assignments are extra-credit, total them up, *then* normalize them (the extra-credit assignments should *not* all have equal weight). To find the extra-credit assignments **read the syllabus**.

Create a function `extra_credit_total` that takes in a dataframe like `grades` and returns the total extra-credit grade as a proportion between 0 and 1.

In [17]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)


In [18]:
ec_cols = []
for c in grades.columns:
    if("extra-credit" in c):
        ec_cols.append(c)
    if("checkpoint" in c):
        ec_cols.append(c)
ec = []
ec_max_points = []
ec_lateness = []
for c in ec_cols:
    if("Max Points" in c):
        ec_max_points.append(c)
    elif("Lateness" in c):
        ec_lateness.append(c)
    else:
        ec.append(c)

In [19]:
grades[ec].sum(axis=1) / grades[ec_max_points].sum(axis=1)

0      0.542857
1      0.514286
2      0.314286
3      0.085714
4      0.028571
5      0.457143
6      0.128571
7      0.500000
8      0.571429
9      0.471429
10     0.528571
11     0.285714
12     0.400000
13     0.528571
14     0.371429
15     0.342857
16     0.414286
17     0.142857
18     0.628571
19     0.485714
20     0.300000
21     0.471429
22     0.000000
23     0.242857
24     0.442857
25     0.128571
26     0.171429
27     0.285714
28     0.300000
29     0.114286
         ...   
505    0.614286
506    0.571429
507    0.000000
508    0.642857
509    0.571429
510    0.285714
511    0.114286
512    0.585714
513    0.000000
514    0.514286
515    0.385714
516    0.385714
517    0.428571
518    0.485714
519    0.157143
520    0.185714
521    0.214286
522    0.471429
523    0.285714
524    0.200000
525    0.071429
526    0.485714
527    0.742857
528    0.214286
529    0.014286
530    0.585714
531    0.214286
532    0.228571
533    0.285714
534    0.171429
Length: 535, dtype: floa

### Putting it together

**Question 7**

Finally, you need to create the final course grades. To do this, you will add up the total of each course component according to the weights given in the syllabus. 

* Create a function `total_points` that takes in `grades` and returns the final course grades according to the syllabus. Course grades should be proportions between zero and one.
* Create a function `final_grades` that takes in the final course grades as above and returns a Series of letter grades given by the standard cutoffs (`A >= .90`, `.90 > B >= .80`, `.80 > C >= .70`, `.70 > D >= .60`, `.60 > F`). You should not use rounding to determining the letter grades.
* Create a function `letter_proportions` which takes in the dataframe `grades` and outputs a Series that contains the proportion of the class that received each grade. (This question requires you to put everything together).

*Note*: You can and should use your functions from previous questions in this problem!

*Note*: You need to create a helper function that is an analogue to question 1 for the projects. Be aware that projects may consist of both autograded (final) and free-response portions. The checkpoints are part of the extra-credit.

Verify for yourself the course grade distribution and relevant statistics!

In [151]:
hw_grade = proj.adjust_lateness(grades)
num_hw = hw_grade.shape[1]
hw_grade = proj.hw_total(hw_grade) + (proj.extra_credit_total(grades) / num_hw)
hw_grade = hw_grade * 0.20


midterm_cols = []
final_cols = []
for c in grades.columns:
    if("Midterm" in c):
        midterm_cols.append(c)
    if("Final" in c):
        final_cols.append(c)

midterm_grade = (grades[midterm_cols[0]] / grades[midterm_cols[1]]) * 0.20
final_grade = (grades[final_cols[0]] / grades[final_cols[1]]) * 0.30

def normalize_proj(grades):
    proj_cols = []
    for c in grades.columns:
        if("project" in c):
            if("checkpoint" not in c):
                proj_cols.append(c)
    proj_points_cols = []
    proj_max_points_cols = []
    proj_lateness_cols = []
    for c in proj_cols:
        if('Max Points' in c):
            proj_max_points_cols.append(c)
        elif('Lateness' in c):
            proj_lateness_cols.append(c)
        else:
            proj_points_cols.append(c)

    proj_df = pd.DataFrame()
    for i in np.arange(1, len(proj_points_cols) + 1):

        proj_num = 'project{:02d}'.format(i)
        proj_points= []
        proj_max_points = []
        proj_lateness = []

        for j in np.arange(len(proj_points_cols)):
            if(proj_num in proj_points_cols[j]):
                proj_points.append(proj_points_cols[j])
            if(proj_num in proj_max_points_cols[j]):
                proj_max_points.append(proj_max_points_cols[j])
            if(proj_num in proj_lateness_cols[j]):
                proj_lateness.append(proj_lateness_cols[j])

        if(len(proj_points) == 0):
            break

        proj_df[proj_num] = grades[proj_points].sum(axis=1) / grades[proj_max_points].sum(axis=1)
        
    return proj_df
proj_df = normalize_proj(grades)
proj_grade = proj_df.mean(axis=1) * 0.30

total = hw_grade + proj_grade + midterm_grade + final_grade

In [22]:
#A >= .90, .90 > B >= .80, .80 > C >= .70, .70 > D >= .60, .60 > F

def letter(final_grade):
    if(final_grade >= 0.9):
        return 'A'
    elif(final_grade >= 0.8):
        return 'B'
    elif(final_grade >= 0.7):
        return 'C'
    elif(final_grade >= 0.6):
        return 'D'
    else:
        return 'F'
total.apply(letter)

out = proj.final_grades(pd.Series([0.92, 0.81, 0.41]))
print(np.all(out == ['A', 'B', 'F']))

True


In [23]:
total = proj.total_points(grades)
grade = proj.final_grades(total)
grade.value_counts() / grade.shape[0]

B    0.506542
C    0.287850
A    0.091589
D    0.085981
F    0.028037
dtype: float64

### Do Sophomores get better grades?

**Question 8**

You notice that students who are sophomores on average did better in the class (if you can't verify this, you should go back and check your work!). Is this difference significant, or just due to noise?

Perform a hypothesis test, assessing likelihood of the null hypothesis: 
> "sophomores earn grades that are roughly equal on average to the rest of the class."


Create a function `simulate_pval` which takes in the number of simulations `N` and `grades` and returns the the likelihood that the grade of juniors was no better on average than the class as a whole (i.e. calculate the p-value).

In [24]:
proj.total_points(grades.loc[grades['Level'] == 'SO']).mean()

0.8300956308315962

In [25]:
sophomores = grades.loc[grades['Level'] == "SO"]
soph_grades = proj.total_points(sophomores)
observed_stat = np.mean(soph_grades)
class_grades = proj.total_points(grades)
    
averages = []
N = 10000
for i in range(N):
    random_sample = class_grades.sample(sophomores.shape[0], replace = False)
    curr_avg = np.mean(random_sample)
    averages.append(curr_avg)
        
averages = np.array(averages)
np.count_nonzero(averages >= observed_stat) / N


0.0187

### Creating a curve

You realize that certain assignments in the course were harder than other assignments and you would like take this into account. You feel if someone did very well on a difficult assignment, that it should have more effect that doing well on an easy one. You decide to try out a curve as follows:

1. Convert *every* assignment to [Standard Units](https://www.inferentialthinking.com/chapters/14/2/Variability.html#standard-units).
2. Calculate the proportion of the course grade that every assignment represents.
3. Calculate the weighted sum of the standardized assignment scores and their weights.
4. Now that you have a sorted list of total scores, assign the same number of each letter grade as in the un-curved distribution (this allows for an entire class to get `A`s for example, if the class is easy).

**Question 9**

Create a function `get_assignment_proportions` that takes in `grades` and returns a dictionary 
* keyed by assignment name 
* with values given by the proportion of the final grade that assignment makes up. 

*Note*: Every column in `grades` that represents a student score should be a key.

In [63]:
#1
su_df = pd.DataFrame()
assignment_cols = []
assignment_mp_cols = []

for c in grades.columns:
    if("hw" in c or "project" in c or "extra-credit" in c or 'Midterm' in c or 'Final' in c):
        if("Max Points" not in c and "Lateness" not in c):
            assignment_cols.append(c)
        elif("Max Points" in c):
            assignment_mp_cols.append(c)
for c in assignment_cols:
    su_df[c] = (grades[c] - grades[c].mean()) / grades[c].std()
#su_df


Index(['hw01', 'hw02', 'project01', 'hw03', 'project01_free_response',
       'extra-credit', 'hw04', 'hw05', 'project02_checkpoint01', 'Midterm',
       'hw06', 'project02_checkpoint02', 'hw07', 'project02_final',
       'project02_free_response', 'hw08', 'hw09', 'project03_checkpoint01',
       'project03_final', 'Final'],
      dtype='object')

In [100]:
#2
hw_cols = []
ec_cols = []
proj_cols = []
proj_max_points_cols = []
for c in grades.columns:
    if(c in ['hw%0*d' % (2, d) for d in range(1,100)] ):
        hw_cols.append(c)
    if("checkpoint" in c or 'extra-credit' in c):
        if('Max Points' not in c and 'Lateness' not in c):
            ec_cols.append(c)
    
    if("project" in c):
        if('checkpoint' not in c):
            if('Max Points' not in c and 'Lateness' not in c):
                proj_cols.append(c)
            if('Max Points' in c):
                proj_max_points_cols.append(c)

num_hw = len(hw_cols)
hw_prop = 0.2 / num_hw
num_ec = len(ec_cols)

proportions = dict()
for c in hw_cols:
    proportions[c] = hw_prop
for c in ec_cols:
    proportions[c] = hw_prop / num_ec

proj_combined_cols = normalize_proj(grades).columns
num_proj = len(proj_combined_cols)
proj_prop = 0.3 / num_proj

for proj_num in proj_combined_cols:
    part_of_proj = []
    for c in proj_max_points_cols:
        if proj_num in c:
            part_of_proj.append(c)

    proj_num_total = grades[part_of_proj].mean().sum()
    for i in np.arange(len(proj_cols)):
        col = proj_cols[i]
        max_point_col = proj_max_points_cols[i]
        weight = grades[max_point_col].mean()
        proportions[col] = (weight / proj_num_total) * proj_prop

proportions['Midterm'] = 0.2
proportions['Final'] = 0.3


In [28]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = proj.get_assignment_proportions(grades)
print('project01_free_response' in out.keys())
print('project03_checkpoint01' in out.keys())
print(np.isclose(sum(out.values()), 1.0222, atol=0.01))

True
True
True


**Question 10**

Create a function `curved_total_points` which takes in `grades` and outputs the curved total scores for each student. For the HW questions, grade adjustments should *still* be made for late-submissions, however, for simplicity, **do not** drop the lowest HW assignment. 

*Note*: When standardizing scores, the mean/std that you are standardizing to should *not* incorporate missing values. However, a missing assignment *should* be set to zero *before* standardizing (otherwise, you could do average by skipping all work!).

Create a function `curved_letter_grades` which takes in:
1. a Series of curved course grades (as above),
2. a Series of letter grade distributions (e.g. the output of `letter_proportions`)

and returns a Series containing the letter grade of each student according to the curve.    

*Note:* You may find the `np.percentile` function useful here!

In [193]:
su_df = pd.DataFrame()

proportions = proj.get_assignment_proportions(grades)
assignment_cols = proportions.keys()

for c in assignment_cols:
    su_df[c] = (grades[c] - grades[c].mean()) / grades[c].std()

su_df = su_df.fillna(0)

def curve(student):
    su_grade = 0
    for c in su_df.columns:
        su_grade += proportions[c] * student[c]
    return su_grade
curved_points = su_df.apply(curve, axis=1)
curved_points

0      0.550334
1     -0.066361
2     -0.532369
3      0.542574
4     -1.199240
5      0.623490
6      0.013805
7      1.081229
8      0.308239
9      0.442001
10     0.707017
11    -0.851103
12     0.418066
13     0.919248
14    -0.218232
15     0.281184
16     0.702039
17    -0.825229
18     0.746545
19     0.278802
20    -0.704219
21     0.229005
22     0.045726
23     0.411947
24    -0.260452
25    -1.154006
26     0.088384
27    -0.059990
28     0.453386
29    -0.466237
         ...   
505    0.168025
506    0.205196
507   -1.162098
508   -0.242987
509    1.017191
510   -0.225114
511   -1.157019
512    0.713340
513   -0.621706
514    0.317614
515   -0.105656
516   -0.084234
517   -0.916144
518    0.829970
519    0.455772
520   -0.170949
521    0.812430
522    0.662728
523   -0.341280
524   -0.416228
525    0.250805
526   -0.152888
527    0.662811
528   -0.251441
529   -0.025974
530    0.142018
531   -0.877857
532    0.157077
533    0.356239
534    0.497789
Length: 535, dtype: floa

In [197]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = proj.curved_total_points(grades)
print(isinstance(out, pd.Series))
print(out.max() < 2)
print(out.min() > -10)

True
True
True


0      0.550334
1     -0.066361
2     -0.532369
3      0.542574
4     -1.199240
5      0.623490
6      0.013805
7      1.081229
8      0.308239
9      0.442001
10     0.707017
11    -0.851103
12     0.418066
13     0.919248
14    -0.218232
15     0.281184
16     0.702039
17    -0.825229
18     0.746545
19     0.278802
20    -0.704219
21     0.229005
22     0.045726
23     0.411947
24    -0.260452
25    -1.154006
26     0.088384
27    -0.059990
28     0.453386
29    -0.466237
         ...   
505    0.168025
506    0.205196
507   -1.162098
508   -0.242987
509    1.017191
510   -0.225114
511   -1.157019
512    0.713340
513   -0.621706
514    0.317614
515   -0.105656
516   -0.084234
517   -0.916144
518    0.829970
519    0.455772
520   -0.170949
521    0.812430
522    0.662728
523   -0.341280
524   -0.416228
525    0.250805
526   -0.152888
527    0.662811
528   -0.251441
529   -0.025974
530    0.142018
531   -0.877857
532    0.157077
533    0.356239
534    0.497789
Length: 535, dtype: floa

In [186]:
#prop = proj.letter_proportions(grades)

def curved_letter(final_grade_su):
    if(final_grade_su >= cutoff[0]):
        return 'A'
    elif(final_grade_su >= cutoff[1]):
        return 'B'
    elif(final_grade_su >= cutoff[2]):
        return 'C'
    elif(final_grade_su >= cutoff[3]):
        return 'D'
    else:
        return 'F'

prop = pd.Series([0.2]*5, index='A B C D F'.split())
curved_grades = pd.Series([-0.2, 0, -0.5, 0.2, 2, -1, -3.1, 3, 0.4, 5])

prop = prop.sort_index()
cutoff = []
percentile = 100
for prop in prop:
    percentile -= (prop * 100)
    cutoff.append(np.percentile(curved_grades, percentile))
    
(curved_grades).apply(curved_letter)

0    D
1    C
2    D
3    C
4    B
5    F
6    F
7    A
8    B
9    A
dtype: object

### Assessing the curve

**Question 11**

Do data analysis to understand the effect the curve has on students' grades in the given course. Write a summary of your analysis in the free response section below. You should address:
1.  Was there a change in the median letter grade in the course between the not-curved/curved grades?
2. How many students saw a grade increase due to the curve? Why did their grades increase?
3. How many students saw a grade decrease due to the curve? Why did their grades decrease?
4. Describe a hypothetical class where a student's grade might decrease due to implementing such a curve.
5. Discuss the advantages and disadvantages of using the curve over grading on a straight-scale.

**Free Response Cell**

---

**Response to Question 10 here**

1. No 
2. 50
3. 34
4. 
5. Some advantages of using this curve is that if you did fairly well on a hard assignment (above the mean), you would expect to get a higher grade. We would expect harder assignments to have more people 


---

In [241]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
curved_grades = proj.curved_total_points(grades)
prop = proj.letter_proportions(grades)

curved = proj.curved_letter_grades(curved_grades, prop)
curved.value_counts()

B    271
C    154
A     49
D     46
F     15
dtype: int64

In [212]:
uncurved_grades = proj.total_points(grades)
uncurved = proj.final_grades(uncurved_grades)
uncurved.value_counts()

B    271
C    154
A     49
D     46
F     15
dtype: int64

In [237]:
change = (curved != uncurved)
grade_diff = (curved[change] + uncurved[change])
def test(grade_diff):
    if(grade_diff == 'AB' or grade_diff == 'AC' or grade_diff == 'AD' or grade_diff == 'AF'
       or grade_diff == 'BC' or grade_diff == 'BD' or grade_diff == 'BF'
       or grade_diff == 'CD' or grade_diff == 'CF'
       or grade_diff == 'DF'):
        return 1
    else:
        return 0
len(grade_diff) - grade_diff.apply(test).sum()

50

In [236]:
len(grade_diff)

84

# Congratulations, you finished the project!

### Before you submit:
* Be sure you run the doctests on all your code in project01.py
* Be sure your free repsonse questions are all answered, readable, and that you haven't changed the cells outside the horizontal lines!

### To submit:
* **Convert the notebook to PDF and upload to gradescope for grading the free response.**
* **Upload the .py file to gradescope**