<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject #1: Trends in High School GPAs</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/trends-in-high-school-gpas/">https://discovery.cs.illinois.edu/microproject/trends-in-high-school-gpas/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Common Data Set

The [Common Data Set (CDS)](https://commondataset.org/) is an annual report published by nearly every major college and university in the United States with the goal to achieve the *"development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item"*.

As part of the Common Data Set, universities report the "percentage of all enrolled, degree-seeking, first-time, first-year students [...] high school grade-point averages" in the following GPA ranges:

- Percentage of incoming freshman with a high school GPA of 4.00
- Percentage of incoming freshman with a high school GPA of 3.75 or greater
- Percentage of incoming freshman with a high school GPA of 3.50 or greater
- ...and so on...

For example, the University of Wisconsin-Madison's [UW-Madison 2024 CDS](https://data.wisc.edu/common-data-set-and-rankings/) reports following the following high school GPAs for the freshman class entering in Fall 2023:

| High School GPA | Percentage of Freshman Class |
| --------------: | ---------------------------: |
| =4.00            | 47.8%                        |
| $\ge$ 3.75 (includes 4.00s)    | 83.8%  |
| $\ge$ 3.50     | 95.9%                        |
| $\ge$ 3.25     | 98.8%                        |
| $\ge$ 3.00     | 99.7%                        |
| $\ge$ 2.50     | 100%                        |
| $\ge$ 2.00     | 100%                        |
| $\ge$ 1.00     | 100%                        |
| $\ge$ 0.00     | 100%                        |

In January 2025, we compiled all of high school GPAs from the Common Data Sets provided by all Big Ten Universities and provide them as a dataset `cds-high-school-gpas.csv`.  In this MicroProject, you will nerd out with this data and explore any trends in the high school GPAs of freshman at Big Ten schools.  Let's nerd out! 🎉


### Background Knowledge

To finish this MicroProject, we assume you already know how to:

- Load a CSV file into a DataFrame using `pd.read_csv` ([review loading a CSV file](https://discovery.cs.illinois.edu/learn/Basics-of-Data-Science-with-Python/Python-for-Data-Science-Introduction-to-DataFrames/)),
- Perform simple row selection of a DataFrame ([review row selection](https://discovery.cs.illinois.edu/learn/Basics-of-Data-Science-with-Python/Row-Selection-with-DataFrames/)),

With that knowledge, this MicroProject will guide you through nerding out with the Common Data Set and creating a visualization from a DataFrame.  Let's get started! :)

<hr style="color: #DD3403;">

## Part 1: Importing the CDS Big Ten High School GPAs Dataset

You can find the `cds-high-school-gpas.csv` dataset at the following URL: https://waf.cs.illinois.edu/discovery/cds-high-school-gpas.csv

Load the URL as a dataset as a new DataFrame in a variable named `df`:

In [0]:
# Load the URL as a new DataFrame in a variable named `df`:
# (The URL is: "https://waf.cs.illinois.edu/discovery/cds-high-school-gpas.csv")
...

Once you have the dataset loaded, you can display the DataFrame by placing the variable name on the last line of any Python cell.  For example, run the following cell that just contains the variable named `df` to see the DataFrame you just loaded:

In [0]:
df

<hr style="color: #DD3403;">

## Part 2: Exploring the Data

One of the first steps in Data Science is to understand your data and explore the dataset!

### Part 2.1: Highest Percentage of Freshman with 4.00 High School GPAs

First, find the **ten rows** that have the highest percentage of incoming freshman with a 4.00 GPA.  Store those ten rows in a new variable named `df_highest400`.

Helpful Tips: 
- You will need to reference Part 1 to understand the structure of the dataset to find the exact name of the column.
- You can review the [DISCOVERY page: "Row Selection with DataFrames"](https://discovery.cs.illinois.edu/learn/Basics-of-Data-Science-with-Python/Row-Selection-with-DataFrames/) to find out how to select rows with the largest values.

In [0]:
# Find the **ten rows** that have the highest percentage of incoming freshman with a 4.00 GPA:
df_highest400 = ...
df_highest400

In [0]:
### TEST CASE for Part 2.1: Highest Percentage of Freshman with 4.00 High School GPAs
#
# What is this cell?
# - This cell contains test cases for the MicroProject.  You can modify anything except
#   the first line of this cell, but we will replace this cell with a new version of this
#   cell when your MicroProject is graded.  It's usually best to not change this cell!
#
# - To run the test cases we have for you, just run this Python cell like any other cell! :)
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you understand how
#   it works is actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)
#
# - You will find more cells that begin with the words "TEST CASE" throughout the
#   notebook at important points to make sure everything is looking good so far!
#

tada = "\N{PARTY POPPER}"

assert( "df_highest400" in vars() ), \
  "You must define a new variable called `df_highest400`."

assert( "DataFrame" in str(type(df_highest400)) ), \
  "The variable `df_highest400` must contain a DataFrame."

assert( "School" in df_highest400 ), \
  "The DataFrame stored in `df_highest400` does not contain a column \"School\".  Are you sure it's a subset of the original DataFrame?"

assert( ">=3.75" in df_highest400 ), \
  "The DataFrame stored in `df_highest400` does not contain a column \">=3.75\".  Are you sure it's a subset of the original DataFrame?"

assert( len(df_highest400) == 10 ), \
  "The DataFrame stored in `df_highest400` must ONLY contain 10 rows."

assert( df_highest400["=4.00"].sum() == 509.01 ), \
  "The DataFrame stored in `df_highest400` must contain ONLY the rows with the highest percentage of students with 4.00 GPAs."

assert( len(df_highest400["School"].unique() == 2) ), \
  "The DataFrame stored in `df_highest400` must contain ONLY the rows with the highest percentage of students with 4.00 GPAs."

print(f"{tada} All tests passed! {tada}")

### Part 2.2: Exploring UW-Madison Data

In the introduction, we previewed the Fall 2023 freshman class at the University of Wisconsin-Madison.  To understand if there's a trend in the data, we want to select **ALL** years of data from University of Wisconsin-Madison.

To do this, store the rows with data about the University of Wisconsin-Madison in a new variable named `df_wisconsin`:

In [0]:
# Select all the rows with data about the University of Wisconsin-Madison
# and store those rows in a new variable named `df_wisconsin`:
df_wisconsin = ...
df_wisconsin

In [0]:
### TEST CASE for Part 2.2: Exploring UW-Madison Data
tada = "\N{PARTY POPPER}"

assert( "df_wisconsin" in vars() ), \
  "You must define a new variable called `df_wisconsin`."

assert( "DataFrame" in str(type(df_wisconsin)) ), \
  "The variable `df_wisconsin` must contain a DataFrame."

assert( "School" in df_wisconsin ), \
  "The DataFrame stored in `df_wisconsin` does not contain a column \"School\".  Are you sure it's a subset of the original DataFrame?"

assert( ">=3.75" in df_wisconsin ), \
  "The DataFrame stored in `df_wisconsin` does not contain a column \">=3.75\".  Are you sure it's a subset of the original DataFrame?"

assert( len(df_wisconsin["School"].unique() == 1) ), \
  "The DataFrame stored in `df_wisconsin` must ONLY contain data about \"University of Wisconsin-Madison\" and no other schools (your DataFrame contains multiple schools). "

assert( df_wisconsin["School"].unique()[0] == "University of Wisconsin-Madison" ), \
  "The DataFrame stored in `df_wisconsin` must ONLY contain data about \"University of Wisconsin-Madison\" and no other schools (your DataFrame contains a different schools). "

assert( len(df_wisconsin) == 24 ), \
  "The DataFrame stored in `df_wisconsin` must contain ALL of the rows about \"University of Wisconsin-Madison\" (your DataFrame contains the incorrect number of rows). "

import math
assert( math.isclose( df_wisconsin[">=3.75"].std(), 11.990700587738619 ) ), \
  "The DataFrame stored in `df_wisconsin` contains incorrect data (did you accidentally change the data somewhere?)"

print(f"{tada} All tests passed! {tada}")

### Part 2.3: Exploring Data From U-Michigan

An extremely helpful tool is to list **ALL the unique values for a given column** by using the command:

> ```py
> df["Column"].unique()
> ```

To list all of the different universities that appear in the dataset, we would need to find the **unique values for the column `School`**.  Using the syntax above, list all of the unique Big Ten universities stored in the DataFrame `df`:

In [0]:
# List all of the unique Big Ten universities stored in the DataFrame `df`:
...

Once you have a list of all the University names, find the exact spelling and capitalization for The University of Michigan as it appears in the dataset.  Then, just like in Part 2.2, select all the rows from the original DataFrame (`df`) that contain data about The University of Michigan.

Store your data in a new variable called `df_michigan`:

In [0]:
# Select all the rows from the original DataFrame (`df`) that contain data about
# The University of Michigan.  Store your data in a new variable called `df_michigan`:
df_michigan = ...
df_michigan

In [0]:
### TEST CASE for Part 2.3: Exploring Data From U-Michigan
tada = "\N{PARTY POPPER}"

assert( "df_michigan" in vars() ), \
  "You must define a new variable called `df_michigan`."

assert( "DataFrame" in str(type(df_michigan)) ), \
  "The variable `df_michigan` must contain a DataFrame."

assert( "School" in df_michigan ), \
  "The DataFrame stored in `df_michigan` does not contain a column \"School\".  Are you sure it's a subset of the original DataFrame?"

assert( ">=3.75" in df_michigan ), \
  "The DataFrame stored in `df_michigan` does not contain a column \">=3.75\".  Are you sure it's a subset of the original DataFrame?"

assert( len(df_michigan["School"].unique() == 1) ), \
  "The DataFrame stored in `df_michigan` must ONLY contain data about \"University of Michigan\" and no other schools (your DataFrame contains multiple schools). "

assert( df_michigan["School"].unique()[0] == "University of Michigan" ), \
  "The DataFrame stored in `df_michigan` must ONLY contain data about \"University of Michigan\" and no other schools (your DataFrame contains multiple schools). "

assert( len(df_michigan) == 25 ), \
  "The DataFrame stored in `df_michigan` must contain ALL of the rows about \"University of Michigan\" (your DataFrame contains the incorrect number of rows). "

import math
assert( math.isclose( df_michigan[">=3.75"].std(), 10.547687577630183 ) ), \
  "The DataFrame stored in `df_michigan` contains incorrect data (did you accidentally change the data somewhere?)"

print(f"{tada} All tests passed! {tada}")

### Part 2.4: Exploring Data From Michigan State

Finally, let's create a variable for one final school!  Find all the rows about Michigan State and store those rows in a variable called `df_michiganState`:

In [0]:
# Select all the rows from the original DataFrame (`df`) that contain data about
# Michigan State.  Store your data in a new variable called `df_michiganState`:
df_michiganState = ...
df_michiganState

In [0]:
### TEST CASE for Part 2.4: Exploring Data From Michigan State
tada = "\N{PARTY POPPER}"

assert( "df_michiganState" in vars() ), \
  "You must define a new variable called `df_michiganState`."

assert( "DataFrame" in str(type(df_michiganState)) ), \
  "The variable `df_michiganState` must contain a DataFrame."

assert( "School" in df_michiganState ), \
  "The DataFrame stored in `df_michiganState` does not contain a column \"School\".  Are you sure it's a subset of the original DataFrame?"

assert( ">=3.75" in df_michiganState ), \
  "The DataFrame stored in `df_michiganState` does not contain a column \">=3.75\".  Are you sure it's a subset of the original DataFrame?"

assert( len(df_michiganState["School"].unique() == 1) ), \
  "The DataFrame stored in `df_michiganState` must ONLY contain data about \"Michigan State\" and no other schools (your DataFrame contains multiple schools). "

assert( df_michiganState["School"].unique()[0] == "Michigan State" ), \
  "The DataFrame stored in `df_michiganState` must ONLY contain data about \"Michigan State\" and no other schools (your DataFrame contains multiple schools). "

assert( len(df_michiganState) == 25 ), \
  "The DataFrame stored in `df_michiganState` must contain ALL of the rows about \"Michigan State\" (your DataFrame contains the incorrect number of rows). "

import math
assert( math.isclose( df_michiganState[">=3.75"].std(), 8.284110410546061 ) ), \
  "The DataFrame stored in `df_michiganState` contains incorrect data (did you accidentally change the data somewhere?)"

print(f"{tada} All tests passed! {tada}")

<hr style="color: #DD3403;">

## Part 3: Data Visualization

If you look back over the three DataFrames you created in Part 2.2 (Wisconsin), 2.3 (U-Michigan), and 2.4 (Michigan State), you can examine the tables for trends.  However, it's often more impactful to see the data visually!

For any visualization, we generally need to specify:
- What data do we want on the **x-axis**?
- What data do we want on the **y-axis**?

### Part 3.1: X-Axis Values

To understand a trend over time, it's common the visualize the year on the x-axis.  To help create this visualization, find the column name in your DataFrame that contains data for the year of the data and store it in the Python variable `x_column` below:

In [0]:
x_column = ...
x_column

### Part 3.2: Y-Axis Values

In addition, we also need to decide what actual data we want to visualize the trend over time.
- Do we want to visualize only students who have a 3.00? 
- ...or those with at least a 3.50? 
- ...or a perfect 4.0?
- ...or something else entirely?
- We can choose any column in our dataset!

To start, find the name of the column that contains data of the **percentage of all freshman who have at least a high school GPA of 3.50 or greater**.  Store that column name in the variable `y_column`:

In [0]:
y_column = ...
y_column

In [0]:
### MINI TEST CASE for Part 3.1 and Part 3.2
tada = "\N{PARTY POPPER}"
assert( type(x_column) == type("") ), "The value stored in `x_column` must be just the column name for the x-values of data -- not the actual data. (Make sure your `x_column` variable is a string, not a whole column.)"
assert( type(y_column) == type("") ), "The value stored in `y_column` must be just the column name for the y-values of data -- not the actual data. (Make sure your `y_column` variable is a string, not a whole column.)"
print(f"{tada} All tests passed! {tada}")

### Part 3.3: Create the visualization!

In Module 2 of DISCOVERY (the very next module!), you will learn how to create a visualization -- but take a peek at the code below before you run it.
- In the first line, you will find we use your `df_wisconsin` that you created in Part 2.2.  This is how we source data about the high school GPAs of the freshman class at The University of Wisconsin. 
- Immediately after `df_wisconsin`, we use `.plot.line` to create a line graph with the data! This line of code is stored in the variable `ax` that is referenced later in the code to add modifications to the original line graph.
- This is repeated in a very similar way in Lines #2 and Line #3 below for `df_michigan` and `df_michiganState`. Note the `ax=` parameter in these lines which adds each university's line to the original Wisconsin-Madison line graph stored in the variable `ax`.

You'll create line plots like this, along with dozens of other visualizations, when we cover Data Visualization in DISCOVERY in Module 2!  However, for now, we've provided it for you. :)

When you're ready, run the code to create the visualization:

In [0]:
# Line 1: Line Plot for University of Wisconsin (using `df_wisconsin`)
ax = df_wisconsin.plot.line(x=x_column, y=y_column)

# Line 2: Line Plot for University of Michigan (using `df_michigan`)
df_michigan.plot.line(x=x_column, y=y_column, ax=ax)

# Line 3: Line Plot for Michigan State (using `df_michiganState`)
df_michiganState.plot.line(x=x_column, y=y_column, ax=ax)

# Setting up meaningful label names:
ax.set_ylabel(f"% of Freshman Class with High School GPA {y_column}")
ax.set_title("Trends in High School GPAs among Incoming Freshman Classes")
ax.set_xticks(range(2004, 2024))
ax.set_xticklabels(labels=range(2004, 2024), rotation=90)
ax.legend( ["Wisconsin", "Michigan", "Michigan State"])

### Part 3.4: Modify the Data Being Visualized

Finally, return to Part 3.2 and modify the `y_column` to visualize the percentage of the freshman class that has a **high school GPA of at 3.75 or greater**.

⚠️ - You must modify the `y_column` cell and re-run that cell **AND** generate a new graph by re-running the visualization cell in order to pass this the next test case! - ⚠️


In [0]:
### TEST CASE for Part 3: Data Visualization
assert(len(ax.lines) >= 3), "You must have (at least) three line plots in your visualization."
assert(y_column == ">=3.75"), "Make sure to read Part 3.4 and update the value of y_column variable to the correct value."

values = []
for line in ax.lines:
  values.append( line.get_data()[1].data[0] )

assert( 83.8 in values ), "You must have a line for data from the University of Wisconsin."
assert( 92.3 in values ), "You must have a line for data from the University of Michigan."
assert( 55.3 in values ), "You must have a line for data from the Michigan State."
print(f"{tada} All tests passed! {tada}")

<hr style="color: #DD3403;">

## JKLMNOP: Just Keep Letting Me Nerd Out Please!

The visualization you created is the foundational pieces of the visualization *"Trends in High School GPAs among Incoming Freshman Classes of Big Ten Schools"*, which won first prize at The University of Illinois the Big Ten Academic Alliance's 2025 Data Visualization Championship:

- You can view the visualization here: https://waf.cs.illinois.edu/visualizations/Trends-in-High-School-GPAs-of-Incoming-Freshman/
- The first of the two line graph on the link above (about half way down the page) graphs the same data you graphed in Part 3. 🎉

If you want to keep nerding out, here's two additional challenges!
- You can complete them by adding new "Code" cells below this cell in your Jupyter notebook.
- The JKLMNOP challenges are not required to finish the MicroProject, but gives you a way to explore a little bit more! :)
- If you want to skip these challenges, you can continue to "Run the GitHub Action Autograder" to grade your MicroProject.

**JKLMNOP Challenge #1**: Continue building your data visualization and create a visualization that includes all ten schools in the dataset into a single visualization!

**JKLMNOP Challenge #2**: As a advanced challenge, add the "B1G Average" line to your visualization similar to what is displayed in the *"Trends in High School GPAs among Incoming Freshman Classes of Big Ten Schools"* visualization linked above.

In [0]:
# You can add as many or as few new code cells as you need! :)

<hr style="color: #DD3403;">

## Run the GitHub Action Autograder

You're almost done!  All that's left is to commit your lab to GitHub and then run the GitHub Action grader on it! 

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**.

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/trends-in-high-school-gpas/ and complete the section **"Commit and Grade Your Notebook"**.

3. Only once you have a green check mark (✅) on your GitHub Action for this MicroProject, you have completed this MicroProject! 🎉