<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: World University Rankings</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/world-university-rankings/">https://discovery.cs.illinois.edu/microproject/world-university-rankings/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: The Times Higher Education

There are hundreds of organizations that rank universities, including US News and World Report, QS World University Rankings, Times Higher Education (THE), and many others.

The Times Higher Education (THE) provides a clean, well-documented CSV that includes their rankings based on the "performance data on universities for students and their families, academics, university leaders, governments and industry".  Their 2020 dataset includes almost 1,400 universities across 92 countries and includes 13 performance indicators that measure an institution’s performance across teaching, research, knowledge transfer and international outlook.  Their website with additional details on this dataset is found here: https://www.timeshighereducation.com/content/world-university-rankings

In this MicroProject, you will explore basic DataFrame operations on the Times Higher Education university rankings.

<hr style="color: #DD3403;">

## Importing the World University Rank Dataset

To use the `pandas` library, we must **import** it into your notebook. import pandas as `pd` in the following cell:

In [1]:
import pandas as pd

Use panda's `read_csv` function to read the `World_University_Rank_2020.csv` (already provided for you) and store that data into a DataFrame called `df`:

In [2]:
# Read the CSV into a new DataFrame `df`:
df = pd.read_csv("World_University_Rank_2020.csv")

In [3]:
### TEST CASE for Importing the World University Rank Dataset
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert("df" in vars()), "Make sure to name the dataframe df"
assert(df["University"].iloc[0] == "University of Oxford")
assert(df["University"].iloc[1392] == "Pontifical Catholic University of Minas Gerais")
assert("University" in df)
assert("Score" not in df)
print(f"{tada} All Tests Passed! {tada}") 

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Puzzle 1: Rankings with the United States

In this dataset, each row represents one university.

Find the variable that encodes where the university is located.  Using a conditional, create a new DataFrame called `df_US` that includes the subset of data containing only the universities in the United States:

In [4]:
df_US = df [df.Country == 'United States']
df_US

Unnamed: 0,Score_Rank,University,Country,Number_students,Numb_students_per_Staff,International_Students,Percentage_Female,Percentage_Male,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result,Overall_Ranking
1,2,California Institute of Technology,United States,2240,6.4,30%,34%,66%,92.1,97.2,97.9,88.0,82.5,94.5,94.50
3,4,Stanford University,United States,16135,7.3,23%,43%,57%,92.8,96.4,99.9,66.2,79.5,94.3,94.30
4,5,Massachusetts Institute of Technology,United States,11247,8.6,34%,39%,61%,90.5,92.4,99.5,86.9,89.0,93.6,93.60
5,6,Princeton University,United States,7983,8.1,25%,45%,55%,90.3,96.3,98.8,58.6,81.1,93.2,93.20
6,7,Harvard University,United States,20823,9.2,24%,49%,51%,89.2,98.6,99.1,47.3,76.3,93.0,93.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1011,428,Oakland University,United States,15887,18.9,4%,57%,43%,25.5,12.7,24.5,35.3,28.9,21.9,10.7�22.1
1015,429,"California State University, Long Beach",United States,30595,19.7,4%,58%,42%,23.1,10.1,27.3,35.9,37.3,21.8,10.7�22.1
1048,437,University of North Florida,United States,12132,17.7,2%,57%,43%,24.8,10.5,25.7,34.6,24.0,21.0,10.7�22.1
1052,438,Texas State University,United States,31893,19.1,1%,58%,42%,17.4,14.3,27.3,36.4,30.4,20.9,10.7�22.1


In [5]:
### TEST CASE for Puzzle 1: Rankings with the United States
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert(df.iloc[47]["University"] == "University of Illinois at Urbana-Champaign")
assert(df.iloc[47]["Number_students"] == 44916)

assert('df_US' in vars()), "Make sure to name the dataframe df_US."
assert(len(df_US) == 172), "It looks like you did not subset df_US to only include universities located in the United States."
assert(df_US["University"].iloc[0] == "California Institute of Technology")
assert(df_US["Number_students"].iloc[171] == 14791)

print(f"{tada} All Tests Passed! {tada}") 

🎉 All Tests Passed! 🎉


## **Key Idea**: Indexes

By default, pandas creates an **index column** that will always start with the index `0`.  Each row after the first receives an index in an increasing order (the second row has index `1`, the third row index `2`, and so on).

The Top University in the full original dataset, Oxford, has an index of 0.  Using the full original dataset (`df`), `df.loc[0]` will display the row at the index (or `loc`) `0`.  See this yourself in running the following code:

In [6]:
# Find the row with the index `0` in the DataFrame `df`:
df.loc[0]

Score_Rank                                    1
University                 University of Oxford
Country                          United Kingdom
Number_students                           20664
Numb_students_per_Staff                    11.2
International_Students                      41%
Percentage_Female                           46%
Percentage_Male                             54%
Teaching                                   90.5
Research                                   99.6
Citations                                  98.4
Industry_Income                            65.5
International_Outlook                      96.4
Score_Result                               95.4
Overall_Ranking                           95.40
Name: 0, dtype: object

### Indexes after Row Selection

Our dataset of US-based universities -- `df_US` -- is a subset of the original full dataset and **still has the original index values**.

When we attempt to find index `0` of `df_US`, we **expect to get a `KeyError`** indicating that this index does NOT exist within `df_US`.  Try this yourself:

In [7]:
# Find the row with the index `0` in the DataFrame `df_US`:
#          ** An error in expected!! **
#   ** (You will fix it in the next section.) **
df_US = df_US.reset_index()

df_US.loc[0]

index                                                       1
Score_Rank                                                  2
University                 California Institute of Technology
Country                                         United States
Number_students                                          2240
Numb_students_per_Staff                                   6.4
International_Students                                    30%
Percentage_Female                                         34%
Percentage_Male                                           66%
Teaching                                                 92.1
Research                                                 97.2
Citations                                                97.9
Industry_Income                                          88.0
International_Outlook                                    82.5
Score_Result                                             94.5
Overall_Ranking                                         94.50
Name: 0,

### Fix The KeyError

(⚠️ You **MUST** make the fix for grading to work, it's easy to miss this section! ⚠️)

In the above cell, fix the error by updating the code to select the best University in `df_US`.

Fix it and re-run the cell before you continue to the next puzzle. :)

<hr style="color: #DD3403; border-style: dashed;">

## Puzzle 1.2: Re-indexing the Universities in the United States

The command `df_US.reset_index()` can be used to **regenerate** the indexes for `df_US`.

When you run `reset_index`, the pandas library will:
1. Move the existing indexes to a new column called `index`
2. Then, replace all of the indexes in `df_US` with a new index (starting with `0`, and counting up by one for each row, just like it was a new dataset)

Use the `reset_index()` function on `df_US` to reset the indexes:

In [8]:
df_US = df_US.reset_index()
df_US

Unnamed: 0,level_0,index,Score_Rank,University,Country,Number_students,Numb_students_per_Staff,International_Students,Percentage_Female,Percentage_Male,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result,Overall_Ranking
0,0,1,2,California Institute of Technology,United States,2240,6.4,30%,34%,66%,92.1,97.2,97.9,88.0,82.5,94.5,94.50
1,1,3,4,Stanford University,United States,16135,7.3,23%,43%,57%,92.8,96.4,99.9,66.2,79.5,94.3,94.30
2,2,4,5,Massachusetts Institute of Technology,United States,11247,8.6,34%,39%,61%,90.5,92.4,99.5,86.9,89.0,93.6,93.60
3,3,5,6,Princeton University,United States,7983,8.1,25%,45%,55%,90.3,96.3,98.8,58.6,81.1,93.2,93.20
4,4,6,7,Harvard University,United States,20823,9.2,24%,49%,51%,89.2,98.6,99.1,47.3,76.3,93.0,93.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
167,167,1011,428,Oakland University,United States,15887,18.9,4%,57%,43%,25.5,12.7,24.5,35.3,28.9,21.9,10.7�22.1
168,168,1015,429,"California State University, Long Beach",United States,30595,19.7,4%,58%,42%,23.1,10.1,27.3,35.9,37.3,21.8,10.7�22.1
169,169,1048,437,University of North Florida,United States,12132,17.7,2%,57%,43%,24.8,10.5,25.7,34.6,24.0,21.0,10.7�22.1
170,170,1052,438,Texas State University,United States,31893,19.1,1%,58%,42%,17.4,14.3,27.3,36.4,30.4,20.9,10.7�22.1


In [9]:
### TEST CASE for Puzzle 1.2: Re-indexing the universities in the United States
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"
assert(df_US["University"].loc[0] == "California Institute of Technology")
assert("index" in df_US.columns)
print(f"{tada} All Tests Passed! {tada}") 

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Puzzle 2: Large Universities in the United States

In the United States, there are generally large universities (often "large state schools") and small universities (often "small liberal arts colleges").  Explore the dataset and find the variable that stores the number of students that attends each university.

For the next analysis, we want to focus only on **large universities in the United States**.  For the purpose of this analysis, we'll refer to any University with at least 30,000 students as "large".

Create a new DataFrame called `df_US_large`, which contains all large universities in the United States:

In [10]:
df_US_large = df_US [df_US.Number_students >= 30000]

In [11]:
### TEST CASE for Puzzle 2: Large universities in the United States
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert('df_US_large' in vars()), "Make sure to name the dataframe df_US_large."
assert(len(df_US_large) == 45), "It looks like you did not subset df_US_large to only universities with at least 30000 students."
assert(df_US_large["University"].iloc[7] == "University of Illinois at Urbana-Champaign")
print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


## Puzzle 2.2: Creating a Ranking List of Large US Universities

In the THE dataset, the dataset provides a `Score_Rank` column that contains the global rank of each school. For our analysis, we want to add a new column called `US_large_rank` that contains the ranking of large schools in the United States.  There are several ways to do this, but one of the easiest is to use the `reset_index()` function.

There's three observations we can make:

- Since the list is already sorted, we know the first university in `df_US_large` is the top university (Rank #1).  The second row is Rank #2, and so on.
- If we use `reset_index`, pandas will re-index the rows starting from zero.
- The first row is index 0, which will be Rank #1, so we'd just need to add one to all the index values!

Complete the following two steps:
1. Use `reset_index()` to reset the index values of `df_US_large`,
2. Then, create a new column in `df_US_large` called `"US_large_rank"` with the value: `df_US_large.index + 1`.

In [12]:
# Step 1: Reset the index of df_US_large (if needed, see Puzzle 1.2 to refresh how you can do this):
df_US_large = df_US [df_US.Number_students >= 30000]
df_US_large = df_US_large.reset_index(drop = True)



In [13]:
# Step 2: Create a new column in df_US_large:
df_US_large["US_large_rank"] = df_US_large.index + 1


## Viewing the Top Large Schools

After all that work, let's check out the top 10 large universities in the United States:

In [14]:
# Select the top ten large universities in the United States:
df_US_large.head(10)

Unnamed: 0,level_0,index,Score_Rank,University,Country,Number_students,Numb_students_per_Staff,International_Students,Percentage_Female,Percentage_Male,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result,Overall_Ranking,US_large_rank
0,9,12,13,"University of California, Berkeley",United States,41081,13.7,17%,50%,50%,83.0,90.6,99.2,46.1,70.4,88.3,88.3,1
1,11,16,16,"University of California, Los Angeles",United States,41066,9.4,17%,54%,46%,83.1,88.6,97.3,51.3,64.1,86.8,86.8,2
2,14,20,20,University of Michigan-Ann Arbor,United States,42982,8.3,17%,49%,51%,79.4,86.1,94.9,47.7,59.2,83.8,83.8,3
3,16,25,25,University of Washington,United States,45692,11.1,16%,53%,47%,72.2,82.2,98.6,47.5,60.4,81.6,81.6,4
4,18,28,27,New York University,United States,44466,8.9,33%,57%,43%,76.8,77.5,96.5,38.7,65.4,81.1,81.1,5
5,19,30,29,"University of California, San Diego",United States,33579,13.0,23%,46%,54%,62.6,78.9,97.7,90.3,63.7,78.8,78.8,6
6,21,39,34,University of Texas at Austin,United States,49165,17.2,10%,52%,48%,68.2,76.2,93.2,46.6,39.1,75.4,75.4,7
7,22,47,41,University of Illinois at Urbana-Champaign,United States,44916,17.9,24%,47%,53%,63.2,78.0,84.4,47.9,54.3,73.0,72.9,8
8,23,50,44,University of Wisconsin-Madison,United States,39154,10.0,13%,0%,0%,68.8,70.3,85.3,46.3,47.4,72.0,72.0,9
9,26,53,47,University of North Carolina at Chapel Hill,United States,35419,9.4,8%,57%,43%,59.7,63.2,96.9,43.9,38.0,69.9,69.9,10


In [15]:
### TEST CASE for Puzzle 2.2: Using the index to store a US_large_rank
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert("US_large_rank" in df_US_large)
assert(df_US_large.loc[0][0] == df_US_large.iloc[0][0])
assert(df_US_large.loc[0]["US_large_rank"] == 1)
print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


  assert(df_US_large.loc[0][0] == df_US_large.iloc[0][0])


<hr style="color: #DD3403;">

## Puzzle 3: Creating Random Subsets

Instead of focusing on just a subset of a DataFrame, researchers often need to look at a random sample of a DataFrame.

Returning to the original dataset of nearly 1,400 universities, create one new DataFrame, `df_random_15` , that gives us a random sample of 15 rows in the dataset.

In [16]:
df_random_15 = df.sample(15)
df_random_15

Unnamed: 0,Score_Rank,University,Country,Number_students,Numb_students_per_Staff,International_Students,Percentage_Female,Percentage_Male,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result,Overall_Ranking
1142,459,Tomas Bata University in Zl�n,Czech Republic,8546,19.5,10%,58%,42%,18.7,17.4,12.6,37.2,42.0,18.7,10.7�22.1
717,342,Palack� University Olomouc,Czech Republic,17443,13.8,12%,69%,31%,20.9,18.3,46.9,35.4,57.3,31.0,28.3�35.2
689,333,Kansas State University,United States,21536,14.3,9%,50%,50%,25.0,20.4,45.8,41.9,47.2,31.9,28.3�35.2
1133,458,The University of Electro-Communications,Japan,4813,15.1,5%,12%,88%,19.3,18.6,14.1,39.1,29.0,18.8,10.7�22.1
442,249,Indian Institute of Technology Delhi,India,7284,14.8,1%,20%,80%,48.5,27.9,49.0,73.9,17.6,40.8,38.8�42.3
14,14,UCL,United Kingdom,32665,10.6,52%,57%,43%,77.8,88.7,96.1,42.7,96.2,87.1,87.10
1336,509,Ming Chuan University,Taiwan,18590,22.7,12%,60%,40%,13.6,11.1,10.3,35.0,30.4,13.7,10.7�22.1
261,169,University of Surrey,United Kingdom,13125,16.0,37%,55%,45%,30.2,34.3,71.9,46.8,93.7,49.1,46.9�50.0
1347,512,Pirogov Russian National Research Medical Univ...,Russian Federation,8078,6.3,6%,72%,28%,23.2,9.5,3.5,37.9,21.0,13.4,10.7�22.1
1041,436,Bu-Ali Sina University,Iran,11609,26.3,1%,53%,47%,22.2,15.3,25.8,35.3,16.0,21.1,10.7�22.1


In [17]:
### TEST CASE for Puzzle 3: Creating random subsets
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert('df_random_15' in vars()), "Make sure to name the dataframe df_random_15."
assert(len(df_random_15) == 15), "It looks like you did not sample exactly 15 rows."
assert(len(df_random_15[df_random_15.Country != "United States"]) > 0), "Make sure to sample from the full dataset stored in `df`"
print(f"{tada} All Tests Passed! {tada}") 

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/world-university-rankings/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉
