<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject #3: Highest Mountains in the World</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/highest-mountain/">https://discovery.cs.illinois.edu/microproject/highest-mountain/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Wikipedia's "List of mountains by elevation"

Wikipedia is an ad-free, open source, and very reliable source of information about almost every topic you can imagine!  In this MicroProject, you will explore how to easily import data from any Wikipedia tables into a DataFrame.

The Wikipedia article "List of mountains by elevation" (https://en.wikipedia.org/wiki/List_of_mountains_by_elevation) contains information on hundreds of mountains -- including **Mount Everest** (tallest in the world), **Denali / Mount McKinley** (tallest in the United States), and hundreds more!
- Click the Wikipedia link above to view how the Wikipedia page looks in your web browser before we begin to work with it in Python!

Once you've taken a look at the data, let's nerd out with gathering data from Wikipedia!

### Background Knowledge

To finish this MicroProject, we assume you already know how to:

- Load a CSV file into a DataFrame using `pd.read_csv` ([review loading a CSV file](https://discovery.cs.illinois.edu/learn/Basics-of-Data-Science-with-Python/Python-for-Data-Science-Introduction-to-DataFrames/)),
- Perform simple row selection of a DataFrame ([review row selection](https://discovery.cs.illinois.edu/learn/Basics-of-Data-Science-with-Python/Row-Selection-with-DataFrames/)), and
- Work with using conditionals row selection of a DataFrame ([review conditionals with DataFrames](https://discovery.cs.illinois.edu/learn/Basics-of-Data-Science-with-Python/DataFrames-with-Conditionals/))

With that knowledge, this MicroProject will guide you through nerding out with gathering data from Wikipedia and finding some facts about the tallest mountains in the world.  Let's get started! :)

<hr style="color: #DD3403;">

## Part 1: Fetching Data From Wikipedia

The Wikipedia article "List of mountains by elevation" is organized into several tables of data:

- One table for mountains that are at least 8,000m in height,
- Another table for 7,000m mountains  (7,000m - 7,999m),
- Another table for 6,000m mountains  (6,000m - 6,999m),
- ...and so on...

The `pd.read_html(...)` function in the pandas library is designed to **read data from tables found in webpages**.

### Part 1.1: Using `pd.read_html`

In the following cell, we'll use `pd.read_html(...)` to read all of the tables from the Wikipedia page  "List of mountains by elevation" (https://en.wikipedia.org/wiki/List_of_mountains_by_elevation).  Here's a brief overview of the `pd.read_html` function:

- The `read_html` function is **very similar** to the commonly used `read_csv`.
- Instead of returning a DataFrame from a CSV file, the `read_html` returns **one DataFrame for each table on the website** as a Python list of DataFrames.  *(This means we'll have 8 different DataFrames when there are eight different tables on the Wikipedia page.*)
- Just like `read_csv`, you only need to provide the URL of the data or an HTML file that you've downloaded! 🎉


### Using an HTML file vs. Using of a URL

Historically, many sites could be easily and automatically downloaded with Python and you can use the URL inside of `pd.read_html`.  However, with the growth of AI agents scraping websites, many sites (including Wikipedia) are adding security to prevent automatic "scraping" of data.  This prevents us from having Python directly download the website and extract the data.

To use data from a site, you can save a copy of the webpage as an HTML file.  We have done that for the Wikipedia page and provided it for you as part of this MicroProject.  The file name is `List of mountains by elevation - Wikipedia.html`.

In the cell below, load the HTML page using the filename above and fetch a list of DataFrames.  Each DataFrame is a table from the Wikipedia page "List of mountains by elevation" and you will store those in the Python variable `dfList`:

In [0]:
# Fetch a list of DataFrames, where each DataFrame is a table from the Wikipedia page
# "List of mountains by elevation" store it in the Python variable `dfList`.
# The filename of the downloaded HTML page is:
#    "List of mountains by elevation - Wikipedia.html"
dfList = ...
dfList

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Part 1.1: Using pd.read_html
#
# What is this cell?
# - This cell contains test cases for the MicroProject.  You can modify anything except
#   the first line of this cell, but we will replace this cell with a new version of this
#   cell when your MicroProject is graded.  It's usually best to not change this cell!
#
# - To run the test cases we have for you, just run this Python cell like any other cell! :)
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you understand how
#   it works is actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)
#
# - You will find more cells that begin with the words "TEST CASE" throughout the
#   notebook at important points to make sure everything is looking good so far!
#

tada = "\N{PARTY POPPER}"

assert("dfList" in vars()), "You must define a Python variable called `dfList`."
assert(type(dfList) == type([])), "Your Python variable `dfList` must contain a list."
assert(type(dfList[0]) == type(pd.DataFrame())), "Your Python variable `dfList` must contain a list of DataFrames."
assert("Feet" in dfList[0]), "Your Python variable `dfList` must be from the Wikipedia page 'List of mountains by elevation'."
assert("Range" in dfList[1]), "Your Python variable `dfList` must be from the Wikipedia page 'List of mountains by elevation'."
assert("Mountain" in dfList[2]), "Your Python variable `dfList` must be from the Wikipedia page 'List of mountains by elevation'."
assert("Location and notes" in dfList[3]), "Your Python variable `dfList` must be from the Wikipedia page 'List of mountains by elevation'."
assert("Metres" in dfList[4]), "Your Python variable `dfList` must be from the Wikipedia page 'List of mountains by elevation'."

print(f"{tada} All Tests Passed! {tada}")

### Part 1.2: Exploring the List of DataFrames

Your variable `dfList` contains a **Python `list`** of several DataFrames, one DataFrame for each table on the webpage.  To use this as one complete dataset, we need to join the lists together into one large DataFrame.

Before we do that, let's explore the individual DataFrames.  To look at the first item a `list`, we access the `0th` index of the list by using the Python code:

> ```py
> # Accesses the first element (index 0) of a list called `myList`
> myList[0]
> ```

Applying this to the variable `dfList`, the following code displays the first DataFrame stored in `dfList`.  This first DataFrame contains the data from the first table on the Wikipedia page:

In [0]:
dfList[0]

The second DataFrame is accessed at index `[1]`:

In [0]:
dfList[1]

Continue to look at each index, until you find the **very last DataFrame** in the list that contains data about the mountains.  *(We'll need to know the last index for the next section.)*

In [0]:
# Display the very last DataFrame containing data about the mountains from Wikipedia:
...

### Part 1.3: Joining the individual DataFrames into one large DataFrame

Before we can do analysis on the whole dataset, we need to join the individual DataFrames together into one large DataFrame.  When we join DataFrames end-to-end, where the last row of the previous DataFrame is followed by the first row of the next DataFrame, the operation is called **concatenation**.

Read the DISCOVERY guide to learn the syntax on "Combining DataFrames by Concatenation"
- [DISCOVERY Guide: "Combining DataFrames by Concatenation"](https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/)
- https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/

Use concatenation to create a single DataFrame `df` that contains data about every mountain found on the Wikipedia page:

In [0]:
df = ...
df

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Part 1.3: Joining the individual DataFrames into one large DataFrame
tada = "\N{PARTY POPPER}"

assert("df" in vars()), "You must create a variable called `df` that is a single DataFrame (not a list)."
assert(type(dfList) == type([])), "You must NOT override `dfList`.  The variable `dfList` must still be a DataFrame for each table on Wikipedia."
assert(len(df) > len(dfList[0])), "You must concat all of the DataFrames together into one large DataFrame `df`."
assert("Feet" in df), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`."
assert("Mountain" in df), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`."
assert("Hindu Kush" in df["Range"].values), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`.  (You are missing at least one table.)"
assert("K2" in df["Mountain"].values), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`.  (You are missing at least one table.)"
assert("Batura Sar" in df["Mountain"].values), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`.  (You are missing at least one table.)"
assert("Meru Peak" in df["Mountain"].values), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`.  (You are missing at least one table.)"
assert("Ubinas" in df["Mountain"].values), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`.  (You are missing at least one table.)"
assert(len(df) > 1600 and len(df) < 1700), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`.  (You are missing at least one table.)"
assert(len(df[ df["Location and notes"].str.contains("Himalayas")]) == 35), "Your DataFrame stored in the variable `df` must be a concatenation of all the individual DataFrames in `dfList`.  (You are missing at least one table.)"

print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Part 2: Mountains in the United States

Now that we have every mountain in a single DataFrame, we can do some analysis!

In the dataset, the `Location and Notes` column contains a human-written description of the location and other notes.  For example, the notes about the mountain "Makalu" notes that the mountain is in "Nepal".

To do the next analysis, we want to select from all the mountains in the entire dataset `df` and find only the mountains located in the United States.

To do this, you'll need to do two things:

1. First, look back at the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation), or explore `df` here in Python, to find out **all the different ways mountains in the United States might be labeled**.  *(Hint: There's two different ways!)*
2. Second, read the DISCOVERY guide to learn the syntax on "Selecting DataFrame Rows Based on String Contents" to identify how we can use the two different ways the United States is labeled:
    - [DISCOVERY Guide: "Selecting DataFrame Rows Based on String Contents"](https://discovery.cs.illinois.edu/guides/DataFrame-Row-Selection/dataframe-string-contains/)
    - https://discovery.cs.illinois.edu/guides/DataFrame-Row-Selection/dataframe-string-contains/

Create a DataFrame of only the mountains in the United States and store it in the variable `df_us` in the cell below.

In [0]:
# Create a DataFrame of only the mountains in the United States and store it in the variable `df_us`:
df_us = ...
df_us

### Part 2 Analysis: Percentage of Mountains in the Dataset in the United States?

What percentage of mountains in the entire dataset are found in the United States?

In [0]:
pct_us = ...
pct_us

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Part 2: Mountains in the United States
tada = "\N{PARTY POPPER}"

assert("df_us" in vars()), "You must create a variable called `df_us`."
assert(len(df_us) > 300 and len(df_us) < 400), "You DataFrame must contain ALL mountains in the United States. It appears you have not identified the two different ways that the United States appears in the Wikipedia tables."
assert(len(df_us[ df_us.Mountain.str.contains("Mount Saint Elias")]) == 1), "You DataFrame must contain ALL mountains in the United States. It appears you have not identified the two different ways that the United States appears in the Wikipedia tables."
assert(len(df_us[ df_us.Mountain.str.contains("Denali")]) == 1), "You DataFrame must contain ALL mountains in the United States. It appears you have not identified the two different ways that the United States appears in the Wikipedia tables."
assert("Ubinas" not in df_us["Mountain"]), "You DataFrame must contain ALL mountains in the United States. It appears you have not identified the two different ways that the United States appears in the Wikipedia tables."
assert("Carihuairazo" not in df_us["Mountain"]), "You DataFrame must contain ALL mountains in the United States. It appears you have not identified the two different ways that the United States appears in the Wikipedia tables."
assert("Sirbal Peak" not in df_us["Mountain"]), "You DataFrame must contain ALL mountains in the United States. It appears you have not identified the two different ways that the United States appears in the Wikipedia tables."

assert("pct_us" in vars()), "You must create a variable called `pct_us`."
assert(pct_us > 0 and pct_us < 1), "The variable called `pct_us` must contain a percentage between 0 and 1."
assert(pct_us == len(df_us) / len(df)), "The variable called `pct_us` must contain the percentage of mountains in the United States among all the mountains in the dataset."

print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Part 3: Higher than the Highest Mountain in the United States

You have identified the highest mountains in the United States, and also have the data for the highest mountains across the world! 🎉

In the final puzzle for this MicroProject, create a DataFrame that contains **ALL** of the mountains that have a height that is higher than the highest mountain in the United States.  Store the DataFrame in the Python variable `df_higherThanUS`:


In [0]:
# Store your new DataFrame in the variable `df_higherThanUS`:
df_higherThanUS = ...
df_higherThanUS

### Part 3 Visualization: Heights of Various Mountains

A bar chart is a great way to visualize data that contains non-numeric data or categories.  In this MicroProject, you have explored the height of various mountains -- let's visualize just how tall Mount Everest compared to other mountains in this dataset.

Since there are over 1,000 mountains in the dataset, our visualization will show a subset of all the mountains.  Specifically, we'll visualize **every 49th mountain** -- indexes `[0]`, `[49]`, `[98]`, `[147]`, `[196]`, etc.

Selecting only every 61st mountain can be done by selecting a range from your DataFrame with the following format:
> ```py
> # Selects every 61st row, starting with [0]:
> df[::49]
> ```

Creating a bar chart from a DataFrame is done by using the general format:

> ```py
> # Generic format for a bar chart from a DataFrame:
> df.plot.bar(x="data-column-name", y="data-column-name")
> ```

Combining these together, your bar chart can be created with the general format:

> ```py
> # Generic format for a bar chart of every 61st row of a DataFrame:
> df[::49].plot.bar(x="data-column-name", y="data-column-name")
> ```

#### Create Your Visualization

Create a bar chart below, using the mountain name for your `x`-axis data and the height (either feet or meters, your choice!) for the `y`-axis data:

In [0]:
# Create a bar chart
# - using the mountain name for your `x`-axis data, and
# - using the height (either feet or meters, your choice!) for the `y`-axis data
...

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Part 3: Higher than the Highest Mountain in the United States
tada = "\N{PARTY POPPER}"

assert("df_higherThanUS" in vars()), "You must create a variable called `df_higherThanUS`."
assert( len(list(set(df_higherThanUS["Mountain"]) & set(df_us["Mountain"]))) == 0 ), "Some of the mountains in `df_higherThanUS` are in the United States, but there should be none."
assert( len(df_higherThanUS) == len(df[ df.Feet > df_us.sort_values("Feet", ascending=False).iloc[0].Feet ]) ), "There are some mountains in `df_higherThanUS` that are shorter than the highest mountain in the United States, but there should be none."

print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/highest-mountain/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉