<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Highest Mountains in the World</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/highest-mountain/">https://discovery.cs.illinois.edu/microproject/highest-mountain/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Wikipedia's "List of mountains by elevation"

Wikipedia is an absolutely amazing source of information about almost every topic you can imagine!  In this microproject, you will explore how to easily use data in Wikipedia tables as datasets.

The Wikipedia article "[List of mountains by elevation](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)" (https://en.wikipedia.org/wiki/List_of_mountains_by_elevation) contains information on hundreds of mountains -- including Mount Everest (tallest in the world), Denali (tallest in the United States), and many more!
- Click the link above [(or right here)]((https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)" (https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)) to view how the Wikipedia page looks in your web browser!

### Using pandas `read_html` function

The `pd.read_html(...)` function in the pandas library is designed to read data from tables found in webpages.
- `read_html` is very similar to the more commonly used `read_csv`
- Instead of returning a DataFrame like `read_csv`, the `read_html` returns a **list of DataFrames** -- one DataFrame for each table!
- Just like `read_csv`, you only need to provide the URL of the data!

Import `pandas` and create a new variable called `pages` the reads in all of tables on the Wikipedia page  "[List of mountains by elevation](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)":

In [None]:
pages = ...
pages

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("pages" in vars())
assert(type(pages[0]) == type(pd.DataFrame()))
assert("Feet" in pages[0])
assert("Range" in pages[1])
assert("Mountain" in pages[2])
assert("Location and Notes" in pages[3])
assert("Metres" in pages[4])
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Joining the individual DataFrames into one large DataFrame

Now that you have **ALL** of the tables in the `pages` variable, we want to convert this into one large DataFrame.  However, instead of having just one DataFrame, the webpage has different tables.

Let's explore the individual tables.  Using `pages[0]`, you can view the first table of data found on the Wikipedia page:

In [None]:
pages[0]

Using `pages[1]`, you view the second table that was found:

In [None]:
pages[1]

### Finding the Last DataFrame

Continue to look at the tables the Wikipedia page contains.  Find out the **last index** of `pages` that contains data amount the mountains:

In [None]:
...

### Combining the DataFrames Together

Before we can do analysis on the whole dataset, we need to join the individual tables together.  When we join DataFrames end-to-end, where the last row of the previous DataFrame is followed by the first row of the next DataFrame, the operation is called concatenation.

Read the DISCOVERY guide to learn the syntax on "Combining DataFrames by Concatenation"
- [Guide: "Combining DataFrames by Concatenation"](https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/) (https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/)

Use concatenation to create a single DataFrame `df` that contains data amount every mountain found on the Wikipedia page:

In [None]:
df = ...
df

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("df" in vars())
assert(len(df) > len(pages[0]))
assert("Feet" in df)
assert("Mountain" in df)
assert(len(df[df.Feet > 26000]) > 0)
assert(len(df[df.Feet < 2000]) > 0)
assert(len(df[ (df.Feet < 26000) & (df.Feet > 22000) ]) > 0)
assert(len(df[ (df.Feet < 22000) & (df.Feet > 18000) ]) > 0)
assert(len(df[ (df.Feet < 18000) & (df.Feet > 14000) ]) > 0)
assert(len(df[ (df.Feet < 14000) & (df.Feet > 10000) ]) > 0)
assert(len(df[ (df.Feet < 10000) & (df.Feet > 6000) ]) > 0)
assert(len(df[ (df.Feet < 6000) & (df.Feet > 2000) ]) > 0)
print(f"{tada} All Tests Passed! {tada}")


<hr style="color: #DD3403;">

## Mountains in the United States

Now that we have every mountain in a single DataFrame, we can do some analysis!  In the dataset, the `Location and Notes` column contains a human-written description of the location and other notes.

Create a DataFrame called `df_us` that contains all of the mountains in the United States.

- You will need to look back at the [Wikipedia page]((https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/)), or explore `df` here in Python, to find out all the different ways mountains in the United States might be labeled.  *(Hint: There's two different ways!)*

In [None]:
df_us = ...
df_us

### Analysis: Percentage of Mountains in the Dataset in the United States?

What percentage of mountains in the entire dataset is found in the United States?

In [None]:
pct_us = ...
pct_us

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("df_us" in vars())
assert(len(df_us) > 300)
assert(len(df_us[ df_us.Mountain.str.contains("Mount Saint Elias")]) == 1)
assert(len(df_us[ df_us.Mountain.str.contains("Denali")]) == 1)

assert("pct_us" in vars())
assert(pct_us == len(df_us) / len(df))

print(f"{tada} DataFrame Analysis: All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## 🔬 Microproject - All Checkpoint 🔬

The final check is that you pass all the tests, all at once!

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert("pages" in vars())
assert(type(pages[0]) == type(pd.DataFrame()))
assert("Feet" in pages[0])
assert("Range" in pages[1])
assert("Mountain" in pages[2])
assert("Location and Notes" in pages[3])
assert("Metres" in pages[4])

assert("df" in vars())
assert(len(df) > len(pages[0]))
assert("Feet" in df)
assert("Mountain" in df)
assert(len(df[df.Feet > 26000]) > 0)
assert(len(df[df.Feet < 2000]) > 0)
assert(len(df[ (df.Feet < 26000) & (df.Feet > 22000) ]) > 0)
assert(len(df[ (df.Feet < 22000) & (df.Feet > 18000) ]) > 0)
assert(len(df[ (df.Feet < 18000) & (df.Feet > 14000) ]) > 0)
assert(len(df[ (df.Feet < 14000) & (df.Feet > 10000) ]) > 0)
assert(len(df[ (df.Feet < 10000) & (df.Feet > 6000) ]) > 0)
assert(len(df[ (df.Feet < 6000) & (df.Feet > 2000) ]) > 0)

assert("df_us" in vars())
assert(len(df_us) > 300)
assert(len(df_us[ df_us.Mountain.str.contains("Mount Saint Elias")]) == 1)
assert(len(df_us[ df_us.Mountain.str.contains("Denali")]) == 1)

assert("pct_us" in vars())
assert(pct_us == len(df_us) / len(df))

print(f"{tada}{tada} All Tests Passed! {tada}{tada}")


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the instructions to commit and grade this MicroProject on GitHub!
