<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Highest Mountains in the World</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/highest-mountain/">https://discovery.cs.illinois.edu/microproject/highest-mountain/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Wikipedia's "List of mountains by elevation"

Wikipedia is an absolutely amazing source of information about almost every topic you can imagine!  In this MicroProject, you will explore how to easily use data in Wikipedia tables as datasets.

The Wikipedia article "[List of mountains by elevation](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)" (https://en.wikipedia.org/wiki/List_of_mountains_by_elevation) contains information on hundreds of mountains -- including **Mount Everest** (tallest in the world), **Denali** (tallest in the United States), and many more!
- Click the link above [(or right here)](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation) to view how the Wikipedia page looks in your web browser!

### Using pandas `read_html` function

The `pd.read_html(...)` function in the pandas library is designed to read data from tables found in webpages.
- `read_html` is very similar to the more commonly used `read_csv`
- Instead of returning a DataFrame like `read_csv`, the `read_html` returns a **list of DataFrames -- one DataFrame for each table on the website**!
- Just like `read_csv`, you only need to provide the URL of the data!

Import `pandas` and use `pd.read_html` to create a new variable called `pages` the reads in all of tables on the Wikipedia page  "[List of mountains by elevation](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)":

In [12]:
import pandas as pd
pages = pd.read_html("https://en.wikipedia.org/wiki/List_of_mountains_by_elevation")
pages

[                          Mountain  Metres   Feet      Range  \
 0                    Mount Everest    8848  29029  Himalayas   
 1                               K2    8611  28251  Karakoram   
 2                    Kangchenjunga    8586  28169  Himalayas   
 3                           Lhotse    8516  27940  Himalayas   
 4                           Makalu    8485  27838  Himalayas   
 5                          Cho Oyu    8188  26864  Himalayas   
 6                       Dhaulagiri    8167  26795  Himalayas   
 7                          Manaslu    8163  26781  Himalayas   
 8                     Nanga Parbat    8126  26660  Himalayas   
 9                        Annapurna    8091  26545  Himalayas   
 10  Gasherbrum I (Hidden peak; K5)    8080  26509  Karakoram   
 11                      Broad Peak    8051  26414  Karakoram   
 12              Gasherbrum II (K4)    8035  26362  Karakoram   
 13                    Shishapangma    8027  26335  Himalayas   
 
                       

### 🔬 Checkpoint Tests 🔬

In [13]:
### TEST CASE for Data Import
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert("pages" in vars())
assert(type(pages[0]) == type(pd.DataFrame()))
assert("Feet" in pages[0])
assert("Range" in pages[1])
assert("Mountain" in pages[2])
assert("Location and Notes" in pages[3])
assert("Metres" in pages[4])
print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Joining the individual DataFrames into one large DataFrame

Now that you have **ALL** of the tables in the `pages` variable, we want to convert this into one large DataFrame.  However, instead of having just one DataFrame, the webpage has different tables.

Let's explore the individual tables.  Using `pages[0]`, you can view the first table of data found on the Wikipedia page:

In [14]:
pages[0]

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
0,Mount Everest,8848,29029,Himalayas,Nepal/China
1,K2,8611,28251,Karakoram,Pakistan/China
2,Kangchenjunga,8586,28169,Himalayas,Nepal/India
3,Lhotse,8516,27940,Himalayas,Nepal – Climbers ascend Lhotse Face in climbin...
4,Makalu,8485,27838,Himalayas,Nepal
5,Cho Oyu,8188,26864,Himalayas,"Nepal – Considered ""easiest"" eight-thousander"
6,Dhaulagiri,8167,26795,Himalayas,Nepal – Presumed world's highest from 1808-1838
7,Manaslu,8163,26781,Himalayas,Nepal
8,Nanga Parbat,8126,26660,Himalayas,Pakistan
9,Annapurna,8091,26545,Himalayas,Nepal – First eight-thousander to be climbed (...


Using `pages[1]`, you can view the second table that was found:

In [15]:
pages[1]

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
0,Gasherbrum III,7952,26089,Karakoram,Pakistan
1,Gyachung Kang,7952,26089,Himalayas,Nepal (Khumbu)/China
2,Annapurna II,7937,26040,Himalayas,Nepal
3,Gasherbrum IV (K3),7932,26024,Karakoram,Pakistan
4,Himalchuli,7893,25896,Himalayas,"Manaslu, Nepal"
...,...,...,...,...,...
129,Saipal,7031,23068,Himalayas,Nepal
130,Padmanabh,7030,23064,Himalayas,India
131,Spantik,7027,23054,Karakoram,Pakistan
132,Pamri Sar,7016,23018,Karakoram,Pakistan


### Finding the Last DataFrame

Continue to look at the tables the Wikipedia page contains.  Find out the **last index** of `pages` that contains data about the mountains:

In [16]:
last_index = pages[len(pages) - 1]
last_index

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
0,Sgurr Dearg,986.0,3235,Cuillin,Scotland
1,Mount Sizer,980.0,3215,Diablo Range,US (California)
2,Mount Valin,980.0,3215,Saguenay Lac St-Jean,Canada (Québec)
3,Hyangnosan,979.0,3212,,"Gyeongnam Province, South Korea"
4,Scafell Pike,978.0,3209,Southern Fells,England (Cumbria) – Highest in England
...,...,...,...,...,...
126,Mount Ngerchelchuus,242.0,794,,"Babeldaob, Palau – Highest point"
127,Mount Royal,233.0,764,,"Quebec, Canada"
128,Diamond Head,232.0,761,,US (Hawaii)
129,Yeomposan,203.0,666,,"Ulsan, South Korea"


### Combining the DataFrames Together

Before we can do analysis on the whole dataset, we need to join the individual tables together.  When we join DataFrames end-to-end, where the last row of the previous DataFrame is followed by the first row of the next DataFrame, the operation is called concatenation.

Read the DISCOVERY guide to learn the syntax on "Combining DataFrames by Concatenation"
- [Guide: "Combining DataFrames by Concatenation"](https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/) (https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/)

Use concatenation to create a single DataFrame `df` that contains data about every mountain found on the Wikipedia page:

In [17]:
df = pd.concat(pages)
df

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
0,Mount Everest,8848.0,29029,Himalayas,Nepal/China
1,K2,8611.0,28251,Karakoram,Pakistan/China
2,Kangchenjunga,8586.0,28169,Himalayas,Nepal/India
3,Lhotse,8516.0,27940,Himalayas,Nepal – Climbers ascend Lhotse Face in climbin...
4,Makalu,8485.0,27838,Himalayas,Nepal
...,...,...,...,...,...
126,Mount Ngerchelchuus,242.0,794,,"Babeldaob, Palau – Highest point"
127,Mount Royal,233.0,764,,"Quebec, Canada"
128,Diamond Head,232.0,761,,US (Hawaii)
129,Yeomposan,203.0,666,,"Ulsan, South Korea"


### 🔬 Checkpoint Tests 🔬

In [18]:
### TEST CASE for Data Merging
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert("df" in vars())
assert(len(df) > len(pages[0]))
assert("Feet" in df)
assert("Mountain" in df)
assert("Hindu Kush" in df["Range"].values)
assert("K2" in df["Mountain"].values)
assert("Batura Sar" in df["Mountain"].values)
assert("Meru Peak" in df["Mountain"].values)
assert("Ubinas" in df["Mountain"].values)
assert(len(df) > 1500 and len(df) < 1650)
assert(len(df[ df["Location and Notes"].str.contains("Himalayas")]) == 35)
print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Mountains in the United States

Now that we have every mountain in a single DataFrame, we can do some analysis!  In the dataset, the `Location and Notes` column contains a human-written description of the location and other notes.

Create a DataFrame called `df_us` that contains all of the mountains in the United States.

- You will need to look back at the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation), or explore `df` here in Python, to find out all the different ways mountains in the United States might be labeled.  *(Hint: There's two different ways!)*

In [19]:
df_us = df [ df ['Location and Notes'].str.contains('United States') | df ['Location and Notes'].str.contains('Alaska') | df ['Location and Notes'].str.contains('US') |
            df ['Location and Notes'].str.contains('Hawaii')]
df_us

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
97,Denali (Mount McKinley),6190.0,20308,,"Alaska Range, United States (Alaska) – Highest..."
32,Mount Saint Elias,5489.0,18009,Saint Elias Mountains,"Yukon, Canada/Alaska, US – Second highest in b..."
49,Mount Foraker,5304.0,17402,Alaska Range,"Alaska, US"
78,Mount Bona,5005.0,16421,Saint Elias Mountains,"Alaska, US – Also given as 5,030 m or 5,045m"
0,Mount Blackburn,4996.0,16391,,"Wrangell Mtns., Alaska, US (also given 5036 m)"
...,...,...,...,...,...
89,Taum Sauk Mountain,540.0,1772,,"Missouri, US"
96,Little Si,480.0,1575,Cascade Range,"Washington, US"
106,Storm King Mountain,408.0,1339,Hudson Highlands,US (New York)
125,Jerimoth Hill,247.0,810,,"Rhode Island, US"


### Analysis: Percentage of Mountains in the Dataset in the United States?

What percentage of mountains in the entire dataset are found in the United States?

In [20]:
pct_us = (len(df_us) / len(df))
pct_us

0.21406539173349784

### 🔬 Checkpoint Tests 🔬

In [21]:
### TEST CASE for Mountains
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert("df_us" in vars())
assert(len(df_us) > 300 and len(df_us) < 400)
assert(len(df_us[ df_us.Mountain.str.contains("Mount Saint Elias")]) == 1)
assert(len(df_us[ df_us.Mountain.str.contains("Denali")]) == 1)
assert("Ubinas" not in df_us["Mountain"])
assert("Carihuairazo" not in df_us["Mountain"])
assert("Sirbal Peak" not in df_us["Mountain"])
assert("pct_us" in vars())
assert(pct_us == len(df_us) / len(df))

print(f"{tada} DataFrame Analysis: All Tests Passed! {tada}")

🎉 DataFrame Analysis: All Tests Passed! 🎉


<hr style="color: #DD3403;">

## 🔬 Microproject - All Checkpoints 🔬

The final check is that you pass all the tests, all at once!

In [22]:
### TEST CASE for Final Checkpoint
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"

assert("pages" in vars())
assert(type(pages[0]) == type(pd.DataFrame()))
assert("Feet" in pages[0])
assert("Range" in pages[1])
assert("Mountain" in pages[2])
assert("Location and Notes" in pages[3])
assert("Metres" in pages[4])

assert("df" in vars())
assert(len(df) > len(pages[0]))
assert("Feet" in df)
assert("Mountain" in df)
assert("Hindu Kush" in df["Range"].values)
assert("K2" in df["Mountain"].values)
assert("Batura Sar" in df["Mountain"].values)
assert("Meru Peak" in df["Mountain"].values)
assert("Ubinas" in df["Mountain"].values)
assert(len(df) > 1500 and len(df) < 1650)
assert(len(df[ df["Location and Notes"].str.contains("Himalayas")]) == 35)

assert("df_us" in vars())
assert(len(df_us) > 300 and len(df_us) < 400)
assert(len(df_us[ df_us.Mountain.str.contains("Mount Saint Elias")]) == 1)
assert(len(df_us[ df_us.Mountain.str.contains("Denali")]) == 1)
assert("Ubinas" not in df_us["Mountain"])
assert("Carihuairazo" not in df_us["Mountain"])
assert("Sirbal Peak" not in df_us["Mountain"])

assert("pct_us" in vars())
assert(pct_us == len(df_us) / len(df))

print(f"{tada}{tada} All Tests Passed! {tada}{tada}")


🎉🎉 All Tests Passed! 🎉🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/highest-mountain/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉