<h1 style='text-align: center'>
<div style='color: #DD3403; font-size: 60%'>Data Science DISCOVERY MicroProject</div>
<span style=''>Creating Choropleth Maps from DataFrames with folium</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/choropleth-map-dataframe/">https://discovery.cs.illinois.edu/microproject/choropleth-map-dataframe/</a></div>
</h1>

<hr style='color: #DD3403;'>

## Data Visualization: Choropleth Maps

Geographical data visualizations are some of the most impactful forms of data visualization, as these visualizations easily allows the user to locate places familiar to themselves.  One popular geographical visualization is a [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) -- a map that shades a geographical region to visually encode data about the region.  As an example, population density maps and per-capita income maps are common choropleth maps.

Understanding how to use an external library, and read the documentation provided by a library developer, is a critical skill to always continuing to learn and expand your Data Science skills!  In this MicroProject, you will learn about the `folium` Python library -- [https://python-visualization.github.io/folium/](https://python-visualization.github.io/folium/) -- to create choropleth maps from a DataFrame!

Let's nerd out! :)

<hr style='color: #DD3403;'>

## Part 1: Exploring the `folium` Python library

All widely-used Python libraries will have extensive examples and it is often easy to get started by viewing an example of the library's code by the authors of the library.  The `folium` project provides a "Getting Started" guide that includes a section on choropleth maps: https://python-visualization.github.io/folium/latest/getting_started.html#Choropleth-maps

When I take a look at the code, which we provide below, I see that the provided code has four distinct sections:

1. **Data Import**: The first six lines import data about the United States geography (`state_geo`) and then the unemployment into a DataFrame (`state_data`),
2. **Map Creation**: The next line of code creates a blank map, and sets the initial latitude/longitude and zoom level to provide a view of the entire United States.
3. **Data Visualization**: The next several lines of code is one giant call to `folium.Choropleth`, which configures the data visualization on the map.
4. **Rendering**: The final two lines are used to display the map inside of your notebook.

Try it out, and see your first choropleth map! 🗺️

In [0]:
## "Choropleth maps" from folium's "Getting Started" Guide:
## - https://python-visualization.github.io/folium/latest/getting_started.html#Choropleth-maps

import folium
import requests
import pandas

# == Section 1: Data Import ==
state_geo = requests.get(
    "https://raw.githubusercontent.com/python-visualization/folium-example-data/main/us_states.json"
).json()
state_data = pandas.read_csv(
    "https://raw.githubusercontent.com/python-visualization/folium-example-data/main/us_unemployment_oct_2012.csv"
)

# == Section 2: Map Creation ==
m = folium.Map(location=[48, -102], zoom_start=3)

# == Section 3: Data Visualization ==
folium.Choropleth(
    geo_data=state_geo,
    name="choropleth",
    data=state_data,
    columns=["State", "Unemployment"],
    key_on="feature.id",
    fill_color="YlGn",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Unemployment Rate (%)",
).add_to(m)

# == Section 4: Rendering ==
folium.LayerControl().add_to(m)
m


<hr style='color: #DD3403;'>

## Part 2: The Dataset: University of Illinois Demographics by State

The [Division of Management Information (DMI)](https://www.dmi.illinois.edu/) at The University of Illinois is a service unit that provides current and historical student enrollment information statistics.  One of the many datasets that DMI provides is the "Final Statistical Abstract" that provides "a summary of student information on the tenth day of the term".

> Only students taking at least one on-campus, credit-bearing class are included in these reports. The following categories of students are excluded: auditors (students taking only non-credit classes); students taking only extramural or off-campus classes; Medical Scholars taking no on-campus, non-MSP classes. (Note: Illini Center MBA students are included in these statistics.)

The exact data is provided as a large, visually formatted spreadsheet sheet that can be viewed here: https://www.dmi.illinois.edu/stuenr/abstracts/SP25_ten.htm

To help focus on building the choropleth maps, we have extracted the data shown in the teal subtable titled "By Permanent Home Address" and provided it for you as `uiuc-dmi-students-by-permanent-home-address.csv`.

### Part 2.1: Create a DataFrame of Illinois Students by State

In the Python variable `df`, create a DataFrame containing the University of Illinois student demographic by state by loading the provided CSV file `uiuc-dmi-students-by-permanent-home-address.csv`:

In [0]:
# Load the University of Illinois student demographic by state:
# - The data is found in: "uiuc-dmi-students-by-permanent-home-address.csv"
df = ...
df

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Part 2.1: Create a DataFrame of Illinois Students by State
tada = "\N{PARTY POPPER}"
assert('df' in vars()), "Make sure your DataFrame is named `df`."
assert('State' in df), "Make sure you've imported the right dataset."
assert(len(df) == 58), "Make sure you've imported the right dataset."
assert(df.Total.sum() == 56619), "Make sure you've imported the right dataset."
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 3: Making Our Own Choropleth Map

One of the best ways to begin to use a new library is to modify existing code to create your own visualization!

This MicroProject will walk you through modifying `folium`'s "Getting started" guide that you saw provided in Part 1.

### Part 3.1: Using the Illinois Dataset

In the "Getting Started" example, you used the dataset they provided about unemployment by state.  In what we labeled as "Section 1: Data Import", this was imported into the variable `state_data`.

- **Modification #1**: Instead of loading in the dataset they've provided, change `state_data` to be equal to `df` (the DataFrame containing Illinois data).

Incorporate Modification #1 below and run the code (*you should get an error!*, continue to Part 3.2 to fix it):

In [0]:
## "Choropleth maps" from folium's "Getting Started" Guide:
## - https://python-visualization.github.io/folium/latest/getting_started.html#Choropleth-maps

import folium
import requests
import pandas

# == Section 1: Data Import ==
state_geo = requests.get(
    "https://raw.githubusercontent.com/python-visualization/folium-example-data/main/us_states.json"
).json()
state_data = pandas.read_csv(
    "https://raw.githubusercontent.com/python-visualization/folium-example-data/main/us_unemployment_oct_2012.csv"
)

# == Section 2: Map Creation ==
m = folium.Map(location=[48, -102], zoom_start=3)

# == Section 3: Data Visualization ==
folium.Choropleth(
    geo_data=state_geo,
    name="choropleth",
    data=state_data,
    columns=["State", "Unemployment"],
    key_on="feature.id",
    fill_color="YlGn",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Unemployment Rate (%)",
).add_to(m)

# == Section 4: Rendering ==
folium.LayerControl().add_to(m)
m

### Part 3.2: Fix the `KeyError: 'Unemployment'`

After making Modification #1, you will find yourself with a `KeyError: 'Unemployment'`.  A `KeyError` indicates that Python attempted to look for data in a column that does not exist.

This hint leads us to look into the "Section 3: Data Visualization" for any mentions of the columns that are being used.  Specifically, you will find:

> ```py
>   ...
>   data=state_data,
>   columns=["State", "Unemployment"],    # <-- Line about `columns`
>   key_on="feature.id",
>   ...
> ```

To fix this:
- Look back at your DataFrame in Part 2.  What column name contains the **total** number of students from Illinois in each state?
- **Modification #2**: Replace the `"Unemployment"` column name with the column name that contains the total number of students at Illinois for each state.

Return to Part 3.1 and incorporate Modification #2 to your visualization.


### Part 3.3: Fix the Mapping to Each States

After making Modification #2, the entire map is gray! :(  When using `folium`, a gray location indicates there is no data available for that location -- but our DataFrame has the data!

The mapping between data in your DataFrame and the visual encoding of the state on the data visualization is done through the `key_on` field.  The `key_on` field specifically tells `folium` how to map your data to the visualization.

In the "Quick Start" guide, the `key_on` field is `properties.id` (*`key_on="feature.id"`*).
- This indicates that the key that maps the data to the visual encoding is in the `"features"` dictionary, and then the `"id"` field of each entry.

Examining an entry in the `"features"` dictionary by running the code below, you can see that `'id': 'AL'` represents the state of Alabama by the two-letter postal code:

In [0]:
state_geo["features"][0]

#### Part 3.3 (Continued)

Unfortunately for us, the `state` column in our dataset contains the **full state name** (ex: "Alabama") instead of the two letter abbreviation.

There are two ways to solve this problem:
1. We can modify the `state_geo` variable so that the `id` field contains the **full state name**, OR
2. We can modify the `key_on` attribute to reference the field with the full state name (instead of `id`)

Looking at the JSON object above, you can find the **full state name** ("Alabama") NOT in `'id'`, but in `'properties'` and then in `'names'`.  This makes the second way to solve this problem way easier than the first.

To do this, 
- **Modification #3**: Replace the `key_on=` attribute to be equal to `feature.properties.name` (the location of the full state name).

Return to Part 3.1 and incorporate Modification #3 to your visualization.

### Part 3.4: Fix the Legend

After making Modification #3, we see **the majority of students at the University of Illinois** comes from Illinois -- but what about the other states?  They all look the same!

Before we worry about that in Part 4, let's fix one other thing that looks odd before we continue: *Why does our legend say "Unemployment Rate?"*
- **Modification #3**: Replace the `legend_name=` attribute with a descriptive title of the data you're visualizing.

Return to Part 3.1 and incorporate Modification #4 to your visualization.

In [0]:
### TEST CASE for Part 3: Making Our Own Choropleth Map
tada = "\N{PARTY POPPER}"
assert( "m" in vars() ), "Ensure your map variable remains `m`."
html = m._repr_html_()
assert( "choropleth" in html ), "Ensure your have a choropleth map."
assert( "30872" in html ), "Ensure you are using `Total` for your data from your DataFrame."
assert( "Unemployment" not in html ), "Ensure your `legend_name` is referencing the correct data."
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 4: Scaling Data with Data Spanning Multiple Orders of Magnitude

The visualization implies that **everyone** at Illinois comes from Illinois, and no where else?  Let's explore the raw data and ensure there's not an error in the data.

### Part 4.1: Sorting the Data by Enrollment

To be certain that there is not an error in the data, first sort the data based on the total number of students for each state:

(*Not sure how to do this? The [DISCOVERY guide "Sorting a DataFrame Using Pandas"](https://discovery.cs.illinois.edu/guides/Modifying-DataFrames/sorting-a-dataframe-with-pandas/) can help refresh your knowledge on sorting a DataFrame*)

In [0]:
# Sort your DataFrame `df` by the total number of students:
...

### Part 4.2: Finding the State with the Largest Enrollment besides Illinois

Now, find the single row for the **US state** that has the **second largest representation of students** at The University of Illinois and store it in `df_secondLargestState`:

In [0]:
df_secondLargestState = ...
df_secondLargestState

In [0]:
### Part 4.2: Finding the State with the Largest Enrollment besides Illinois
tada = "\N{PARTY POPPER}"
assert( "df_secondLargestState" in vars() ), "You must define a variable `df_secondLargestState`." 
assert( len(df_secondLargestState) == 1 ), "The DataFrame `df_secondLargestState` must contain only one row." 
X = df.nlargest(4, "Total").reset_index().iloc
assert( df_secondLargestState["Total"].values[0] != X[1]["Total"] ), f"The second largest state is not \"Other Countries\", that is not a US state."
assert( df_secondLargestState["Total"].values[0] == X[2]["Total"] ), f"Make sure you have the second largest US state."
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 5: Scaling Data

The data in our dataset spans many orders of magnitude -- from just a single-digit number of students from Alaska to thousands of students from California and Illinois, it is incredibly hard to create a simple scale to visually represent numbers from as small as 2 all the way to over 20,000.

A common way to visually represent data with many orders of magnitude is to use a **log scale**.  Here's some information about "base10 logarithm" or $log_{10}$:

- The logarithm function is the inverse of the exponent function.  *(A base10 logarithm is the inverse of $10^{x}$.)*

- The base10 logarithm is easy to understand since it **roughly counts the number of possible zeros in a number**:
    - 1 has one digit (no zeros), and $log_{10}(1) = 0$
    - 10 has two digits (one possible zero), and $log_{10}(10) = 1$
    - 100 has three digits (two possible zeros), and $log_{10}(100) = 2$
    - 1000 has four digits (three possible zeros), and $log_{10}(1000) = 3$
    - 10,000 has four digits (four possible zeros), and $log_{10}(10000) = 4$
    - ...some people prefer to think of a number's base10 logarithm as the number of digits minus one, instead of the number of zeros.

The $log_{10}$ will approximate partial values:

- 33 is between 10 and 100, so we expect the base10 logarithm to be between 1 and 2.  It's `1.52`.
- 72 is also between 10 and 100, so we expect the base10 logarithm to be between 1 and 2, and closer to 2.  It's `1.86`.
- The number of students at Illinois from Illinois, 30872, has five digits so its base10 logarithm should be between `4` and `5`.  It's `4.49`.

### Using the `np.log10` function

The `numpy` library, commonly imported as `np`, provides a function that will transform a column of data into the `log10` values.  Create a new column, `Total_log10`, that uses the `np.log10(...)` function with the `Total` column:

In [0]:
import numpy as np
df["Total_log10"] = ...
df

In [0]:
### TEST CASE for Part 5: Scaling Data
import math 
import numpy as np
tada = "\N{PARTY POPPER}"
assert('Total_log10' in df), "Ensure that your new column in `df` is called `Total_log10`."
assert( math.isclose((np.log10(df.Total) - df["Total_log10"]).sum(), 0) ), "The values in your `Total_log10` column are incorrect."
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 6: Creating a log10-scaled Choropleth Map


Finally, let's try remaking our choropleth map from before using our new column `Total_log10`! 🎉
- First, return to Part 3.1 and copy and paste all your code into the cell below.
- Then, modify the code to use the column `Total_log10` instead of `Total`.

In [0]:
...

### Analysis

The choropleth map you have just created shows a easy-to-read distribution of our entire dataset into a single data visualization.  A reader can now quickly answer contextually relevant questions very quickly, including:
- What states that border Illinois sends the fewest students to the University of Illinois?
- Other than California, what other Western states do many University of Illinois students call home?
- ...and many other similar questions!

In [0]:
### TEST CASE for Part 6: Creating a log10-scaled Choropleth Map
tada = "\N{PARTY POPPER}"
assert( "m" in vars() ), "Ensure your map variable remains `m`."
html = m._repr_html_()
assert( "choropleth" in html ), "Ensure your have a choropleth map."
assert( str(round(df["Total_log10"].max(), 5))  in html ), "Ensure you are using `Total_log10` for your data."
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/choropleth-map-dataframe/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉