# Welcome to lab_kmeans! 🌎

In this lab, you will continue your exploration of machine learning by doing some k-means clustering! 

A few tips to remember:

- **You are not alone on your journey in learning programming!** You have your lab TA, your CAs, your lab group, and the professors (Prof. Wade and Prof. Karle), who are all here to help you out!
- If you find yourself stuck for more than a few minutes, ask a neighbor or course staff for help! When you are giving help to your neighbor, explain the **idea and approach** to the problem without sharing the answer itself so they can have the same ***ah-hah*** moment!
- We are here to help you! Don't feel embarrassed or shy to ask us for help!

Let's get started!

In [None]:
# Meet your CAs and TA if you haven't already!
# ...first name is enough, we'll know who they are! :)
ta_name = ""
ca1_name = ""
ca2_name = ""


# Say hello to each other!
# - Groups of 3 are ideal :)
# - However, groups of 2 or 4 are fine too!
#
# Question of the Day (QOTD) to Ask Your Group: "When are you leaving campus for winter break?"
partner1_name = ""
partner1_netid = ""
partner1_day = ""

partner2_name = ""
partner2_netid = ""
partner2_day = ""

partner3_name = ""
partner3_netid = ""
partner3_day = ""

<hr style="color: #DD3403;">

# Part 1: The World Happiness Dataset
Every year, the UN Sustainable Development Solutions Network (SDSN) creates a **report** detailing the "happiness" of various countries in the world. Utilizing economic, social, and health data, they create the [World Happiness Report](https://worldhappiness.report/about/). Curators of the report **observed survey data** of seven variables (GDP Per Capita, Social Support, Life Expectancy, Freedom, Generosity, Corruption, and Dystopia), estimating their **associations with life** evaluations, ultimately coming up with a `Happiness Score` for each country.  

Some of their report uses data that we can analyze for this lab. We've collected a version of the 2023 **World Happiness** into a dataset and provided it in **CSV format** - it's the `happiness-report-2023.csv` file! 

## Puzzle 1.1: Loading In

Load the **World Happiness Dataset** from `happiness-report-2023.csv` and store it in the DataFrame `df`:

In [None]:
...

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == TEST CASES for Puzzle 1.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert('df' in vars()), "The DataFrame should be loaded in as a variable named `df`."
assert(len(df) == 137), "This is not the dataset we are looking for..."
assert('Happiness Score' in df), "This is not the dataset we are looking for..."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

## Puzzle 1.2: Correlation Coefficients
Let's explore a bit about our dataset. It's always good practice to explore and understand your data before performing any machine learning task. Generate the **correlation coefficient matrix** of our **World Happiness Dataset**, `df` in the cell below:

Note: Some versions of Pandas will not run the `.corr()` function on DataFrames with non-numerical columns. There is a `numeric_only` argument in `.corr()` that you can set to `True` to only include numerical columns in correlation coefficient matrix. It should look like `.corr(numeric_only = True)`.

In [None]:
...

### Group Analysis: Correlations


**Q1: Which two columns were the most highly correlated, producing a coefficient of `0.837533`? Given the real-life context, does this make sense? Why or why not?**

*(✏️ Edit this cell to replace this text with your group's answer. ✏️)*

**Q2: All of the columns in our dataset show significant correlation with `Happiness Score` except for `Generosity` (with a coefficient of `0.044082`).This means there is almost no relationship between the perceived `Generosity` of a country and its `Happiness Score`. Explain why you think this is the case.**

*(✏️ Edit this cell to replace this text with your group's answer. ✏️)*

## Puzzle 1.3: Visualization
Now, let's visualize the relationships between some variables in our dataset. 

Generate a scatter plot showing the relationship between `GDP Per Capita` and `Healthy Life Expectancy`. Remember to specify these columns as the **x** and **y** of the scatter plot.


In [None]:
...

**Q3: Observing the plot above, how would you divide the data into two groups (clusters)?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

Now, generate a scatter plot showing the relationship between `Happiness Score` and `Generosity`:

In [None]:
...

**Q4: Observing the plot above, how would you divide the data into two groups (clusters)?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

<hr style="color: #DD3403;">

# Part 2: Clustering
Now that we've observed some relationships in our dataset, it's time to try **clustering the data**! You may have noticed that our dataset contains fairly high **correlation coefficients** across **multiple columns**. Despite this, it will still be **valuable** to perform **k-means clustering**. 

With a **k-means clustering** model, we will be able to both:
- Identify potential relationships between **non-linear** variables, and
- Identify the **most important *features*** of our dataset by looking at **cluster centroids**

Before we begin, remember that KMeans clustering is a method of **unsupervised** machine learning, meaning unlike `lab_regression`, we **do not provide** **labels** or target values. Rather, we will **allow the model** to determine **groups of the data** based on their similarity alone. 

## Part 2.1: Creating a KMeans Model

In the cell below, create a new k-means model named `kmeans` that contains just **two (2) clusters**:


In [None]:
# Import the KMeans library:


# Create a KMeans model:
kmeans = ...

## Part 2.2: Fitting Numeric Columns to the Model
The k-means model only works on **numeric data** (*you can't find a mean -- or average -- of non-numbers*). The following code provides a list of all the **numeric columns** in our `df`. We've provided you with the cell below to define and store the `numeric_columns` list.

*Do not modify the cell below - just **run it!***

In [None]:
numeric_columns = ['Happiness Score', 'GDP Per Capita', 'Social Support', 'Healthy Life Expectancy', 'Freedom', 'Generosity', 'Corruption','Dystopia']
numeric_columns

Try to `fit` the k-means model data from the `numeric_columns`.

Note: You should **expect to see** a `ValueError: Input X contains NaN.!` Make sure to get that **error message** - this is **intended**!

In [None]:
kmeans.fit(df[numeric_columns])

## Puzzle 2.3: Drop the Rows with Missing Data

The **error message** above provides details on several solutions to dealing with **missing data**. To continue to use `KMeans`, our best option is to **drop rows with missing data**. Create a new DataFrame, `df2`, that contains only **rows in `df` with no NaN values** that can be used for clustering:

Hint: The `.dropna()` function would be useful here!



In [None]:
df2 = ...
df2

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == TEST CASES for Puzzle 2.3 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(not(df2[numeric_columns].isna().any().any())), "Make sure to drop all rows in `df2` with NaN values."
assert(len(df) - len(df2) == 1), "The length of your `df2` if not correct. Make sure you dropped all NaN rows in `df2`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

## Part 2.4: Normalizing the Numeric Data

In lecture, you learned that means are **VERY sensitive** to **outliers**. Since our data comes from **different ranges**, we must **normalize the numeric data**. To **normalize the data**, we can divide **each numeric column** by the **maximum value** of the column. This is done in the provided code below.

Notice that, in the result, all column values are now **scaled to be between 0 and 1**. Since all values are now in the **same range**, we no longer have any column with values in a different range.

*Do not modify the cell below - just **run it!***

In [None]:
for column in numeric_columns:
    df2[column] = df2[column] / df2[column].abs().max()
df2[numeric_columns]

## Puzzle 2.5: Training with `numeric_columns` from `df2`

Now that we have handled missing data and normalized the numeric data, we can **train our model**! Using the `numeric_columns` and `df2`, `fit` your `kmeans` model:

In [None]:
...

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == TEST CASES for Puzzle 2.5 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(isinstance(kmeans, KMeans)), "The `kmeans` model should be saved as `kmeans`."
assert(kmeans.n_clusters == 2), "The `kmeans` model should have two clusters."
assert(kmeans.cluster_centers_ is not None), "The `kmeans` model should be trained with `df2`'s `numeric_columns`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

## Puzzle 2.6: Centroids

In lecture, you learned that `kmeans.cluster_centers_` will display the location of the **final centroids**.

The **order they're displayed** will be the **same order** as the columns are listed in `numeric_columns`. 

Explore the **centroids** and **numeric columns** by **running the cells below**:


In [None]:
kmeans.cluster_centers_

In [None]:
numeric_columns

### Analysis: Centroids

**Q5: In the following cell, answer both of these questions:**
- Looking at the centroids above, which cluster would you expect to have the **"happier"** countries? How do you know? 
- Which **features** are arguably **most important** to defining the **centroids**? (**Hint:** Look for the largest *differences* between values from centroid to centroid)



*(✏️ Edit this cell to replace this text with your answer. ✏️)*

## Puzzle 2.7: Prediction

Now, we can use our model to **assign each country** in our dataset to one of the two **clusters**!

Using your `kmeans` model and the `numeric_columns`, `predict` the centroid for each row and store that prediction in a new column in `df2` called `cluster`:



In [None]:
...

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == TEST CASES for Puzzle 2.7 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
import pandas.api.types as ptypes
assert('cluster' in df2.columns.to_list()), "The cluster predictions should be stored in a column `cluster` of `df2`."
assert(ptypes.is_numeric_dtype(df2['cluster'])), "The `cluster` column should be numeric."
assert(set(df2.cluster.unique()) == set([0, 1])), "The `cluster` column should contain values of 0 and 1 (representing the cluster of each row)."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

## Puzzle 2.8: Visualizing Your Model

In the cell below, we've used `df2.plot.scatter()` to display a **scatter plot** of some columns of our data. We specify the **parameter** `c` (color) to be the `cluster` of each data point, visualizing the **two clusters** in **two different colors**. 

The columns in the scatter plot that will be generated below are of `GPD Per Capita` (on the x-axis) and `Happiness Score` (on the y-axis).

*Do not modify the cell below - just **run it!***

In [None]:
df2.plot.scatter(
    x='GDP Per Capita', 
    y='Happiness Score',
    c="cluster",
    colormap='Set1'
)

**Q6: In the plot above, do we see clean clusters, or are points intermingled? What does this say about our clustering model in relation to the columns shown in the plot?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

Next, using the code from **above** as a template, generate a **scatter plot** of `df2` such that:
- We observe the **relationship between** `Happiness Score` and `Generosity` (these are your `x` and `y`)
- Visualize the **clusters by color** (keep `c` and `colormap` identical):

In [None]:
...

**Q7: In the plot above, do we see clean clusters, or are points intermingled? In Part 1, you may have observed a low correlation between `Happiness Score` and `Generosity`. Does this make sense given the clusters above? Why or why not?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

Finally, generate a **scatter plot** of `df2` such that:
- We observe the **relationship between** `GDP Per Capita` and `Healthy Life Expectancy` (these are your `x` and `y`)
- Visualize the **clusters by color** (keep `c` and `colormap` identical):

In [None]:
...

**Q8: In the plot above, do we see clean clusters, or are points intermingled? In Part 1, you may have observed a high correlation between `GDP Per Capita` and `Healthy Life Expectancy`. Does this make sense given the clusters above? Why or why not?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

<hr style="color: #DD3403;">

# Part 3: A Third Cluster
Our first `kmeans` model was created with only **two (2) clusters**. This is a valid start, but perhaps **more clusters** suit the dataset better. 

Let's experiment by adding **an additional cluster** and observe any differences in model results!

## Puzzle 3.1: Creating and Training our Model

In the cell below, create a new KMeans model named `kmeans_three` that contains **three (3) clusters**:


In [None]:
kmeans_three = ...

Recall that k-means models can only be fit to **numeric data**, and the data **cannot be NaN**. 

Using the `numeric_columns` and `df2`, `fit` your `kmeans_three` model:

In [None]:
...

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == TEST CASES for Puzzle 3.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(isinstance(kmeans_three, KMeans)), "The KMeans model with three clusters should be saved as `kmeans_three`."
assert(kmeans_three.n_clusters == 3), "The `kmeans_three` model should have three clusters."
assert(kmeans_three.cluster_centers_ is not None), "The `kmeans_three` model should be trained with `df2`'s `numeric_columns`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

## Puzzle 3.2: Model Centroids

Explore the **centroids** of your `kmeans_three` model alongside the  **numeric columns** by **running the cells below**:

Remember, the **order of numbers displayed** will be the **same order** as the columns are listed in `numeric_columns`. 

In [None]:
kmeans_three.cluster_centers_

In [None]:
numeric_columns

### Analysis: Centroids, Again
**Q9: Once more, answer both of these questions:**
- Looking at the centroids above, which cluster would you expect to have the **"happiest"** countries? How do you know? 
- Which **features** are arguably **most important** to defining the **centroids**? (**Hint:** Look for the largest *differences* between values from centroid to centroid)

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

## Puzzle 3.3: Prediction, Again
Now, we are going to predict the closest cluster centroids for each of our rows (countries). Using your `kmeans_three` model and the `numeric_columns`, `predict` the centroid for each row and store that prediction in a new column in `df2` called `three_cluster`:

In [None]:
...

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == TEST CASES for Puzzle 3.3 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
import pandas.api.types as ptypes
assert('three_cluster' in df2.columns.to_list()), "The cluster predictions for your kmeans_three model should be stored in a column `three_cluster` of `df2`."
assert(ptypes.is_numeric_dtype(df2['three_cluster'])), "The `three_cluster` column should be numeric."
assert(set(df2.three_cluster.unique()) == set([0, 1, 2])), "The `three_cluster` column should contain values of 0, 1, and 2 (representing the cluster of each row)."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

## Puzzle 3.4: Visualizing Your Model, Again

Now that we've trained and predicted our `kmeans_three` model, we can visualize a **scatter plot** of our data in clusters. We specify the **parameter** `c` (color) to be the `three_cluster` of each data point, visualizing the **three clusters** in **three different colors**. 

The columns in the scatter plot that will be generated below are of `GPD Per Capita` (on the x-axis) and `Happiness Score` (on the y-axis).

Notice this code is **identical** to that from Puzzle 2.9, with the singular change of the `c` parameter passed to `scatter()`. 

*Do not modify the cell below - just **run it!***

In [None]:
df2.plot.scatter(
    x='GDP Per Capita', 
    y='Happiness Score',
    c="three_cluster",
    colormap='Set1'
)

**Q10: In the plot above, do we see clean clusters, or are points intermingled? What does this say about our clustering model in relation to the columns shown in the plot?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

Next, again, using the code from **above** as a template, generate a **scatter plot** of `df2` such that:
- We observe the **relationship between** `Happiness Score` and `Generosity` (these are your `x` and `y`)
- Visualize the **clusters by color** (keep `c` and `colormap` identical):

In [None]:
...

**Q11: In the plot above, do we see clean clusters, or are points intermingled? In Part 1, you may have observed a low correlation between `Happiness Score` and `Generosity`. Does this make sense given the clusters above? Why or why not?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

One last time, generate a **scatter plot** of `df2` such that:
- We observe the **relationship between** `GDP Per Capita` and `Healthy Life Expectancy` (these are your `x` and `y`)
- Visualize the **clusters by color** (keep `c` and `colormap` identical):

In [None]:
...

**Q12: In the plot above, do we see clean clusters, or are points intermingled? In Part 1, you may have observed a high correlation between `GDP Per Capita` and `Healthy Life Expectancy`. Does this make sense given the clusters above? Why or why not?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

### Group Analysis: Comparing Models
**Q13:** Now that we've **trained and visualized** both a **two-cluster** and **three-cluster** KMeans model on the World Happiness Dataset, make some observations about the **performance of both**. Do you think one of the two models we've created **better clusters** the dataset or **reflects patterns** within it? Backup your answer with **at least three observations** from your visualizations. 



*(✏️ Edit this cell to replace this text with your group's answer. ✏️)*

*Side Note:* If you are curious, there are **numeric metrics** in the `sklearn.metrics` module that can be used to evaluate clustering models **quantitatively**. Feel free to investigate any of those metrics on your own time (**not required** for this lab). One example of such a metric is **silhouette score** - which measures how similar an object is to its own cluster versus neighboring cluster(s).

<hr style="color: #DD3403;">

# Part 4: Reflecting on Machine Learning

**Q14**: Consider the takeaways from both this lab and the previous (`lab_regression`). You've learned a lot about two **fundamental machine learning** techniques - **regression** and **clustering**. Given your experience, answer the following questions in a paragraph-style response:
- When would you employ **regression** rather than **clustering** and vice-versa? What are the most important *"tells"* that a dataset can have to guide your intuition? 
- Do you believe that **clustering** was particularly effective for the **World Happiness Dataset**? What takeaways do you think **regression** have provided instead? 
- How would you approach conducting Machine Learning on our **Hello Dataset**? Do you believe it is better suited for **regression** or **clustering**? Both?  



*(✏️ Edit this cell to replace this text with your answer. ✏️)*

<hr style="color: #DD3403;">

# Submission

You're almost done!  All you need to do is to commit your lab to GitHub:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the Canvas instructions to commit this lab to your Git repository!

3. Your TA will grade your submission and provide you feedback after the lab is due. :)