<a href="https://colab.research.google.com/github/sujathasivaraman/mlai/blob/main/Copy_of_Part_1__Examining_the_Problem_and_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h2>Numerical Classification - Day 1</h2>

<p style="font-size: 16px;">
Welcome to your journey into <b>Numerical Classification</b>! We‚Äôre excited to have you here and can‚Äôt wait to see what you‚Äôll achieve. This series of notebooks has been thoughtfully designed to guide you through the world of numerical data analysis and classification, a foundational area in machine learning where algorithms learn to distinguish between different classes based on numerical features.
</p>

<p style="font-size: 16px;">
Each notebook will lead you step-by-step through the key stages of building a numerical classification model. From preprocessing raw data to visualizing trends, engineering features, and training powerful machine learning algorithms, you‚Äôll gain both theoretical knowledge and practical hands-on experience. Along the way, you‚Äôll encounter diverse datasets and challenges that mirror real-world problems, helping you build skills that are directly applicable to the field.
</p>

<p style="font-size: 16px;">
By the end of these notebooks, you‚Äôll have a robust understanding of numerical classification techniques and a project that showcases your unique insights and creativity. You‚Äôll learn not only to build effective models but also to evaluate their performance and refine them for optimal results. This is your opportunity to dive deep, explore different approaches, and customize your project to reflect your interests.
</p>

<p style="font-size: 16px;">
We encourage you to experiment boldly, ask questions, and think critically as you progress. This isn‚Äôt just about following a recipe; it‚Äôs about discovery and innovation. Take ownership of your project, try out new ideas, and make it truly your own.
</p>

<p style="font-size: 16px;">
So, let‚Äôs dive in and start building! We‚Äôre here to support you every step of the way as you embark on this exciting journey into numerical classification machine learning. Enjoy the process, and don‚Äôt hesitate to go beyond the guidelines‚Äîthis project is as much about exploring the possibilities as it is about mastering the fundamentals.
</p>

<font color="orange"><b>Disclaimer:</b> This project uses numerical datasets from publicly available sources. While efforts have been made to ensure data quality, some datasets may contain errors or inconsistencies. Be prepared to handle missing values, outliers, and other common data challenges.</font>


# PART I: Choosing Your Project

<img src="https://drive.google.com/uc?export=view&id=1FGLwKxlZi7iuqD1MnwiuVHr2s9JYG4TT" height=300>

In [None]:
# @title Import Packages (might take a couple minutes)

%%capture

# Data handling and visualization libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
%%capture
print("import statements run")


In [None]:


# @title Project Selection
Project = "Heart Disease Detection" # @param ["Salary Prediction","Heart Disease Detection","Weather Classification"]

if Project == "Salary Prediction":
  url = "https://drive.google.com/uc?id=10abbbVs3fQSK_Dqig_gzS22CKT6IMsOl"
  df = pd.read_csv(url)

elif Project == "Heart Disease Detection":
  url = "https://drive.google.com/uc?id=1u5YfMAkLH2ybUt14d8o8OVFV8CKTHS8c"
  df = pd.read_csv(url)

elif Project == "Weather Classification":
  url = "https://drive.google.com/uc?id=1ukw5BvceV-F2eeiS6pd3xRiffFBdtBGO"
  df = pd.read_csv(url)


  print("Ran project selection")

# PART II: Understanding Your Project

The following exercises will help you dive deep into the purpose of your project and consider the broader implications of your work. Take your time with each question to think critically about your project‚Äôs goals, the importance of your predictions, and the impact your insights could have.


### Exercise 2A: Define the Goal of Your Project

- What is the primary goal of this dataset?
- If you can accurately predict this label, what real-world problem does it help solve?
- Who does it benefit? How might they use your prediction tool?

In [None]:
# @title

# Please answer the following questions to clarify the goals and implications of your project.

# What is the primary goal of this dataset?
primary_goal = "Show the salaries of a diverse set of individuals"  # @param {type:"string"}

# If you can accurately predict this label, what real-world problem does it help solve?
real_world_problem = "Identify inequalities and patterns in salaries"  # @param {type:"string"}

# Who does it benefit? How might they use your prediction tool?
benefits = "Help ensure equality and fairness"  # @param {type:"string"}

print("Ran the goal section of project selection")

Ran the goal section of project selection


### Exercise 2B: Understand the Broader Implications

- What could be the potential positive impacts of your project?
- Are there any potential negative impacts or risks associated with your project?


In [None]:
# @title

# Please answer the following questions to clarify the goals and implications of your project.

# What could be the potential positive impacts of your project?
positive_impacts = "Help youngsters decide and design their future"  # @param {type:"string"}

# Are there any potential negative impacts or risks associated with your project?
risks = "This is again prediction not clairvoyance"  # @param {type:"string"}

## ‚öô Machine Learning

### Exercise 2C: Utilizing Machine Learning

- Brainstorm some ideas for how machine learning could provide a solution to your project.

In [None]:
# @title

idea_1 = "Enter your answer here"  # @param {type:"string"}

idea_2 = "Enter your answer here"  # @param {type:"string"}

### Exercise 2D: Supervised Learning

**Supervised machine learning** is a type of machine learning where a model learns from labeled data. In this approach, we provide the model with both inputs (features) and the correct outputs (labels) during training, allowing it to learn relationships between them. This enables the model to make predictions on new, unseen data based on what it has learned.

  <img src="https://miro.medium.com/v2/resize:fit:1200/1*fq4smdRhVA2ZL6dxrikbKg.jpeg" height="400">

---

##### Questions to Consider

- **What would be the features in an Numerical dataset?**
  
- **What would be the labels for your chosen project?**

In [None]:
# @title
features_X = "" # @param {type:"string"}

labels_y = "" # @param {type:"string"}


# Part III: Data Wrangling

<img src="https://cdn.prod.website-files.com/62d80a87294cc0d49739df0e/66ed6f86016807b84b3a330a_AD_4nXcvYi3hjKzYNbb61NFA7py3wVlg9sRcv8qZFgzYrF2l4AJL_iyRtZCnzfyyN2M0PUOH2rd5YEDaeYoWct4jKuTBl89QJubiIIbvV0_NpRFquqw_N9yP7i-foJBKat2TGmGi5cPh0H_8OwK_LGZzKDZ-NZhgMBD7L-LiSbEcvHm2kpzxQiXUqw.png" height=400>

The go-to library for working with numerical data in Python is Pandas. This powerful open-source library is designed specifically for data manipulation and analysis. With its versatile data structures and rich functionality, Pandas makes handling structured data fast, efficient, and intuitive.

In this notebook, we'll rely heavily on Pandas to conduct our analyses. By the time you've completed the next few sections, you won‚Äôt just have your data prepared for building machine learning models‚Äîyou‚Äôll also have taken a significant step toward mastering Pandas!

When working with numerical data, these pandas functions will help you explore, clean, and understand your dataset effectively:

1. **`df.head()`**  
   Displays the first few rows of the DataFrame (default is 5).

2. **`df.info()`**  
   Provides a summary of the DataFrame, including the number of non-null entries and data types.

3. **`df['column'].value_counts()`**  
   Returns the count of unique values in a specified column (e.g., `labels`).

4. **`df.isnull().sum()`**  
   Checks for missing values in each column and returns the count of `NaN` values.

5. **`df.dropna()`**  
   Removes rows with missing values. You can also use `df.dropna(subset=['labels'])` to remove rows only where the column, `labels`, contains `NaN`, which is helpful for preparing your data for analysis.

6. **`df.fillna(value)`**  
   Replaces `NaN` values with a specified `value`. For example, you might fill missing text values with an empty string (`""`) or a placeholder, depending on your needs.

7. **`df['labels'].apply(function)`**  
   Applies a function to each entry in the `labels` column. This is powerful for text preprocessing tasks, such as converting text to lowercase, removing punctuation, or applying custom text-cleaning functions (which will come in handy in the next notebook!).

8. **`df.groupby('labels')`**  
   Groups the DataFrame by the `labels` column and can be followed by aggregate functions (e.g., `.size()` to count samples per label).

We've pre-loaded your DataFrame into the variable ```df```, so you can start practicing the functions listed above right away!

### Exercise 3A

Use ```df.head()``` to view the first 10 rows of your dataset.

üí° *Hint*: You can adjust the number of rows displayed by specifying a different number in `df.head()`. For more details, refer to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html).

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,label
0,0,5199,65,Local-gov,254413.0,Some-college,10.0,Divorced,Exec-managerial,Not-in-family,White,Female,0.0,0.0,40,United-States,<=50K
1,1,2447,28,Private,331381.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,0.0,0.0,40,United-States,<=50K
2,2,4227,35,Private,255191.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,45,United-States,>50K
3,3,4093,45,Private,195554.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,45,United-States,>50K
4,4,2426,45,Self-emp-not-inc,40690.0,Some-college,10.0,Never-married,Farming-fishing,Own-child,White,Male,0.0,0.0,75,United-States,<=50K


What observations did you make about the documentation? What are "parameters," and how do they influence the behavior of functions?

In [None]:
#@title

reflection = "" # @param {type:"string"}

### Exercise 3B

Use ```df.info()``` to get some more information about your dataframe.

üí° For more details, refer to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html).

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1206 entries, 0 to 1205
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      1206 non-null   int64  
 1   id              1206 non-null   int64  
 2   age             1206 non-null   int64  
 3   workclass       1206 non-null   object 
 4   fnlwgt          1205 non-null   float64
 5   education       1206 non-null   object 
 6   education.num   1205 non-null   float64
 7   marital.status  1206 non-null   object 
 8   occupation      1205 non-null   object 
 9   relationship    1204 non-null   object 
 10  race            1205 non-null   object 
 11  sex             1205 non-null   object 
 12  capital.gain    1205 non-null   float64
 13  capital.loss    1205 non-null   float64
 14  hours.per.week  1206 non-null   int64  
 15  native.country  1205 non-null   object 
 16  label           1206 non-null   object 
dtypes: float64(4), int64(4), object(9

What insights can you gather from the dataset summary provided by df.info()? Identify any key observations. Discuss your findings with your mentor and log your observations for each section below.

In [None]:
#@title

reflection = "" # @param {type:"string"}

### Exercise 3C

Next, we'll remove duplicates from our DataFrame. Duplicate entries can occur during data collection or processing and can introduce bias into your machine learning models. For example, if the same data point appears multiple times, it may disproportionately influence the model, leading to overfitting‚Äîwhere the model performs well on the training data but struggles to generalize to unseen data.

To identify and handle duplicates, we can use the powerful functions provided by the pandas library. The ```df.duplicated()``` function helps us find duplicate rows in the dataset, while the ```df.drop_duplicates()``` function allows us to remove them effectively.

In [None]:
df[df.duplicated()]
df.drop_duplicates()

Unnamed: 0.1,Unnamed: 0,id,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,label
0,0,5199,65,Local-gov,254413.0,Some-college,10.0,Divorced,Exec-managerial,Not-in-family,White,Female,0.0,0.0,40,United-States,<=50K
1,1,2447,28,Private,331381.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,0.0,0.0,40,United-States,<=50K
2,2,4227,35,Private,255191.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,45,United-States,>50K
3,3,4093,45,Private,195554.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,45,United-States,>50K
4,4,2426,45,Self-emp-not-inc,40690.0,Some-college,10.0,Never-married,Farming-fishing,Own-child,White,Male,0.0,0.0,75,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1201,1201,2468,64,Self-emp-not-inc,339321.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,24,United-States,>50K
1202,1202,9745,38,Private,207202.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,48,United-States,>50K
1203,1203,9162,48,Private,107231.0,Prof-school,15.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,99999.0,0.0,50,United-States,>50K
1204,1204,1307,35,State-gov,193241.0,HS-grad,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,1651.0,40,United-States,<=50K


How many duplicates did you find in your dataframe? How can you verify that you've dropped them from ```df```?

In [None]:
#@title

reflection = "" # @param {type:"string"}

### Exercise 3D

Next, we'll address missing values in our DataFrame. Missing data can disrupt the training process, as models struggle to interpret and make predictions when certain values are absent. As data scientists, it‚Äôs our responsibility to decide how best to handle these gaps.

There are two common approaches:

1.   **Imputation**: Replacing missing values with a calculated value, such as the mean, median, or mode of the column. This helps maintain the integrity of the dataset while ensuring no data points are lost.
2.   **Dropping Rows or Columns**: If you have a sufficiently large dataset or the missing values are concentrated in unimportant rows or columns, you can opt to drop them entirely.

To identify and handle missing values in your DataFrame, you can use the following pandas functions:

*   `df.isnull().sum()`: Identifies missing values in each column.
*   `df.dropna()`: Drops rows or columns with missing values.
*   `df.fillna(value)`: Fills missing values with a specified value (e.g., mean, median, or mode).

üí° *Hint*: You can search each function online to access the official documentation, where you‚Äôll find detailed usage information and examples.

-------

Additionally, use column filtering to print out the rows that contain NaN. To clean and understand your dataset, it's often helpful to identify rows where specific columns contain missing values (`NaN`). In this step, you'll use **column filtering** to isolate these rows for inspection.

#### Steps to Follow

1. **Select the Column**:  
   Decide which column you'd like to check for `NaN` values.

2. **Apply Column Filtering**:  
   Use column filtering with the `.isna()` method to isolate rows where the selected column contains `NaN`.  
   
   **Hint**: The basic structure looks like `df[df['column_name'].isna()]`, where `'column_name'` is the name of the column you want to inspect.

#### General Column Filtering Tip

Beyond finding `NaN` values, you can also use column filtering to select rows based on other conditions. For example, `df[df['column_name'] == "condition"]` allows you to filter rows where a column matches a specific value.

Using column filtering is a powerful tool for exploring and preparing your data for analysis!

In [None]:
df.isnull().sum().sum()

np.int64(10)

In [None]:
df

Unnamed: 0.1,Unnamed: 0,id,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,label
0,0,5199,65,Local-gov,254413.0,Some-college,10.0,Divorced,Exec-managerial,Not-in-family,White,Female,0.0,0.0,40,United-States,<=50K
1,1,2447,28,Private,331381.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,0.0,0.0,40,United-States,<=50K
2,2,4227,35,Private,255191.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,45,United-States,>50K
3,3,4093,45,Private,195554.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,45,United-States,>50K
4,4,2426,45,Self-emp-not-inc,40690.0,Some-college,10.0,Never-married,Farming-fishing,Own-child,White,Male,0.0,0.0,75,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1201,1201,2468,64,Self-emp-not-inc,339321.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,24,United-States,>50K
1202,1202,9745,38,Private,207202.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,48,United-States,>50K
1203,1203,9162,48,Private,107231.0,Prof-school,15.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,99999.0,0.0,50,United-States,>50K
1204,1204,1307,35,State-gov,193241.0,HS-grad,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,1651.0,40,United-States,<=50K


In [None]:
df.isnull()
# for column in df.columns:
df_clean = df.dropna()
df_clean

Unnamed: 0.1,Unnamed: 0,id,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,label
0,0,5199,65,Local-gov,254413.0,Some-college,10.0,Divorced,Exec-managerial,Not-in-family,White,Female,0.0,0.0,40,United-States,<=50K
1,1,2447,28,Private,331381.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,0.0,0.0,40,United-States,<=50K
2,2,4227,35,Private,255191.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,45,United-States,>50K
3,3,4093,45,Private,195554.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,45,United-States,>50K
4,4,2426,45,Self-emp-not-inc,40690.0,Some-college,10.0,Never-married,Farming-fishing,Own-child,White,Male,0.0,0.0,75,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1201,1201,2468,64,Self-emp-not-inc,339321.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,24,United-States,>50K
1202,1202,9745,38,Private,207202.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,48,United-States,>50K
1203,1203,9162,48,Private,107231.0,Prof-school,15.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,99999.0,0.0,50,United-States,>50K
1204,1204,1307,35,State-gov,193241.0,HS-grad,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,1651.0,40,United-States,<=50K


How many NaN values did you find in your dataset? Did you decide to replace the values or drop them? Why?

In [None]:
#@title

reflection = "10" # @param {type:"string"}

### Exercise 3E

The next step in our data preprocessing pipeline is to ensure that all categorical columns‚Äîthose containing strings instead of numbers‚Äîare properly converted into numerical representations. This transformation is essential because machine learning models work with numbers and cannot directly process categorical data. The process of converting categorical variables into numerical ones is called **encoding**.  

The most common approach to encoding categorical variables is to assign each unique value in a column to a corresponding number. However, depending on the nature of your data, there are different encoding techniques you can use:  

**Label Encoding**  
Label Encoding assigns each unique category a numeric value. This method is simple and works well when your categorical data has a natural order (e.g., "Low", "Medium", "High").  

```python
from sklearn.preprocessing import LabelEncoder

# Apply Label Encoding
label_encoder = LabelEncoder()
df['encoded_column'] = label_encoder.fit_transform(df['categorical_column'])
```

**One-Hot Encoding**  
One-Hot Encoding creates a binary column for each unique category in the original column, assigning a value of `1` if the category is present in the row and `0` otherwise. This method is particularly useful for unordered categorical variables.  

```python
# Apply One-Hot Encoding using pandas
df = pd.get_dummies(df, columns=['categorical_column'], prefix='category')
```

**Target/Ordinal Encoding**  
For categorical variables with a clear and meaningful order (e.g., education levels like "High School", "Bachelor's", "Master's", "PhD"), you can manually assign numerical values to reflect their rank. This approach ensures that the order is preserved in the encoded data, which can be particularly useful for models that benefit from this structure.

```python
# Define ordinal mapping
education_mapping = {'High School': 1, "Bachelor's": 2, "Master's": 3, 'PhD': 4}
df['encoded_column'] = df['categorical_column'].map(education_mapping)
```

This method works well when the order of categories has a logical or meaningful impact on the model‚Äôs predictions. For example, encoding education levels or survey ratings (e.g., "Poor", "Average", "Excellent") as ordinal values preserves their inherent ranking and ensures that the model can interpret the relationship between them.  

However, it‚Äôs important to use ordinal encoding only when the categories have a natural order. If the data does not have a meaningful ranking, this approach may introduce unintended bias into your model.  

### Things to Keep in Mind  
- **Dimensionality Considerations:** One-hot encoding can lead to an increase in dataset size if there are many unique categories, so choose your encoding method carefully based on the dataset‚Äôs size and complexity.  
- **Logical Relationships:** Ordinal encoding works best when there is a clear and interpretable relationship between the categories. Avoid using it for unordered data.  
- **Efficiency:** Libraries like `pandas` and `scikit-learn` make it easy to implement these encoding techniques, but always validate the results to ensure they align with your data‚Äôs context.  

By encoding your categorical variables thoughtfully, you ensure that your data is ready for machine learning models to process effectively. Take a moment to evaluate which encoding method best suits your categorical columns, and proceed to implement it in your DataFrame!




How many columns did you have to encode? What strategies did you use for each of them and why?

In [None]:
#@title

reflection = "" # @param {type:"string"}

## Part IV: Data Visualization

<img src="https://www.causeweb.org/cause/sites/default/files/resources/fun/cartoons/Art_of_Visualization.jpg" height=600>

While Python and its libraries allow us to analyze and output data in powerful ways, it‚Äôs often true that **a picture speaks a thousand words**. Data visualization can reveal patterns, trends, and insights that might not be immediately obvious through raw numbers alone. Effective visuals add tremendous value to research papers, presentations for stakeholders, and project reports by making complex data more accessible and engaging.

### Why Visualize Your Data?

Visualizing your data can help you:

- **Identify trends and patterns**: Charts and graphs can highlight relationships within your data, allowing you to uncover insights that may go unnoticed in tabular formats.

- **Communicate findings clearly**: Visuals provide a straightforward way to convey key points, making it easier for others to understand and act on your analysis.

- **Support decision-making**: Well-designed visualizations can help stakeholders make informed decisions by presenting data in a way that is both informative and persuasive.

### Exercise 4A

We‚Äôll be using **Matplotlib** to visualize our columns and find a potential outlier.

```python
# Create a boxplot to identify outliers
plt.figure(figsize=(8, 6))
plt.boxplot(df['Column_A'], vert=False, patch_artist=True, boxprops=dict(facecolor='lightblue'))
plt.title('Boxplot of Column_A')
plt.xlabel('Values')
plt.show()

# Create a histogram to visualize the distribution
plt.figure(figsize=(8, 6))
plt.hist(df['Column_A'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram of Column_A')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()


Take some time to explore customization options for your plot! Use Google and consult with your mentor to learn how to adjust colors, bar sizes, and other visual elements to create a unique, personalized plot. The [Matplotlib Bar Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) is a great resource to get started with customization ideas.

Once you‚Äôve designed your custom plot, save it using the following code. You can then download from the folder icon in the left-hand menu and upload it to your Slack Channel!

```python
plt.savefig("YourName.jpg")


Now that you've found the column with your outlier, use what you've learned to drop the row that contains that value!

In [None]:
### YOUR CODE HERE

Take a few minutes to document some of the changes you made to really make the plot your own! Where was the outlier in your dataframe? What steps did you take to

In [None]:
#@title

reflection = "" # @param {type:"string"}

## PART V: Day 1 Homework

1. Collaborate with your mentor to find a compelling new dataset where you can apply your recently acquired skills! Be sure to check out our [Dataset Selection Task Sheet](https://docs.google.com/document/d/15U87xJS0nhOqtFTT1i-foW-D-ukiEs63SdZuo7xrAiM/edit?usp=sharing) to stay organized and focused throughout the process!

2. Once you‚Äôve identified an exciting dataset that sparks your curiosity, set up a new Google Colab notebook and refer to our [Loading Your Dataset](https://docs.google.com/document/d/1XAyU-Mu30AnZsN8DoZ4ghoikuM-JULfhHOw-bZdc0LI/edit?usp=drive_link) tutorial for a step-by-step guide to getting started.

3. Dive into coding! Begin experimenting with the techniques you learned today and see how far you can push your understanding. Keep track of any questions and don‚Äôt hesitate to post them in your Slack channel or attend our Open Labs for additional guidance and support.

4. Reflect on your experience. What concepts clicked for you? What challenges did you face? Document your insights to solidify your learning and identify areas for improvement. Drop this reflection in your Slack channel once your done!

## üéÅ Wrapping Up Day 1

Great work! You've expertly selected a project, thoroughly explored its data, cleaned it up, and even added your unique customizations. Next up, we will see how to build and evaluate your machine learning model!