# Python ML Assignment: Exploring Toy Datasets

**Course:** Python ML for Middle School Students  
**Assignment Points:** 25 points  


---

## 📋 Assignment Overview

In this assignment, you will:
1. Learn how to load scikit-learn toy datasets
2. Convert them into pandas DataFrames
3. Perform Exploratory Data Analysis (EDA)
4. Answer questions about the data

### 📚 Reference Material
You should refer to the **Exploratory Data Analysis notebook**:  
https://github.com/sjasthi/Python-DS-Data-Science/blob/main/pandas/2_pandas_Exploratory_Data_Analysis.ipynb

---

## 🎯 Learning Objectives

By completing this assignment, you will be able to:
- ✅ Load scikit-learn toy datasets
- ✅ Convert sklearn datasets to pandas DataFrames
- ✅ Use pandas methods for data exploration
- ✅ Create visualizations to understand data
- ✅ Draw conclusions from data analysis

---

## Part 1: Setup and Example (5 points)

### Task 1.1: Import Required Libraries (1 point)

In [None]:
# TODO: Import the following libraries
# - pandas as pd
# - numpy as np
# - matplotlib.pyplot as plt
# - seaborn as sns
# - datasets from sklearn

# Your code here:



# Set display options
pd.set_option('display.max_columns', None)
%matplotlib inline

print("✓ Libraries imported successfully!")

### Task 1.2: Example - Loading Iris Dataset (2 points)

**I'll show you how to do this first, then you'll do it yourself!**

In [None]:
# EXAMPLE: How to load a toy dataset and convert to DataFrame

# Step 1: Load the dataset
iris = datasets.load_iris()
print("dataype of iris: ", type(iris))

# Step 2: Understand the dataset structure
print("📦 Dataset Components:")
print(f"   - data: Features (measurements)")
print(f"   - target: Labels (species)")
print(f"   - feature_names: Column names for features")
print(f"   - target_names: Names of species")
print(f"   - DESCR: Dataset description\n")

# Step 3: Create DataFrame from features
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
print("dataype of df_iris: ", type(df_iris))

# Step 4: Add the target column
df_iris['species'] = iris.target

# Step 5: Replace numeric targets with actual names
df_iris['species_name'] = df_iris['species'].map({
    0: iris.target_names[0],
    1: iris.target_names[1],
    2: iris.target_names[2]
})

print("✅ Iris Dataset loaded successfully!")
print(f"\nDataFrame shape: {df_iris.shape}")
print(f"Columns: {list(df_iris.columns)}")

In [None]:
# Display the first few rows
print("🔍 First 5 rows of Iris dataset:\n")
display(df_iris.head())

print("\n📊 Basic Information:")
df_iris.info()

### Task 1.3: Read the Dataset Description (2 points)

Every toy dataset comes with a detailed description. Let's read it!

In [None]:
# TODO: Print the dataset description
# Hint: Use iris.DESCR

# Your code here:
print(iris.DESCR)

**Question 1.3:** After reading the description, answer these questions:

1. How many samples (rows) are in the Iris dataset? **[Your answer]**
2. How many features (measurements) are there? **[Your answer]**
3. What are the three species of Iris flowers? **[Your answer]**
4. What do the four features measure? **[Your answer]**

---

## Part 2: Wine Dataset Exploration (10 points)

Now it's your turn! You will load and explore the **Wine dataset**.

### Task 2.1: Load the Wine Dataset (3 points)

In [None]:
# TODO: Complete the following steps

# Step 1: Load the wine dataset
# Hint: wine = datasets.load_wine()

# Your code here:


# Step 2: Create a DataFrame with the features
# Hint: Use pd.DataFrame() with wine.data and wine.feature_names

# Your code here:
df_wine = None  # Replace with your code


# Step 3: Add the target column
# Hint: df_wine['target'] = wine.target

# Your code here:


# Step 4: Add target names
# Hint: Map 0, 1, 2 to wine.target_names

# Your code here:


print("✅ Wine dataset loaded!")
print(f"Shape: {df_wine.shape}")

### Task 2.2: Basic Exploration (4 points)

Use pandas methods to explore the Wine dataset.

In [None]:
# TODO: Display the first 10 rows
# Your code here:



In [None]:
# TODO: Display the last 5 rows
# Your code here:



In [None]:
# TODO: Display random 5 rows
# Hint: Use .sample()
# Your code here:



In [None]:
# TODO: Display basic information about the dataset
# Hint: Use .info()
# Your code here:



In [None]:
# TODO: Display statistical summary
# Hint: Use .describe()
# Your code here:



In [None]:
# TODO: Check for missing values
# Hint: Use .isna().sum()
# Your code here:



### Task 2.3: Answer Questions (3 points)

**Question 2.3a:** How many wine samples are in the dataset? **[Your answer]**

**Question 2.3b:** How many features (chemical measurements) are there? **[Your answer]**

**Question 2.3c:** What is the average (mean) alcohol content? **[Your answer]**

**Question 2.3d:** Are there any missing values in the dataset? **[Your answer]**

**Question 2.3e:** What is the maximum value of 'proline'? **[Your answer]**

---

## Part 3: Breast Cancer Dataset Analysis (10 points)

Now explore the **Breast Cancer** dataset on your own!

### Task 3.1: Load and Prepare the Dataset (3 points)

In [None]:
# TODO: Load the breast cancer dataset
# Create a DataFrame with feature names
# Add target column (0 = malignant, 1 = benign)
# Add target_name column with actual names

# Your code here:




print("✅ Breast Cancer dataset loaded!")

### Task 3.2: Comprehensive EDA (4 points)

Perform a complete exploratory data analysis:

In [None]:
# TODO: Display dataset shape
# Your code here:



In [None]:
# TODO: Display first 5 rows
# Your code here:



In [None]:
# TODO: Display column names
# Hint: Use .columns
# Your code here:



In [None]:
# TODO: Display data types of all columns
# Hint: Use .dtypes
# Your code here:



In [None]:
# TODO: Display statistical summary
# Your code here:



In [None]:
# TODO: Count how many samples are in each class (malignant vs benign)
# Hint: Use .value_counts() on the target_name column
# Your code here:



### Task 3.3: Create Visualizations (3 points)

In [None]:
# TODO: Create a bar plot showing the count of each class
# Hint: Use df['target_name'].value_counts().plot(kind='bar')

# Your code here:



plt.title('Distribution of Breast Cancer Classes')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

In [None]:
# TODO: Create a histogram of 'mean radius'
# Hint: Use df['mean radius'].hist()

# Your code here:



plt.title('Distribution of Mean Radius')
plt.xlabel('Mean Radius')
plt.ylabel('Frequency')
plt.show()

In [None]:
# TODO: Create a scatter plot of 'mean radius' vs 'mean texture'
# Color the points by target (malignant vs benign)
# Hint: Use plt.scatter() with c=df['target']

# Your code here:



plt.title('Mean Radius vs Mean Texture')
plt.xlabel('Mean Radius')
plt.ylabel('Mean Texture')
plt.colorbar(label='Class (0=Malignant, 1=Benign)')
plt.show()

---

## 📊 Final Questions and Reflection

### Breast Cancer Dataset Questions:

**Question 3.1:** How many total samples are in the breast cancer dataset? **[Your answer]**

**Question 3.2:** How many features are there? **[Your answer]**

**Question 3.3:** Which class has more samples - malignant or benign? **[Your answer]**

**Question 3.4:** What is the mean value of 'mean radius'? **[Your answer]**

**Question 3.5:** Looking at your scatter plot, can you see any pattern that might help distinguish between malignant and benign cases? Explain in 2-3 sentences.

**[Your answer here]**

---

---

## 📝 Reflection (Required)

**Question R.1:** What was the most interesting thing you learned from exploring these datasets?

**[Your answer here]**

**Question R.2:** Which dataset did you find most interesting and why?

**[Your answer here]**

**Question R.3:** How do you think machine learning could be used with the Breast Cancer dataset in real life?

**[Your answer here]**

---

## 📤 Submission Instructions

1. ✅ Make sure all code cells run without errors
2. ✅ Answer all questions in the markdown cells
3. ✅ Include your name at the top of the notebook
4. ✅ Save your notebook as: `YourLastName_ML_ToyDatasets.ipynb`
5. ✅ Submit through [instructor's preferred method]

---

## 📊 Grading Rubric

| Section | Points | Description |
|---------|--------|-------------|
| Part 1: Setup & Example | 5 | Imports, understanding example, reading description |
| Part 2: Wine Dataset | 10 | Loading, exploration, and questions |
| Part 3: Breast Cancer | 10 | Loading, EDA, visualizations, questions |
| **Total** | **25** | |

### Grading Criteria:
- **Code Quality:** Code runs without errors, proper use of pandas methods
- **Completeness:** All tasks attempted and completed
- **Understanding:** Correct answers to questions demonstrate understanding
- **Visualizations:** Clear, properly labeled charts
- **Reflection:** Thoughtful responses showing engagement with material

---

## 💡 Helpful Resources

- **Pandas Documentation:** https://pandas.pydata.org/docs/
- **Scikit-learn Toy Datasets:** https://scikit-learn.org/stable/datasets/toy_dataset.html
- **Your EDA Reference:** https://github.com/sjasthi/Python-DS-Data-Science/blob/main/pandas/2_pandas_Exploratory_Data_Analysis.ipynb

### Common Pandas Methods You'll Need:
```python
df.head()           # First 5 rows
df.tail()           # Last 5 rows
df.sample(n)        # Random n rows
df.info()           # Data types and non-null counts
df.describe()       # Statistical summary
df.shape            # (rows, columns)
df.columns          # Column names
df.dtypes           # Data types
df.isna().sum()     # Count missing values
df['col'].value_counts()  # Count unique values
```

---

**Good luck! Remember: Data exploration is like being a detective - you're looking for clues in the data! 🔍**