# Guided Exercise: Iris Flower Classification

## Introduction

Welcome to this guided exercise on building a machine learning classifier for the famous Iris dataset! In this exercise, you'll work through the complete machine learning pipeline:

1. **Data Loading & Exploration** - Understanding your data
2. **Data Preprocessing** - Preparing data for modeling
3. **Model Building** - Training classification algorithms
4. **Model Evaluation** - Assessing performance
5. **Model Improvement** - Hyperparameter tuning

### Learning Objectives
By the end of this exercise, you will be able to:
- Load and explore datasets using scikit-learn
- Perform exploratory data analysis (EDA)
- Preprocess data for machine learning
- Train multiple classification models
- Evaluate model performance using appropriate metrics
- Compare different algorithms and select the best one

### The Iris Dataset
The Iris dataset contains measurements of 150 iris flowers from three different species:
- **Iris Setosa**
- **Iris Versicolor** 
- **Iris Virginica**

For each flower, we have four measurements (features):
- Sepal length
- Sepal width
- Petal length
- Petal width

### Instructions
- Read each section carefully
- Complete the code in the cells marked with `# TODO:`
- Answer the questions in the markdown cells
- Don't worry if you get stuck - think about what you've learned in the course!
- Compare your results with classmates after completing the exercise

## Section 1: Data Loading and Initial Exploration

### 1.1 Loading the Iris Dataset

**Task:** Load the Iris dataset using scikit-learn's `load_iris()` function.

**Instructions:**
1. Import the necessary module from sklearn.datasets
2. Load the iris dataset
3. Print some basic information about the dataset (shape, feature names, target names)

**Questions to think about:**
- How many samples are in the dataset?
- How many features does each sample have?
- What are the three classes (species) we're trying to predict?

In [None]:
# TODO: Import the load_iris function from sklearn.datasets

# TODO: Load the iris dataset and assign it to a variable called 'iris'

# TODO: Print the shape of the data (number of samples, number of features)

# TODO: Print the feature names

# TODO: Print the target names (class names)

# TODO: Print the first 5 samples of the data and their corresponding labels

### 1.2 Basic Dataset Information

**Task:** Extract the features (X) and target labels (y) from the dataset.

**Questions to think about:**
- What is the difference between features and labels in a supervised learning problem?
- Why do we separate X (features) from y (target)?

In [None]:
# TODO: Extract features (X) and target labels (y) from the iris dataset
# X should contain the measurements, y should contain the species labels

# TODO: Print the shape of X and y

# TODO: Print a few examples of X and corresponding y values

## Section 2: Exploratory Data Analysis (EDA)

### 2.1 Statistical Summary

**Task:** Create a pandas DataFrame from the iris data and examine basic statistics.

**Questions to think about:**
- What can you learn from the mean and standard deviation of each feature?
- Are there any features that might be more useful for classification than others?

In [None]:
# TODO: Import pandas and create a DataFrame from the iris data
# Include both features and target labels in the DataFrame

# TODO: Display the first few rows of the DataFrame

# TODO: Use .describe() to get statistical summary of the features

# TODO: Check for any missing values in the dataset

### 2.2 Data Visualization

**Task:** Create visualizations to understand the relationships in the data.

**Questions to think about:**
- Which features seem to be most correlated?
- Can you see natural groupings of the different species?
- Which visualization techniques would be most appropriate here?

In [None]:
# TODO: Import matplotlib.pyplot and seaborn for visualization

# TODO: Create a pairplot to visualize relationships between all features
# Color the points by species to see how they cluster

# TODO: Create a correlation heatmap to see feature correlations

# TODO: Create boxplots for each feature grouped by species

### 2.3 Class Distribution

**Task:** Check if the classes are balanced in the dataset.

**Questions to think about:**
- Are the classes evenly distributed?
- What problems might arise if classes are imbalanced?
- How would you handle imbalanced classes?

In [None]:
# TODO: Count the number of samples in each class

# TODO: Create a bar plot showing the distribution of classes

# TODO: Calculate the percentage of each class

## Section 3: Data Preprocessing

### 3.1 Train-Test Split

**Task:** Split your data into training and testing sets.

**Questions to think about:**
- Why do we need to split the data?
- What happens if we train and test on the same data?
- What should be a good ratio for train/test split?

In [None]:
# TODO: Import train_test_split from sklearn.model_selection

# TODO: Split the data into training and testing sets (80% train, 20% test)
# Use random_state=42 for reproducibility

# TODO: Print the shapes of X_train, X_test, y_train, y_test

### 3.2 Feature Scaling

**Task:** Standardize the features so they're on the same scale.

**Questions to think about:**
- Why might feature scaling be important for some algorithms?
- Which algorithms are sensitive to feature scales?
- When might you not need to scale features?

In [None]:
# TODO: Import StandardScaler from sklearn.preprocessing

# TODO: Create a StandardScaler object

# TODO: Fit the scaler on the training data and transform both train and test sets

# TODO: Print the mean and standard deviation of the scaled features

## Section 4: Model Building

### 4.1 K-Nearest Neighbors (KNN)

**Task:** Train a KNN classifier on the iris data.

**Questions to think about:**
- How does the KNN algorithm work?
- What is the effect of choosing different values of k?
- What are the advantages and disadvantages of KNN?

In [None]:
# TODO: Import KNeighborsClassifier from sklearn.neighbors

# TODO: Create a KNN classifier with k=3

# TODO: Train (fit) the classifier on the training data

# TODO: Make predictions on the test data

# TODO: Print the first 10 predictions and actual labels for comparison

### 4.2 Logistic Regression

**Task:** Train a Logistic Regression classifier.

**Questions to think about:**
- How does logistic regression work for multi-class problems?
- What assumptions does logistic regression make?
- When would you choose logistic regression over other algorithms?

In [None]:
# TODO: Import LogisticRegression from sklearn.linear_model

# TODO: Create a LogisticRegression classifier

# TODO: Train the classifier on the training data

# TODO: Make predictions on the test data

### 4.3 Support Vector Machine (SVM)

**Task:** Train an SVM classifier.

**Questions to think about:**
- What is the basic idea behind SVM?
- What is the kernel trick and when is it useful?
- How does SVM handle non-linearly separable data?

In [None]:
# TODO: Import SVC from sklearn.svm

# TODO: Create an SVM classifier with RBF kernel

# TODO: Train the classifier on the training data

# TODO: Make predictions on the test data

### 4.4 Decision Tree

**Task:** Train a Decision Tree classifier.

**Questions to think about:**
- How does a decision tree make predictions?
- What are the advantages of decision trees?
- What problems can decision trees suffer from?

In [None]:
# TODO: Import DecisionTreeClassifier from sklearn.tree

# TODO: Create a DecisionTreeClassifier

# TODO: Train the classifier on the training data

# TODO: Make predictions on the test data

### 4.5 Random Forest

**Task:** Train a Random Forest classifier.

**Questions to think about:**
- How is Random Forest different from a single Decision Tree?
- What is bagging and how does it help?
- Why might Random Forest be better than individual trees?

In [None]:
# TODO: Import RandomForestClassifier from sklearn.ensemble

# TODO: Create a RandomForestClassifier

# TODO: Train the classifier on the training data

# TODO: Make predictions on the test data

## Section 5: Model Evaluation

### 5.1 Accuracy Score

**Task:** Calculate the accuracy of each model.

**Questions to think about:**
- What does accuracy tell us?
- When might accuracy be misleading?
- How is accuracy calculated?

In [None]:
# TODO: Import accuracy_score from sklearn.metrics

# TODO: Calculate accuracy for each of the 5 models you trained
# Store the results in variables or a dictionary

# TODO: Print the accuracy scores for all models

# TODO: Which model performed best? Which performed worst?

### 5.2 Confusion Matrix

**Task:** Create confusion matrices for each model.

**Questions to think about:**
- What does each cell in a confusion matrix represent?
- How can you tell which classes are being confused?
- What are precision and recall?

In [None]:
# TODO: Import confusion_matrix and classification_report from sklearn.metrics

# TODO: Create confusion matrices for each model

# TODO: Display the confusion matrices (you can use print or visualization)

# TODO: Look at which classes are most often confused with each other

### 5.3 Classification Report

**Task:** Generate detailed classification reports.

**Questions to think about:**
- What do precision, recall, and F1-score tell you?
- When might you care more about precision vs recall?
- How is F1-score calculated?

In [None]:
# TODO: Generate classification reports for each model

# TODO: Print the reports and compare them

# TODO: Which model has the best overall performance?

# TODO: Are there any classes that are harder to predict?

### 5.4 Cross-Validation

**Task:** Use cross-validation to get more reliable performance estimates.

**Questions to think about:**
- Why is cross-validation better than a single train-test split?
- What does k-fold cross-validation do?
- How many folds should you typically use?

In [None]:
# TODO: Import cross_val_score from sklearn.model_selection

# TODO: Perform 5-fold cross-validation on each of your models

# TODO: Print the cross-validation scores (mean and standard deviation)

# TODO: How do the cross-validation results compare to the single train-test split?

## Section 6: Model Improvement

### 6.1 Hyperparameter Tuning with Grid Search

**Task:** Use GridSearchCV to find the best hyperparameters for one of your models.

**Questions to think about:**
- What are hyperparameters vs parameters?
- Why is it important to tune hyperparameters?
- What is the difference between grid search and random search?

In [None]:
# TODO: Import GridSearchCV from sklearn.model_selection

# TODO: Choose one model (e.g., Random Forest or SVM) to tune

# TODO: Define a parameter grid to search over

# TODO: Create a GridSearchCV object

# TODO: Fit the grid search on the training data

# TODO: Print the best parameters and best score

# TODO: Evaluate the tuned model on the test set

### 6.2 Feature Importance

**Task:** Analyze which features are most important for prediction.

**Questions to think about:**
- How can you determine feature importance?
- Which features seem most important for iris classification?
- Could you use fewer features and still get good performance?

In [None]:
# TODO: For models that provide feature importance (like Random Forest), extract and visualize it

# TODO: Create a bar plot of feature importances

# TODO: Which features are most/least important?

## Section 7: Final Comparison and Conclusions

### 7.1 Model Comparison Summary

**Task:** Create a summary comparing all your models.

**Questions to think about:**
- Which model would you recommend for this problem?
- What are the trade-offs between different models?
- How might your choice change if this were a different problem?

In [None]:
# TODO: Create a summary table or visualization comparing all models
# Include metrics like accuracy, precision, recall, F1-score

# TODO: Write your conclusions about which model performs best and why

### 7.2 What You Learned

**Reflection Questions:**
1. What was the most challenging part of this exercise?
2. What surprised you about the results?
3. How might you apply what you learned to a different dataset?
4. What would you do differently if you were to repeat this analysis?

### 7.3 Next Steps

**Advanced Topics to Explore:**
- Try other classification algorithms (Naive Bayes, Gradient Boosting)
- Experiment with different preprocessing techniques
- Try the exercise on a different dataset
- Learn about ensemble methods and stacking
- Explore techniques for handling imbalanced datasets

## Congratulations!

You've completed a comprehensive machine learning exercise covering the entire pipeline from data exploration to model deployment. Take some time to reflect on what you've learned and how you can apply these techniques to your own projects.

**Remember:** Machine learning is an iterative process. Don't be discouraged if your first models don't perform perfectly - the key is to keep experimenting and learning!