# Welcome to Machine Learning and Artificial Intelligence!

## About Me
- **Name:** Davide Torre
- **Role:** Teaching Assistant (TA)
- **Contact:** dtorre@luiss.it

---

## Office Hours
- **When:** Fridays from 3:00 PM to 4:30 PM
- **Where:** [Location/Online Platform]
- **Contact:** Please, email me to schedule a meeting.

---

## Course Materials
- Lecture notes, assignments, and additional resources will be provided on the course website.

---

Looking forward to an exciting learning journey together! If you have any questions or need assistance, don't hesitate to reach out.

Let's explore the world of Machine Learning and Artificial Intelligence!



# Premise
*Much of your time as a data scientist is likely to be spent wrangling data: figuring out how to get it, getting it, examining it, making sure it's correct and complete.*

Stephen Klosterman - Data Science Projects in Python

# Today's topic
- What is a dataset?
- Downloading a dataset from the web via Google Colab
- Getting started with Pandas and Seaborn

# Importing libraries with Python 🐍📚

In this part of the course, we'll dive into the fundamental step of setting up our Python environment for Machine Learning and Artificial Intelligence. One of the key aspects of this setup is importing the necessary libraries and tools that will empower us throughout the course. 🚀

## Why Importing Libraries Matters 🧩

Python is an incredibly versatile programming language, and its power lies in its extensive ecosystem of libraries. These libraries provide pre-built functions and modules that make complex tasks, such as data manipulation, visualization, and machine learning, much more accessible.

By importing the right libraries, we can tap into the collective knowledge and effort of the Python community to streamline our workflow and achieve our goals efficiently.

## Common Libraries We'll Use 📦

Here are some of the essential libraries we'll frequently import and work with in this course:

- **Pandas:** For data manipulation and analysis.
- **NumPy:** For efficient numerical operations.
- **Seaborn and Matplotlib:** For data visualization and plotting.
- **Scikit Learn:** For machine learning algorithms.
- **Keras:** For building and training artificial neural networks.

## How to Import Libraries 🌟

Importing libraries in Python is a straightforward process. Typically, we use the `import` statement followed by the library name. For example:

```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
import keras

In [None]:
import numpy as np
import pandas as pd

import requests
import zipfile
import os

# Downloading the dataset from the web 🌐

In [None]:
# Step 1: Download the ZIP file
url = 'https://tinyurl.com/AInML001'
response = requests.get(url)

# Check if the download was successful
if response.status_code == 200:
    with open("iris.zip", "wb") as file:
        file.write(response.content)
        print("Downloaded the iris dataset successfully.")
else:
    raise Exception("Failed to download the iris dataset.")

# Step 2: Unzip the file
with zipfile.ZipFile("iris.zip", "r") as zip_ref:
    zip_ref.extractall("iris_dataset")

# Step 3: List the contents of the folder
extracted_files = os.listdir("iris_dataset")
print("Contents of the iris_dataset folder:")
for file in extracted_files:
    print(file)

# Step 4: Load the dataset with pandas
if "iris.data" in extracted_files:
    iris_df = pd.read_csv(os.path.join("iris_dataset", "iris.data"), header=None)
    print("\nLoaded the Iris dataset with Pandas:")
    print(iris_df.head())
else:
    print("\nFailed to find the iris data file in the extracted folder.")


# 🎉 Our First Dataset: The Iris Dataset 🌼

Congratulations, everyone! We've successfully taken our first step in the world of data with our very own dataset - the Iris dataset. 🙌

## What We've Achieved So Far 🏆

- 📥 **Downloaded the Dataset:** We triumphantly downloaded the Iris dataset from the web. It's a small but classic dataset that's often used for introductory machine learning tasks.

## What's Next? 🤔

Now that we have our dataset on hand, the next exciting step is to explore and understand it. 🧐 We'll dig into the dataset's structure, get a sense of its contents, and start unraveling the mysteries hidden within. 🕵️‍♀️

Stay tuned for some hands-on data exploration as we dive into the Iris dataset in our upcoming sessions! 🚀

Let's embark on this journey together and uncover the insights that await us in the world of data! 🌟


In [None]:
# Step 1 : A look into the dataset
iris_df.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Datatypes 🚣
A dataset can be seen as a *collection* of measurements. In this course we'll see (mostly) tabular dataset in which each row is a series of measurement that refers only on a single occurance.
In this context, the data can be either:
- **number** (continuous data, e.g. the first column of this example)
- **class** (categorical data, e.g. the last column of this example)
- **metadata** (for example an ID, a code that uniquely identifies the measurement, hour of the day, etc)

## What is the meaning of each column? ❓

There's a little mystery we need to solve. Right now, when we look at the dataset, the columns have no names, and we don't know what each column represents. 🕵️‍♂️

But fear not! 🦸‍♀️ There's a clever solution at our fingertips. In the 'iris_dataset' folder, we have some additional files that can shed light on this mystery. 🌟

In the world of programming, we sometimes use special commands to interact with our environment. One such command is `!`, which allows us to run shell commands right from our Google Colab Notebook code. 🚀

The `cat` command is like a curious cat 🐱 that lets us peek into the content of a file. So, when we run the command:

```python
!cat 'iris_dataset/iris.index'
```
We're essentially asking Python to show us the contents of the 'iris.index' file located in our 'iris_dataset' folder.

Now, let's go ahead and run this command to reveal the meaning of each column in our dataset! 🤓

In [None]:
!cat 'iris_dataset/iris.names'

1. Title: Iris Plants Database
	Updated Sept 21 by C.Blake - Added discrepency information

2. Sources:
     (a) Creator: R.A. Fisher
     (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
     (c) Date: July, 1988

3. Past Usage:
   - Publications: too many to mention!!!  Here are a few.
   1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"
      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
      to Mathematical Statistics" (John Wiley, NY, 1950).
   2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
      Structure and Classification Rule for Recognition in Partially Exposed
      Environments".  IEEE Transactions on Pattern Analysis and Machine
      Intelligence, Vol. PAMI-2, No. 1, 67-71.
      -- Results:
         -- very low misclassification rates (0% for t

## Adding Meaningful Column Names 📊

### Assigning Column Names 🏷️

In Python, assigning column names to a dataset is as simple as creating a list of names and assigning it to the DataFrame. We've chosen the following names for our columns:

- Sepal Length (in cm)
- Sepal Width (in cm)
- Petal Length (in cm)
- Petal Width (in cm)
- Class (Iris Species)

We've now breathed life into our dataset by providing clear labels for each column. Let's take a look at our updated Iris dataset with these meaningful column names! 🌼



In [None]:
# Assigning column names to the Iris dataset
column_names = ["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "class"]
iris_df.columns = column_names

# Display the updated DataFrame
print("\nUpdated Iris dataset with column names:")
print(iris_df.head())



Updated Iris dataset with column names:
   sepal_length_cm  sepal_width_cm  petal_length_cm  petal_width_cm  \
0              5.1             3.5              1.4             0.2   
1              4.9             3.0              1.4             0.2   
2              4.7             3.2              1.3             0.2   
3              4.6             3.1              1.5             0.2   
4              5.0             3.6              1.4             0.2   

         class  
0  Iris-setosa  
1  Iris-setosa  
2  Iris-setosa  
3  Iris-setosa  
4  Iris-setosa  


## Understanding the Iris Dataset Columns 🌱

- **sepal_length_cm:** This column represents the length of the sepal (a part of the iris flower) in centimeters. Sepals are the green, leafy structures that protect the flower's bud.

- **sepal_width_cm:** This column represents the width of the sepal in centimeters. It provides information about the size and shape of the sepal.

- **petal_length_cm:** Here, we have the length of the petal, another part of the iris flower, in centimeters. The petal is the colorful part of the flower that attracts pollinators.

- **petal_width_cm:** Similar to the previous column, this one represents the width of the petal in centimeters

- **class:** The flower name

<img src=https://miro.medium.com/v2/resize:fit:720/1*YYiQed4kj_EZ2qfg_imDWA.png>


# The quest for data integrity

## Understanding NaN in Datasets: The "Data Void" 🧩

In the world of data, NaN, which stands for "Not-a-Number," represents missing or undefined values in a dataset. It's like a placeholder that indicates the absence of data or a value that couldn't be recorded for some reason. NaN can occur for various reasons, such as data entry errors, sensor malfunctions, or simply because a value was not available at the time of recording.

### Why Are NaNs "Dangerous" for a Data Project? ⚠️

The presence of NaN values in a dataset can be problematic for several reasons:

- **Data Integrity:** NaNs can compromise the integrity of your data, making it unreliable for analysis. Missing values can lead to biased results and incorrect conclusions.

- **Algorithm Behavior:** Many machine learning algorithms can't handle NaN values and may produce errors or incorrect predictions when encountered with missing data.

- **Visualization:** NaNs can disrupt data visualizations, leading to gaps or inaccuracies in charts and graphs.

- **Imputation:** Before analyzing data or training machine learning models, NaN values often need to be handled through imputation (replacing missing values with estimated values). The choice of imputation method can impact the results.

Now, let's see how we can identify and count NaN values in our Iris dataset.

In [None]:
# Count NaN values in the Iris dataset
nan_count = iris_df.isna().sum()

# Display the count of NaN values for each column
print("NaN count in each column:")
print(nan_count)


NaN count in each column:
sepal_length_cm    0
sepal_width_cm     0
petal_length_cm    0
petal_width_cm     0
class              0
dtype: int64


Very good! Our data is safe and sound!

## Understanding Outliers: The Unusual Suspects 📈

In data analysis, an outlier refers to an observation or data point that significantly deviates from the majority of the data points in a dataset. These are the "unusual suspects" in your data, and they can have a substantial impact on your analysis.

### Why Are Outliers Relevant to Data Integrity? 🤔

Outliers are essential to consider in any data analysis for several reasons:

- **Data Integrity:** Outliers can indicate potential data quality issues, such as data entry errors or sensor malfunctions. Identifying and addressing outliers is crucial to maintaining data integrity.

- **Statistical Validity:** Outliers can distort statistical measures such as the mean and standard deviation, leading to inaccurate summaries of your data. This can mislead your analysis and conclusions.

- **Machine Learning:** In machine learning, outliers can negatively affect the performance of models. Some algorithms are sensitive to outliers, which can lead to biased predictions or reduced model accuracy.

- **Business Insights:** Outliers can sometimes represent important real-world events or anomalies. Understanding why outliers occur can provide valuable insights into your domain.

In summary, outliers are relevant to data integrity because they can distort your analysis, impact the validity of statistical measures, affect machine learning models, and potentially hide important insights within your data. Detecting and appropriately handling outliers is a critical step in data preprocessing and analysis.


### Spotting Outliers: Leveraging the Power of Visualization 👁️📊

When it comes to detecting outliers in your data, your best tool is your own senses, especially your sense of sight. Visualizing your data can be an incredibly effective way to spot those unusual data points that might otherwise go unnoticed. Here's why data visualization is so powerful:

- **Visual Patterns:** Our eyes are excellent at recognizing patterns, and data visualization allows us to see trends, clusters, and anomalies that might not be apparent in raw numbers.

- **Immediate Insight:** A well-designed graph or chart can provide immediate insights. You can quickly identify outliers by observing data points that lie far from the main cluster.

- **Comparative Analysis:** Visualization enables us to compare different data points visually, making it easier to identify values that stand out.

- **Contextual Understanding:** Visualizations provide context for your data. Outliers become more meaningful when you can see how they relate to the overall data distribution.

However, it's essential to recognize that not everyone may have access to traditional visual cues. For individuals with visual impairments, alternatives like sonification (representing data with sound) or tactile graphics (using touch to interpret data) can be invaluable. Data science is becoming more inclusive, and efforts are being made to ensure that everyone, regardless of their abilities, can participate in and benefit from data analysis and interpretation.

# Getting started with seaborn

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(iris_df, hue='class')

## Understanding Seaborn: Elegant Data Visualization 📊🐍

**Seaborn** is a popular Python data visualization library built on top of Matplotlib. It's designed to create aesthetically pleasing and informative statistical graphics. Seaborn is widely used in data science and research for its simplicity and versatility in producing attractive visualizations.

### Key Features of Seaborn 🌟

Seaborn offers several features that make it a valuable tool for data visualization:

- **High-Level Interface:** Seaborn provides an intuitive and high-level interface for creating a variety of statistical plots. You can create complex visualizations with just a few lines of code.

- **Beautiful Aesthetics:** Seaborn is known for its attractive and customizable color palettes and themes, making your plots visually appealing.

- **Statistical Capabilities:** Seaborn integrates well with Pandas DataFrames and can automatically calculate and display summary statistics, confidence intervals, and more within your plots.

- **Wide Range of Plot Types:** Whether you need scatter plots, bar charts, heatmaps, or violin plots, Seaborn has you covered with a broad range of plot types.

- **Faceting:** Seaborn makes it easy to create multi-plot grids, allowing you to visualize relationships between multiple variables simultaneously.

- **Regression Analysis:** Seaborn includes built-in functions for visualizing linear and non-linear relationships between variables.

### When to Use Seaborn 🤔

Seaborn is an excellent choice when you want to create informative and elegant visualizations for your data analysis, data exploration, and data presentation tasks. Whether you're a beginner or an experienced data scientist, Seaborn simplifies the process of creating beautiful and meaningful plots, helping you convey your data's story effectively.

With its rich functionality and ability to integrate seamlessly with Pandas and Matplotlib, Seaborn has become an indispensable tool in the data science toolkit.


In [None]:
# Scatter plot of sepal length vs. sepal width
sns.scatterplot(data=iris_df, x="sepal_length_cm", y="sepal_width_cm", hue="class")
plt.title("Scatter Plot of Sepal Length vs. Sepal Width")
plt.show()


This scatter plot visualizes the relationship between sepal length and sepal width, with different colors representing each iris class. It shows that Setosa species typically have shorter sepal lengths and slightly wider sepals compared to the other two species.

In fact, if we take a look at the photo we realise that the other two classes are difficult to distinguish by seeing the size of the sepals!

In [None]:
# Box plot of petal length by class
sns.boxplot(data=iris_df, x="class", y="petal_length_cm")
plt.title("Box Plot of Petal Length by Iris Class")
plt.show()


This box plot provides a summary of petal length for each iris class.


It highlights the differences in petal length between the three classes, with Setosa having the shortest petals and Virginica the longest. It also identifies potential outliers (the diamonds).

In [None]:
# Violin plot of sepal width by class
sns.violinplot(data=iris_df, x="class", y="sepal_width_cm")
plt.title("Violin Plot of Sepal Width by Iris Class")
plt.show()

The violin plot displays the distribution of sepal widths for each iris class. It reveals that Setosa has a relatively wider sepal width distribution, while Versicolor and Virginica have narrower and similar distributions.

<img src=https://miro.medium.com/v2/resize:fit:640/format:webp/1*TTMOaNG1o4PgQd-e8LurMg.png>

Source: medium.com

In [None]:
# Pair plot for all numerical columns
sns.pairplot(iris_df, hue="class")

This pair plot visualizes relationships between all numerical features (sepal length, sepal width, petal length, and petal width) for each iris class. It provides a comprehensive overview of the dataset, highlighting correlations and separations between the classes.



In [None]:
# Correlation heatmap
correlation_matrix = iris_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="magma_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Iris Dataset")
plt.show()

Insights: The heatmap displays the correlation between different numerical features in the Iris dataset. It helps identify which features are strongly correlated and provides insights into which attributes may be more important for distinguishing iris classes.

These Seaborn examples offer various ways to visualize and gain insights from the Iris dataset, helping to better understand the relationships between different attributes and the differences between iris classes.

## Creating Subplots with `plt.subplots` 📊

In data visualization, it's often useful to display multiple plots in a grid layout to compare and analyze different aspects of your data simultaneously. The line of code:

```python
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
```
is used to create a grid of subplots in a figure.

Breaking Down the Line of Code 🧐
fig: This variable holds the reference to the main figure object that contains all the subplots. It represents the entire canvas where your plots will be placed.

axes: This variable holds a 2D array (matrix) of subplot objects. Each element of this matrix represents a specific subplot within the grid. In this case, it's a 2x2 grid, so axes will be a 2x2 matrix.

plt.subplots(2, 2, figsize=(12, 8)): This function call does the following:

2, 2: Specifies that you want a 2x2 grid of subplots, meaning there will be four subplots arranged in two rows and two columns.
figsize=(12, 8): Sets the size of the entire figure to 12 inches in width and 8 inches in height. This determines the dimensions of the canvas that holds your subplots.
Why Use Subplots? 🤔
Creating subplots is essential when you want to visualize multiple aspects of your data side by side. It allows you to compare distributions, trends, or relationships more effectively, making your data analysis more insightful and comprehensive.

Here's how you can create and customize subplots using plt.subplots:


In [None]:
import matplotlib.pyplot as plt

# Create a 2x2 grid of subplots with a specific figure size
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Customize and populate each subplot with data
# (Add your individual subplot code here)

# Adjust spacing between subplots for better readability
plt.tight_layout()

# Show the subplots
plt.show()


### Example distribution plot
Now that we know how to make subplots, let's plot the distribution of each variable in the dataset.
To do that we'll use seaborn's `histplot` and we'll set the **target** as one of the subplots via the argument `ax`.

In [None]:
# Set the style
sns.set(style="whitegrid")

# Create a figure with subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Sepal Length Distribution Plot
sns.histplot(data=iris_df, x="sepal_length_cm", kde=True, ax=axes[0, 0], hue='class')
axes[0, 0].set_title("Distribution of Sepal Length")

# Sepal Width Distribution Plot
sns.histplot(data=iris_df, x="sepal_width_cm", kde=True, ax=axes[0, 1], hue='class')
axes[0, 1].set_title("Distribution of Sepal Width")

# Petal Length Distribution Plot
sns.histplot(data=iris_df, x="petal_length_cm", kde=True, ax=axes[1, 0], hue='class')
axes[1, 0].set_title("Distribution of Petal Length")

# Petal Width Distribution Plot
sns.histplot(data=iris_df, x="petal_width_cm", kde=True, ax=axes[1, 1], hue='class')
axes[1, 1].set_title("Distribution of Petal Width")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plots
plt.show()


## The Challenge of Class Imbalance in Datasets 📊

In many real-world datasets, especially those used in classification tasks, you may encounter a common challenge known as "class imbalance." Class imbalance occurs when the number of data points in each class is not evenly distributed. This imbalance can lead to biased model training and affect the model's ability to accurately predict minority classes.

### The Imbalanced Dataset Problem ⚖️

Class imbalance can pose several issues, including:

- **Bias Towards Majority Classes:** Machine learning models tend to perform well on the majority class while struggling to predict minority classes. This can result in low sensitivity or recall for the minority classes.

- **Misleading Accuracy:** A model trained on imbalanced data can achieve high accuracy by simply predicting the majority class, even if it fails to detect minority class instances.

- **Ineffective Decision Boundaries:** Class imbalance can affect the decision boundary learned by the model, making it less representative of the underlying data distribution.

### Addressing Class Imbalance 🛠️

To mitigate the challenges posed by class imbalance, techniques such as resampling (oversampling or undersampling), using different evaluation metrics (e.g., F1-score), or employing specialized algorithms like SMOTE (Synthetic Minority Over-sampling Technique) are commonly used.

### Visualizing Class Distribution 📈

One way to understand the extent of class imbalance is by visualizing the distribution of classes within the dataset. Let's create a plot to visualize the class distribution in our dataset.

In [None]:
# Count the occurrences of each class in the Iris dataset
class_counts = iris_df["class"].value_counts()

# Create a bar plot to visualize class distribution
plt.figure(figsize=(8, 6))
sns.barplot(x=class_counts.index, y=class_counts.values)
plt.title("Class Distribution in the Iris Dataset")
plt.xlabel("Iris Species")
plt.ylabel("Count")
plt.show()


Balanced Dataset Confirmed 📊⚖️

After visualizing the class distribution in the Iris dataset, it's evident that we have a balanced dataset. Each class, representing a different iris species (Setosa, Versicolor, and Virginica), appears to have an approximately equal number of instances.

This balance in the dataset is a positive sign for our classification tasks. A balanced dataset ensures that our machine learning models have sufficient data for each class, reducing the risk of bias toward any particular class during model training.

This balance will contribute to more reliable and accurate model predictions across all classes. Let's continue exploring and analyzing our Iris dataset with confidence!


# Key Takeaways: A Guided EDA Process 🚀🔍

Exploratory Data Analysis (EDA) is an essential step in any data science project, helping us understand, clean, and gain insights from our dataset. Here's a structured EDA process to guide your data exploration:

1. **Understand Column Meanings:** Begin by comprehending the meaning of each column in your dataset. Know what each attribute represents and its significance in your analysis. This provides context for your data.

2. **Check Data Integrity:** Ensure the integrity of your dataset by inspecting for missing values (NaNs) and outliers. Addressing data quality issues is crucial to reliable analysis.

3. **Visualize Distributions:** Utilize data visualization techniques to explore the distribution of individual variables. Tools like histograms, kernel density plots, and box plots can reveal insights into the data's spread and central tendencies.

4. **Pair Plots for Relationships:** Create pair plots to visualize relationships between pairs of variables. This matrix of scatter plots helps identify correlations and patterns, aiding in understanding the data's structure.

By following this structured EDA process, you'll develop a comprehensive understanding of your dataset, identify potential data quality issues, and gain valuable insights that will inform subsequent analysis and modeling efforts. EDA is not just a preliminary step; it's an integral part of data-driven decision-making!

Next time we'll see how to deal with missing values.