<a href="https://colab.research.google.com/github/swopnimghimire-123123/Machine-Learning-Journey/blob/main/18_Data_Exploration_Explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's break down the data exploration steps and the outputs from the code in cell :

### Data Exploration Explained

1.  **Loading the Dataset:**
    *   The code first loads the built-in **Iris dataset** from the `sklearn.datasets` library.
    *   This dataset is a classic dataset in machine learning and contains measurements of iris flowers.
    *   The data is then converted into a **pandas DataFrame** for easier manipulation and analysis.
    *   The `target` column is added to the DataFrame, which represents the species of the iris flower (there are three species in this dataset, represented by numbers 0, 1, and 2).

2.  **Displaying Data Head:**
    *   The command `display(df.head())` shows the **first 5 rows** of the DataFrame.
    *   This gives you a quick look at the structure of the data, the column names, and the first few data points. It helps in understanding what the data looks like at the beginning.

3.  **Displaying Data Sample:**
    *   The command `display(df.sample(5))` shows **5 random rows** from the DataFrame.
    *   Sampling the data provides a different perspective than just looking at the head, as it can reveal variations or patterns that might not be apparent in the initial rows.

4.  **Displaying Data Types:**
    *   The command `display(df.dtypes)` shows the **data type of each column** in the DataFrame.
    *   Understanding data types (e.g., `float64`, `int64`) is crucial for data cleaning and analysis, as it dictates what kind of operations can be performed on each column. In this case, all the measurement columns are `float64`, and the target is `int64`.

5.  **Checking for Missing Values:**
    *   The command `display(df.isnull().sum())` calculates the **number of missing (null) values** in each column.
    *   The output shows that all columns have 0 missing values, indicating a clean dataset with no missing information.

6.  **Displaying Data Description:**
    *   The command `display(df.describe())` provides **descriptive statistics** for the numerical columns.
    *   This includes:
        *   `count`: The number of non-null values.
        *   `mean`: The average value.
        *   `std`: The standard deviation, which measures the spread of the data.
        *   `min`: The minimum value.
        *   `25%`, `50%` (median), `75%`: The quartiles, which divide the data into four equal parts.
        *   `max`: The maximum value.
    *   This output gives you a summary of the central tendency, dispersion, and shape of the distribution for each numerical feature.

7.  **Checking for Duplicate Rows:**
    *   The command `display(df.duplicated().sum())` counts the **number of duplicate rows** in the DataFrame.
    *   The output `np.int64(1)` indicates that there is **one duplicate row** in the dataset. This is something to be aware of and potentially handle depending on the analysis.

8.  **Displaying Correlation Matrix:**
    *   The command `display(df.corr())` calculates the **pairwise correlation** between all columns.
    *   The correlation matrix shows a value between -1 and 1 for each pair of columns:
        *   A value close to 1 indicates a strong positive linear relationship (as one increases, the other tends to increase).
        *   A value close to -1 indicates a strong negative linear relationship (as one increases, the other tends to decrease).
        *   A value close to 0 indicates a weak or no linear relationship.
    *   Looking at the matrix, you can see strong positive correlations between `petal length (cm)`, `petal width (cm)`, and `target`. This suggests that these features are highly related to the species of the iris flower. `sepal width (cm)` shows weaker correlations with the other features and a negative correlation with `sepal length (cm)`, `petal length (cm)`, `petal width (cm)`, and `target`.

This comprehensive exploration provides a solid foundation for understanding the characteristics of the Iris dataset before performing more advanced analysis or modeling.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display the head of the data
print("## Data Head")
display(df.head())

# Display a sample of the data
print("## Data Sample")
display(df.sample(5))

# Display data types of columns
print("## Data Types")
display(df.dtypes)

# Check for missing values
print("## Missing Values")
display(df.isnull().sum())

# Display mathematical description of the data
print("## Data Description")
display(df.describe())

# Check for duplicate rows
print("## Duplicate Rows")
display(df.duplicated().sum())

# Display correlation matrix
print("## Correlation Matrix")
display(df.corr())

## Data Head


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Data Sample


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
49,5.0,3.3,1.4,0.2,0
148,6.2,3.4,5.4,2.3,2
88,5.6,3.0,4.1,1.3,1
33,5.5,4.2,1.4,0.2,0
137,6.4,3.1,5.5,1.8,2


## Data Types


Unnamed: 0,0
sepal length (cm),float64
sepal width (cm),float64
petal length (cm),float64
petal width (cm),float64
target,int64


## Missing Values


Unnamed: 0,0
sepal length (cm),0
sepal width (cm),0
petal length (cm),0
petal width (cm),0
target,0


## Data Description


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


## Duplicate Rows


np.int64(1)

## Correlation Matrix


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
sepal length (cm),1.0,-0.11757,0.871754,0.817941,0.782561
sepal width (cm),-0.11757,1.0,-0.42844,-0.366126,-0.426658
petal length (cm),0.871754,-0.42844,1.0,0.962865,0.949035
petal width (cm),0.817941,-0.366126,0.962865,1.0,0.956547
target,0.782561,-0.426658,0.949035,0.956547,1.0
