# Data Exploration (Iris Species Dataset)

### In this notebook, we will:
1. Load the dataset from `Data/Raw` folder.
2. Understand the structure of the dataset (rows, columns, data types).
3. Explore basic statistics of numeric features.
4. Check for missing values.
5. Analyze the target variable (`Species`).
6. Summarize initial observations for next steps (data cleaning, visualization, modeling).


## Step 1: Import Libraries
We'll import the necessary libraries for data handling.

In [1]:
import pandas as pd
import numpy as np

## Step 2: Load the Dataset
Load the Iris CSV dataset from the `Data/Raw` folder and preview the first few rows.

In [2]:
df = pd.read_csv('D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\Raw\iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Step 3: Dataset Information
Check the number of rows, columns, data types, and non-null values to understand the dataset structure.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


##  Step 4: Basic Statistics
View summary statistics for numeric columns (mean, min, max, standard deviation) to get an idea of the feature ranges.

In [4]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


## Step 5: Missing Values
Check if there are any missing values in the dataset.

In [5]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

## Step 6: Target Variable: Species
Analyze the distribution of each Iris species in the dataset.

In [6]:
# Count of each species
print(df['Species'].value_counts())

print('\n')

# Percentage distribution
print(df['Species'].value_counts(normalize=True) * 100)

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64


Species
Iris-setosa        33.333333
Iris-versicolor    33.333333
Iris-virginica     33.333333
Name: proportion, dtype: float64


## Step 7: Summary & Observations

- **Dataset Size:** Number of rows and columns.
- **Missing Values:** None found (dataset is clean).
- **Feature Overview:** All features are numeric except `Species`.
- **Target Distribution:** Balanced distribution across species.
- **Next Steps:** 
  1. Perform data cleaning if needed (02_Data-Cleaning).
  2. Visualize relationships between features and target (03_EDA).
  3. Prepare dataset for model training (04_Algo-Comparison).