# Exploring Data with Visualizations and EDA

## **Introduction**
Exploratory Data Analysis (EDA) and visualizations are fundamental steps in understanding the structure and patterns within a dataset. Visual tools can provide insights that are not immediately evident from raw numbers.

### **Key Topics**:

#### **1. Histograms**:
- Used to visualize the distribution of a single numeric variable.
- Displays frequencies of data in bins.
- Formula for bin width:
  $$
  \text{Bin Width} = \frac{\text{Range of Data}}{\text{Number of Bins}}
  $$

#### **2. Boxplots**:
- Used to summarize the distribution of a variable and identify outliers.
- Key components:
  - **Median (Q2)**: Middle value.
  - **Interquartile Range (IQR)**: Difference between Q3 and Q1.
  - **Outliers**: Values outside $(Q1 - 1.5 \times \text{IQR})$ or $(Q3 + 1.5 \times \text{IQR})$.

#### **3. Scatterplots**:
- Used to explore relationships between two numeric variables.
- Useful for identifying trends, clusters, and potential correlations.

#### **4. Correlation vs. Causation**:
- **Correlation**: A statistical measure that describes the relationship between two variables.
  $$
  \text{Correlation Coefficient (r)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}
$$
- **Causation**: Indicates a cause-and-effect relationship.
- **Key Insight**: Correlation does not imply causation.

#### **5. Heatmaps and Pairplots**:
- **Heatmaps**: Visualize correlations between numeric variables using a color-coded matrix.
- **Pairplots**: Visualize pairwise relationships between numeric variables.


## **Dataset**
We will use the **Iris Flowers Dataset** for this notebook.

**Dataset Description**:
- The Iris dataset contains information about the measurements of iris flowers from three species.

**Columns of Interest**:
- **`sepal length (cm)`**: Numerical, represents the length of the sepals in centimeters.
- **`sepal width (cm)`**: Numerical, represents the width of the sepals in centimeters.
- **`petal length (cm)`**: Numerical, represents the length of the petals in centimeters.
- **`petal width (cm)`**: Numerical, represents the width of the petals in centimeters.
- **`species`**: Categorical, represents the species of the flower (Setosa, Versicolor, or Virginica).

## **Loading the Dataset**

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
#loading the iris flowers dataset 
from sklearn.datasets import load_iris

In [6]:
iris = load_iris(as_frame=True)

In [7]:
iris_df = iris.frame 
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

In [8]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,species
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


## **Exercises**

### **Exercise 1: Data Overview**
**Question**: What are the data types of the columns? How many unique values are present in the `species` column?

In [11]:
# knowing the target values 


### **Exercise 2: Histograms**
**Scenario**: The botanist wants to understand the distribution of sepal and petal lengths.

**Questions**:
- Plot histograms for `sepal length (cm)` and `petal length (cm)`.
- Identify the range of values and observe the overall distribution.


### **Exercise 3: Boxplots**
**Scenario**: The researcher wants to identify any potential outliers in the measurements.

**Questions**:
- Create boxplots for `sepal width (cm)` and `petal width (cm)`.
- Highlight any outliers and explain their significance.


### **Exercise 4: Scatterplots**
**Scenario**: The scientist wants to study the relationship between petal length and width.

**Questions**:
- Create a scatterplot to visualize the relationship between `petal length (cm)` and `petal width (cm)`.
- Is there a visible trend?


### **Exercise 5: Correlation and Heatmaps**
**Scenario**: The researcher wants to analyze the correlation between all numeric variables.

**Questions**:
- Calculate the correlation coefficients for all numeric columns.
- Create a heatmap to visualize the correlations.
- Which variables have the strongest correlation?


### **Exercise 6: Pairplots**
**Scenario**: The researcher wants to understand pairwise relationships across all features, categorized by species.

**Questions**:
- Create a pairplot for all numeric features, grouped by `species`.
- Are there any noticeable clusters or separations between species?


### **Exercise 7: Species Analysis**
**Scenario**: The botanist wants to explore the differences in sepal and petal measurements between species.

**Questions**:
- What is the average sepal length for each species?
- Which species has the largest average petal width?


### **Exercise 8: Outlier Detection**
**Questions**:
- Identify outliers for `sepal width (cm)` using the IQR method.
- How do the outliers compare across different species?


### **Exercise 9: Visualizing Categorical Data**
**Scenario**: The botanist wants to understand the species distribution.

**Questions**:
- Create a bar plot to show the count of each species.


### **Exercise 10: Advanced Scatterplots**
**Scenario**: The researcher wants to add more context to the scatterplots.

**Questions**:
- Create a scatterplot for `sepal length (cm)` vs `sepal width (cm)`, using `species` for color coding.


## **Conclusion**
Visualizations and EDA are powerful tools for uncovering patterns and relationships in data. Through this notebook, you explored key visualization techniques and their applications to the Iris dataset. These skills are essential for analyzing real-world datasets effectively.
