## Why EDA is Important

EDA helps you:

* Understand what your data contains
* Detect missing values and errors
* Find patterns and trends
* Identify outliers
* Decide what analysis or model is appropriate

Without EDA, you’re basically **guessing**.

## Core Python Libraries Used in EDA

### 1. NumPy

**Purpose:** Numerical computing

**What it’s used for:**
* Working with numbers efficiently
* Arrays and mathematical operations
* Under the hood of many data libraries

**Example use cases:**
* Calculating averages
* Performing fast math operations
* Handling numerical datasets

Think of NumPy as:
> “The math engine of data analysis”

In [9]:
import numpy as np

# Comments help explain what's happening
# Create a simple array (list of numbers)
numbers = np.array([10, 20, 30, 40, 50])

print("Our Array:", numbers)

# Perform a fast mathematical operation: Calculate the mean (average)
average_val = np.mean(numbers)
average_val1 = np.std(numbers)


print("Average Value:", average_val)
print("STD:", average_val1)

Our Array: [10 20 30 40 50]
Average Value: 30.0
STD: 14.142135623730951


In [2]:
array_1 = np.array([1,2,3,4,5])
array_2 = np.array([6,7,8,9,10])

list_1 = [1,2,3,4,5]
list_2 = [6,7,8,9,10]

In [8]:
for i in list_1:
    print(i * 2)

2
4
6
8
10


In [4]:
array_1 * 2

array([ 2,  4,  6,  8, 10])

### 2. Pandas

**Purpose:** Data manipulation and analysis

**What it’s used for:**
* Reading data (CSV, Excel, SQL, etc.)
* Cleaning data
* Filtering and transforming data
* Summarizing datasets

Key data structure:
* **DataFrame** → looks like an Excel table

Think of Pandas as:
> “Excel, but smarter and programmable”

In [None]:
import pandas as pd

# Creating a simple DataFrame (table)
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'London']
}

df = pd.DataFrame(data)

# Display the table
print("Our DataFrame:")
display(df)  # In Jupyter, display() shows a nice table

### 3. Matplotlib

**Purpose:** Data visualization

**What it’s used for:**
* Line charts
* Bar charts
* Histograms
* Scatter plots

Visualization helps you:
* See trends
* Spot outliers
* Compare values easily

Think of Matplotlib as:
> “Turning numbers into pictures”

### 4. (Optional but common) Seaborn

**Purpose:** Statistical visualization (built on Matplotlib)

**What it’s used for:**
* Cleaner and more attractive charts
* Visualizing distributions and relationships

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Data for plotting
x = [1, 2, 3, 4, 5]
y = [10, 15, 7, 10, 5]

# Create a simple line plot using Matplotlib
plt.figure(figsize=(8, 4))  # Set plot size
plt.plot(x, y, marker='o', linestyle='-', color='b')
plt.title("Simple Line Plot Example")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.grid(True)
plt.show()

## Typical EDA Workflow

1. **Import libraries**
2. **Load the dataset**
3. **Preview the data**
4. **Understand data types**
5. **Check for missing values**
6. **Generate summary statistics**
7. **Visualize the data**
8. **Draw insights**

## Small Example: Introductory EDA Analysis (Conceptual)

Imagine a dataset of **students’ exam scores** with columns:

* `name`
* `age`
* `math_score`
* `english_score`

We will replicate this conceptually in code.

In [None]:
# Step 1 & 2: Load Data and Understand it

# Creating a dataset directly in code (normally you would read a CSV)
student_data = {
    'name': ['Emma', 'Liam', 'Olivia', 'Noah', 'Ava', 'Ethan', 'Sophia'],
    'age': [20, 21, 19, 22, 20, 23, 19],
    'math_score': [85, 90, 78, 92, 60, 45, 88],
    'english_score': [88, 79, 85, 95, 70, 50, 92]
}

df_students = pd.DataFrame(student_data)

# Preview the data (Step 3: Preview)
print("Top 5 rows of our data:")
display(df_students.head())

# Check data types and info (Step 4: Understand data types)
print("\nInfo about the columns:")
df_students.info()

In [None]:
# Step 5 & 6: Check Data Quality & Summary Statistics

# Check for missing values
print("Missing values per column:")
print(df_students.isnull().sum())

# Summary statistics (mean, min, max, std)
print("\nSummary Statistics:")
display(df_students.describe())

In [None]:
# Step 7: Visualize

# A. Histogram of Math Scores -> see score distribution
plt.figure(figsize=(6, 4))
sns.histplot(df_students['math_score'], bins=5, kde=True, color='skyblue')
plt.title("Distribution of Math Scores")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.show()

# B. Scatter plot: Age vs Math Score
plt.figure(figsize=(6, 4))
sns.scatterplot(x='age', y='math_score', data=df_students, color='red', s=100)
plt.title("Age vs Math Score")
plt.show()

### Step 8: Insights

* Most students score between 60–80 in math
* A few outliers scored very low or very high
* No strong relationship between age and performance

This is **EDA** — not predicting, just **understanding**.

## Key Takeaway

* EDA is **mandatory**, not optional
* Pandas = data handling
* NumPy = numerical operations
* Matplotlib = visualization
* Always explore before modeling

In [None]:
import pandas