# EDA Task 1: Understanding Data Structure

## Activity 1: Real-World Dataset — Titanic

In [1]:
import pandas as pd
import seaborn as sns

# Load Titanic dataset
df_titanic = sns.load_dataset('titanic')

print("--- Titanic Dataset Overview ---")

# Shape of the dataset
print("Shape:", df_titanic.shape)

# Column names
print("\nColumn Names:")
print(df_titanic.columns.tolist())

# Data types
print("\nData Types:")
print(df_titanic.dtypes)

# Preview first few rows
print("\nFirst 5 Rows:")
print(df_titanic.head())

# Info summary
print("\nInfo Summary:")
df_titanic.info()

# Optional: check for missing values
print("\nMissing Values Summary:")
print(df_titanic.isnull().sum())

--- Titanic Dataset Overview ---
Shape: (891, 15)

Column Names:
['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']

Data Types:
survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

First 5 Rows:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4      

## Activity 2: Simulated Dataset — Clean & Controlled (simulate customer behavior on a product page)

In [2]:
import numpy as np

# Set seed for reproducibility
np.random.seed(0)

# Create simulated dataset
df_sim = pd.DataFrame({
    'age': np.random.randint(18, 80, 100),
    'income': np.random.normal(50000, 15000, 100),
    'gender': np.random.choice(['male', 'female'], size=100),
    'purchased': np.random.choice([0, 1], size=100)
})

print("\n--- Simulated Dataset Overview ---")

# Shape of the dataset
print("Shape:", df_sim.shape)

# Column names
print("\nColumn Names:")
print(df_sim.columns.tolist())

# Data types
print("\nData Types:")
print(df_sim.dtypes)

# Preview first few rows
print("\nFirst 5 Rows:")
print(df_sim.head())

# Info summary
print("\nInfo Summary:")
df_sim.info()

# Check for missing values
print("\nMissing Values Summary:")
print(df_sim.isnull().sum())


--- Simulated Dataset Overview ---
Shape: (100, 4)

Column Names:
['age', 'income', 'gender', 'purchased']

Data Types:
age            int32
income       float64
gender        object
purchased      int32
dtype: object

First 5 Rows:
   age        income  gender  purchased
0   62  56080.966760  female          0
1   65  51779.098779  female          1
2   71  68816.211025  female          1
3   18  71286.530612    male          1
4   21  38842.158757  female          1

Info Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        100 non-null    int32  
 1   income     100 non-null    float64
 2   gender     100 non-null    object 
 3   purchased  100 non-null    int32  
dtypes: float64(1), int32(2), object(1)
memory usage: 2.5+ KB

Missing Values Summary:
age          0
income       0
gender       0
purchased    0
dtype: int64
