ðŸ§  WHAT IS EDA?

EDA â€” Exploratory Data Analysis â€” is the process of understanding your data before doing anything with it. You're asking questions like:

* How big is this dataset?
* What columns do I have and what do they mean?
* Are there missing values?
* What does the distribution of each column look like?
* Are there any patterns or relationships between columns?

Skipping EDA and jumping straight to building a model is like driving to a new city without looking at the map. You'll crash.

ðŸ§  THE DATASET â€” Titanic

ðŸ§  TOPIC 1 â€” First Look

In [2]:
import pandas as pd

df = pd.read_csv("titanic.csv")

print(df.shape)          # how many rows and columns
print(df.columns)        # what are the columns
print(df.head())         # first 5 rows
print(df.dtypes)         # data type of each column
print(df.info())         # non-null counts + dtypes together
print(df.describe())     # statistics for numeric columns

(891, 12)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2

Understanding what each column means is called knowing your domain. For Titanic:

* Survived â€” 0 = died, 1 = survived (this is what we'll predict later)
* Pclass â€” ticket class: 1 = first class, 2 = second, 3 = third
* Sex â€” male/female
* Age â€” age in years
* SibSp â€” number of siblings/spouses aboard
* Parch â€” number of parents/children aboard
* Fare â€” ticket price paid
* Embarked â€” port of embarkation: C = Cherbourg, Q = Queenstown, S = Southampton

ðŸ§  TOPIC 2 â€” Missing Value Analysis

In [None]:
# How many missing values per column
print(df.isnull().sum())

# Percentage missing â€” more useful
print((df.isnull().sum() / len(df)) * 100)

You'll find Age, Cabin, and Embarked have missing values. A column with more than 70% missing is usually dropped entirely â€” it has too little information to be useful.

ðŸ§  TOPIC 3 â€” Value Counts

For columns with categories, value_counts() tells you how many of each:

In [4]:
print(df["Survived"].value_counts())
# 0    549  â€” died
# 1    342  â€” survived

print(df["Pclass"].value_counts())
print(df["Sex"].value_counts())
print(df["Embarked"].value_counts())

Survived
0    549
1    342
Name: count, dtype: int64
Pclass
3    491
1    216
2    184
Name: count, dtype: int64
Sex
male      577
female    314
Name: count, dtype: int64
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64


This immediately tells you something important â€” more people died than survived. This is called class imbalance and it affects how you build and evaluate ML models. Remember this term

ðŸ§  TOPIC 4 â€” Groupby â€” Finding Patterns
* groupby() splits data into groups and lets you compute statistics per group. This is where insights come from:

In [7]:
# Average survival rate by gender
print(df.groupby("Sex")["Survived"].mean())

# Average survival rate by class
print(df.groupby("Pclass")["Survived"].mean())

# Average fare paid per class
print(df.groupby("Pclass")["Fare"].mean())

# Multiple aggregations at once
print(df.groupby("Pclass").agg({
    "Survived": "mean",
    "Fare": "mean",
    "Age": "mean"
}))

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64
        Survived       Fare        Age
Pclass                                
1       0.629630  84.154687  38.233441
2       0.472826  20.662183  29.877630
3       0.242363  13.675550  25.140620


When you run these you'll discover real historical patterns in the data â€” women survived at much higher rates, first class passengers survived more than third class. That's EDA working â€” you're finding signal in data.

ðŸ§  TOPIC 5 â€” Correlation

Correlation measures how strongly two numeric columns are related. Value ranges from -1 to 1:
* 1 means perfectly positively related
* -1 means perfectly negatively related
* 0 means no relationship

In [11]:
#print(df.corr(numeric_only=True))

# Correlation of everything with Survived specifically
print(df.corr(numeric_only=True)["Survived"].sort_values(ascending=False))

Survived       1.000000
Fare           0.257307
Parch          0.081629
PassengerId   -0.005007
SibSp         -0.035322
Age           -0.077221
Pclass        -0.338481
Name: Survived, dtype: float64


This tells you which features are most related to survival â€” useful for deciding which columns to use in your model later.

Task 1 â€” First Look (15 mins)

In [1]:
import pandas as pd

df = pd.read_csv("Titanic.csv")

print(df.shape)
print(df.columns)
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum()) 
print(df.isnull().sum()/len(df)*100) 

#data has total 891 rows and 12 columns
#only age cabin and embarked has null values where cabin has the highest number of null values
#cabin is almost useless data as it contains more than 70% empty values

(891, 8)
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='str')
   Survived  Pclass     Sex   Age  SibSp  Parch     Fare Embarked
0         0       3    male  22.0      1      0   7.2500        S
1         1       1  female  38.0      1      0  71.2833        C
2         1       3  female  26.0      0      0   7.9250        S
3         1       1  female  35.0      1      0  53.1000        S
4         0       3    male  35.0      0      0   8.0500        S
<class 'pandas.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    str    
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    str    
dtyp

Task 2 â€” Cleaning (25 mins)

In [None]:
df['Age']=df['Age'].fillna(df['Age'].median())
print(df['Age'].value_counts())
df['Embarked']=df['Embarked'].fillna(df['Embarked'].mode()[0])
print(df.isnull().sum())
df.drop(columns=['Cabin','Name','Ticket','PassengerId'], inplace=True)
df.to_csv("titanic.csv", index=False)


inplace=True is a useful trick it modifies the DataFrame directly without needing to reassign again..

* df['Embarked'].mode(): The mode() function calculates the mode(s) (most frequent value(s)) of the Embarked column. It returns a pandas Series, even if there is only one mode.
* [0]: The mode() function returns a Series because a column can have multiple modes. Appending [0] accesses the first mode as a single scalar value, which fillna() requires to fill all the missing entries consistently

Task 3 â€” EDA + Insights (30 mins)

In [82]:
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.000000,1,0,7.2500,S
1,1,1,female,38.000000,1,0,71.2833,C
2,1,3,female,26.000000,0,0,7.9250,S
3,1,1,female,35.000000,1,0,53.1000,S
4,0,3,male,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S
887,1,1,female,19.000000,0,0,30.0000,S
888,0,3,female,29.699118,1,2,23.4500,S
889,1,1,male,26.000000,0,0,30.0000,C


In [None]:
survived = df.groupby("Survived")["Survived"].value_counts()
print(survived[1]/len(df['Survived'])*100)
#38.38% survived

sex_survivals = df.groupby("Sex")["Survived"].mean()
print(sex_survivals)
print(sex_survivals*100)
# Women survived at 74% rate, men at only 19%

class_survivals = df.groupby("Pclass")["Survived"].mean()
print(class_survivals*100)
#1class-62%, 2class-47%, 3class-24%

print(df.groupby("Survived")["Age"].mean())
# 0 (died)    â†’ average age ~30
# 1 (survived) â†’ average age ~28

print(df.corr(numeric_only=True)["Survived"].sort_values(ascending=False))
#fare has the highest correlation with survived

print(df.groupby("Survived")["Fare"].mean())
# 0 (died)    â†’ average fare ~22
# 1 (survived) â†’ average fare ~48

38.38383838383838


AttributeError: 'Series' object has no attribute 'value'