In [15]:
import pandas as pd
import numpy as np

df = pd.read_csv("/kaggle/input/test-file/tested.csv")

df


  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


# Titanic Dataset ‚Äî Attribute Description

The Titanic dataset contains demographic, socioeconomic, and travel-related attributes for passengers aboard the RMS Titanic.  
The table below summarizes each attribute clearly for analysis and data preparation.

---

## üìò Attribute Description Table

| **Attribute** | **Type** | **Description** | **Notes / Importance** |
|---------------|----------|-----------------|-------------------------|
| **PassengerId** | Numerical (int) | Unique ID assigned to each passenger | Only an index; no predictive meaning |
| **Survived** | Binary categorical (0/1) | Survival status: 1 = Survived, 0 = Died | Target variable for prediction |
| **Pclass** | Ordinal categorical | Passenger class: 1 = First, 2 = Second, 3 = Third | Strong indicator of socioeconomic status |
| **Name** | Text | Full name including title (Mr, Mrs, Miss, etc.) | Titles can be extracted as features |
| **Sex** | Categorical | Gender of passenger | One of the strongest predictors of survival |
| **Age** | Numerical (float) | Age in years | Contains missing values; important demographic |
| **SibSp** | Numerical (int) | Number of siblings/spouses aboard | Helps determine family structure |
| **Parch** | Numerical (int) | Number of parents/children aboard | Useful for creating ‚Äúfamily size‚Äù |
| **Ticket** | Text | Ticket number/code | Sometimes groups passengers; messy attribute |
| **Fare** | Numerical (float) | Ticket price | Strongly related to class (Pclass) |
| **Cabin** | Text | Cabin number | Many missing values; deck letter is useful |
| **Embarked** | Categorical (C/Q/S) | Port of embarkation | Reflects geographic/socioeconomic patterns |

---


In [16]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

#### In the Titanic dataset

Possible MCAR example: Fare (1 missing)
One single Fare value being missing is suspiciously random. It's likely a data-entry glitch, not related to the passenger‚Äôs class, wealth, or survival.

If Fare is MCAR:

- The missingness doesn't bias the analysis.
- You can safely drop that row or fill it using simple imputation.
- This is the ‚Äúno hidden meaning, just bad luck‚Äù category.

#### MAR ‚Äî Missing At Random

The missingness depends on other known variables, but not on the value of the missing variable itself.

The classic example: Age might be missing more often for women, or people in 3rd class, or passengers with no siblings.

In the Titanic dataset

Age missing for 86 passengers is likely MAR.

Why?
Age tends to be missing more for certain groups:

- Women often didn‚Äôt report age in early records
- Third-class passengers had the least documentation
- Crew members (often younger adults) sometimes have missing ages

So the missingness depends on variables like Sex, Pclass, or Ticket group, but not typically on Age itself.

If Age is MAR:

You can use predictive imputation, e.g., regression, RandomForestImputer.

- Dropping rows will introduce bias.
- This is the ‚Äúthe missingness has a pattern, but you can measure that pattern‚Äù category.

#### MNAR ‚Äî Missing Not At Random

The missingness depends on the value itself.
This is the trickiest, moodiest category.

In the Titanic dataset

Cabin (327 missing) is likely MNAR.

Why?
Cabin numbers were usually only recorded for:

- First-class passengers
- Some second-class passengers
- Third-class passengers often did not have cabin assignments, or their records were not kept. So missingness is tied to their socioeconomic status and physical location.

In other words:
Passengers with missing Cabin values are likely people who didn‚Äôt have a cabin, not just people whose cabin wasn‚Äôt recorded.

This is ‚Äúthe missingness reveals something important about the value itself.‚Äù

If Cabin is MNAR:

- Simple imputation is dangerous, you damage the meaning.
- Best practice is to convert Cabin to a binary feature like:
- df['HasCabin'] = df['Cabin'].notnull().astype(int)


#### Treating missing values:

1. Listwise deletions:

In [17]:
# Listwise deletion
df_listwise = df.dropna()  

print(df_listwise.shape)
df_listwise.head()

## This will drastically shrink your dataset because Cabin has 327 missing values.

(87, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
14,906,1,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance...",female,47.0,1,0,W.E.P. 5734,61.175,E31,S
24,916,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
26,918,1,1,"Ostby, Miss. Helene Ragnhild",female,22.0,0,1,113509,61.9792,B36,C
28,920,0,1,"Brady, Mr. John Bertram",male,41.0,0,0,113054,30.5,A21,S


### 2. Pairwise Deletion

Instead of dropping entire rows, you drop missing values only for the specific pair of variables being analyzed.

This is used only during analysis, not as a permanent fix.

Example:
You want correlation between Age and Fare ‚Üí keep only rows where both exist.
For another analysis, you may use different rows.

In [18]:
# Pairwise deletion for specific pairs
age_fare_corr = df[['Age', 'Fare']].dropna().corr()
age_fare_corr


Unnamed: 0,Age,Fare
Age,1.0,0.337932
Fare,0.337932,1.0


#### 3. Dropping Entire Columns

Sometimes a column has too much missingness and is not worth keeping.

Your Cabin column is the perfect example.

Why drop Cabin?

327 out of ~891 rows are missing (‚âà 73%)

Cabin numbers are too granular to use directly

‚ÄúHasCabin‚Äù (binary) is more meaningful for prediction



In [19]:
# Drop Cabin entirely
df_dropcol = df.drop(columns=['Cabin'])
df_dropcol.head()

# Or convert Cabin ‚Üí HasCabin first



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,S


In [20]:
# This preserves a useful signal:

df['HasCabin'] = df['Cabin'].notnull().astype(int)
df = df.drop(columns=['Cabin'])
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,HasCabin
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Q,0
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,S,0
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Q,0
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,S,0
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,S,0
