In [1]:
import pandas as pd
import numpy as np

In [5]:
data = pd.read_csv('train.csv')
data.head()
data.info()

# Check for missing values
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# Count duplicates
print("Total duplicates before:", data.duplicated().sum())
# Remove duplicates
data = data.drop_duplicates()
print("Total duplicates after:", data.duplicated().sum())

Total duplicates before: 0
Total duplicates after: 0


In [7]:
# Show missing before cleaning
print("Missing values before:\n", data.isnull().sum())
# Fill Age with mean
data['Age'] = data['Age'].fillna(data['Age'].mean())
# Fill Embarked with mode
data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0])
# Drop Cabin (too many missing)
if 'Cabin' in data.columns:
    data = data.drop(columns=['Cabin'])
# Show missing after cleaning
print("Missing values after:\n", data.isnull().sum())


Missing values before:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
Missing values after:
 PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


In [8]:
# Create a new column: Sum of squares of Age and Fare
data['Age_Fare_SumSq'] = data['Age']**2 + data['Fare']**2
# Filter: Keep only adults (Age > 18)
filtered_data = data[data['Age'] > 18]
filtered_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Age_Fare_SumSq
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,536.5625
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,6525.308859
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,738.805625
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,4044.61
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,1289.8025


In [11]:
filtered_data.to_csv('cleaned_titanic.csv', index=False)
print("Cleaned and filtered dataset saved as 'cleaned_titanic.csv'")

Cleaned and filtered dataset saved as 'cleaned_titanic.csv'


In [14]:
"""
Summary of Concepts Learned

In this task, I worked with data structures and data transformation techniques using Python. The main goal was to perform data cleaning and apply transformations on a dataset from Kaggle (Titanic dataset).

I started by using the Pandas and NumPy libraries, which are essential for handling structured data efficiently. Through this, I learned how to load, explore, and manipulate datasets using DataFrames.

The first step was to identify and remove duplicate records to ensure data consistency. Then, I handled missing values using techniques such as replacing numerical values (like Age) with the mean and categorical values (like Embarked) with the mode. Columns with excessive missing data (like Cabin) were removed to maintain data quality.

Next, I performed data transformations to derive new insights — for example, creating a new column using the sum of squares of numerical features (Age and Fare). I also applied filtering to focus on relevant records, such as passengers above a certain age.

Finally, I learned the importance of saving cleaned and transformed data for further analysis and how to organize my code into modular functions for reusability and clarity.
"""

'\nSummary of Concepts Learned\n\nIn this task, I worked with data structures and data transformation techniques using Python. The main goal was to perform data cleaning and apply transformations on a dataset from Kaggle (Titanic dataset).\n\nI started by using the Pandas and NumPy libraries, which are essential for handling structured data efficiently. Through this, I learned how to load, explore, and manipulate datasets using DataFrames.\n\nThe first step was to identify and remove duplicate records to ensure data consistency. Then, I handled missing values using techniques such as replacing numerical values (like Age) with the mean and categorical values (like Embarked) with the mode. Columns with excessive missing data (like Cabin) were removed to maintain data quality.\n\nNext, I performed data transformations to derive new insights — for example, creating a new column using the sum of squares of numerical features (Age and Fare). I also applied filtering to focus on relevant reco