<a href="https://colab.research.google.com/github/umair594/VirtualInternship-Rhombix_Technologies/blob/main/titanic_prepracessing_task(a).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task 1: DATA PREPROCESSING

**Objective**

To clean and prepare the dataset before applying any machine learning model. This involves:

Handling missing values

Removing outliers

Scaling/normalizing numeric data

Splitting the dataset into training and testing sets

**Step 01: Import Libraries**

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

**Step 02: Loading the Dataset**

The dataset was successfully loaded from the provided CSV file.

The shape (number of rows and columns) and first few rows were displayed to understand the data structure.

In [2]:
# Load the Dataset
df = pd.read_csv('titanic_data.csv')

In [3]:
# Shape of the Dataset
df.shape

(889, 15)

In [4]:
# Display first few rows of the dataset
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     889 non-null    int64  
 1   pclass       889 non-null    int64  
 2   sex          889 non-null    object 
 3   age          713 non-null    float64
 4   sibsp        889 non-null    int64  
 5   parch        889 non-null    int64  
 6   fare         889 non-null    float64
 7   embarked     887 non-null    object 
 8   class        889 non-null    object 
 9   who          889 non-null    object 
 10  adult_male   889 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  887 non-null    object 
 13  alive        889 non-null    object 
 14  alone        889 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.2+ KB


**Step 03: Handling Missing Values**

Checked for missing values in all columns.

Numeric columns: Missing values were filled with the median (to reduce the effect of outliers).

Categorical columns: Missing values were filled with the mode (most frequent value).

Result: No missing values remain in the dataset.

In [6]:
# Display count of missing values
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,176
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [7]:
# Numeric columns → fill with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Categorical columns → fill with mode (most frequent value)
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

print("\n Missing values handled successfully!")


 Missing values handled successfully!


In [8]:
# Check for missing values again
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


**Step 04: Handling Outliers**

Used the Interquartile Range (IQR) method to detect and remove extreme outliers.

For each numerical column, values outside 1.5 × IQR from the 1st or 3rd quartile were removed.

This step reduced noise and improved data quality.

In [9]:
# Handling Outliers using IQR method
def remove_outliers_iqr(dataframe, columns):
    for col in columns:
        Q1 = dataframe[col].quantile(0.25)
        Q3 = dataframe[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        before = dataframe.shape[0]
        dataframe = dataframe[(dataframe[col] >= lower_bound) & (dataframe[col] <= upper_bound)]
        after = dataframe.shape[0]
        print(f"{col}: Removed {before - after} outliers")
    return dataframe

df = remove_outliers_iqr(df, numeric_cols)
print("\n Outliers handled successfully!")

survived: Removed 0 outliers
pclass: Removed 0 outliers
age: Removed 66 outliers
sibsp: Removed 39 outliers
parch: Removed 144 outliers
fare: Removed 81 outliers

 Outliers handled successfully!


In [10]:
# Check again outlairs handling by displaying the shape of the dataset
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 559 entries, 0 to 888
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     559 non-null    int64  
 1   pclass       559 non-null    int64  
 2   sex          559 non-null    object 
 3   age          559 non-null    float64
 4   sibsp        559 non-null    int64  
 5   parch        559 non-null    int64  
 6   fare         559 non-null    float64
 7   embarked     559 non-null    object 
 8   class        559 non-null    object 
 9   who          559 non-null    object 
 10  adult_male   559 non-null    bool   
 11  deck         559 non-null    object 
 12  embark_town  559 non-null    object 
 13  alive        559 non-null    object 
 14  alone        559 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 62.2+ KB


**Step 05: Normalization / Feature Scaling**

Applied StandardScaler to all numeric features.

Each numeric column now has a mean of 0 and a standard deviation of 1.

This ensures that all features contribute equally to the model and prevents bias toward large-scale variables.

In [11]:
# Normalization / Feature Scaling

scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df_scaled[numeric_cols])

print("\n Features normalized/scaled successfully!")


 Features normalized/scaled successfully!


In [12]:
# Check Normalization result
df_scaled.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,-0.636021,0.670946,male,-0.846681,1.844514,0.0,-0.615205,S,Third,man,True,C,Southampton,no,False
2,1.572275,0.670946,female,-0.374449,-0.431608,0.0,-0.551309,S,Third,woman,False,C,Southampton,yes,True
3,1.572275,-2.117596,female,0.688074,1.844514,0.0,3.724995,S,First,woman,False,C,Southampton,yes,False
4,-0.636021,0.670946,male,0.688074,-0.431608,0.0,-0.539476,S,Third,man,True,C,Southampton,no,True
5,-0.636021,0.670946,male,-0.138333,-0.431608,0.0,-0.500826,Q,Third,man,True,C,Queenstown,no,True


**Step 06: Train–Test Splitting**

The dataset was divided into:

Training Set: 80% of the data (used for model training)

Testing Set: 20% of the data (used for model evaluation)

Random seed = 42 for reproducibility.

In [13]:
# train test splitting
target_column = 'target'  # Change this to your actual target column name

if target_column in df_scaled.columns:
    X = df_scaled.drop(columns=[target_column])
    y = df_scaled[target_column]
else:
    # If you don't have a target column yet, treat all columns as features
    X = df_scaled.copy()
    y = None
    print("\n No target column specified — splitting only features.")

# Split data (80% training, 20% testing)
if y is not None:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print("\n Data split into training and testing sets successfully!")
    print("Training set shape:", X_train.shape)
    print("Testing set shape:", X_test.shape)
else:
    X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
    print("\n Data split into training and testing sets (features only) successfully!")
    print("Training set shape:", X_train.shape)
    print("Testing set shape:", X_test.shape)


 No target column specified — splitting only features.

 Data split into training and testing sets (features only) successfully!
Training set shape: (447, 15)
Testing set shape: (112, 15)


**Result :**

Data preprocessing was completed successfully.
The dataset is now balanced, normalized, and free from missing or extreme values  making it ready for further analysis and machine learning tasks.