<a href="https://colab.research.google.com/github/umair594/VirtualInternship-Rhombix_Technologies/blob/main/titanic_feature_engineering_task(b).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task 2: FEATURE ENGINEERING

**Create additional features useful for predicting equipment failure**

**Objective**

To generate new, meaningful features that improve the prediction of equipment failure or any target outcome.

**Time-Based Features**

If the dataset includes a time or date column, the following were extracted:

Hour, Day, Month, Day of Week, and Is Weekend

These help capture time-dependent behavior (for example, machines might fail more on weekends or certain hours).

**Step 01: Import Libraries**

In [1]:
# Import libraries
import pandas as pd
import numpy as np

**Step 02: Load the preprocessed data**

In [2]:
# Load the Dataset
df = pd.read_csv("titanic_data.csv")

In [3]:
df.shape

(889, 15)

In [4]:
# Displaying first few rows of the dataset
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
# Missing values
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,176
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [6]:
# Numeric columns → fill with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Categorical columns → fill with mode (most frequent value)
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

print("\n Missing values handled successfully!")


 Missing values handled successfully!


In [7]:
# Check for missing values again
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


**Step 03: Rolling Statistics**

For each numeric feature, rolling (windowed) statistics were created:

Rolling Mean, Rolling Standard Deviation, Rolling Min, and Rolling Max

These capture short-term trends, fluctuations, or stability in sensor readings.

In [8]:
# Identify numeric and time columns

numeric_cols = df.select_dtypes(include=[np.number]).columns
time_cols = [col for col in df.columns if 'time' in col.lower() or 'date' in col.lower()]

print("Numeric Columns:", list(numeric_cols))
print("Time Columns:", list(time_cols))

Numeric Columns: ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']
Time Columns: []


**Step 04: Lag Features**

Created lag-1 and lag-2 values for each numeric feature.
These represent the feature’s value from the previous one or two time steps  often useful for predicting future behavior.

In [9]:
# Convert time columns to datetime (if any)

for col in time_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')


**Step 05: Interaction Features**

Built new features by combining existing ones:

Sum, Difference, and Ratio between the first two numeric columns.
These capture relationships and dependencies between variables.

In [10]:
# Create Time-Based Features

# If you have a timestamp column, extract useful parts
if len(time_cols) > 0:
    time_col = time_cols[0]  # using first detected time column
    df['hour'] = df[time_col].dt.hour
    df['day'] = df[time_col].dt.day
    df['month'] = df[time_col].dt.month
    df['day_of_week'] = df[time_col].dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    print("\n Time-based features created successfully!")
else:
    print("\n No time column detected — skipping time-based feature creation.")


 No time column detected — skipping time-based feature creation.


In [11]:
 # Create Rolling Statistics (for numeric sensor-like data)

# Rolling mean, std, min, max (window size = 3 by default)
for col in numeric_cols:
    df[f'{col}_rolling_mean'] = df[col].rolling(window=3, min_periods=1).mean()
    df[f'{col}_rolling_std'] = df[col].rolling(window=3, min_periods=1).std()
    df[f'{col}_rolling_min'] = df[col].rolling(window=3, min_periods=1).min()
    df[f'{col}_rolling_max'] = df[col].rolling(window=3, min_periods=1).max()

print("\n Rolling statistics features created successfully!")


 Rolling statistics features created successfully!


In [12]:
# Create Lag Features (previous value)

for col in numeric_cols:
    df[f'{col}_lag1'] = df[col].shift(1)
    df[f'{col}_lag2'] = df[col].shift(2)

print("\n Lag features created successfully!")


 Lag features created successfully!


In [13]:
# Create Derived/Interaction Features

# Example: sum, difference, ratio between features (if more than one numeric)
if len(numeric_cols) >= 2:
    col1, col2 = numeric_cols[0], numeric_cols[1]
    df['sum_feat'] = df[col1] + df[col2]
    df['diff_feat'] = df[col1] - df[col2]
    df['ratio_feat'] = df[col1] / (df[col2] + 1e-6)  # avoid division by zero

print("\n Derived interaction features created successfully!")


 Derived interaction features created successfully!


**Step 06: Handling Missing Data**

Any missing values from lag or rolling operations were filled using forward and backward fill methods.

In [14]:
# Handle Missing Values Created by Lagging

df.fillna(method='bfill', inplace=True)
df.fillna(method='ffill', inplace=True)


  df.fillna(method='bfill', inplace=True)
  df.fillna(method='ffill', inplace=True)


**Step 07: Saving and Output**

All new features were added to the original dataset, which was then saved as:
 feature_engineered_data.csv

In [15]:
# Save Enhanced Dataset

df.to_csv("feature_engineered_data.csv", index=False)
print("\n Feature-engineered dataset saved as 'feature_engineered_data.csv'")


 Feature-engineered dataset saved as 'feature_engineered_data.csv'


In [16]:
#  Preview of new dataset
df.shape

(889, 54)

In [17]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,...,age_lag2,sibsp_lag1,sibsp_lag2,parch_lag1,parch_lag2,fare_lag1,fare_lag2,sum_feat,diff_feat,ratio_feat
0,0,3,male,22.0,1,0,7.25,S,Third,man,...,22.0,1.0,1.0,0.0,0.0,7.25,7.25,3,-3,0.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,...,22.0,1.0,1.0,0.0,0.0,7.25,7.25,2,0,0.999999
2,1,3,female,26.0,0,0,7.925,S,Third,woman,...,22.0,1.0,1.0,0.0,0.0,71.2833,7.25,4,-2,0.333333
3,1,1,female,35.0,1,0,53.1,S,First,woman,...,38.0,0.0,1.0,0.0,0.0,7.925,71.2833,2,0,0.999999
4,0,3,male,35.0,0,0,8.05,S,Third,man,...,26.0,1.0,0.0,0.0,0.0,53.1,7.925,3,-3,0.0
