# Social Media Addiction Prediction: Preprocessing Scenario Analysis

This notebook demonstrates a systematic scenario-based approach to optimize preprocessing for predicting students' social media addiction scores. All feature engineering is performed before scenario testing. The analysis includes:

1. Data loading and cleaning
2. Feature engineering (all new features created before scenario testing)
3. Scenario 1: Feature selection comparison (All features, SelectKBest, RFE, VarianceThreshold)
4. Scenario 2: Normalization comparison (StandardScaler, MinMaxScaler, RobustScaler, None)
5. Results summary and visualization


In [52]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression, RFE, VarianceThreshold
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

plt.style.use('default')
sns.set_palette('husl')

## 1. Data Loading & Cleaning
Load the dataset and perform initial cleaning.

In [53]:
# Load the dataset
# (Update the path if needed)
df = pd.read_csv('dataset/Students_Social_Media_Addiction.csv')
print('Dataset shape:', df.shape)
display(df.head())

# Check for missing values
missing = df.isnull().sum()
print('Missing values per column:')
print(missing[missing > 0] if missing.sum() > 0 else 'No missing values found.')

Dataset shape: (705, 13)


Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


Missing values per column:
No missing values found.


## 2. Feature Engineering
All new features are created before scenario testing. This includes usage categories, sleep quality, mental health risk, and interaction features.

In [54]:
# Feature Engineering
fe = df.copy()

# Usage category
fe['Usage_Category'] = pd.cut(fe['Avg_Daily_Usage_Hours'], bins=[0,2,4,6,float('inf')], labels=['Low','Medium','High','Very High'])

# Adequate sleep (7-9 hours)
fe['Adequate_Sleep'] = ((fe['Sleep_Hours_Per_Night'] >= 7) & (fe['Sleep_Hours_Per_Night'] <= 9)).astype(int)

# Mental health risk
fe['Mental_Health_Risk'] = pd.cut(fe['Mental_Health_Score'], bins=[0,4,6,8,float('inf')], labels=['High','Medium','Low','Very Low'])

# Encode categorical features
le_gender = LabelEncoder()
fe['Gender_Encoded'] = le_gender.fit_transform(fe['Gender'])
le_academic = LabelEncoder()
fe['Academic_Level_Encoded'] = le_academic.fit_transform(fe['Academic_Level'])
le_relationship = LabelEncoder()
fe['Relationship_Status_Encoded'] = le_relationship.fit_transform(fe['Relationship_Status'])

# Interaction features
fe['SM_Impact_Score'] = fe['Avg_Daily_Usage_Hours'] * fe['Affects_Academic_Performance'].map({'Yes':1, 'No':0})
fe['Lifestyle_Balance'] = fe['Sleep_Hours_Per_Night'] - fe['Avg_Daily_Usage_Hours']

# One-hot encode usage category and mental health risk
fe = pd.get_dummies(fe, columns=['Usage_Category','Mental_Health_Risk'], drop_first=True)

print('Feature engineering complete. New shape:', fe.shape)
display(fe.head())

Feature engineering complete. New shape: (705, 25)


Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,...,Academic_Level_Encoded,Relationship_Status_Encoded,SM_Impact_Score,Lifestyle_Balance,Usage_Category_Medium,Usage_Category_High,Usage_Category_Very High,Mental_Health_Risk_Medium,Mental_Health_Risk_Low,Mental_Health_Risk_Very Low
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,...,2,1,5.2,1.3,False,True,False,True,False,False
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,...,0,2,0.0,5.4,True,False,False,False,True,False
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,...,2,0,6.0,-1.0,False,True,False,True,False,False
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,...,1,2,0.0,4.0,True,False,False,False,True,False
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,...,0,1,4.5,1.5,False,True,False,True,False,False


## 📁 Export Feature Engineered Dataset

Save the feature engineered dataset to CSV for future use and analysis.

In [55]:
# Create feature engineered CSV file
output_file = 'dataset/Students_Social_Media_Addiction_Feature_Engineered.csv'

# Save the feature engineered dataset
fe.to_csv(output_file, index=False)

print(f"✅ Feature engineered dataset saved to: {output_file}")
print(f"Original shape: {df.shape}")
print(f"Feature engineered shape: {fe.shape}")
print(f"Added {fe.shape[1] - df.shape[1]} new features")

# Show column comparison
print("\n=== COLUMN COMPARISON ===")
print("Original columns:")
original_cols = list(df.columns)
for i, col in enumerate(original_cols, 1):
    print(f"{i:2d}. {col}")

print(f"\nNew columns added ({fe.shape[1] - df.shape[1]} total):")
new_columns = [col for col in fe.columns if col not in df.columns]
for i, col in enumerate(new_columns, 1):
    print(f"{i:2d}. {col}")

# Show sample of key engineered features
print("\n=== SAMPLE OF ENGINEERED FEATURES ===")
key_features = ['Student_ID', 'Addicted_Score', 'Adequate_Sleep', 'SM_Impact_Score', 
               'Lifestyle_Balance', 'Gender_Encoded', 'Academic_Level_Encoded', 'Relationship_Status_Encoded']
sample_data = fe[key_features].head(10)
print(sample_data)

# Show one-hot encoded features
print("\n=== ONE-HOT ENCODED FEATURES ===")
onehot_cols = [col for col in fe.columns if ('Usage_Category_' in col or 'Mental_Health_Risk_' in col)]
print("One-hot encoded columns:")
for col in onehot_cols:
    value_counts = fe[col].value_counts()
    print(f"  - {col}: {value_counts[1]} ones, {value_counts[0]} zeros")

print(f"\n The complete feature engineered dataset is now saved as: {output_file}")
print("You can open this CSV file to see all engineered features!")

✅ Feature engineered dataset saved to: dataset/Students_Social_Media_Addiction_Feature_Engineered.csv
Original shape: (705, 13)
Feature engineered shape: (705, 25)
Added 12 new features

=== COLUMN COMPARISON ===
Original columns:
 1. Student_ID
 2. Age
 3. Gender
 4. Academic_Level
 5. Country
 6. Avg_Daily_Usage_Hours
 7. Most_Used_Platform
 8. Affects_Academic_Performance
 9. Sleep_Hours_Per_Night
10. Mental_Health_Score
11. Relationship_Status
12. Conflicts_Over_Social_Media
13. Addicted_Score

New columns added (12 total):
 1. Adequate_Sleep
 2. Gender_Encoded
 3. Academic_Level_Encoded
 4. Relationship_Status_Encoded
 5. SM_Impact_Score
 6. Lifestyle_Balance
 7. Usage_Category_Medium
 8. Usage_Category_High
 9. Usage_Category_Very High
10. Mental_Health_Risk_Medium
11. Mental_Health_Risk_Low
12. Mental_Health_Risk_Very Low

=== SAMPLE OF ENGINEERED FEATURES ===
   Student_ID  Addicted_Score  Adequate_Sleep  SM_Impact_Score  \
0           1               8               0         

### Feature Engineering Summary

**Key Accomplishments:**
- Transformed raw dataset from 13 to 25 columns
- Created 17 meaningful features for modeling (excluding target + non-predictive columns)
- Applied domain knowledge: adequate sleep, usage categories, mental health risk
- Generated interaction features: social media impact score, lifestyle balance
- Properly encoded categorical variables and created one-hot features

**Next Steps:** Test different feature selection and normalization approaches to optimize model performance.