In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

**1. Loading Dataset**

In [21]:
df = pd.read_csv(r"/content/Indian_Kids_Screen_Time.csv")
df.head()

Unnamed: 0,Age,Gender,Avg_Daily_Screen_Time_hr,Primary_Device,Exceeded_Recommended_Limit,Educational_to_Recreational_Ratio,Health_Impacts,Urban_or_Rural
0,14,Male,3.99,Smartphone,True,0.42,"Poor Sleep, Eye Strain",Urban
1,11,Female,4.61,Laptop,True,0.3,Poor Sleep,Urban
2,18,Female,3.73,TV,True,0.32,Poor Sleep,Urban
3,15,Female,1.21,Laptop,False,0.39,,Urban
4,12,Female,5.89,Smartphone,True,0.49,"Poor Sleep, Anxiety",Urban


In [22]:
# Check missing values
print("Missing values per column:\n", df.isnull().sum())


Missing values per column:
 Age                                     0
Gender                                  0
Avg_Daily_Screen_Time_hr                0
Primary_Device                          0
Exceeded_Recommended_Limit              0
Educational_to_Recreational_Ratio       0
Health_Impacts                       3218
Urban_or_Rural                          0
dtype: int64


**2. Handling missing values and inconsistent categories**

In [24]:
# Create Age_Band first
age_bins = [0, 5, 12, 18, 25, 40, 60, 100]
age_labels = ['Child', 'Pre-Teen', 'Teen', 'Young Adult', 'Adult', 'Middle-Aged', 'Senior']
df['Age_Band'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=True)

# Numerical → mean
num_cols = df.select_dtypes(include=['int64','float64']).columns
for col in num_cols:
    df[col].fillna(df[col].mean(), inplace=True)

# Categorical → rules
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Primary_Device'] = df['Primary_Device'].fillna(df['Age_Band'].astype(str))
df['Urban_or_Rural'] = df['Urban_or_Rural'].fillna(df['Age_Band'].astype(str))
df['Health_Impacts'] = df['Health_Impacts'].fillna("Unknown")


# -------------------------
# Verify
# -------------------------
print("\nAfter preprocessing:\n", df.isnull().sum())
df.head()


After preprocessing:
 Age                                  0
Gender                               0
Avg_Daily_Screen_Time_hr             0
Primary_Device                       0
Exceeded_Recommended_Limit           0
Educational_to_Recreational_Ratio    0
Health_Impacts                       0
Urban_or_Rural                       0
Age_Band                             0
dtype: int64


Unnamed: 0,Age,Gender,Avg_Daily_Screen_Time_hr,Primary_Device,Exceeded_Recommended_Limit,Educational_to_Recreational_Ratio,Health_Impacts,Urban_or_Rural,Age_Band
0,14,Male,3.99,Smartphone,True,0.42,"Poor Sleep, Eye Strain",Urban,Teen
1,11,Female,4.61,Laptop,True,0.3,Poor Sleep,Urban,Pre-Teen
2,18,Female,3.73,TV,True,0.32,Poor Sleep,Urban,Teen
3,15,Female,1.21,Laptop,False,0.39,Unknown,Urban,Teen
4,12,Female,5.89,Smartphone,True,0.49,"Poor Sleep, Anxiety",Urban,Pre-Teen



**3. creating derived fields : age bands, weekday/weekend flags, device/activity shares , formating any date/time fields**

In [25]:
# Get column names as a list
column_names_list = df.columns.tolist()
print("\nColumn names (list):")
column_names_list



Column names (list):


['Age',
 'Gender',
 'Avg_Daily_Screen_Time_hr',
 'Primary_Device',
 'Exceeded_Recommended_Limit',
 'Educational_to_Recreational_Ratio',
 'Health_Impacts',
 'Urban_or_Rural',
 'Age_Band']

In [26]:
# 3. Device/Activity Shares
# Since only 'Primary_Device' and 'Avg_Daily_Screen_Time_hr' are available,
# we can calculate proportional usage based on device type counts

device_counts = df['Primary_Device'].value_counts(normalize=True)
print("\nDevice usage proportion:\n", device_counts)

# Optional: Add a column mapping each device to its share (simplified example)
def device_share(device):
    if pd.isna(device):
        return 0
    return device_counts.get(device, 0)

df['Device_Share'] = df['Primary_Device'].apply(device_share)

# 4. Educational to Recreational Ratio
# You already have 'Educational_to_Recreational_Ratio'; you can normalize if needed
# Example: convert to percentage
df['Edu_Recreational_Percent'] = df['Educational_to_Recreational_Ratio'] * 100



Device usage proportion:
 Primary_Device
Smartphone    0.470346
TV            0.256075
Laptop        0.147549
Tablet        0.126030
Name: proportion, dtype: float64


**6. Save preprocessed data for reuse; document logic**

In [28]:
# 5. Save preprocessed dataset
df.to_csv("Indian_Kids_Screen_Time_Preprocessed.csv", index=False)
# 6. Verify
df.head()

Unnamed: 0,Age,Gender,Avg_Daily_Screen_Time_hr,Primary_Device,Exceeded_Recommended_Limit,Educational_to_Recreational_Ratio,Health_Impacts,Urban_or_Rural,Age_Band,Device_Share,Edu_Recreational_Percent
0,14,Male,3.99,Smartphone,True,0.42,"Poor Sleep, Eye Strain",Urban,Teen,0.470346,42.0
1,11,Female,4.61,Laptop,True,0.3,Poor Sleep,Urban,Pre-Teen,0.147549,30.0
2,18,Female,3.73,TV,True,0.32,Poor Sleep,Urban,Teen,0.256075,32.0
3,15,Female,1.21,Laptop,False,0.39,Unknown,Urban,Teen,0.147549,39.0
4,12,Female,5.89,Smartphone,True,0.49,"Poor Sleep, Anxiety",Urban,Pre-Teen,0.470346,49.0


**Understandings**


**Loading the dataset:**

The dataset `Indian_Kids_Screen_Time.csv` was loaded using `pd.read_csv()`.

Initial preview (`df.head()`) showed the first 5 rows and confirmed the columns present.

**Handling missing values:**

- Numerical columns (`Age`, `Avg_Daily_Screen_Time_hr`, etc.) were filled with the mean value to avoid nulls.
- Categorical columns (`Gender`, `Primary_Device`, `Health_Impacts`, etc.) were filled with the mode (with age band).
- After this step, there were no missing values in any column.

**Handling inconsistent categories:**

- Categorical values were standardized by removing extra spaces and capitalizing the first letter (`str.strip().str.title()`).
- Example: `tv`, `Tv` → standardized to `Tv`.

**Creating derived fields:**

- **Age Bands (`Age_Band`)**: Categorized children into Toddler, Child, Pre-Teen, and Teen based on the Age column. Helps in analyzing trends by age groups.
- **Device Usage Proportion (`Device_Share`)**: Calculated the share of kids using each primary device (e.g., Smartphone = 0.47 → 47% of kids). Provides a numeric representation of device popularity for analysis or modeling.
- **Educational-to-Recreational Ratio (`Edu_Recreational_Percent`)**: Converted the ratio into percentage for easier interpretation.

**Insights from output:**

- After cleaning, all missing values were filled.
- Device usage distribution shows Smartphone is the most used device (~47%), followed by TV, Laptop, and Tablet.
- Age bands were successfully created to classify children into Pre-Teen and Teen in the sample data.
- The dataset now contains both original columns and derived features, which allows for flexible analysis and reporting.

**Saving preprocessed dataset:**

- Cleaned and enhanced dataset was saved as `Indian_Kids_Screen_Time_Preprocessed.csv` for future reuse.

**Overall understanding:**

- Preprocessing ensures data consistency, completeness, and readiness for analysis or modeling.
- Derived features like `Age_Band`, `Device_Share`, and `Edu_Recreational_Percent` make it easier to analyze trends, create visualizations, and perform predictive modeling.

