Before building any model, we must clearly understand:

1. What are the input features (X)?

2. What are the target variables (y)?

3. How are they aligned?

4. What is the prediction task type (regression or classification)?

This dataset is separated into two files:

smartphone_battery_features.csv → engineered input features

smartphone_battery_targets.csv → ground truth battery health and replacement recommendation

Since features and targets are stored separately, the first engineering task is data alignment.

We need to:

- Inspect shapes of both datasets

- Check whether there is a common identifier column

- Verify if row ordering matches

- Identify the target column(s)

- Determine the prediction problem type

In [1]:
import pandas as pd

# load feature
X = pd.read_csv("../data/raw/smartphone_battery_features.csv")
y = pd.read_csv("../data/raw/smartphone_battery_targets.csv")

print("X shape:", X.shape)
print("y shape:", y.shape)

print("\nX columns:")
print(list(X.columns))

print("\ny columns:")
print(list(y.columns))


X shape: (5000, 15)
y shape: (5000, 3)

X columns:
['Device_ID', 'device_age_months', 'battery_capacity_mah', 'avg_screen_on_hours_per_day', 'avg_charging_cycles_per_week', 'avg_battery_temp_celsius', 'fast_charging_usage_percent', 'overnight_charging_freq_per_week', 'gaming_hours_per_week', 'video_streaming_hours_per_week', 'background_app_usage_level', 'signal_strength_avg', 'charging_habit_score', 'usage_intensity_score', 'thermal_stress_index']

y columns:
['Device_ID', 'current_battery_health_percent', 'recommended_action']


In [None]:
print("\nX head:")
print(X.head())


X head:
                              Device_ID  device_age_months  \
0  207dd94c-0430-43aa-b388-4893447e628e                 38   
1  3f4d1d33-ba89-4814-a168-7b4cc75be26b                 28   
2  b4adca05-564f-4b70-ab69-e8d66e656463                 14   
3  4147e039-31b7-480a-bbc9-03cd0f66e9f1                 42   
4  3f9b0fb7-73c2-4ab7-8e30-7b492097a3f5                  7   

   battery_capacity_mah  avg_screen_on_hours_per_day  \
0                  4500                          7.1   
1                  3000                          6.8   
2                  3000                          7.2   
3                  3000                          5.5   
4                  3000                          7.6   

   avg_charging_cycles_per_week  avg_battery_temp_celsius  \
0                          11.4                      34.8   
1                          10.3                      35.4   
2                          11.2                      29.4   
3                           8.3      

In [None]:
print("\ny head:")
print(y.head())

In [None]:
print("\nX info:")
X.info()


X info:
<class 'pandas.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Device_ID                         5000 non-null   str    
 1   device_age_months                 5000 non-null   int64  
 2   battery_capacity_mah              5000 non-null   int64  
 3   avg_screen_on_hours_per_day       5000 non-null   float64
 4   avg_charging_cycles_per_week      5000 non-null   float64
 5   avg_battery_temp_celsius          5000 non-null   float64
 6   fast_charging_usage_percent       5000 non-null   float64
 7   overnight_charging_freq_per_week  5000 non-null   int64  
 8   gaming_hours_per_week             5000 non-null   float64
 9   video_streaming_hours_per_week    5000 non-null   float64
 10  background_app_usage_level        5000 non-null   str    
 11  signal_strength_avg               5000 non-null   str    
 12  charging

In [None]:
print("\ny info:")
y.info()


y info:
<class 'pandas.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Device_ID                       5000 non-null   str    
 1   current_battery_health_percent  5000 non-null   float64
 2   recommended_action              5000 non-null   str    
dtypes: float64(1), str(2)
memory usage: 117.3 KB


Dataset Overview

We are given two separate files:

smartphone_battery_features.csv (X)

smartphone_battery_targets.csv (y)

Shape Check

X shape: (5000, 15)

y shape: (5000, 3)

Both datasets contain 5000 rows, indicating potential row-level alignment.

Key Observations
1. Identifier Column

Both X and y contain a common column:

Device_ID

This suggests we should merge using Device_ID instead of assuming row ordering alignment.

2. Feature Types (X)

Numerical features:

int64: 4

float64: 8

Categorical features:

3 string columns:

Device_ID

background_app_usage_level

signal_strength_avg

No missing values detected.

3. Target Variables (y)

Columns:

current_battery_health_percent → Continuous variable (float64)

recommended_action → Categorical variable (string)

This indicates a multi-task prediction setup:

Regression task → battery health %

Classification task → recommended action

Initial Conclusion

We are dealing with a supervised multi-task learning problem.

Next step:
Merge X and y using Device_ID and construct a unified dataset.

In [None]:
df = pd.merge(X, y,  on = 'Device_ID', how = 'inner')

df.shape
df.head()

Unnamed: 0,Device_ID,device_age_months,battery_capacity_mah,avg_screen_on_hours_per_day,avg_charging_cycles_per_week,avg_battery_temp_celsius,fast_charging_usage_percent,overnight_charging_freq_per_week,gaming_hours_per_week,video_streaming_hours_per_week,background_app_usage_level,signal_strength_avg,charging_habit_score,usage_intensity_score,thermal_stress_index,current_battery_health_percent,recommended_action
0,207dd94c-0430-43aa-b388-4893447e628e,38,4500,7.1,11.4,34.8,90.8,7,7.9,14.0,Medium,Poor,4,10.0,4.04,32.8,Change Phone
1,3f4d1d33-ba89-4814-a168-7b4cc75be26b,28,3000,6.8,10.3,35.4,60.6,2,8.6,11.0,Medium,Good,7,10.0,4.23,50.3,Replace Battery
2,b4adca05-564f-4b70-ab69-e8d66e656463,14,3000,7.2,11.2,29.4,29.3,4,0.3,10.3,Medium,Good,6,10.0,2.21,66.1,Replace Battery
3,4147e039-31b7-480a-bbc9-03cd0f66e9f1,42,3000,5.5,8.3,32.8,62.5,0,1.9,4.9,Medium,Good,8,10.0,3.13,46.8,Change Phone
4,3f9b0fb7-73c2-4ab7-8e30-7b492097a3f5,7,3000,7.6,11.6,38.7,85.4,6,7.9,9.3,High,Good,5,10.0,4.95,67.2,Replace Battery


In [None]:
import sys
sys.version
sys.executable


: 