In [2]:
import pandas as pd

df = pd.read_csv('/content/iot_sensor.csv')
print("DataFrame loaded successfully. Displaying the first 5 rows:")
print(df.head())

DataFrame loaded successfully. Displaying the first 5 rows:
             timestamp sensor_id  temperature  humidity
0  2025-02-01 00:00:00        S2         24.0      40.0
1  2025-02-01 01:00:00        S3         30.0       NaN
2  2025-02-01 02:00:00        S1         24.0      50.0
3  2025-02-01 03:00:00        S2         24.0       NaN
4  2025-02-01 04:00:00        S3         23.0      42.0


## Handle Missing Values

### Subtask:
Apply forward fill (fill) to handle any missing values in the dataset.


**Reasoning**:
I will apply forward fill to handle missing values in the DataFrame and then display the first few rows to verify the changes.



In [3]:
df.ffill(inplace=True)
print("DataFrame after forward fill. Displaying the first 5 rows:")
print(df.head())

DataFrame after forward fill. Displaying the first 5 rows:
             timestamp sensor_id  temperature  humidity
0  2025-02-01 00:00:00        S2         24.0      40.0
1  2025-02-01 01:00:00        S3         30.0      40.0
2  2025-02-01 02:00:00        S1         24.0      50.0
3  2025-02-01 03:00:00        S2         24.0      50.0
4  2025-02-01 04:00:00        S3         23.0      42.0


## Remove Sensor Drift

### Subtask:
Calculate a rolling mean (e.g., with a window of 5 or 10, to be determined upon code generation) for relevant sensor reading columns to remove drift. Subtract this rolling mean from the original readings.


**Reasoning**:
To remove sensor drift, I will calculate a rolling mean for both 'temperature' and 'humidity' columns with a window size of 10 and then subtract these rolling means from their respective original columns. Finally, I will display the first 5 rows of the modified DataFrame.



In [4]:
window_size = 10

df['temperature_rolling_mean'] = df['temperature'].rolling(window=window_size, min_periods=1).mean()
df['temperature'] = df['temperature'] - df['temperature_rolling_mean']

df['humidity_rolling_mean'] = df['humidity'].rolling(window=window_size, min_periods=1).mean()
df['humidity'] = df['humidity'] - df['humidity_rolling_mean']

df.drop(columns=['temperature_rolling_mean', 'humidity_rolling_mean'], inplace=True)

print(f"DataFrame after removing sensor drift with a rolling mean window of {window_size}. Displaying the first 5 rows:")
print(df.head())

DataFrame after removing sensor drift with a rolling mean window of 10. Displaying the first 5 rows:
             timestamp sensor_id  temperature  humidity
0  2025-02-01 00:00:00        S2          0.0  0.000000
1  2025-02-01 01:00:00        S3          3.0  0.000000
2  2025-02-01 02:00:00        S1         -2.0  6.666667
3  2025-02-01 03:00:00        S2         -1.5  5.000000
4  2025-02-01 04:00:00        S3         -2.0 -2.400000


## Normalize Numerical Sensor Readings

### Subtask:
Normalize the numerical sensor readings ('temperature', 'humidity') using standard scaling.

**Reasoning**:
I need to normalize the numerical sensor readings ('temperature', 'humidity') using standard scaling. This requires importing StandardScaler from sklearn.preprocessing, fitting it to the relevant columns, and then transforming the data.



In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['temperature', 'humidity']] = scaler.fit_transform(df[['temperature', 'humidity']])

print("DataFrame after standard scaling numerical features. Displaying the first 5 rows:")
print(df.head())

DataFrame after standard scaling numerical features. Displaying the first 5 rows:
             timestamp sensor_id  temperature  humidity
0  2025-02-01 00:00:00        S2     0.018389 -0.011347
1  2025-02-01 01:00:00        S3     1.176150 -0.011347
2  2025-02-01 02:00:00        S1    -0.753451  1.762878
3  2025-02-01 03:00:00        S2    -0.560491  1.319322
4  2025-02-01 04:00:00        S3    -0.753451 -0.650067


## Encode Categorical Sensor IDs

### Subtask:
Encode categorical sensor IDs using an appropriate method (e.g., One-Hot Encoding or Label Encoding).

**Reasoning**:
To encode the categorical 'sensor_id' column, One-Hot Encoding is a suitable method as it avoids implying any ordinal relationship between the sensor IDs. I will use `pd.get_dummies` for this purpose and then display the head of the DataFrame to show the new encoded columns.



In [6]:
df = pd.get_dummies(df, columns=['sensor_id'], prefix='sensor')

print("DataFrame after One-Hot Encoding sensor_id. Displaying the first 5 rows:")
print(df.head())

DataFrame after One-Hot Encoding sensor_id. Displaying the first 5 rows:
             timestamp  temperature  humidity  sensor_S1  sensor_S2  sensor_S3
0  2025-02-01 00:00:00     0.018389 -0.011347      False       True      False
1  2025-02-01 01:00:00     1.176150 -0.011347      False      False       True
2  2025-02-01 02:00:00    -0.753451  1.762878       True      False      False
3  2025-02-01 03:00:00    -0.560491  1.319322      False       True      False
4  2025-02-01 04:00:00    -0.753451 -0.650067      False      False       True


## Summarize Processed Data

### Subtask:
Summarize the shape and first few rows of the processed dataset, confirming its readiness for anomaly detection.

**Reasoning**:
To confirm the dataset's readiness for anomaly detection, I will display the shape of the DataFrame and its first 5 rows to provide a summary of the processed data.



In [7]:
print("Shape of the processed DataFrame:", df.shape)
print("\nFirst 5 rows of the processed DataFrame:")
print(df.head())

Shape of the processed DataFrame: (50, 6)

First 5 rows of the processed DataFrame:
             timestamp  temperature  humidity  sensor_S1  sensor_S2  sensor_S3
0  2025-02-01 00:00:00     0.018389 -0.011347      False       True      False
1  2025-02-01 01:00:00     1.176150 -0.011347      False      False       True
2  2025-02-01 02:00:00    -0.753451  1.762878       True      False      False
3  2025-02-01 03:00:00    -0.560491  1.319322      False       True      False
4  2025-02-01 04:00:00    -0.753451 -0.650067      False      False       True


## Summary:

### Data Analysis Key Findings
*   The `iot_sensor.csv` dataset was successfully loaded into a pandas DataFrame.
*   Missing values in the dataset were handled by applying a forward fill strategy.
*   Sensor drift was removed from 'temperature' and 'humidity' readings by subtracting a rolling mean with a window size of 10.
*   The numerical 'temperature' and 'humidity' features were normalized using `StandardScaler`.
*   The categorical 'sensor_id' column was successfully One-Hot Encoded, creating new binary columns like 'sensor\_S1', 'sensor\_S2', 'sensor\_S3'.
*   The final processed DataFrame has a shape of (50, 6), indicating 50 rows and 6 columns, comprising the timestamp, two normalized sensor readings, and three one-hot encoded sensor ID columns.

### Insights or Next Steps
*   The dataset is now fully preprocessed and ready for an anomaly detection algorithm to be applied.
*   The next step would involve selecting and implementing an anomaly detection model, such as Isolation Forest or One-Class SVM, to identify unusual patterns in the processed sensor data.
