# Smart Factory Energy Prediction Challenge

## Problem Overview
The goal is to develop a machine learning model to accurately predict the energy consumption of industrial equipment (`equipment_energy_consumption`) based on sensor data from a manufacturing facility. This will help optimize operations for energy efficiency and cost reduction.

## Objectives
1. Analyze sensor data to identify patterns and relationships.
2. Build a robust regression model.
3. Evaluate model performance using RMSE, MAE, and R².
4. Provide actionable insights and recommendations.
5. Determine the utility of `random_variable1` and `random_variable2`.

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import lightgbm as lgb

# Settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x) # Format floats to 3 decimal places
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Load and Inspect Data

In [2]:
# Load the dataset
try:
    df = pd.read_csv('data/data.csv')
except FileNotFoundError:
    print("Error: 'data/data.csvdata.csv' not found. Please ensure the data file is in the correct path.")
    
    exit()


print("Dataset Shape:", df.shape)
print("\nFirst 5 rows of the dataset:")
display(df.head())

print("\nDataset Info:")
df.info()

print("\nSummary Statistics:")
display(df.describe(include='all'))

Dataset Shape: (16857, 29)

First 5 rows of the dataset:


Unnamed: 0,timestamp,equipment_energy_consumption,lighting_energy,zone1_temperature,zone1_humidity,zone2_temperature,zone2_humidity,zone3_temperature,zone3_humidity,zone4_temperature,zone4_humidity,zone5_temperature,zone5_humidity,zone6_temperature,zone6_humidity,zone7_temperature,zone7_humidity,zone8_temperature,zone8_humidity,zone9_temperature,zone9_humidity,outdoor_temperature,atmospheric_pressure,outdoor_humidity,wind_speed,visibility_index,dew_point,random_variable1,random_variable2
0,2016-01-11 17:00:00,60.0,-77.78778596503064,33.74660933896648,47.59666666666671,19.2,44.79,19.79,,19.0,45.567,17.167,55.2,,84.257,17.2,41.627,18.2,48.9,17.033,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275,13.275
1,2016-01-11 17:10:00,60.0,30.0,19.89,46.69333333333329,19.2,44.722,19.79,44.79,19.0,45.992,17.167,55.2,6.833,84.063,17.2,,18.2,48.863,17.067,45.56,6.483,733.6,92.0,6.667,59.167,5.2,18.606,18.606
2,2016-01-11 17:20:00,50.0,30.0,19.89,46.3,19.2,44.627,19.79,44.933,35.921,45.89,,55.09,6.56,83.157,17.2,41.433,18.2,48.73,17.0,45.5,6.367,733.7,92.0,6.333,55.333,5.1,28.643,28.643
3,2016-01-11 17:30:00,50.0,40.0,33.74660933896648,46.0666666666667,19.2,44.59,19.79,45.0,,45.723,17.167,55.09,6.433,83.423,17.133,41.29,18.1,94.386,17.0,45.4,6.25,733.8,92.0,6.0,51.5,37.674,45.41,45.41
4,2016-01-11 17:40:00,60.0,40.0,19.89,46.33333333333329,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.367,84.893,17.2,41.23,18.1,48.59,4.477,45.4,6.133,733.9,92.0,5.667,47.667,4.9,10.084,10.084



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16857 entries, 0 to 16856
Data columns (total 29 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   timestamp                     16857 non-null  object 
 1   equipment_energy_consumption  16013 non-null  object 
 2   lighting_energy               16048 non-null  object 
 3   zone1_temperature             15990 non-null  object 
 4   zone1_humidity                16056 non-null  object 
 5   zone2_temperature             16004 non-null  object 
 6   zone2_humidity                15990 non-null  float64
 7   zone3_temperature             16055 non-null  float64
 8   zone3_humidity                15979 non-null  float64
 9   zone4_temperature             16041 non-null  float64
 10  zone4_humidity                16076 non-null  float64
 11  zone5_temperature             16019 non-null  float64
 12  zone5_humidity                16056 non-null 

Unnamed: 0,timestamp,equipment_energy_consumption,lighting_energy,zone1_temperature,zone1_humidity,zone2_temperature,zone2_humidity,zone3_temperature,zone3_humidity,zone4_temperature,zone4_humidity,zone5_temperature,zone5_humidity,zone6_temperature,zone6_humidity,zone7_temperature,zone7_humidity,zone8_temperature,zone8_humidity,zone9_temperature,zone9_humidity,outdoor_temperature,atmospheric_pressure,outdoor_humidity,wind_speed,visibility_index,dew_point,random_variable1,random_variable2
count,16857,16013.0,16048.0,15990.0,16056.0,16004.0,15990.0,16055.0,15979.0,16041.0,16076.0,16019.0,16056.0,16009.0,16010.0,16063.0,16052.0,16009.0,16080.0,16084.0,15969.0,16051.0,16015.0,16058.0,16029.0,16042.0,16031.0,16031.0,16033.0
unique,16769,130.0,20.0,433.0,2172.0,838.0,,,,,,,,,,,,,,,,,,,,,,,
top,2016-01-25 21:50:00,50.0,0.0,21.0,3.348059697903068,19.2,,,,,,,,,,,,,,,,,,,,,,,
freq,2,3400.0,11687.0,479.0,164.0,303.0,,,,,,,,,,,,,,,,,,,,,,,
mean,,,,,,,39.495,21.666,38.201,20.24,37.946,19.053,50.289,6.47,59.163,19.672,34.033,21.606,41.854,18.851,40.318,6.219,755.758,78.978,4.196,38.457,2.784,24.855,25.094
std,,,,,,,10.13,2.594,10.144,2.783,10.77,2.346,18.723,8.868,52.658,2.88,11.345,2.975,12.302,2.529,11.169,7.555,13.644,28.566,4.41,21.319,6.095,26.215,25.524
min,,,,,,,-77.266,6.544,-71.406,4.613,-81.446,5.921,-141.64,-42.987,-353.393,3.578,-84.883,4.502,-94.386,4.477,-81.582,-37.525,678.16,-221.669,-20.93,-82.33,-32.098,-120.17,-120.41
25%,,,,,,,37.758,20.533,36.593,19.267,35.2,18.061,45.29,2.93,37.067,18.5,31.0,20.5,38.627,17.89,38.23,3.0,750.8,71.0,2.0,29.0,0.45,12.18,12.194
50%,,,,,,,40.293,21.767,38.4,20.29,38.09,19.05,48.854,6.263,62.767,19.6,34.23,21.79,42.04,18.89,40.363,6.0,756.2,84.167,4.0,40.0,2.75,24.867,24.834
75%,,,,,,,43.0,22.76,41.433,21.357,41.561,20.1,53.918,9.69,86.59,21.0,38.157,22.79,46.004,20.2,43.79,9.25,762.1,91.988,6.0,40.0,5.308,37.95,37.972


### Initial Observations:
- The dataset has 16857 rows and 29 columns.
- `timestamp` is an object, needs conversion to datetime.
- Several columns, including the target `equipment_energy_consumption` and other sensor readings like `lighting_energy`, `zone1_temperature`, etc., are read as `object` type. These likely contain numerical data mixed with non-numeric strings and need to be converted to numeric.
- Many columns have missing values.
- Some features show a very wide range (e.g., min/max values for humidity, temperature in some zones), suggesting potential outliers or data entry errors.