# Import Dependency

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [3]:
# Load data
data = pd.read_csv('/Users/ridwanmahenra/Documents/skilvul/Data/appliance_data.csv')

In [9]:
df = pd.DataFrame(data)
df

Unnamed: 0,Voltage (V),Ampere (A),Timestamp,Device ID
0,4.5,0.36,2024-03-01 00:00:00,TV
1,4.5,0.05,2024-03-01 00:05:00,TV
2,4.5,0.05,2024-03-01 00:10:00,TV
3,4.5,0.49,2024-03-01 00:15:00,TV
4,4.5,0.47,2024-03-01 00:20:00,TV
...,...,...,...,...
10075,5.0,0.91,2024-03-07 23:35:00,Refrigerator
10076,5.0,0.57,2024-03-07 23:40:00,Refrigerator
10077,5.0,0.85,2024-03-07 23:45:00,Refrigerator
10078,5.0,0.37,2024-03-07 23:50:00,Refrigerator


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10080 entries, 0 to 10079
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Voltage (V)  10080 non-null  float64
 1   Ampere (A)   10080 non-null  float64
 2   Timestamp    10080 non-null  object 
 3   Device ID    10080 non-null  object 
dtypes: float64(2), object(2)
memory usage: 315.1+ KB


# Data preprocessing

## Convert timestamp to datetime
Convert the 'Timestamp' column in a df dataframe to datetime data type using the pd.to_datetime() function. This is usually done when we want to utilize time or date information in data analysis, such as in time series modeling.

In [12]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10080 entries, 0 to 10079
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Voltage (V)  10080 non-null  float64       
 1   Ampere (A)   10080 non-null  float64       
 2   Timestamp    10080 non-null  datetime64[ns]
 3   Device ID    10080 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 315.1+ KB


## Calculate power consumption
* Calculates electrical power (in watts) by using data from the 'Voltage (V)' column which contains the voltage value in volts and the 'Ampere (A)' column which contains the current value in amperes.

* The calculation of electrical power is done by multiplying the voltage value by the current value on each row of data in the DataFrame.

In [13]:
df['Power (W)'] = df['Voltage (V)'] * df['Ampere (A)']

In [14]:
df

Unnamed: 0,Voltage (V),Ampere (A),Timestamp,Device ID,Power (W)
0,4.5,0.36,2024-03-01 00:00:00,TV,1.620
1,4.5,0.05,2024-03-01 00:05:00,TV,0.225
2,4.5,0.05,2024-03-01 00:10:00,TV,0.225
3,4.5,0.49,2024-03-01 00:15:00,TV,2.205
4,4.5,0.47,2024-03-01 00:20:00,TV,2.115
...,...,...,...,...,...
10075,5.0,0.91,2024-03-07 23:35:00,Refrigerator,4.550
10076,5.0,0.57,2024-03-07 23:40:00,Refrigerator,2.850
10077,5.0,0.85,2024-03-07 23:45:00,Refrigerator,4.250
10078,5.0,0.37,2024-03-07 23:50:00,Refrigerator,1.850


# Feature transformation
* Feature Selection/Extraction: We select two features, namely "Voltage (V)" and "Ampere (A)", because we believe that these two features will have a correlation with electric power consumption.

* Data Labeling (Target Labeling): Our target is "Power (W)" or electrical power consumption in watts.

In [15]:

X = df[['Voltage (V)', 'Ampere (A)']]
y = df['Power (W)']

# Spliting Data

At this stage, the previously prepared data (features and labels) is divided into two subsets: training set and validation set. 

In [16]:
# Split data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Modeling Training
* RandomForestRegressor: This is a machine learning algorithm used to build regression models. Random Forest is one of the effective ensemble methods that uses a large number of random decision trees to make predictions.

* random_state=42: This parameter is used to specify the seed used by the random number generator in the algorithm. This ensures that the random decision tree generation is done in a consistent manner from one execution to another. In this case, we use the value 42 as the random seed for consistent results.

Once the model is created, the next step is to train the model using the training data. This is done using the fit method, where the model will learn from the patterns contained in the given training data.

In [17]:
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Predicting power consumption
Predicting power consumption for the validation set involves using a pre-trained machine learning model to estimate the power consumption value corresponding to the input features (voltage and amperage) in the validation dataset (X_valid).
* model.predict(X_valid): Metode ini menerapkan model yang telah dilatih pada fitur input dalam set validasi (X_valid) untuk menghasilkan prediksi variabel target (konsumsi daya). Metode ini mengembalikan sebuah larik nilai konsumsi daya yang diprediksi, yang disimpan dalam variabel y_pred.


In [18]:
y_pred = model.predict(X_valid)

# MSE
The mean squared error is calculated using the actual power consumption values (y_valid) from the validation set and the predicted power consumption values (y_pred) generated by your model. The mean_squared_error function computes the MSE between these two arrays.

In [19]:
mse = mean_squared_error(y_valid, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.0002696350148809011


* The Mean Squared Error (MSE) value is about 
0.0002696 indicates that, on average, the squared difference between the actual power consumption value and the value predicted by the model is very low.

* Since MSE is a measure of the average squared deviation of the prediction from the actual value, a lower MSE indicates a better performance of this regression model. In this case, the low MSE value indicates that the model's predictions are very close to the actual power consumption values in the validation set.

* This indicates that the Random Forest regression model performed well on the validation data, with relatively accurate power consumption predictions.

# Performance Metrics

## Performance Train Data
* The MAE value of about 0.00157 indicates that, on average, the absolute difference between the actual power consumption value and the value predicted by the model is very low. A lower MAE indicates better performance, and in this case, the model's predictions are very close to the actual values.
* The R2 score of approximately 0.999999 indicates that  model explains almost all of the variability of the response data around its mean. A value close to 1.0 suggests that  model fits the data very well, indicating an excellent performance in capturing the variance in the target variable based on the features.

In [20]:
from sklearn.metrics import mean_absolute_error, r2_score

y_train_pred = model.predict(X_train)

mae_train = mean_absolute_error(y_train, y_train_pred)

r2_train = r2_score(y_train, y_train_pred)

print("Performance Metrics for Training Data:")
print("Mean Absolute Error (MAE):", mae_train)
print("R-squared (R2):", r2_train)


Performance Metrics for Training Data:
Mean Absolute Error (MAE): 0.0015742559523878822
R-squared (R2): 0.999999890713417


## Performance Validation Data
* The MAE value of approximately 0.00416 indicates that, on average, the absolute difference between the actual power consumption values and the predicted values by  model is very low. While this value is slightly higher than the MAE for the training data, it still suggests that the model's predictions are very close to the actual values.
* The R2 score of approximately 0.999999425 indicates that your model explains almost all of the variability of the response data around its mean for the validation set. Like the training data, this suggests that the model fits the validation data very well, capturing a high level of variance in the target variable based on the features.

In [21]:
mae_valid = mean_absolute_error(y_valid, y_pred)

r2_valid = r2_score(y_valid, y_pred)

print("\nPerformance Metrics for Validation Data:")
print("Mean Absolute Error (MAE):", mae_valid)
print("R-squared (R2):", r2_valid)



Performance Metrics for Validation Data:
Mean Absolute Error (MAE): 0.004160863095242394
R-squared (R2): 0.9999994250174804


# Prediction

In [22]:
import pandas as pd

new_data = pd.DataFrame({
    'Voltage (V)': [4.5, 18],
    'Ampere (A)': [0.2, 2.0],
    'Device ID': ['TV', 'AC']
})

print("New Data:")
print(new_data)


New Data:
   Voltage (V)  Ampere (A) Device ID
0          4.5         0.2        TV
1         18.0         2.0        AC


* Using the pre-trained model, we predict the power consumption for the given new data.

* After getting the predicted power consumption for each device, we sum them up to get the total power consumption.

* After getting the power consumption prediction for each device, we sum them up to get the total power consumption.

* Using the predefined electricity rate per kWh (for example, 450 rupiah per kWh), we multiply the total power consumption by the rate to get the electricity bill.

* To calculate each device's contribution to the electricity bill, we divide each device's predicted power consumption by the total power consumption.

In [27]:
# Predict power consumption for new data
predicted_power = model.predict(new_data[['Voltage (V)', 'Ampere (A)']])

# Calculate total power consumption
total_power_consumption = predicted_power.sum()

# Calculate electricity bill based on electricity tariff
electricity_rate_per_kWh = 450  # Rupiah per kWh
electricity_bill = total_power_consumption * electricity_rate_per_kWh

print("Predicted Electricity Bill: Rp ", electricity_bill)

# Calculate the contribution of each device to the electricity bill
device_contributions = predicted_power / total_power_consumption

# Create DataFrame for devices and their contribution
device_ranking = pd.DataFrame({
    'Device': new_data['Device ID'],  # Ambil nama perangkat dari data baru
    'Contribution': device_contributions
})

# Sort the DataFrame by contribution
device_ranking = device_ranking.sort_values(by='Contribution', ascending=False)

print("Ranking of Devices based on Contribution:")
print(device_ranking)


Predicted Electricity Bill: Rp  16603.38
Ranking of Devices based on Contribution:
  Device  Contribution
1     AC      0.975607
0     TV      0.024393


# Save Model

In [None]:
import joblib

# Simpan model ke file
joblib.dump(model, 'model.pkl')
