<a href="https://colab.research.google.com/github/saketvaibhav7114/Regression-Project-on-Appliance-Energy-Prediction/blob/main/Appliance_Energy_Prediction_(Regression).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name    -Appliance Energy Prediction (Regression)**



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name** -  Saket Vaibhav


# **Project Summary -**

The Appliance Energy Prediction project focuses on predicting the total energy usage (in kWh) of a house based on time series data obtained from the UCI repository. The dataset includes temperatures (in Celsius) and humidity percentages from different rooms of the house, along with additional weather data from the Chivers weather station, such as temperature, humidity, and wind speed. The main objective is to build a predictive model that accurately estimates the energy consumption of household appliances using these available variables.

The first step of the project is to conduct Exploratory Data Analysis (EDA) to understand the dataset's features and their relationships. The analysis will provide insights into data distributions, correlations, and potential patterns related to energy consumption. The project aims to handle any missing values by imputing them closer to real-world values, ensuring data integrity for model building.

Next, the project addresses the presence of outliers, as they can distort model estimation. Outliers will be carefully handled or removed to improve the model's performance and prediction accuracy.

Feature engineering is an essential part of the project. The plan is to add extra features that could provide additional information and improve the model's prediction capabilities. These engineered features will be chosen based on their potential impact on the target variable and insights from previous work on similar Kaggle problems.

As the dataset might contain correlated features, the project aims to remove redundant variables and retain the most informative ones. In cases where features are highly correlated, the plan is to replace them with alternatives that have less impact on the target variable but still provide valuable information.

To ensure accurate model training, the data will be properly scaled. Scaling is essential for both parametric and non-parametric models to maintain consistent data distributions.

The project will then proceed with building and evaluating three different models: Support Vector Machines (SVM), one boosting algorithm, and one bagging algorithm. Hyperparameter tuning will be performed to find the best parameters for each model, enhancing their predictive capabilities.

Finally, once the best model and hyperparameters are determined, the plan is to save the model weights to avoid retraining in the future, ensuring efficiency and quick predictions.

The Appliance Energy Prediction project aims to deliver a robust predictive model that can accurately estimate household appliance energy consumption. By leveraging advanced modelling techniques and thorough data analysis, the project seeks to provide valuable insights into energy usage patterns, contributing to energy conservation and informed decision-making for homeowners and utility companies alike.

# **GitHub Link -**

https://github.com/saketvaibhav7114/Regression-Project-on-Appliance-Energy-Prediction

# **Problem Statement**


* Build a predictive model that accurately forecasts the energy consumption of various household appliances.


* Identify and analyze the most influential factors and features that impact appliance energy consumption, such as appliance type, usage patterns, environmental conditions, household characteristics, and time-related data.


* Ensure that the predictive model exhibits high accuracy and generalizability on unseen data, enabling reliable predictions even for appliances not present in the training dataset.

* Enhance the model's interpretability to provide users with meaningful insights into the factors influencing energy consumption and facilitating data-driven decision-making.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Libraries for Data Manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import calendar
import scipy.stats as stats

# Avoid Warning
import warnings
warnings.filterwarnings('ignore')

# Libraries for Prediction
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split



### Dataset Loading

In [None]:
appliance_df=pd.read_csv("/content/data_application_energy.csv")
pd.set_option("display.max_columns",None)

In [None]:
appliance_df['lights'].value_counts()

### Dataset First View

In [None]:
appliance_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
appliance_df.shape

### Dataset Information

In [None]:
# Dataset Info
appliance_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
appliance_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
appliance_df.isna().sum()

### What did you know about your dataset?

The dataset is well-prepared for further analysis, as it contains 19,735 rows and 29 features. One advantage is that there are no missing values in any of the features, ensuring the data's completeness. Additionally, there are no duplicate rows, providing a clean and unique dataset for analysis. Most of the features are numerical, making preprocessing relatively straightforward. There is only one object-type feature, which can be easily converted to a datetime format for compatibility with numerical data. This favorable data condition simplifies the preprocessing steps, allowing the focus to be on feature engineering and model development to achieve accurate appliance energy consumption predictions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
appliance_df.columns

In [None]:
# Dataset Describe
appliance_df.describe().T

### Variables Description

**Appliances :** Target variable


(All the temperature measures here are measured in Celsius)

**T1 :** Temperature in Kitchen Area

**T2 :** Temperature in Living Room

**T3 :** Temperature in Laundry Area

**T4 :** Temperature in Office Room

**T5 :** Temperature in Bathroom Area

**T6 :** Temperature Outside the Building

**T7 :** Temperature in Ironing Room

**T8 :** Temperature in Teenager Room

**T9 :** Temperature in Parents Room

**RH1 :** Temperature in Kitchen Area

**RH2 :** Temperature in Living Room

**RH3 :** Temperature in Laundry Area

**RH4 :** Temperature in Office Room

**RH5 :** Temperature in Bathroom Area

**RH6 :** Temperature Outside the Building

**RH7 :** Temperature in Ironing Room

**RH8 :** Temperature in Teenager Room

**RH9 :** Temperature in Parents Room

**Windspeed :** This Windspeed has some outliers.

**Hg RHout:** Humidity Outside (From Chievers weather station) in %

**Visibility:** (From Chievers weather station) in Km

**Pressure:** (From Chievers weather station) in mm Hg.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in appliance_df.columns:
    print('\033[1m {}:\033[0m {}'.format(column, appliance_df[column].unique()[:5]))

In [None]:
# Check count of Unique Values for each variable.
for column in appliance_df.columns:
    print('\033[1m {}:\033[0m {}'.format(column, appliance_df[column].nunique()))

## Check Skewness of Data


In [None]:
appliance_df.skew()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#changing date column dtype as datetime type
appliance_df['date'] = pd.to_datetime(appliance_df['date'])
appliance_df.info()

In [None]:
appliance_df.columns

In [None]:
# Rename the columns
appliance_df.rename(columns={'Appliances':'appliances','T1':'temp_kitchen', 'RH_1':'hum_kitchen', 'T2':'temp_living_room', 'RH_2':'hum_living_room',
                             'T3':'temp_laundary_room','RH_3':'hum_laundary_room', 'T4':'temp_office_room','RH_4':'hum_office',
                             'T5':'temp_bathroom', 'RH_5':'hum_bathroom', 'T6':'temp_build_out','RH_6':'hum_build_out',
                             'T7':'temp_ironing_room', 'RH_7':'hum_ironing_room', 'T8':'temp_teen_room','RH_8':'hum_teen_room',
                             'T9':'temp_parent_room','RH_9':'hum_parent_room', 'T_out':'temp_out','Press_mm_hg':'press_mm_hg',
                             'RH_out':'out_humidity', 'Windspeed':'windspeed', 'Visibility':'visibility','Tdewpoint':'temp_dewpoint'},inplace=True)

In [None]:
# Extracting month, hour & weekdays from date
appliance_df['month_no']= appliance_df['date'].dt.month
appliance_df['weekday']= appliance_df['date'].dt.weekday
appliance_df['hour']= appliance_df['date'].dt.hour
appliance_df['date_day']= appliance_df['date'].dt.day
appliance_df['week_no']= appliance_df['date'].dt.week


In [None]:
# dropping the date column
appliance_df.drop('date',axis=1,inplace=True)

In [None]:
appliance_df.head()

In [None]:
# Grouping Columns:
temperature=['temp_kitchen','temp_living_room','temp_laundary_room','temp_office_room',
              'temp_bathroom','temp_build_out','temp_ironing_room','temp_teen_room',
              'temp_parent_room','temp_out','temp_dewpoint']

humidity=['hum_kitchen','hum_living_room','hum_laundary_room','hum_office','hum_bathroom',
          'hum_build_out','hum_ironing_room','hum_teen_room','hum_parent_room','out_humidity']

datetime=['month_no', 'weekday', 'hour', 'date_day', 'week_no']

### What all manipulations have you done and insights you found?



1.   Changing the column name for easy understanding

1.   Splitting date columns into day,week,month,hour & weekday
2.   Dropping the date column



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Density Distribution plot of Target Variable

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(15,8))
sns.distplot(appliance_df['appliances'])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2:- Log Transformation of Target Variable

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(15,10))
sns.distplot(np.log10(appliance_df['appliances']))
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

In [None]:
lights_counts = appliance_df['lights'].value_counts()

# Plotting the bar plot
sns.barplot(x=lights_counts.index, y=lights_counts.values)

# Adding labels and title
plt.xlabel('Lights')
plt.ylabel('Count')
plt.title('Bar Plot: Count of Lights')

plt.show()


#### Chart - 3

In [None]:
# Chart - 3 visualization code
fig, ax = plt.subplots(len(temperature), 2, figsize=(20, 50))

for i, col in enumerate(temperature):
    # Original data distribution
    dist = sns.distplot(appliance_df[col], ax=ax[i, 0], color='blue')
    ax[i, 0].axvline(appliance_df[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax[i, 0].axvline(appliance_df[col].median(), color='cyan', linestyle='dashed', linewidth=2)

    # Logarithmically scaled data distribution with epsilon added
    epsilon = 1e-10
    log_data = np.log10(appliance_df[col] + epsilon)
    dist = sns.distplot(log_data, ax=ax[i, 1], color='blue')
    ax[i, 1].axvline(log_data.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax[i, 1].axvline(log_data.median(), color='cyan', linestyle='dashed', linewidth=2)

plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
fig, ax = plt.subplots(len(humidity), 2, figsize=(20, 50))

for i, col in enumerate(humidity):
    # Original data distribution
    dist = sns.distplot(appliance_df[col], ax=ax[i, 0], color='blue')
    ax[i, 0].axvline(appliance_df[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax[i, 0].axvline(appliance_df[col].median(), color='cyan', linestyle='dashed', linewidth=2)

    # Logarithmically scaled data distribution with epsilon added
    epsilon = 1e-10
    log_data = np.log10(appliance_df[col] + epsilon)
    dist = sns.distplot(log_data, ax=ax[i, 1], color='blue')
    ax[i, 1].axvline(log_data.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax[i, 1].axvline(log_data.median(), color='cyan', linestyle='dashed', linewidth=2)

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
n = len(temperature)
fig, ax = plt.subplots(n, 2, figsize=(20, 40))

for i, col in enumerate(temperature):
    # Create line plot for the current temperature column against 'appliances'
    sns.lineplot(data=appliance_df, x=col, y='appliances', color='green', ax=ax[i, 1])
    ax[i, 1].set_xlabel(col)
    ax[i, 1].set_ylabel('Appliance Consumption')
    ax[i, 1].set_title(f'Line Plot: {col} vs. Appliance Consumption')

# Remove any empty subplots
for i in range(n):
    ax[i, 0].remove()

# Adjust the layout to prevent overlapping of labels and titles
plt.tight_layout()

# Show the plots
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
n = len(humidity)
fig, ax = plt.subplots(n, 2, figsize=(20, 40))

for i, col in enumerate(humidity):
    # Create line plot for the current humidity column against 'appliances'
    sns.lineplot(data=appliance_df, x=col, y='appliances', color='green', ax=ax[i, 1])
    ax[i, 1].set_xlabel(col)
    ax[i, 1].set_ylabel('Appliance Consumption')
    ax[i, 1].set_title(f'Line Plot: {col} vs. Appliance Consumption')

# Remove any empty subplots
for i in range(n):
    ax[i, 0].remove()

# Adjust the layout to prevent overlapping of labels and titles
plt.tight_layout()

# Show the plots
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

n = len(datetime)
fig, ax = plt.subplots(n, 2, figsize=(20, 15))

for i, col in enumerate(datetime):
    # Create bar plot for the current datetime column against 'appliances'
    sns.barplot(data=appliance_df, x=col, y='appliances', color='orange', ax=ax[i, 0])
    ax[i, 0].set_xlabel(col)
    ax[i, 0].set_ylabel('Appliance Consumption')
    ax[i, 0].set_title(f'Bar Plot: {col} vs. Appliance Consumption')

    # Create line plot for the current datetime column against 'appliances'
    sns.lineplot(data=appliance_df, x=col, y='appliances', color='orange', ax=ax[i, 1])
    ax[i, 1].set_xlabel(col)
    ax[i, 1].set_ylabel('Appliance Consumption')
    ax[i, 1].set_title(f'Line Plot: {col} vs. Appliance Consumption')

# Adjust the layout to prevent overlapping of labels and titles
plt.tight_layout()

# Show the plots
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

fig, ax= plt.subplots(figsize=(12, 8))

# Plot boxplots for temperature columns
sns.boxplot(data=appliance_df[temperature], ax=ax)
ax.set_xticklabels(temperature, rotation=45, ha='right')
ax.set_xlabel('Temperature Columns')
ax.set_ylabel('Values')
ax.set_title('Boxplots for Temperature Columns')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 8 visualization code

fig, ax= plt.subplots(figsize=(12, 8))

# Plot boxplots for humidity columns
sns.boxplot(data=appliance_df[humidity], ax=ax)
ax.set_xticklabels(humidity, rotation=45, ha='right')
ax.set_xlabel('Humidity Columns')
ax.set_ylabel('Values')
ax.set_title('Boxplots for Humidity Columns')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

fig, ax= plt.subplots(figsize=(12, 8))

sns.boxplot(data=appliance_df[datetime], ax=ax)
ax.set_xticklabels(datetime, rotation=45, ha='right')
ax.set_xlabel('Datetime Columns')
ax.set_ylabel('Values')
ax.set_title('Boxplots for Datetime Columns')


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

fig, axes = plt.subplots(len(temperature), 1, figsize=(20, 60))

for i, col in enumerate(temperature):
    feature = appliance_df[col]
    label = appliance_df['appliances']
    correlation = feature.corr(label)

    # Scatter plot
    axes[i].scatter(x=feature, y=label,c='skyblue')
    axes[i].set_xlabel(col,fontsize=15)
    axes[i].set_ylabel('Appliances Consumption',fontsize=15)
    axes[i].set_title('Appliances Consumption vs ' + col + ' - Correlation: ' + str(correlation),fontsize=20)

    # Linear regression line
    z = np.polyfit(appliance_df[col], appliance_df['appliances'], 1)
    y_hat = np.poly1d(z)(appliance_df[col])
    axes[i].plot(appliance_df[col], y_hat, "r-", lw=1,color='red')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
fig, axes = plt.subplots(len(humidity), 1, figsize=(20, 60))

for i, col in enumerate(humidity):
    feature = appliance_df[col]
    label = appliance_df['appliances']
    correlation = feature.corr(label)

    # Scatter plot
    axes[i].scatter(x=feature, y=label)
    axes[i].set_xlabel(col,fontsize=15)
    axes[i].set_ylabel('Appliances Consumption',fontsize=15)
    axes[i].set_title('Appliances Consumption vs ' + col + ' - Correlation: ' + str(correlation),fontsize=20)

    # Linear regression line
    z = np.polyfit(appliance_df[col], appliance_df['appliances'], 1)
    y_hat = np.poly1d(z)(appliance_df[col])
    axes[i].plot(appliance_df[col], y_hat, "r-", lw=1)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

fig, axes = plt.subplots(len(datetime), 1, figsize=(20, 25))

for i, col in enumerate(datetime):
    feature = appliance_df[col]
    label = appliance_df['appliances']
    correlation = feature.corr(label)

    # Scatter plot
    axes[i].scatter(x=feature, y=label)
    axes[i].set_xlabel(col,fontsize=15)
    axes[i].set_ylabel('Appliances Consumption',fontsize=15)
    axes[i].set_title('Appliances Consumption vs ' + col + ' - Correlation: ' + str(correlation),fontsize=20)

    # Linear regression line
    z = np.polyfit(appliance_df[col], appliance_df['appliances'], 1)
    y_hat = np.poly1d(z)(appliance_df[col])
    axes[i].plot(appliance_df[col], y_hat, "r-", lw=1)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(20,10))
correlation = appliance_df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm',fmt='.1f')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(appliance_df[temperature])

In [None]:
sns.pairplot(appliance_df[humidity])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1) There is no change in appliance energy consumption on weekdays & weekend.


2) There is no significant difference in the energy consumption for appliance between day & night.


3) The mean temperature in kitchen is greater than normal room temperature.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)-** There is no change in appliance energy consumption on weekdays & weekend.

**Alternate Hypothesis(Ha)-** There is higher appliance energy consumption on weekend as compared to weekdays.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
data_weekday=appliance_df[appliance_df['weekday']<=5][['appliances']]
data_weekend=appliance_df[appliance_df['weekday']>=5][['appliances']]


# Extract 50 samples from data_weekday & data_weekend
sample_weekday=data_weekday.sample(50)
sample_weekend=data_weekend.sample(50)

# Perform t-test
t_statistics,p_value=stats.ttest_ind(sample_weekday,sample_weekend,equal_var=True)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistics)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis: There is higher appliance energy consumption on weekend as compared to weekdays.")
else:
    print("Fail to reject the null hypothesis: There is no change in appliance energy consumption on weekdays & weekend.")

##### Which statistical test have you done to obtain P-Value?

T-test are performed to find P-value

##### Why did you choose the specific statistical test?

T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis(H0):-There is no significant difference in the energy consumption for appliance between day & night.

Alternative Hypothesis(Ha): There is significant difference in the energy consumption for appliance between day & night.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Create a column indicating whether it's day or night (you might need to adjust this condition)
appliance_df['time_of_day'] = ['day' if 7 <= hour < 19 else 'night' for hour in appliance_df['hour']]

# Separate data for day and night
data_day = appliance_df[appliance_df['time_of_day'] == 'day']['appliances']
data_night = appliance_df[appliance_df['time_of_day'] == 'night']['appliances']


# Extract 50 samples from data_day & data_night
sample_day=data_day.sample(50)
sample_night=data_night.sample(50)

# Perform t-test
t_statistic, p_value = stats.ttest_ind(sample_day, sample_night, equal_var=True)

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in energy consumption between day and night.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in energy consumption between day and night.")

##### Which statistical test have you done to obtain P-Value?

T-test are performed to find P-value

##### Why did you choose the specific statistical test?

T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis(H0):** The mean temperature in kitchen is equal to normal room temperature.

**Alternative Hypothesis(Ha):** The mean temperature in kitchen is greater than normal room temperature.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Specify the normal room temperature
normal_room_temperature = 25.0

# Select data for the kitchen temperature
data_kitchen = appliance_df['temp_kitchen']

# Extract 50 sample from data_kitchen
sample_kitchen_data=data_kitchen.sample(50)

# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_kitchen_data, normal_room_temperature, alternative='greater')

# Set the significance level (alpha)
alpha = 0.05

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis: The mean temperature in the kitchen is greater than normal room temperature.")
else:
    print("Fail to reject the null hypothesis: The mean temperature in the kitchen is not greater than normal room temperature.")

##### Which statistical test have you done to obtain P-Value?

T-test are performed to find P-value

##### Why did you choose the specific statistical test?

T-test is commonly used to compare the means of two samples or groups to assess whether the observed difference is statistically significant or if it could have occurred by chance.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Checking Outliers

# Select numeric columns for box plots
numeric_columns = ['appliances', 'lights', 'temp_kitchen', 'hum_kitchen',
                   'temp_living_room', 'hum_living_room', 'temp_laundary_room',
                   'hum_laundary_room', 'temp_office_room', 'hum_office', 'temp_bathroom',
                   'hum_bathroom', 'temp_build_out', 'hum_build_out', 'temp_ironing_room',
                   'hum_ironing_room', 'temp_teen_room', 'hum_teen_room',
                   'temp_parent_room', 'hum_parent_room', 'temp_out', 'press_mm_hg',
                   'out_humidity', 'windspeed', 'visibility', 'temp_dewpoint']



# Set the number of rows and columns for subplots
num_rows = 8
num_cols = 4

# Create subplots for box plots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))
sns.set(style="whitegrid")

for n, column in enumerate(numeric_columns):
    row = n // num_cols
    col = n % num_cols
    sns.boxplot(x=appliance_df[column], ax=axes[row, col])
    axes[row, col].set_title(column.upper(), fontsize=10)

# Remove any unused subplots
for n in range(len(numeric_columns), num_rows * num_cols):
    fig.delaxes(axes.flatten()[n])

plt.tight_layout()
plt.show()


In [None]:
# Treating Outliers
def find_outliers_iqr(data):
  # Calculate the 1st & 3rd Quartile for each variable
  q1=data.quantile(0.25)
  q3=data.quantile(0.75)

  # Calculate the Interquartile range(IQR) for each columns
  iqr=q3-q1

  # Calculate the lower & upper bounds for each outliers of each columns
  lower_bound=q1-1.5*iqr
  upper_bound=q3+1.5*iqr

  # Check for outliers in each column & count the number of outliers
  outliers_count=(data<lower_bound) | (data>upper_bound)
  num_outliers=outliers_count.sum()
  return num_outliers

outliers_per_column=find_outliers_iqr(appliance_df)
print("Number of Outliers per columns:")
print(outliers_per_column.sort_values(ascending=False))


In [None]:
def remove_outliers_iqr(column):
  # Convert the data to numeric (with errors coerced to NaN)
  numeric_data = pd.to_numeric(column, errors='coerce')

  # Calculate the 1st & 3rd Quartile for the variable
  q1 = numeric_data.quantile(0.25)
  q3 = numeric_data.quantile(0.75)

  # Calculate the Interquartile range(IQR) for the column
  iqr = q3 - q1

  # Calculate the lower & upper bounds for outliers
  lower_bound = q1 - 1.5 * iqr
  upper_bound = q3 + 1.5 * iqr

  # Remove outliers
  data_no_outliers = numeric_data[(numeric_data >= lower_bound) & (numeric_data <= upper_bound)]

  return data_no_outliers

# Apply the function to each column in the DataFrame
appliance_df_no_outliers = appliance_df[numeric_columns].apply(remove_outliers_iqr)

# Print the shape of the DataFrame before and after removing outliers
print("Shape before removing outliers:", appliance_df.shape)
print("Shape after removing outliers:", appliance_df_no_outliers.shape)



In [None]:
# Plotting the box-plot to again check outliers

# Set the number of rows and columns for subplots
num_rows = 8
num_cols = 4

# Create subplots for box plots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 25))
sns.set(style="whitegrid")

numeric_columns = ['appliances', 'lights', 'temp_kitchen', 'hum_kitchen',
                   'temp_living_room', 'hum_living_room', 'temp_laundary_room',
                   'hum_laundary_room', 'temp_office_room', 'hum_office', 'temp_bathroom',
                   'hum_bathroom', 'temp_build_out', 'hum_build_out', 'temp_ironing_room',
                   'hum_ironing_room', 'temp_teen_room', 'hum_teen_room',
                   'temp_parent_room', 'hum_parent_room', 'temp_out', 'press_mm_hg',
                   'out_humidity', 'windspeed', 'visibility', 'temp_dewpoint']

for n, column in enumerate(numeric_columns):
    row = n // num_cols
    col = n % num_cols
    sns.boxplot(x=appliance_df_no_outliers[column], ax=axes[row, col])
    axes[row, col].set_title(column.upper(), fontsize=10)

# Remove any unused subplots
for n in range(len(numeric_columns), num_rows * num_cols):
    fig.delaxes(axes.flatten()[n])

plt.tight_layout()
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

Interquartile Range (IQR) method are used for outlier detection and removal. The IQR method is robust to extreme values and can help in identifying and removing outliers in skewed or non-normally distributed data.

### 2. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df=appliance_df_no_outliers.copy()

In [None]:
df.isnull().sum()

In [None]:
df.fillna(df.mean(), inplace=True)

In [None]:
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

All the missing values are replaced by their mean.  It's suitable when the data is missing at random and the variable has a normal distribution.

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# creating new features
# create column for average building temperature based on all temperature
df['average_building_temperature']=df[temperature].mean(axis=1)

# Create a column for temperature difference between outside & inside of building
df['temp_difference']=abs(df['average_building_temperature']-df['temp_build_out'])

# Create column for average building humidity
df['average_building_humidity']=df[humidity].mean(axis=1)

# Create column for humidity difference between outside & inside of building
df['humidity_difference']=abs(df['hum_build_out']-df['average_building_humidity'])

In [None]:
df

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The Appliance Energy Prediction project leverages data-driven approaches to enhance energy efficiency and sustainability in residential settings. The developed predictive model empowers homeowners and utility companies to make informed decisions, ultimately leading to a more sustainable and environmentally responsible use of energy resources.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***