# **Household Power Consumption**

### **Dataset Description:**

This dataset captures electric power consumption in a household with one-minute intervals over nearly four years (December 2006 to November 2010). It provides detailed time-series data on various electrical quantities and energy sub-metering.

**Characteristics:**

**Type**: Multivariate, Time-Series

**Tasks**: Regression, Clustering

**Records**: 2,075,259

**Period**: 47 months

**Sampling Rate**:  One minute

**Attributes**:

**Date**: Observation date (dd/mm/yyyy)

**Time**: Observation time (hh:mm:ss)

**Global_active_power**: Active power consumption (kW)

**Global_reactive_power**: Reactive power consumption (kW)

**Voltage**: Supply voltage (V)

**Global_intensity**: Current intensity (A)

**Sub_metering_1**: Energy for the kitchen (Wh)

**Sub_metering_2**: Energy for the laundry room (Wh)

**Sub_metering_3**: Energy for heating/AC (Wh)

**Notes**:

Energy Calculation: (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) gives unmeasured energy consumption.
Missing Values: Around 1.25% of rows have missing data.

**Summary**

The dataset is ideal for analyzing energy consumption trends, time-series forecasting, and predictive modeling, with some data cleaning needed due to missing values.














# **Import libraries and Load the dataset**

In [44]:
import pandas as pd
import numpy as np

In [45]:
!gdown --fuzzy https://drive.google.com/file/d/1bvaXJJqNObOCkX-i475BNxpidk024pyx/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1bvaXJJqNObOCkX-i475BNxpidk024pyx
From (redirected): https://drive.google.com/uc?id=1bvaXJJqNObOCkX-i475BNxpidk024pyx&confirm=t&uuid=e602d3f0-a2e7-4ff0-b35b-054d1edc35ae
To: /content/household_power_consumption.txt
100% 133M/133M [00:01<00:00, 106MB/s]


In [46]:
df=pd.read_csv('/content/household_power_consumption.txt',sep=";")

  df=pd.read_csv('/content/household_power_consumption.txt',sep=";")


# **Basic Data Exploration**

In [47]:
df.shape

(2075259, 9)

**observation**:The dataset consists of 2075259 rows with 9 columns

In [48]:
df.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


**observation**:The above are the first five rows of the dataset

In [49]:
df.tail()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
2075254,26/11/2010,20:58:00,0.946,0.0,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.0,240.0,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.0,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.0,239.7,3.8,0.0,0.0,0.0
2075258,26/11/2010,21:02:00,0.932,0.0,239.55,3.8,0.0,0.0,0.0


**observation**:The above are the last five rows of the dataset

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    object 
 3   Global_reactive_power  object 
 4   Voltage                object 
 5   Global_intensity       object 
 6   Sub_metering_1         object 
 7   Sub_metering_2         object 
 8   Sub_metering_3         float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB



**Observations**:


**Data Types Issue**: Eight columns are object types instead of numeric, indicating non-numeric values or formatting issues.

M**emory Usage**: The dataset uses 142.5 MB, suggesting a large dataset requiring efficient handling.

**Data Cleaning Needed**: Conversion to numeric types is necessary for meaningful analysis and modeling.

In [51]:
len(df)

2075259

**observation**:The length of the DataFrame df is 2,075,259 entries.

In [52]:
df.describe()

Unnamed: 0,Sub_metering_3
count,2049280.0
mean,6.458447
std,8.437154
min,0.0
25%,0.0
50%,1.0
75%,17.0
max,31.0


**Observations** :

**Count**: ~2,049,280 entries (indicating missing values).

**Mean**: 6.46 watt-hours.

**Standard Deviation**: 8.44 (high variability).

**Minimum**: 0 watt-hours (no consumption).

**Percentiles**:

25% of values are 0 watt-hours.

50% (median) is 1 watt-hour.

75% are up to 17 watt-hours.

**Maximum**: 31 watt-hours.

Overall, the data shows many instances of low or zero energy consumption.

# **Finding Null Values and Unique Values**

In [53]:
# Count of unique values in each column
unique_values = df.nunique()
print(unique_values)

Date                     1442
Time                     1440
Global_active_power      6534
Global_reactive_power     896
Voltage                  5168
Global_intensity          377
Sub_metering_1            153
Sub_metering_2            145
Sub_metering_3             32
dtype: int64


In [54]:
#checks for any null (missing) values in each column
df.isnull().any()

Unnamed: 0,0
Date,False
Time,False
Global_active_power,False
Global_reactive_power,False
Voltage,False
Global_intensity,False
Sub_metering_1,False
Sub_metering_2,False
Sub_metering_3,True


In [55]:
# Count of null values in each column
null_values = df.isnull().sum()
print(null_values)


Date                         0
Time                         0
Global_active_power          0
Global_reactive_power        0
Voltage                      0
Global_intensity             0
Sub_metering_1               0
Sub_metering_2               0
Sub_metering_3           25979
dtype: int64


In [56]:
# Percentage of null values in each column
null_percentage = (df.isnull().sum() / len(df)) * 100
print(null_percentage)


Date                     0.000000
Time                     0.000000
Global_active_power      0.000000
Global_reactive_power    0.000000
Voltage                  0.000000
Global_intensity         0.000000
Sub_metering_1           0.000000
Sub_metering_2           0.000000
Sub_metering_3           1.251844
dtype: float64


# **Handling Null Values**

In [57]:
# Drop rows where Sub_metering_3 is null
df = df.dropna(subset=['Sub_metering_3'])
df

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.840,18.400,0.000,1.000,17.0
1,16/12/2006,17:25:00,5.360,0.436,233.630,23.000,0.000,1.000,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.290,23.000,0.000,2.000,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.740,23.000,0.000,1.000,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.680,15.800,0.000,1.000,17.0
...,...,...,...,...,...,...,...,...,...
2075254,26/11/2010,20:58:00,0.946,0.0,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.0,240.0,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.0,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.0,239.7,3.8,0.0,0.0,0.0


In [60]:
# Fill missing values in Sub_metering_3 with the median and assign it back
df['Sub_metering_3'] = df['Sub_metering_3'].fillna(df['Sub_metering_3'].median())

# Display the updated DataFrame (for the first few rows)
print("Updated DataFrame:")
print(df.head())

Updated DataFrame:
         Date      Time Global_active_power Global_reactive_power  Voltage  \
0  16/12/2006  17:24:00               4.216                 0.418  234.840   
1  16/12/2006  17:25:00               5.360                 0.436  233.630   
2  16/12/2006  17:26:00               5.374                 0.498  233.290   
3  16/12/2006  17:27:00               5.388                 0.502  233.740   
4  16/12/2006  17:28:00               3.666                 0.528  235.680   

  Global_intensity Sub_metering_1 Sub_metering_2  Sub_metering_3  
0           18.400          0.000          1.000            17.0  
1           23.000          0.000          1.000            16.0  
2           23.000          0.000          2.000            17.0  
3           23.000          0.000          1.000            17.0  
4           15.800          0.000          1.000            17.0  


In [61]:
# Count of null values in each column after filling
null_values = df.isnull().sum()
print("\nNull Values Count After Filling:")
print(null_values)


Null Values Count After Filling:
Date                     0
Time                     0
Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64


In [62]:
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'],errors = 'coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2049280 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    float64
 3   Global_reactive_power  object 
 4   Voltage                object 
 5   Global_intensity       object 
 6   Sub_metering_1         object 
 7   Sub_metering_2         object 
 8   Sub_metering_3         float64
dtypes: float64(2), object(7)
memory usage: 156.3+ MB


In [None]:
df.describe(include=object)

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2
count,2075259,2075259,2075259,2075259.0,2075259,2075259.0,2075259.0,2075259.0
unique,1442,1440,6534,896.0,5168,377.0,153.0,145.0
top,6/12/2008,17:24:00,?,0.0,?,1.0,0.0,0.0
freq,1440,1442,25979,472786.0,25979,169406.0,1840611.0,1408274.0
