# Clustering Individual Household Electric Power Consumption and Future Consumption Regression Analysis.

Our group proposes to use the Individual household electric power consumption data set to look for power consumption trends over time. We plan on clustering the data using descriptive methods to discover patterns and trends. Applying predictive methods such as regression we plan to predict future power consumption.

Dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip

In [None]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from datetime import datetime
from numpy.linalg import norm
from collections import Counter, defaultdict
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans

# Preprocessing

## Process and clean the data
Process the data by reading each line, removing the column header information and stripping the semicolon seperators. Then convert the date and time stamps to numeric values and merge the two to have a dataset with all numeric values.

In [None]:
def time_to_ratio(time_stamp):
    time = datetime.strptime(time_stamp, '%d/%m/%Y %H:%M:%S')
    start = datetime(year=time.year, month=1, day=1)
    end = datetime(year=time.year+1, month=1, day=1)
    return (time - start).total_seconds()/(end - start).total_seconds()

# read data from text document
with open('household_power_consumption.txt', 'r', encoding='utf-8') as f:
    lines = [line.rstrip('\n') for line in f]

# Remove the '?' uncaptured data if detected
data_raw_reduced = [line for line in lines if '?' not in line] 

# strip the header information and remove semicolons     
data_raw = [l.split(';') for l in data_raw_reduced][1::]

# Convert date and time to a numeric value/ratio
time_ratios = [time_to_ratio(f'{t[0]} {t[1]}') for t in data_raw]

# merge time with raw data removing time stamp strings and replacing with ratios
data_time_raw = [[t, gap, grp, v, gi, s1, s2, s3] for (_, _, gap, grp, v, gi, s1, s2, s3), (t) in zip(data_raw, time_ratios)]


In [None]:
# Verify columns/rows/data are as expected.
print("Number of rows: {}".format(len(data_time_raw)))
print("Number of columns: {}".format(len(data_time_raw[0])))
print(data_time_raw[:10])

# Convert to np array for better processing.
data_time_np = np.array(data_time_raw, dtype=float)
print("Number of rows: {}".format(data_time_np.shape[0]))
print("Number of columns: {}".format(data_time_np.shape[1]))


In [None]:
## Additional Preprocessing Steps here ##


# Cluster Analysis