## The goal of the project
In this project, a CSV file with power consumption information in Tetuan city over one year will be investigated using Python pandas, plotly express. The goal with this project will be to analyze various attributes within the CSV file to learn more about the power consumption information in the file and gain insight into potential use cases for the dataset.

In [1]:
import numpy as np
import pandas as pd

import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

In [2]:
origin_tetuan = pd.read_csv('Tetuan City power consumption.csv')
origin_tetuan.head(5)

Unnamed: 0,DateTime,Temperature,Humidity,Wind Speed,general diffuse flows,diffuse flows,Zone 1 Power Consumption,Zone 2 Power Consumption,Zone 3 Power Consumption
0,1/1/2017 0:00,6.559,73.8,0.083,0.051,0.119,34055.6962,16128.87538,20240.96386
1,1/1/2017 0:10,6.414,74.5,0.083,0.07,0.085,29814.68354,19375.07599,20131.08434
2,1/1/2017 0:20,6.313,74.5,0.08,0.062,0.1,29128.10127,19006.68693,19668.43373
3,1/1/2017 0:30,6.121,75.0,0.083,0.091,0.096,28228.86076,18361.09422,18899.27711
4,1/1/2017 0:40,5.921,75.7,0.081,0.048,0.085,27335.6962,17872.34043,18442.40964


In [3]:
origin_tetuan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52416 entries, 0 to 52415
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   DateTime                   52416 non-null  object 
 1   Temperature                52416 non-null  float64
 2   Humidity                   52416 non-null  float64
 3   Wind Speed                 52416 non-null  float64
 4   general diffuse flows      52416 non-null  float64
 5   diffuse flows              52416 non-null  float64
 6   Zone 1 Power Consumption   52416 non-null  float64
 7   Zone 2  Power Consumption  52416 non-null  float64
 8   Zone 3  Power Consumption  52416 non-null  float64
dtypes: float64(8), object(1)
memory usage: 3.6+ MB


The dataset has not any missing values, so we can directly start from the exploring interesting questions:

* What city zone exhibits more power consumption? Which city zone consumes less power?
* If there is relationship between temperature and power consumption?
* When are the highest and lowest points of power consumption?

## What city zone exhibits more power consumption? Which city zone consumes less power?

The line graph will fit good to answer this question, as over period of one year we can observe power consumption among city zones.
To do so some preprocessings are needed to be done:

1. Converting Datetime object variable into datetime format:

In [4]:
origin_tetuan.DateTime = pd.to_datetime(origin_tetuan.DateTime)

In [5]:
tetuan = origin_tetuan.copy()

In [6]:
tetuan['date'] = tetuan.DateTime.dt.date
tetuan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52416 entries, 0 to 52415
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   DateTime                   52416 non-null  datetime64[ns]
 1   Temperature                52416 non-null  float64       
 2   Humidity                   52416 non-null  float64       
 3   Wind Speed                 52416 non-null  float64       
 4   general diffuse flows      52416 non-null  float64       
 5   diffuse flows              52416 non-null  float64       
 6   Zone 1 Power Consumption   52416 non-null  float64       
 7   Zone 2  Power Consumption  52416 non-null  float64       
 8   Zone 3  Power Consumption  52416 non-null  float64       
 9   date                       52416 non-null  object        
dtypes: datetime64[ns](1), float64(8), object(1)
memory usage: 4.0+ MB


2. Grouping dataset by date, because the original dataset contains records of power consumption in every 10 minutes over one year period which results in 52 416 rows.

In [7]:
tetuan = tetuan.groupby(['date'],as_index=False).agg({
    'Temperature': 'mean',
    'Humidity': 'mean',
    'Wind Speed': 'mean',
    'general diffuse flows': 'mean',
    'diffuse flows': 'mean',
    'Zone 1 Power Consumption': 'mean',
    'Zone 2  Power Consumption': 'mean',
    'Zone 3  Power Consumption': 'mean'
})
tetuan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   date                       364 non-null    object 
 1   Temperature                364 non-null    float64
 2   Humidity                   364 non-null    float64
 3   Wind Speed                 364 non-null    float64
 4   general diffuse flows      364 non-null    float64
 5   diffuse flows              364 non-null    float64
 6   Zone 1 Power Consumption   364 non-null    float64
 7   Zone 2  Power Consumption  364 non-null    float64
 8   Zone 3  Power Consumption  364 non-null    float64
dtypes: float64(8), object(1)
memory usage: 25.7+ KB


3. Adding two more variables (month) in order to facilitate the creation of a line graph depicting power consumption over the span of one year.

In [8]:
tetuan['date'] = pd.to_datetime(tetuan.date)
tetuan['month'] = tetuan.date.dt.month
tetuan['name_month'] = tetuan.date.dt.strftime('%b')
tetuan.head()

Unnamed: 0,date,Temperature,Humidity,Wind Speed,general diffuse flows,diffuse flows,Zone 1 Power Consumption,Zone 2 Power Consumption,Zone 3 Power Consumption,month,name_month
0,2017-01-01,9.675299,68.519306,0.315146,121.390771,25.993924,28465.232067,17737.791287,17868.795181,1,Jan
1,2017-01-02,12.476875,71.456319,0.076563,120.404486,27.22741,28869.493671,19557.725431,17820.763053,1,Jan
2,2017-01-03,12.1,74.981667,0.076715,120.686014,28.57466,30562.447257,20057.269504,17620.803213,1,Jan
3,2017-01-04,10.509479,75.459792,0.082417,122.959319,28.827222,30689.831224,20102.077001,17673.694779,1,Jan
4,2017-01-05,10.866444,71.040486,0.083896,118.749861,29.741437,30802.911393,20033.941237,17664.176707,1,Jan


In [9]:
tetuan_line = tetuan.groupby(['month','name_month'],as_index=False).agg({
    'Zone 1 Power Consumption': 'mean',
    'Zone 2  Power Consumption': 'mean',
    'Zone 3  Power Consumption': 'mean'
})
tetuan_line

Unnamed: 0,month,name_month,Zone 1 Power Consumption,Zone 2 Power Consumption,Zone 3 Power Consumption
0,1,Jan,31032.493535,19394.444717,17746.095349
1,2,Feb,30985.753632,18787.793096,17335.002154
2,3,Mar,31155.165408,18457.937484,16947.686004
3,4,Apr,31169.76821,17633.966395,18593.167677
4,5,May,32396.009166,19977.287859,17621.100953
5,6,Jun,34605.540839,20670.928621,20430.941538
6,7,Jul,35831.553603,24147.886893,28194.111216
7,8,Aug,36435.189574,24656.216575,24648.894732
8,9,Sep,33396.681416,20180.432259,14922.798774
9,10,Oct,32827.660055,21468.993441,13264.095173


In [10]:
title_text = "<span style='color:black;'>Dominant Power Consumer is</span> <span style='color:rebeccapurple;'>city Zone 1</span>"

fig = px.line(
    tetuan_line,
    x='name_month',
    y=['Zone 1 Power Consumption', 'Zone 2  Power Consumption', 'Zone 3  Power Consumption'],
    title=title_text,
    labels={'value': 'Power Consumption [watt]', 'month': 'Month'},
    line_shape='linear',
    color_discrete_sequence=['rebeccapurple','forestgreen','peru'],
)

fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    xaxis=dict(linecolor='black', mirror=False),
    yaxis=dict(linecolor='black', mirror=False)
)

upper_limit = 40000
fig.update_layout(
    yaxis=dict(range=[0, upper_limit])
)

for i, month in enumerate(tetuan_line['name_month']):
    if month in ['Jan', 'Jul', 'Dec']:
        for trace in fig.data:
            value = trace.y[i]
            color = trace.line.color
            fig.add_annotation(
                x=tetuan_line['name_month'][i],
                y=value + 700,
                text=f'{value:.0f}',
                showarrow=False,
                font=dict(size=13, color=color)
            )
annotations = ['city Zone 1', 'city Zone 2', 'city Zone 3']

for i,trace in enumerate(fig.data):
    last_value = trace.y[-1]
    label = annotations[i]
    color = trace.line.color
    fig.add_annotation(
        x=tetuan_line['name_month'].iloc[-1],
        y=last_value,
        text=label,
        showarrow=False,
        font=dict(size=14, color=color),
        xshift=40,
        yshift=-5
    )
fig.update_layout(height=600)
fig.update_layout(showlegend=False)
fig.update_layout(title_font_size=24)
fig.update_xaxes(title_text='')

fig.show()

From the line graph above, it is evident that city Zone 1 consumes significantly more power compared to city Zones 2 and 3.

There is a steady increase in power usage for city Zones 2 and 3 during the first half of the year. However, this trend changes as city Zone 3 reaches its peak in July and then drastically drops by December. Meanwhile, city Zone 2 experiences a steady decrease in power usage until September, followed by a gradual increase in the subsequent months.

## If there is relationship between temperature and power consumption?

The scatter plot is a good way to see wheather there is a relationship between two numerical variables: temperature and power consumption

In [11]:
title_text = "<span style='color:black;'>Relationship Between Power Consumption and Temperature in</span>" \
                "<br><span style='color:rebeccapurple;'>Zone 1, </span> <span style='color:forestgreen;'>Zone 2, </span> <span style='color:peru;'>Zone 3</span>"

fig = px.scatter(tetuan, x='Temperature', y=['Zone 1 Power Consumption',
                                            'Zone 2  Power Consumption','Zone 3  Power Consumption'],
                 title=title_text,
                 labels={'value': 'Power Consumption [watt]', 'Temperature': 'Temperature [°C]'},
                 color_discrete_map={'Zone 1 Power Consumption': 'rebeccapurple', 'Zone 2  Power Consumption': 'forestgreen', 'Zone 3  Power Consumption': 'peru'})

fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    xaxis=dict(linecolor='black', mirror=False),
    yaxis=dict(linecolor='black', mirror=False)
)
fig.update_layout(title_font_size=24)
fig.update_layout(showlegend=False)

fig.show()

The scatter plot does show the linear relationship between the two variables. The plot mainly shows a cloud of points that are close to the shape of a line. Lower temperatures are associated with low power consumption. Higher temperatures are associated with high power usage.

Another method of quantifying the relationship between variables numerically is by using a correlation matrix.

In [12]:
corr_matrix = tetuan.corr()

fig = px.imshow(corr_matrix,
                x=corr_matrix.index,
                y=corr_matrix.columns,
                color_continuous_scale='Viridis',
                labels=dict(color='Correlation'))

for i in range(len(corr_matrix.index)):
    for j in range(len(corr_matrix.columns)):
        fig.add_annotation(x=corr_matrix.index[i],
                           y=corr_matrix.columns[j],
                           text=f'{corr_matrix.iloc[i, j]:.2f}',
                           showarrow=False,
                           font=dict(color='white'))

fig.update_layout(title='Correlation Matrix',
                  width=800,
                  height=800)
fig.update_layout(title_font_size=24)
custom_labels = ['Temperature', 'Humidity', 'Wind Speed', 'General diffuse flows', 'Diffuse flows', 'Zone 1', 'Zone 2', 'Zone 3']

fig.show()

The correlation coefficient is 0.73, 0.42, 0.61 for city Zone 1, Zone 2 and Zone 3 respectively. The values are close to 1 that we should conclude there's really linear relationship between the variables.

## When are the highest and lowest points of power consumption?

1. Let's explore power consumption through the months:

In [13]:
title_text = "Monthly Power Consumption in city Zone 1"
fig = px.box(tetuan,x="name_month", y="Zone 1 Power Consumption",color="name_month")
fig.update_layout(title_text=title_text,
                 xaxis_title='',
                 yaxis_title='Zone 1 Power Consumption')

fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    xaxis=dict(linecolor='black', mirror=False),
    yaxis=dict(linecolor='black', mirror=False)
)
fig.update_layout(title_font_size=24)
fig.update_layout(showlegend=False)

fig.show()

As we can see June, July and August are months when power is consumed the most.

2. Let's explore power consumption through the weekdays:

In [14]:
tetuan['date'] = pd.to_datetime(tetuan.date)
tetuan['name_week'] = tetuan.date.dt.strftime('%A')

In [15]:
tetuan.head()

Unnamed: 0,date,Temperature,Humidity,Wind Speed,general diffuse flows,diffuse flows,Zone 1 Power Consumption,Zone 2 Power Consumption,Zone 3 Power Consumption,month,name_month,name_week
0,2017-01-01,9.675299,68.519306,0.315146,121.390771,25.993924,28465.232067,17737.791287,17868.795181,1,Jan,Sunday
1,2017-01-02,12.476875,71.456319,0.076563,120.404486,27.22741,28869.493671,19557.725431,17820.763053,1,Jan,Monday
2,2017-01-03,12.1,74.981667,0.076715,120.686014,28.57466,30562.447257,20057.269504,17620.803213,1,Jan,Tuesday
3,2017-01-04,10.509479,75.459792,0.082417,122.959319,28.827222,30689.831224,20102.077001,17673.694779,1,Jan,Wednesday
4,2017-01-05,10.866444,71.040486,0.083896,118.749861,29.741437,30802.911393,20033.941237,17664.176707,1,Jan,Thursday


In [16]:
title_text = "Weekday Power Consumption in Zone 1"
fig = px.box(tetuan,x="name_week", y="Zone 1 Power Consumption",color="name_week")
fig.update_layout(title_text=title_text,
                 xaxis_title='',
                 yaxis_title='Zone 1 Power Consumption, [watt]')

fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    xaxis=dict(linecolor='black', mirror=False),
    yaxis=dict(linecolor='black', mirror=False)
)
fig.update_layout(title_font_size=24)
fig.update_layout(showlegend=False)

fig.show()

We can see that Sunday is the day of less power consumption compared to other days

3. Let's explore power consumption through the hours:

In [17]:
origin_tetuan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52416 entries, 0 to 52415
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   DateTime                   52416 non-null  datetime64[ns]
 1   Temperature                52416 non-null  float64       
 2   Humidity                   52416 non-null  float64       
 3   Wind Speed                 52416 non-null  float64       
 4   general diffuse flows      52416 non-null  float64       
 5   diffuse flows              52416 non-null  float64       
 6   Zone 1 Power Consumption   52416 non-null  float64       
 7   Zone 2  Power Consumption  52416 non-null  float64       
 8   Zone 3  Power Consumption  52416 non-null  float64       
dtypes: datetime64[ns](1), float64(8)
memory usage: 3.6 MB


In [18]:
origin_tetuan['hour'] = origin_tetuan.DateTime.dt.hour

In [19]:
fig = px.box(origin_tetuan,x="hour", y="Zone 1 Power Consumption",color="hour")
fig.update_layout(title_text='Hourly energy consumption in city Zone 1',
                 xaxis_title='hours',
                 yaxis_title='Zone 1 Power Consumption, [watt]',
                 xaxis=dict(dtick=1))

fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    xaxis=dict(linecolor='black', mirror=False),
    yaxis=dict(linecolor='black', mirror=False))

fig.update_layout(title_font_size=24)
fig.update_layout(showlegend=False)

fig.show()

Boxplots reveal that power:
* is consumed less at night/early morning (1 am - 9 am)
* steadily increasing afternoon (10 am - 16 pm)
* reaching its peak in the evening (17 pm - 0 am)

## In **conclusion** some recommendations can be given:
* Distribute energy-intensive tasks, such as laundry or dishwashing, in the early morning hours.
* Take advantage of the lower electricity demand on Sundays by using electrical devices during this time. 
* Develop a weekly schedule that prioritizes the use of electrical devices during times of lower demand.