<a href="https://colab.research.google.com/github/xixihaha1995/esp_proj3/blob/main/_1_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preprocessing


## Correlation coefficient


1.   $Corr_{xy} = \frac{\sum(x_i - \bar x)(y_i - \bar y)}{\sqrt{\sum (x_i -\bar x)^2\sum(y_i - \bar y)^2}}$</br>
$x_i$ = values of the x variable in a sample</br>
$y_i$ = values of the x variable in a sample</br>
$\bar x$ = mean of the values of the x variable</br>
$\bar y$ = mean of the values of the y variable</br>

2. [Guess the correlation](http://guessthecorrelation.com/)
3. [Interpreting Correlations](https://rpsychologist.com/correlation/)

## Original data

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("Change directory to the following and print full pathname out:")
%cd '/content/drive/MyDrive/studying/Summer 2022/Project3/'

Mounted at /content/drive
Change directory to the following and print full pathname out:
/content/drive/MyDrive/studying/Summer 2022/Project3


In [None]:
import pandas as pd
from google.colab import data_table
data_table.enable_dataframe_formatter()

data_dir = "./data/"
preview_df = pd.read_csv(data_dir + 'pyep_results.csv')
preview_df.timestamp  = pd.to_datetime(preview_df.timestamp , unit='s')
preview_df.timestamp = pd.to_datetime(preview_df.timestamp, unit='s')
preview_df['minute'] = preview_df.apply(lambda x: x['timestamp'].minute, axis=1)
preview_df['hour'] = preview_df.apply(lambda x: x['timestamp'].hour, axis=1)
preview_df['dayofweek'] = preview_df.apply(lambda x: x['timestamp'].dayofweek, axis=1)
preview_df['dayofmonth'] = preview_df.apply(lambda x: x['timestamp'].day, axis=1)
preview_df['month'] = preview_df.apply(lambda x: x['timestamp'].month, axis=1)

preview_df = preview_df.sort_values("timestamp").drop("timestamp", axis=1)

# new_cols = ['elec_hvac[J]','timestamp','month','dayofmonth','dayofweek', 'hour', 'minute',
#         'oat[C]', 'solar[W/m2]','zone_load_rate[W]']
new_cols = ['oat[C]', 'elec_hvac[J]', 'solar[W/m2]', 'zone_load_rate[W]']
preview_df = preview_df[new_cols]
preview_df.iloc[:100]

In [None]:
import plotly.express as px
fig = px.line(preview_df, x="timestamp", y="oat[C]")
fig.update_layout(
     title=f'Outdoor dry air temperature [C]',
     yaxis_title="Air temperature [C]",
     legend_title="Legend list",
     font=dict(
         family="Times New Roman",
         size=20,
         color="Black"
         )
     )
fig.show()

In [None]:
import plotly.express as px
data=preview_df.corr()
fig = px.imshow(data,
                x=['oat[C]', 'elec_hvac[J]', 'solar[W/m2]', 'zone_load_rate[W]'],
                y=['oat[C]', 'elec_hvac[J]', 'solar[W/m2]', 'zone_load_rate[W]'],
               )
fig.update_xaxes(side="top")
fig.show()

## Scalers
> *Indeed many estimators are designed with the assumption that each feature takes values close to zero or more importantly that all features vary on comparable scales.* 

> *Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.*

### MinMaxScaler
$x_{transformed} = \frac{(x - x_{min})}{x_{max} - x_{min}}$</br>

### Normalizer (L2 Norm)
1.   Wolfram MathWorld [References](https://mathworld.wolfram.com/L2-Norm.html)
2.   The $l^2$-norm definition for vector $\textbf{x} = [x_1, x_2,...x_n]^T$ is the folowing:</br>
$|\textbf{x}|_2 = \sqrt{\sum_{k=1}^{n}|x_k|^2}$. The subscrpit 2 is used to emphasize each instance of variable x should be raised to the power of 2. </br>
For example, the $l^2$-norm for $\textbf{x}=[x_1,x_2,x_3]^T$ is given by:</br>
$|\textbf{x}|_2 = \sqrt{x_1^2 +x_2^2 +x_3^2}$. </br>
3. The default behavior of the [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) from sklearn is defined as the following:</br>
$x_{transformed} =\frac{x}{|\textbf{x}|_2 } $</br>

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
min_max_sc = MinMaxScaler()
l2norm_sc = Normalizer()
data_min_max_scaled = min_max_sc.fit_transform(preview_df.values)
data_l2norm_scaled = l2norm_sc.fit_transform(preview_df.values)

(35040, 4)
[[-5.03095671e-08  9.99999984e-01  0.00000000e+00  1.79010790e-04]
 [-4.16863342e-08  9.99997129e-01  0.00000000e+00  2.39608229e-03]
 [-3.52881387e-08  9.99999575e-01  0.00000000e+00  9.21623441e-04]
 ...
 [ 3.12697269e-08  1.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 2.52317883e-08  1.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 1.92073865e-08  1.00000000e+00  0.00000000e+00  0.00000000e+00]]


In [None]:
'''
Transfer to dataframe
['oat[C]', 'elec_hvac[J]', 'solar[W/m2]', 'zone_load_rate[W]']
'''
df_min_max = pd.DataFrame({'oat[C]': data_min_max_scaled[:,0], 
                        'elec_hvac[J]':data_min_max_scaled[:,1], 
                        "solar[W/m2]": data_min_max_scaled[:,2],
                        "zone_load_rate[W]": data_min_max_scaled[:,3],
                        })

df_l2norm = pd.DataFrame({'oat[C]': data_l2norm_scaled[:,0], 
                        'elec_hvac[J]':data_l2norm_scaled[:,1], 
                        "solar[W/m2]": data_l2norm_scaled[:,2],
                        "zone_load_rate[W]": data_l2norm_scaled[:,3],
                        })

In [None]:
import plotly.express as px
data_min_max=df_l2norm.corr()
fig = px.imshow(data_min_max,
                x=['oat[C]', 'elec_hvac[J]', 'solar[W/m2]', 'zone_load_rate[W]'],
                y=['oat[C]', 'elec_hvac[J]', 'solar[W/m2]', 'zone_load_rate[W]'],
               )
fig.update_xaxes(side="top")
fig.show()

In [None]:
def df_to_plotly(df):
    return {'z': df.values.tolist(),
            'x': df.columns.tolist(),
            'y': df.index.tolist()}
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=3, cols=1,
                    subplot_titles=("Original Correlation", 
                                    "MinMax Scaled Correlation", 
                                    "L2Norm Scaled Correlation"))
corr_ori = preview_df.corr()
corr_min_max = df_min_max.corr()
corr_l2norm =df_l2norm.corr()

fig.add_trace(
    go.Heatmap(df_to_plotly(corr_ori)),
    row=1, col=1
)

fig.add_trace(
    go.Heatmap(df_to_plotly(corr_min_max)),
    row=2, col=1
)

fig.add_trace(
    go.Heatmap(df_to_plotly(corr_l2norm)),
    row=3, col=1
)
fig.show()