The data is currently stored as `.xls` files. In this notebook, we will implement some code to manipulate the data as `pandas.Dataframes` and store as more efficient `.parquet` files on disk.

In [1]:
#!pip install pyarrow

In [2]:
# import any required libraries here
import numpy as np # numerical python
import pandas as pd # pannel datasets
import pyarrow as pa
import pyarrow.parquet as pq
#import re

First, we need to read the `.xls` files into `pandas.Dataframes`. You can use [pandas.read_excel](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for this.

In [3]:
Path = 'C:/Users/rahul/Desktop/case study/data/'

##### For each building the datasets where merged individually. An example of how it was done is shown below.
Eg: Building - OH12

Step 1: Excel was used to concatenate together each of the six header columns, which was then saved in an excel file for naming conventions.

Step 2: Read the two files for each building next. Pandas will then automatically give the dataset unique names by assigning numbers to the recurring columns if we use *OBIS Bezeichnung* as the dataset's header column.

```python
# load the data using pandas
OH12 = pd.read_excel('C:/Users/rahul/Desktop/case study/raw_data/OH12.xls', skiprows=[0,1,2,3,5])
OH12 = OH12.rename({'OBIS Bezeichnung': 'Date',}, axis=1)
OH12['Date']= pd.to_datetime(OH12['Date'])

# Change type of all variables except date into numeric
cols=[i for i in OH12.columns if i not in ["Date"]]
for col in cols:
    OH12[col]=pd.to_numeric(OH12[col])

# load the data using pandas
OH121 = pd.read_excel('C:/Users/rahul/Desktop/case study/raw_data/OH12_01_26-07_19.xls', skiprows=[0,1,2,3,5])
OH121 = OH121.rename({'OBIS Bezeichnung': 'Date',}, axis=1)
OH121['Date']= pd.to_datetime(OH121['Date'])

# Change type of all variables except date into numeric
cols=[i for i in OH121.columns if i not in ["Date"]]
for col in cols:
    OH121[col]=pd.to_numeric(OH121[col])
```

Step 3: Concatenate both the files and sort them.
```python
OH12concat = pd.concat([OH12, OH121])
OH12concat.sort_values(by=['Date'], inplace=True, ignore_index=True)
```
Step 4: Use the drop duplcates() function to remove the overlapping data after that, but take into account the new columns that were introduced during concatenation from the second dataset.
```python
OH12concat = OH12concat.drop_duplicates(subset=OH12concat.columns.difference(['P Summe','Fehler Flags','WV+ Arbeit tariflos',
                                        'Betriebsstunden.1','Fehlerstunden.1','Temperaturdifferenz.1']))
```
Step 5: Write the concatenated dataset back to excel fromat and then copy the header columns of this dataset into the excel file for naming conventions mentioned in Step 1. A sample of the naming scheme is given below. 
```python
OH12concat.to_excel('C:/Users/rahul/Desktop/case study/data/OH12.xlsx')
```

In [4]:
# Naming Sceme for each building. Sheet Names : ['HGII', 'OH12', 'Chemie', 'OH14', 'Kito', 'Grosstage']
Naming_Sceme = pd.read_excel(Path + 'Naming_Scheme.xlsx', sheet_name = 'HGII')
Naming_Sceme

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,,Value,Key
1,,EnergieartSeriennummerBeschreibungOBIS Kennzah...,Date
2,,WärmeCN08 12 01 136-1:1.8.1Wärmeenergie Tarif ...,Wärmeenergie Tarif 1
3,,WärmeCN08 12 01 136-1:80.7.1Durchflussm³/h,Durchfluss
4,,WärmeCN08 12 01 136-1:80.8.1Volumenm³,Volumen
5,,WärmeCN08 12 01 136-1:81.7.1Vorlauftemperatur°C,Vorlauftemperatur
6,,WärmeCN08 12 01 136-1:82.7.1Rücklauftemperatur°C,Rücklauftemperatur
7,,WärmeCN08 12 01 136-1:83.7.1TemperaturdifferenzK,Temperaturdifferenz
8,,WärmeCN08 12 01 136-1:84.7.1WärmeleistungkW,Wärmeleistung
9,,WärmeCN08 21 01 016-1:1.8.1Wärmeenergie Tarif ...,Wärmeenergie Tarif 1.1


In [5]:
# load the building data 
# load the data using pandas
# consider the different number of header rows!

Chemie = pd.read_excel(Path + 'Chemie.xlsx', parse_dates=['Date'])
HGII = pd.read_excel(Path + 'HGII.xlsx', parse_dates=['Date'])
OH12 = pd.read_excel(Path + 'OH12.xlsx', parse_dates=['Date'])
OH14 = pd.read_excel(Path + 'OH14.xlsx', parse_dates=['Date'])
Großtagespflege = pd.read_excel(Path + 'Großtagespflege.xlsx', parse_dates=['Date'])
KitaHokido = pd.read_excel(Path + 'Kita Hokido.xlsx', parse_dates=['Date'])
#Outside_temp = pd.read_csv(Path + 'metdata_dwd_Waltrop.csv')


Next, we need to implement a function that takes a `pandas.Dataframe` and a path string as an input and writes the data to disk as a `parquet` file. You can use the [PyArrow library](https://arrow.apache.org/docs/python/parquet.html) for this: 

In [6]:
def write_as_parquet(df, path):
    # implement this function and add a short doc string describing its use
    # Function to get the names of the dataframe
    name =[x for x in globals() if globals()[x] is df][0]
    
    table = pa.Table.from_pandas(df)
    pq.write_table(table,path + name + '.parquet')

In [7]:
# write_as_parquet(HGII, Path)
# write_as_parquet(Chemie, Path)
# write_as_parquet(OH12, Path)
# write_as_parquet(OH14, Path)
# write_as_parquet(Großtagespflege, Path)
# write_as_parquet(KitaHokido, Path)

Now we need the opposite functionality: a function that reads data from a `.parquet` file on disk and returns it as a `pandas.Dataframe`. Implement this function such that it can take a list of names of column to load as an _optional_ parameter. 

In [8]:
def load_to_pandas(path, columns):
    if not columns:
        
        df = pq.read_pandas(path).to_pandas()
        #df = pq.read_table(path).to_pandas()
    else:
        df = pq.read_pandas(path, columns=columns).to_pandas()
    return df

In [9]:
columns =[]
#columns = ['Date', 'Wärmeenergie Tarif 1']
HGII = load_to_pandas(Path + 'HGII.parquet', columns)
Chemie = load_to_pandas(Path + 'Chemie.parquet', columns)
OH12 = load_to_pandas(Path + 'OH12.parquet', columns)
OH14 = load_to_pandas(Path + 'OH14.parquet', columns)
KitaHokido = load_to_pandas(Path + 'KitaHokido.parquet', columns)
Großtagespflege = load_to_pandas(Path + 'Großtagespflege.parquet', columns)

Great! We can now store data more efficiently on disk and know how to load it again. Store all the data we have as one `.parquet` file per building.