The data is currently stored as `.xls` files. In this notebook, we will implement some code to manipulate the data as `pandas.Dataframes` and store as more efficient `.parquet` files on disk.

In [5]:
pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-10.0.0-cp39-cp39-win_amd64.whl (20.0 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-10.0.0
Note: you may need to restart the kernel to use updated packages.


In [6]:
# import any required libraries here
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

First, we need to read the `.xls` files into `pandas.Dataframes`. You can use [pandas.read_excel](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for this.

In [3]:
# load the building data 
# consider the different number of header rows!

#OH12 = pd.read_excel(r'C:\Users\neena\Documents\Masters Courses\CaseStudy\Data Modified\OH12Modified1.xlsx')
#OH14 = pd.read_excel(r'C:\Users\neena\Documents\Masters Courses\CaseStudy\Data Modified\OH14Modified1.xlsx')
#Chemie = pd.read_excel(r'C:\Users\neena\Documents\Masters Courses\CaseStudy\Data Modified\ChemieModified1.xlsx')
#HGII = pd.read_excel(r'C:\Users\neena\Documents\Masters Courses\CaseStudy\Data Modified\HGIIModified1.xlsx')
KH = pd.read_excel(r'C:\Users\neena\Documents\Masters Courses\CaseStudy\Data Modified\Kita HokidoModified1.xlsx')


Next, we need to implement a function that takes a `pandas.Dataframe` and a path string as an input and writes the data to disk as a `parquet` file. You can use the [PyArrow library](https://arrow.apache.org/docs/python/parquet.html) for this: 

In [36]:
def write_as_parquet(df, path):
    # implement this function and add a short doc string describing its use
   
    file_name =[x for x in globals() if globals()[x] is df][0]
    arrowTable = pa.Table.from_pandas(df)
    pq.write_table(arrowTable, path + file_name + '.parquet' )

In [40]:
#write_as_parquet(OH12,'C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/')
write_as_parquet(OH14, 'C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/')
write_as_parquet(Chemie, 'C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/')
write_as_parquet(HGII, 'C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/')
                 

Now we need the opposite functionality: a function that reads data from a `.parquet` file on disk and returns it as a `pandas.Dataframe`. Implement this function such that it can take a list of names of column to load as an _optional_ parameter. 

In [44]:
def load_to_pandas(path, columns):
    # implement this function and add a short doc string describing its use
    if columns is not None:
        df = pq.read_table(path, columns = columns).to_pandas()
    elif columns is None:
        df = pq.read_table(path).to_pandas()
        
    return df

In [46]:
columns = ['Date','Wärmeleistung','Volumen Kanal 1','WV+ Arbeit Tarif 1']
OH12_df = load_to_pandas('C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/OH12.parquet', columns)
OH14_df = load_to_pandas('C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/OH14.parquet', columns)
HGII_df = load_to_pandas('C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/HGII.parquet', columns)
Chemie_df = load_to_pandas('C:/Users/neena/Documents/Masters Courses/CaseStudy/Data Modified/Chemie.parquet', columns)

Chemie_df

Unnamed: 0,Date,Wärmeleistung,Volumen Kanal 1,WV+ Arbeit Tarif 1
0,2020-03-14 15:15:00,,,1490274.3
1,2020-03-14 15:30:00,,,1490361.1
2,2020-03-14 15:45:00,,,1490449.5
3,2020-03-14 16:00:00,126.965,,1490538.8
4,2020-03-14 16:15:00,,,1490623.7
...,...,...,...,...
82210,2022-07-19 00:45:00,,10204.29,10063532.0
82211,2022-07-19 01:00:00,155.897,,10063625.0
82212,2022-07-19 01:15:00,,,10063715.0
82213,2022-07-19 01:30:00,,,10063806.8


Great! We can now store data more efficiently on disk and know how to load it again. Store all the data we have as one `.parquet` file per building.