# Working with data
In this chapter, we are looking at common data preparation tasks and
how they can be accomplished using the python ecosystem with
dedicated libraries.

We look at three core libraries: 
* Numpy
* Matplotlib
* Pandas

## Introduction to Numpy
As always, make the best use of the best ressources out there:

https://numpy.org/devdocs/user/absolute_beginners.html 

https://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb 
https://www.youtube.com/watch?v=GB9ByFAIAH4![grafik.png](attachment:grafik.png)

#### What we need
Numpy is a very powerful library used in academia and engineering.

We can only scratch the surface in this course.

We focus on the use as our **Time Series Data Holder**.

It is fast and we won't get in trouble manipulating and looping through large numbers of timesteps 

Here is a good Introduction:
https://github.com/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb

In [None]:
import numpy as np
a2 = np.arange(10)
a2


In [None]:
a = np.linspace(1,100,10)
a

In [None]:
z = np.zeros(8760)
len(z)

In [None]:
z.shape

In [None]:
z.size

In [None]:
z2 = np.ones(8760)
z2.sum()


In [None]:
# 

In [None]:
b= np.random.random(10)
b

In [None]:
a, b, a2

In [None]:
c = np.asarray([a,b, a2])


Numpy kann auch andere Formate

In [None]:
dates = np.arange('2021-01-01', '2022-01-01', dtype='datetime64[h]')
dates[1]+24

### Write to file
How to write in column form?

In [None]:
np.savetxt("Exercises/data.csv", c)

In [None]:
np.savetxt("Exercises/data2.csv", c.transpose())

In [None]:
a.tofile("Exercises/data3.csv", sep="\n",format="%s")

### Read from file

https://numpy.org/doc/stable/user/how-to-io.html

In [None]:
!more "Exercises\data\PV.csv" 
#on linux/mac: !cat

In [None]:
PV = np.genfromtxt("Exercises/data/PV.csv", delimiter=",")
PV

In [None]:
PV = PV[1:]
PV.ndim

Analyse eines PV-Profils

In [None]:
Dach_320kWp = PV[:,0] # alle zeilen, erste Spalte

In [None]:
Dach_320kWp.size

In [None]:
print("Dach 320kWP: ",round(Dach_320kWp.sum(),2) ," kWh")

Wie hoch ist der spezifische Ertrag der Zeitreihen?

In [None]:
Ertrag_kWh = round(Dach_320kWp.sum(),2)
Spez = Ertrag_kWh / 320 # kWh/kWp
print(Spez, "kWh/kWp")

Plot

In [None]:
import matplotlib.pyplot as plt
plt.plot(Dach_320kWp)
plt.ylabel("Ertrag [kW]")
plt.xlabel("Stunden")

Ermittlung des täglichen Ertrags

In [None]:
pv = Dach_320kWp
pv_daily = []
for day in range(365):
    pv_day = pv[day*24:(day+1)*24].sum()
    pv_daily.append(pv_day)
    
pv_daily

In [None]:
plt.plot(pv_daily)

Plot together with original data

### Normalization

A typical task is to work just with the shape of a Profile or timeseries

In [None]:
pv.sum() #kWh einer 320 kWp Anlage


In [None]:
# Scale it down to the yield of 1 kWp
pv_1kWp = pv / 320
pv_1kWp.sum() #1038 kWh / kWp

In [None]:
#let's save that for later
pv_1kWp.tofile("Exercises/data/pv_1kWp.csv", sep="\n")

In [None]:
# let's double check: does it still work?
pv_check = np.genfromtxt("Exercises/data/pv_1kWp.csv", delimiter="\n")
(pv_check != pv_1kWp)

In [None]:
pv_check.size

Now we can write a function that takes the path of the 
profile and return it as numpy array, but with a scaling factor


In [None]:
def gen_pv(path, kWp=1):
    profile = np.genfromtxt(path, delimiter="\n")
    return profile * kWp

test = gen_pv("Exercises/data/pv_1kWp.csv", 500)
plt.plot(test)
plt.plot(pv)

In [None]:
# Ok, but in terms of visualization, we can do better.
# let's write a function, that takes a profile and plots the daily averag

def plot_daily(profile):
    pv_daily = []
    for day in range(365):
        pv_day = profile[day*24:(day+1)*24].sum()
        pv_daily.append(pv_day)
    plt.plot(pv_daily, linewidth=1)
    
plot_daily(pv)
plot_daily(test)
plot_daily(gen_pv("Exercises/data/pv_1kWp.csv", 1000))


## Introduction to pandas
Pandas introduces two data containers: **Series** and **DataFrame**. A Dataframe consists of a number of Seires and is in many ways like an Excel "Tables" and/or "Pivot tables". The Series in a Dataframe (df) are it's columns.

For a good overview: 
https://github.com/Tanu-N-Prabhu/Python/blob/master/Pandas/Pandas_DataFrame.ipynb


In [None]:
# import the pandas library
import pandas as pd

df = pd.DataFrame(["A","B","C"], columns=["test"])
df

In [None]:
# creating a dataframe from a dictionary:

Hulls =  {
        "name": ["test hull", "OIB", "PH"],
        "cost": [0., 1000., 1100.], # €/m²BGF Mehrkosten
        "l_T": [500., 600., 400.], # W/K
    }

Hulls

In [None]:
df = pd.DataFrame(Hulls) # pandas Dataframes are typically assigned to "df"
df

In [None]:
# Selecting a column:
df[["name"]]

In [None]:
# can also be done with object notation 
# if the column name is a valid python symbol
df.name

In [None]:
# selecting a column is a bit weird:
df.loc[0] # accessed by the index 

data selection, addition, deletion

In [None]:
# columns can be deleted
del df["name"]

In [None]:
df

In [None]:
# but usually it is better to just
# assign a new variable with just 
# what you want


In [None]:
df = pd.DataFrame(Hulls) # pandas Dataframes are typically assigned to "df"
df

In [None]:
df = df[["name", "l_T"]]
df

In [None]:
# renaming dataframes require a mapping
# from old to new names as a dictionary
newnames = {
    "name": "Standard",
    "l_T": "Transmissions-Leitwert"
}
df.rename(columns=newnames)

In [None]:
df #??

In [None]:
# careful, some operations return a new df
# and leave the old inplace.
# if you wanna change the df, use inplace=True
df.rename(columns=newnames, inplace=True)
df

In [None]:
# or simply reassign to a new/old variable
df = pd.DataFrame(Hulls)
df = df.rename(columns=newnames)
df

### Working with Numpy Arrays

In [None]:

# Dataframes work perfectly with Numpy Array

In [None]:
arr = np.random.random(8760)
df = pd.DataFrame(arr,columns=["random"])
df

In [None]:
# we can use the usual numpy aggregation functions
# mittelwert
df["random"].mean()

In [None]:
df.max(), df.min()

In [None]:
df["Zeros"] = np.zeros(8760)
df

In [None]:
df["Ones"] = np.ones(8760)
df

In [None]:
df.random

In [None]:
# you can access just the first
# 5 rows of the dataframe, if it is very big
df.head()

In [None]:
# sometimes its faster to look at the info
df.info()

In [None]:
# often you wanna see, which columns are 
# in your df
df.columns

Access to the data by numeric row and column identifiers:

In [None]:
df.iloc[1,:] # [i, j] ith row, jth columun

In [None]:
df.iloc[-1,-1]

Calculations
on whole Columns work like with numpy arrays

In [None]:
df["random*10"] = df.random * 10
df

In [None]:
# you can also always use python sequences
# in your columns
df["Hours"] = range(1,8761)
df

#### Datetimes

In [None]:
# Often it is very useful to have an index
# that is a DATE, not a number
# that helps a lot with automatic 
# aggregation
dates = np.arange('2021-01-01', '2022-01-01', dtype='datetime64[h]')
dates

In [None]:
# note that these dates are smart
# you can do arithmetic with it
dates[0] + 12 # note the hour

In [None]:
# you can set the index of a dataframe
# to an appropriate datetime array or series
df.index = dates
df

In [None]:
df.loc[1] #does not work anymore, 
# because your index is now a
# different format

In [None]:
df.loc["2021-12-31 23:00:00"]

In [None]:
# our index is now "smarter"
# it is aware of days, months and years
df.index

In [None]:
df.index.day

In [None]:
# we can use it to filter our data
for month in df.index.month.unique():
    print(len(df[df.index.month == month]))
    


#### Resample
Der Grund, sich mit den datetimes rumzuärgern

In [None]:
# with datetimes 
# you can choose a resample interval
rand_monthly = df["random"].resample("M")
rand_monthly

In [None]:
# Thats like a PIVOT table, 
# we need to combine it with a aggregation
monthly_mean = rand_monthly.mean()
monthly_mean

In [None]:
ones_monthly = df["Ones"].resample("M").sum()
ones_monthly/24

In [None]:
# Oh, and it is really easy to 
# plot dataframes and their resamples
ones_monthly.plot()

In [None]:
ones_monthly.plot(kind="bar")


In [None]:
fig, ax = plt.subplots(2)
ones_monthly.plot(kind="bar", ax=ax[0])
ax[0].set_title("Days in a month")
monthly_mean.plot(ax=ax[1])
ax[1].set_title("Average random")

df["random"].resample("D").mean().plot(ax=ax[1])

### Read CSV with Pandas

In [None]:

pd.read_csv?


In [None]:
em = pd.read_csv("Exercises/data/em_common_15-19.csv") # electricity map
em

In [None]:
em = pd.read_csv("Exercises/data/em_common_15-19.csv", delimiter=";", index_col="datetime", parse_dates=True) # electricity map
em.head()

In [None]:
em.total_production_avg - em.total_consumption_avg

In [None]:
em = pd.read_csv("Exercises/data/em_common_15-19.csv", 
                 delimiter=";", index_col="datetime", 
                 parse_dates=True, decimal=",") # electricity map
em.head()

In [None]:
em["balance"] = em.total_production_avg - em.total_consumption_avg

In [None]:
em.plot()
#plt.legend("")

In [None]:

PV = em["power_consumption_solar_avg"].dropna()
PV.sum()

In [None]:
PV

In [None]:
# now we can easily plot the daily and monthly average
PV.resample("D").sum().plot()
#PV.resample("W").sum().plot()
#PV.resample("M").sum().plot()

In [None]:
# Try plotting the average CO2 Emissions
# per year ("carbon_intensity_avg")

co2 = em["carbon_intensity_avg"]
co2

In [None]:
co2.resample("Y").mean().plot(kind="bar")

### Working with Excel

Read excel to Pandas?


In [None]:
df = pd.read_excel("Exercises/data/E-control.xlsx",
              #sheet_name="FLUCCOplus",
              skiprows=[0,1,2,3,4,5,6,8],
              #index_col=0
                  )

#df= df.iloc[:36]
#df