# Pandas

The Numpy library is excellent for numerical computations, but it lacks support to handle missing data or non-omogeneous arrays. The **Pandas** library is based on Numpy and extends the Numpy functionality, and is currently one of the most widely used tools for data manipulation, providing high-performance, easy-to-use data structures and advanced data analysis tools.

In particular Pandas features:

* A fast and efficient `DataFrame` object for data manipulation with integrated indexing;
* Tools for reading and writing data between in-memory data structures and different formats (CSV, Excel, SQL, HDF5);
* Convenient label-based slicing, fancy indexing, and subsetting of large data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Smart data alignment and integrated handling of missing data;
* Aggregating and transforming data with a powerful "group-by" engine; 
* High performance merging and joining of data sets;
* Time series-functionalities;
* Highly optimized for performance, with critical code paths written in Cython or C.


In [1]:
import numpy as np
import pandas as pd # standard naming convention

## Series

Pandas Series represent an extension of the Numpy 1D arrays. The content of a Series is equivalent to a Numpy array, and in addition the axis  is labeled. Labels doesn't need to be unique but must be a hashable type. (rappresentano una singola colonna del data structure)

Since the content is of type `ndarray`, the content has to be *omogeneous*. However, there is the possibility to store heterogeneous data, but the content in this case would be of type `object`.

One of the most important examples are the time-series, which are used to keep track of the time evolution of a certain quantity.

Link to the official Pandas Series [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).

In [3]:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

# Calling the Series constructor
# Constructor requires the data, and optionally the indices and data type
sr = pd.Series(np.arange(10)*0.5, index=tuple(letters[:10]), dtype=float) #fortemente raccomandato specificare l'indice ( o almeno il tipo di elementi)
print("series:\n", sr, '\n')
print("series type:\n", type(sr), '\n')
print("indices:\n", sr.index, '\n') #per ottenere solo gli indici
print("values:", sr.values, type(sr.values), '\n') # values of the Series are actually a numpy array (si ottengono solo i valori)
print("type:\n", sr.dtype, '\n') #in entrambi i casi si può valutare il tipo di dati forniti

series:
 a    0.0
b    0.5
c    1.0
d    1.5
e    2.0
f    2.5
g    3.0
h    3.5
i    4.0
j    4.5
dtype: float64 

series type:
 <class 'pandas.core.series.Series'> 

indices:
 Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object') 

values: [0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5] <class 'numpy.ndarray'> 

type:
 float64 



In [4]:
print("element by index:", sr['f'], '\n') # Accessing elements like arrays (accedere agli elementi attraverso gli indici) --> 2 nomi: questo è quello raccomandato
print("element by attribute:", sr.f, '\n') # Accessing elements like attributes - not recommended

# selecting a subset of the series
subsr = sr[['d', 'f', 'h']] # note the double square brackets (si passa una lista di oggetti come indici)
print("series subset:\n", subsr, type(subsr), '\n') # Multiple indexing returns another series

element by index: 2.5 

element by attribute: 2.5 

series subset:
 d    1.5
f    2.5
h    3.5
dtype: float64 <class 'pandas.core.series.Series'> 



In [5]:
# Extracting elements and operations are the same as numpy array
print(sr[:3], '\n')
print(sr[7:], '\n')
print(sr[::3], '\n')

# Fancy indexing works on Series, too
print(sr[sr > 3], '\n') #2 operazioni in 1 (prima si definisce una serie, di lunghezza pari alla serie originaria, che è una mappa --> valori booleani rispetto all'operazione, poi si passa e restituisce solo i valori true)

# You can also pass Series to numpy funtions
print(np.exp(sr), '\n')
print(np.mean(sr), np.std(sr), '\n')

a    0.0
b    0.5
c    1.0
dtype: float64 

h    3.5
i    4.0
j    4.5
dtype: float64 

a    0.0
d    1.5
g    3.0
j    4.5
dtype: float64 

h    3.5
i    4.0
j    4.5
dtype: float64 

a     1.000000
b     1.648721
c     2.718282
d     4.481689
e     7.389056
f    12.182494
g    20.085537
h    33.115452
i    54.598150
j    90.017131
dtype: float64 

2.25 1.4361406616345072 



Series may contain non-omogeneous data (si possono specificare different types of data); in this case, the data type is referred to as `object`. Non-homogeneous data is normally handeled also by Pandas and does not represent a problem, however this pays the price of less time-efficient operations.

In [6]:
# Series can be created from a python dictionary, too
# Note that the elements can be of different types
d = {'b' : 1, 'a' : 'cat', 'c' : [2, 3]} #conveniente passare un dizio alla serie --> entrambi hanno indci (chiavi = indici serie, valori dizio = valori serie con =:diventano)
so = pd.Series(d)
print(so, '\n')

b         1
a       cat
c    [2, 3]
dtype: object 



A key difference between Pandas Series and Numpy arrays is that operations between Series **automatically align the data based on the label**.

Thus, you can write operations without considering whether the Series involved have the same labels, or even the same size. (Le serie attuano operazioni, basandosi unicamente sugli indici/labels)

If there is no matching element, the resulting value would be a `NaN`.

In [9]:
s = pd.Series(np.arange(5), index=tuple(letters[:5]))
print("series:\n", s, '\n')

s1 = s[1:] #shiftiamo tutto di un elemento
print("shifted series:\n", s1, '\n')

s2 = s1 + s #definiamo un'altra serie, ma s1 e s hanno diverse lunghezze --> ma l'operazione è comunque possibile, restituisce il risultato a seconda degli indici
print("shifted sum:\n", s2, '\n') #NaN : per rappresentare un valore non valido, un'operazione non valida, ma anche uno spazio vuoto 

s3 = s1 + s[:-1] # in s[:-1] e non c'è
print("double shifted sum:\n", s3, '\n')

series:
 a    0
b    1
c    2
d    3
e    4
dtype: int64 

shifted series:
 b    1
c    2
d    3
e    4
dtype: int64 

shifted sum:
 a    NaN
b    2.0
c    4.0
d    6.0
e    8.0
dtype: float64 

double shifted sum:
 a    NaN
b    2.0
c    4.0
d    6.0
e    NaN
dtype: float64 



### Time series

**Datetime**

When dealing with time, Python provides the `datetime` library that allows to store the date and time in an dedicated object, which possess several methods to access the relevant quantities (day, month, year, hours, minutes, seconds, ...)

In [10]:
# To define a date, the datetime module is very useful
import datetime as dt

date = dt.date.today()
print("Today's date:", date)

# specify year, month, day, hour, minutes, seconds, and microseconds
date = dt.datetime(2020, 11, 12, 10, 45, 10, 15) #si possono specificare
print("Date and time:", date)
print("Month:", date.month, "and minutes:", date.minute)

Today's date: 2022-11-09
Date and time: 2020-11-12 10:45:10.000015
Month: 11 and minutes: 45


**Pandas Timestamps**

Timestamped data is the most basic type of time series data that associates values with points in time.

Functions like `pd.to_datetime` can be used to convert between different formats and, for instance, when reading the time stored as a string from a dataset:

In [11]:
# Get the timestamp, which is the nanoseconds from January 1st 1970
tstamp = pd.Timestamp(date) #è un oggetto che seleziona una specifica date e time
#tstamp = pd.Timestamp(dt.datetime(1970, 1, 1, 0, 0, 0, 1)) #l'origine del tempo per python
print("Timestamp:", tstamp.value)

# when creating a timestamp the format can be explicitly passed
ts = pd.to_datetime('2010/11/12', format='%Y/%m/%d') #usi il metodo to_datetime di pandas e specifichi la stringa e il formato desiderato
print("Time:", ts, ", timestamp:", ts.value, ", type:", type(ts)) #ts.value è il valore timestamp (in nanosec)

ts = pd.to_datetime('12-11-2010 10:39', format='%d-%m-%Y %H:%M')
print("Time:", ts, ", timestamp:", ts.value, ", type:", type(ts))

Timestamp: 1605177910000015000
Time: 2010-11-12 00:00:00 , timestamp: 1289520000000000000 , type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Time: 2010-11-12 10:39:00 , timestamp: 1289558340000000000 , type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>


**Pandas Date range**

Time series are very often used to describe the behaviour of a quantity as a function of time. Pandas has a special type of index for that, `DatetimeIndex`, that can be created e.g. with the function `pd.data_range()`.

In [12]:
# create DatetimeIndex using ranges:
days = pd.date_range(date, periods=7, freq='D') #periods = quante volte vuoi replicare la misura del tempo
print("7 days range:", days)

seconds = pd.date_range(date, periods=3600, freq='s')
print("1 hour in seconds:", seconds)

7 days range: DatetimeIndex(['2020-11-12 10:45:10.000015', '2020-11-13 10:45:10.000015',
               '2020-11-14 10:45:10.000015', '2020-11-15 10:45:10.000015',
               '2020-11-16 10:45:10.000015', '2020-11-17 10:45:10.000015',
               '2020-11-18 10:45:10.000015'],
              dtype='datetime64[ns]', freq='D')
1 hour in seconds: DatetimeIndex(['2020-11-12 10:45:10.000015', '2020-11-12 10:45:11.000015',
               '2020-11-12 10:45:12.000015', '2020-11-12 10:45:13.000015',
               '2020-11-12 10:45:14.000015', '2020-11-12 10:45:15.000015',
               '2020-11-12 10:45:16.000015', '2020-11-12 10:45:17.000015',
               '2020-11-12 10:45:18.000015', '2020-11-12 10:45:19.000015',
               ...
               '2020-11-12 11:45:00.000015', '2020-11-12 11:45:01.000015',
               '2020-11-12 11:45:02.000015', '2020-11-12 11:45:03.000015',
               '2020-11-12 11:45:04.000015', '2020-11-12 11:45:05.000015',
               '2020-11-12 11

To learn more about the frequency strings, please check the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases).

A standard series can be created, and (a range of) elements can be used as indices:

In [13]:
print("index:\n", days, '\n')
tseries = pd.Series(np.random.normal(10, 1, len(days)), index=days)
print("time series:\n", days, '\n')
# Extracting elements
print("slice by position:\n", tseries[0:4], '\n')
print("slice by date range:\n", tseries['2020-9-11' : '2020-9-14'], '\n') # note that includes end time

index:
 DatetimeIndex(['2020-11-12 10:45:10.000015', '2020-11-13 10:45:10.000015',
               '2020-11-14 10:45:10.000015', '2020-11-15 10:45:10.000015',
               '2020-11-16 10:45:10.000015', '2020-11-17 10:45:10.000015',
               '2020-11-18 10:45:10.000015'],
              dtype='datetime64[ns]', freq='D') 

time series:
 DatetimeIndex(['2020-11-12 10:45:10.000015', '2020-11-13 10:45:10.000015',
               '2020-11-14 10:45:10.000015', '2020-11-15 10:45:10.000015',
               '2020-11-16 10:45:10.000015', '2020-11-17 10:45:10.000015',
               '2020-11-18 10:45:10.000015'],
              dtype='datetime64[ns]', freq='D') 

slice by position:
 2020-11-12 10:45:10.000015     8.405041
2020-11-13 10:45:10.000015    10.951249
2020-11-14 10:45:10.000015     9.484155
2020-11-15 10:45:10.000015     8.111454
Freq: D, dtype: float64 

slice by date range:
 Series([], Freq: D, dtype: float64) 



`pd.to_datetime` can also be used to create a `DatetimeIndex` if the argument is a list:

In [14]:
print(pd.to_datetime([1, 2, 3, 4], unit='D', origin=pd.Timestamp('1980-02-03')))

DatetimeIndex(['1980-02-04', '1980-02-05', '1980-02-06', '1980-02-07'], dtype='datetime64[ns]', freq=None)


## DataFrame

A pandas DataFrame can be thought as a tabular spreadsheet, although the performance, the functionalities and the capabilities are very different.
Una sorta di dizionario di serie (in cui ogni serie è una colonna e le serie sono a loro volta indicizzate --> indice colonne + gli indici delle righe = quelli delle serie)

Similarly to Series, the DataFrame structure also contains labeled axes (rows and columns). Arithmetic operations **align on both row and column labels**. Each column in a DataFrame is a Series object: as a matter of fact, a DataFrame can be thought of as a dict-like container for Series objects.

The elements can be of all types, and missing data could be present too (represented as NaN).

For future reference (or for people already familiar with R), a pandas DataFrame is also similar to the R DataFrame.

Link to the official [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

### Constructor

A DataFrame objects can be created by passing a dictionary of objects. Note that the dictionary values are not omogeneous and do not have the same length. In these cases, pandas will automatically adjust the sizes, by replicating the content or adding NaN if necessary.

In [15]:
df = pd.DataFrame({
    'A' : 1.,
    'B' : pd.Timestamp('20130102'),
    'C' : pd.Series(3, index=range(4), dtype='float32'),
    'D' : np.arange(7, 11),
    'E' : pd.Categorical(["test", "train", "test", "train"]), # a Series that represents a category label
})
# the keys of the dictionary represent the labels of the columns

df #non abbiamo specificato gli indici delle righe per tutti gli elementi, perciò vengono creati in automatico

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,3.0,7,test
1,1.0,2013-01-02,3.0,8,train
2,1.0,2013-01-02,3.0,9,test
3,1.0,2013-01-02,3.0,10,train


An example of DataFrame with a `DatatimeIndex` object as index:

In [16]:
entries = 10
columns = ['A', 'B', 'C', 'D']
dates = pd.date_range('11/9/2020 14:45:00', freq='h', periods=entries) # days/month/year
df = pd.DataFrame(np.random.randn(entries, len(columns)), index=dates, columns=columns)
df # pay attention that the date is printed as year-day-month (di default)

Unnamed: 0,A,B,C,D
2020-11-09 14:45:00,-0.584588,-1.607488,-0.281793,0.128556
2020-11-09 15:45:00,0.549843,-1.484993,2.135181,0.346267
2020-11-09 16:45:00,-0.39577,0.405768,1.556294,0.089274
2020-11-09 17:45:00,-0.663736,0.69242,1.049939,-0.837275
2020-11-09 18:45:00,-1.682505,-0.17058,-0.901752,1.3469
2020-11-09 19:45:00,0.055407,1.822185,0.370045,-0.485863
2020-11-09 20:45:00,-0.850912,-0.338874,-1.288974,1.076645
2020-11-09 21:45:00,-1.764013,0.720927,0.5805,0.35555
2020-11-09 22:45:00,-0.336454,-0.507634,0.790993,-0.102413
2020-11-09 23:45:00,0.073741,0.169104,0.289152,1.431716


### Viewing Data : metodi utili

In [17]:
df.head() #ti stampa solo i primi(si può specificare il numero)

Unnamed: 0,A,B,C,D
2020-11-09 14:45:00,-0.584588,-1.607488,-0.281793,0.128556
2020-11-09 15:45:00,0.549843,-1.484993,2.135181,0.346267
2020-11-09 16:45:00,-0.39577,0.405768,1.556294,0.089274
2020-11-09 17:45:00,-0.663736,0.69242,1.049939,-0.837275
2020-11-09 18:45:00,-1.682505,-0.17058,-0.901752,1.3469


In [None]:
df.tail(4) #lo stesso, ma iniziando dalla fine

In [18]:
df.index #index object

DatetimeIndex(['2020-11-09 14:45:00', '2020-11-09 15:45:00',
               '2020-11-09 16:45:00', '2020-11-09 17:45:00',
               '2020-11-09 18:45:00', '2020-11-09 19:45:00',
               '2020-11-09 20:45:00', '2020-11-09 21:45:00',
               '2020-11-09 22:45:00', '2020-11-09 23:45:00'],
              dtype='datetime64[ns]', freq='H')

In [19]:
df.columns #index object

Index(['A', 'B', 'C', 'D'], dtype='object')

In [20]:
df.values #invece è un array

array([[-0.58458792, -1.6074878 , -0.28179345,  0.1285557 ],
       [ 0.54984262, -1.48499314,  2.13518067,  0.34626668],
       [-0.39576955,  0.40576812,  1.55629419,  0.08927389],
       [-0.66373587,  0.69242043,  1.04993934, -0.83727501],
       [-1.68250488, -0.17058024, -0.90175208,  1.34689951],
       [ 0.05540692,  1.82218537,  0.37004532, -0.48586293],
       [-0.85091188, -0.33887419, -1.28897448,  1.07664518],
       [-1.7640135 ,  0.72092687,  0.58050046,  0.35554983],
       [-0.33645416, -0.50763441,  0.7909928 , -0.10241295],
       [ 0.07374115,  0.16910355,  0.28915244,  1.43171605]])

In [21]:
df.describe() #overview del dataframe

Unnamed: 0,A,B,C,D
count,10.0,10.0,10.0,10.0
mean,-0.559899,-0.029917,0.429959,0.334936
std,0.737466,1.037876,1.052615,0.754224
min,-1.764013,-1.607488,-1.288974,-0.837275
25%,-0.804118,-0.465444,-0.139057,-0.054491
50%,-0.490179,-0.000738,0.475273,0.237411
75%,-0.042558,0.620757,0.985203,0.896371
max,0.549843,1.822185,2.135181,1.431716


In [23]:
df.T #si fa la trasposta

Unnamed: 0,2020-11-09 14:45:00,2020-11-09 15:45:00,2020-11-09 16:45:00,2020-11-09 17:45:00,2020-11-09 18:45:00,2020-11-09 19:45:00,2020-11-09 20:45:00,2020-11-09 21:45:00,2020-11-09 22:45:00,2020-11-09 23:45:00
A,-0.584588,0.549843,-0.39577,-0.663736,-1.682505,0.055407,-0.850912,-1.764013,-0.336454,0.073741
B,-1.607488,-1.484993,0.405768,0.69242,-0.17058,1.822185,-0.338874,0.720927,-0.507634,0.169104
C,-0.281793,2.135181,1.556294,1.049939,-0.901752,0.370045,-1.288974,0.5805,0.790993,0.289152
D,0.128556,0.346267,0.089274,-0.837275,1.3469,-0.485863,1.076645,0.35555,-0.102413,1.431716


In [24]:
df.sort_index(axis=1, ascending=False) #si deve specificare l'asse rispetto a cui si vuole ordinare 

Unnamed: 0,D,C,B,A
2020-11-09 14:45:00,0.128556,-0.281793,-1.607488,-0.584588
2020-11-09 15:45:00,0.346267,2.135181,-1.484993,0.549843
2020-11-09 16:45:00,0.089274,1.556294,0.405768,-0.39577
2020-11-09 17:45:00,-0.837275,1.049939,0.69242,-0.663736
2020-11-09 18:45:00,1.3469,-0.901752,-0.17058,-1.682505
2020-11-09 19:45:00,-0.485863,0.370045,1.822185,0.055407
2020-11-09 20:45:00,1.076645,-1.288974,-0.338874,-0.850912
2020-11-09 21:45:00,0.35555,0.5805,0.720927,-1.764013
2020-11-09 22:45:00,-0.102413,0.790993,-0.507634,-0.336454
2020-11-09 23:45:00,1.431716,0.289152,0.169104,0.073741


In [25]:
df.sort_values(by="C", ascending=False) #si può anche specificare la colonna rispetto a cui si possono voler ordinare i dati

Unnamed: 0,A,B,C,D
2020-11-09 15:45:00,0.549843,-1.484993,2.135181,0.346267
2020-11-09 16:45:00,-0.39577,0.405768,1.556294,0.089274
2020-11-09 17:45:00,-0.663736,0.69242,1.049939,-0.837275
2020-11-09 22:45:00,-0.336454,-0.507634,0.790993,-0.102413
2020-11-09 21:45:00,-1.764013,0.720927,0.5805,0.35555
2020-11-09 19:45:00,0.055407,1.822185,0.370045,-0.485863
2020-11-09 23:45:00,0.073741,0.169104,0.289152,1.431716
2020-11-09 14:45:00,-0.584588,-1.607488,-0.281793,0.128556
2020-11-09 18:45:00,-1.682505,-0.17058,-0.901752,1.3469
2020-11-09 20:45:00,-0.850912,-0.338874,-1.288974,1.076645


### Selection

#### Slicing

DataFrame slicing allows to select a subset of the DataFrame, or an entire column (a Series):

In [26]:
# standard and safe
print(df['A'], '\n', type(df['A']), '\n') # Returns a Series (a column)

# equivalent but dangerous (imagine blank spaces in the name of the column, or a column named "T")
print(df.A, '\n') #Fortemente non raccomandato

2020-11-09 14:45:00   -0.584588
2020-11-09 15:45:00    0.549843
2020-11-09 16:45:00   -0.395770
2020-11-09 17:45:00   -0.663736
2020-11-09 18:45:00   -1.682505
2020-11-09 19:45:00    0.055407
2020-11-09 20:45:00   -0.850912
2020-11-09 21:45:00   -1.764013
2020-11-09 22:45:00   -0.336454
2020-11-09 23:45:00    0.073741
Freq: H, Name: A, dtype: float64 
 <class 'pandas.core.series.Series'> 

2020-11-09 14:45:00   -0.584588
2020-11-09 15:45:00    0.549843
2020-11-09 16:45:00   -0.395770
2020-11-09 17:45:00   -0.663736
2020-11-09 18:45:00   -1.682505
2020-11-09 19:45:00    0.055407
2020-11-09 20:45:00   -0.850912
2020-11-09 21:45:00   -1.764013
2020-11-09 22:45:00   -0.336454
2020-11-09 23:45:00    0.073741
Freq: H, Name: A, dtype: float64 



Numpy-like slicing by row ranges is possible, and usually returns a **view** of the original DataFrame:

In [27]:
# selecting rows by range. Returns another DataFrame (usually a view)
print(df[0:3], '\n') #stiamo selezionando le righe --> se si passa solo un elemento: pandas considera le colonne, se si passa un range : pandas seleziona le righe

# or by index range
print(df["2020-11-09 14:45:00" : "2020-11-09 16:45:00"])

                            A         B         C         D
2020-11-09 14:45:00 -0.584588 -1.607488 -0.281793  0.128556
2020-11-09 15:45:00  0.549843 -1.484993  2.135181  0.346267
2020-11-09 16:45:00 -0.395770  0.405768  1.556294  0.089274 

                            A         B         C         D
2020-11-09 14:45:00 -0.584588 -1.607488 -0.281793  0.128556
2020-11-09 15:45:00  0.549843 -1.484993  2.135181  0.346267
2020-11-09 16:45:00 -0.395770  0.405768  1.556294  0.089274


#### Selection by label

The most common way to select elements, rows, or columns in a DataFrame is by using the `.loc[]` method (crea una selezione rispetto ai label --> si selezionano elementi solo rispetto agli indici).

`.loc` supports multi-indexing, and usually returns a **copy** of the DataFrame.

In [28]:
# getting a part of the DataFrame (in this case, a row)) using a label. Returns a Series
dfs = df.loc[dates[0]] # equivalent to df.loc[df.index[0]]
print(dfs, '\n', type(dfs), '\n')

A   -0.584588
B   -1.607488
C   -0.281793
D    0.128556
Name: 2020-11-09 14:45:00, dtype: float64 
 <class 'pandas.core.series.Series'> 



In [29]:
# selecting on a multi-axis by label:
dfa = df.loc[:, ['A','B']] #loc seleziona tutte le righe e le colonne corrispondenti ad A e B (Primo argomento: righe, secondo: colonne)
dfa #ritorna non una serie, ma un dataframe (generalmente una copia del data frame originale)

Unnamed: 0,A,B
2020-11-09 14:45:00,-0.584588,-1.607488
2020-11-09 15:45:00,0.549843,-1.484993
2020-11-09 16:45:00,-0.39577,0.405768
2020-11-09 17:45:00,-0.663736,0.69242
2020-11-09 18:45:00,-1.682505,-0.17058
2020-11-09 19:45:00,0.055407,1.822185
2020-11-09 20:45:00,-0.850912,-0.338874
2020-11-09 21:45:00,-1.764013,0.720927
2020-11-09 22:45:00,-0.336454,-0.507634
2020-11-09 23:45:00,0.073741,0.169104


In [30]:
# showing label slicing, both endpoints are included:
df.loc['2020-11-09 18:45:00':'2020-11-09 20:45:00', ['A','B']]

Unnamed: 0,A,B
2020-11-09 18:45:00,-1.682505,-0.17058
2020-11-09 19:45:00,0.055407,1.822185
2020-11-09 20:45:00,-0.850912,-0.338874


In [31]:
# getting an individual element (se conosco le coordinate)
print(df.loc[dates[1], 'A'], '\n', type(df.loc[dates[1], 'A']), '\n')

0.5498426164447107 
 <class 'numpy.float64'> 



The `.at()` method is equivalent to `.loc[]`. Use `at` if you only need to get or set a single value in a DataFrame or Series.

In [32]:
print(df.at[dates[1], 'A']) #se si seleziona un singolo elemento, si può usare sia loc che at (attenzione: loc [], at ())

0.5498426164447107


#### Selecting by position

`.iloc[]` is similar ot `.loc[]`, but instead of labels, it uses pure integer-location based indexing for selection by position. --> si usano gli "indici" di posizione e non i label

But differently from `.loc[]`, `.iloc[]` usually returns a **view**, not a copy.

In [33]:
# select via the position of the passed integers:
print(df.iloc[3], '\n', type(df.iloc[3]), '\n')

# row and column ranges selected with numpy-like notation:
dfv = df.iloc[3:5, 0:2]
print(dfv, '\n')

A   -0.663736
B    0.692420
C    1.049939
D   -0.837275
Name: 2020-11-09 17:45:00, dtype: float64 
 <class 'pandas.core.series.Series'> 

                            A        B
2020-11-09 17:45:00 -0.663736  0.69242
2020-11-09 18:45:00 -1.682505 -0.17058 



In [34]:
# selection of multiple elements with lists
df.iloc[[1, 2, 4], [0, 2]] # selecting rows 1,2 and 4 for columns 0 and 2

Unnamed: 0,A,C
2020-11-09 15:45:00,0.549843,2.135181
2020-11-09 16:45:00,-0.39577,1.556294
2020-11-09 18:45:00,-1.682505,-0.901752


In [35]:
# slicing rows explicitly
df.iloc[1:3, :]

# slicing columns explicitly
df.iloc[:, 1:3]

Unnamed: 0,B,C
2020-11-09 14:45:00,-1.607488,-0.281793
2020-11-09 15:45:00,-1.484993,2.135181
2020-11-09 16:45:00,0.405768,1.556294
2020-11-09 17:45:00,0.69242,1.049939
2020-11-09 18:45:00,-0.17058,-0.901752
2020-11-09 19:45:00,1.822185,0.370045
2020-11-09 20:45:00,-0.338874,-1.288974
2020-11-09 21:45:00,0.720927,0.5805
2020-11-09 22:45:00,-0.507634,0.790993
2020-11-09 23:45:00,0.169104,0.289152


Similary to `.loc[]` and `.at[]`, there is also `.iat[]` alongside `.iloc[]`:

In [36]:
# selecting an individual element by position: no difference between iloc and iat
print(df.iloc[1,1], ", type:", type(df.iloc[1,1]))
print(df.iat[1,1], ", type:", type(df.iat[1,1]))

-1.4849931433112153 , type: <class 'numpy.float64'>
-1.4849931433112153 , type: <class 'numpy.float64'>


#### Masks

Boolean masks can be used in the same way as numpy, and they represent a very powerful way of filtering out data with certain features. Just like Numpy fancy indexing, using a mask usually returns a **copy** of the DataFrame.

In [37]:
# Selecting on the basis of boolean conditions applied to the whole DataFrame
dfc = df[df > 0] #operazione effettuata su TUTTO il database (sia righe che col)
dfc.iat[0, 0] = -99
# a DataFrame with the same shape is returned, with NaN's where condition is not met
# Note that when a NaN is present in a column of integers, the column (Series) is casted to float (the elements filtered out by the mask sono sostituiti da Nan)
dfc

Unnamed: 0,A,B,C,D
2020-11-09 14:45:00,-99.0,,,0.128556
2020-11-09 15:45:00,0.549843,,2.135181,0.346267
2020-11-09 16:45:00,,0.405768,1.556294,0.089274
2020-11-09 17:45:00,,0.69242,1.049939,
2020-11-09 18:45:00,,,,1.3469
2020-11-09 19:45:00,0.055407,1.822185,0.370045,
2020-11-09 20:45:00,,,,1.076645
2020-11-09 21:45:00,,0.720927,0.5805,0.35555
2020-11-09 22:45:00,,,0.790993,
2020-11-09 23:45:00,0.073741,0.169104,0.289152,1.431716


In [38]:
# Filter by a boolean condition on the values of a single column
dfc[dfc['B'] < 0.5] #tutte le righe, ma elementi di una sola colonna

Unnamed: 0,A,B,C,D
2020-11-09 16:45:00,,0.405768,1.556294,0.089274
2020-11-09 23:45:00,0.073741,0.169104,0.289152,1.431716


**Queries**

Pandas uses a database-like engine to select elements according to a query on the columns of the DataFrame: 
Un'altro metodo, più database oriented (tipo SQL) , per fare esattamente le stesse cose. 

In [39]:
dfq = df.query('C > 0.5')
dfq

Unnamed: 0,A,B,C,D
2020-11-09 15:45:00,0.549843,-1.484993,2.135181,0.346267
2020-11-09 16:45:00,-0.39577,0.405768,1.556294,0.089274
2020-11-09 17:45:00,-0.663736,0.69242,1.049939,-0.837275
2020-11-09 21:45:00,-1.764013,0.720927,0.5805,0.35555
2020-11-09 22:45:00,-0.336454,-0.507634,0.790993,-0.102413


which is equivalent to `dfq = df[df['C'] > 0.5]`:

In [40]:
dfq = df[df['C'] > 0.5]
dfq

Unnamed: 0,A,B,C,D
2020-11-09 15:45:00,0.549843,-1.484993,2.135181,0.346267
2020-11-09 16:45:00,-0.39577,0.405768,1.556294,0.089274
2020-11-09 17:45:00,-0.663736,0.69242,1.049939,-0.837275
2020-11-09 21:45:00,-1.764013,0.720927,0.5805,0.35555
2020-11-09 22:45:00,-0.336454,-0.507634,0.790993,-0.102413


### Copy and views in DataFrames

The view/copy behaviour in Pandas is not as easy as it may appear, as there are counter-intuitive exceptions. There was a plan to fix this by quite some time, but a fix has not been deployed yet.

Check this discussion [here](https://www.practicaldatascience.org/html/views_and_copies_in_pandas.html):

    In numpy, the rules for when you get views and when you don’t are a little complicated, but they are consistent: certain behaviors (like simple indexing) will always return a view, and others (fancy indexing) will never return a view (always a copy).

    But in pandas, whether you get a view or not—and whether changes made to a view will propagate back to the original DataFrame—depends on the structure and data types in the original DataFrame.


In summary, there is only one way to write safe code when dealing with slides of a dataframe: after every instruction that selects a subset of a DataFrame, force the copy by appending `.copy()` to the slice (per evitare che i cambiamenti effettuati alla view(che potrebbero sollevare dei warning) siano propagati all'originale).

### Assignement

Assignment is typically performed after selection:

In [41]:
# Make sure to copy the DataFrame if you plan to modify it, and you don't want to change the original object
dfa = df.copy()

# setting values by label (same as by position)
dfa.at[dates[0], 'A'] = -1

# setting and assigning a numpy array
dfa.loc[:, 'D'] = np.array([5] * len(dfa))

# defining a new column
dfa['E'] = np.arange(len(dfa)) * 0.5

# defining a brand new column by means of a pd.Series: indexes must be the same!
dfa['E prime'] = pd.Series(np.arange(len(dfa))*2, index=dfa.index)

# using masks for assigment
dfa[dfa < 0] = -dfa

dfa

  dfa.loc[:, 'D'] = np.array([5] * len(dfa))


Unnamed: 0,A,B,C,D,E,E prime
2020-11-09 14:45:00,1.0,1.607488,0.281793,5,0.0,0
2020-11-09 15:45:00,0.549843,1.484993,2.135181,5,0.5,2
2020-11-09 16:45:00,0.39577,0.405768,1.556294,5,1.0,4
2020-11-09 17:45:00,0.663736,0.69242,1.049939,5,1.5,6
2020-11-09 18:45:00,1.682505,0.17058,0.901752,5,2.0,8
2020-11-09 19:45:00,0.055407,1.822185,0.370045,5,2.5,10
2020-11-09 20:45:00,0.850912,0.338874,1.288974,5,3.0,12
2020-11-09 21:45:00,1.764013,0.720927,0.5805,5,3.5,14
2020-11-09 22:45:00,0.336454,0.507634,0.790993,5,4.0,16
2020-11-09 23:45:00,0.073741,0.169104,0.289152,5,4.5,18


### Dropping

Dropping columns is an example of a method that does not modify the original object, and returns a new modified object. In other words, if you want to keep the modified DataFrame, perform a new assignment:

```python
df = df.drop(...)
```
Alternatively, the modification of the original object can be forced by specifying `inplace=True` among the arguments.

In [42]:
dfb = dfa.copy()

# Dropping by column..
dfb.drop(['E prime'], axis=1)

# ...which is equivalent to
dfb = dfb.drop(columns=['E prime'])
#dfb.drop(columns=['E prime'], inplace=True) # equivalent to the previous one

dfb.drop(dfb.index[[0, 1, 2]]) # drop by rows

dfb

Unnamed: 0,A,B,C,D,E
2020-11-09 14:45:00,1.0,1.607488,0.281793,5,0.0
2020-11-09 15:45:00,0.549843,1.484993,2.135181,5,0.5
2020-11-09 16:45:00,0.39577,0.405768,1.556294,5,1.0
2020-11-09 17:45:00,0.663736,0.69242,1.049939,5,1.5
2020-11-09 18:45:00,1.682505,0.17058,0.901752,5,2.0
2020-11-09 19:45:00,0.055407,1.822185,0.370045,5,2.5
2020-11-09 20:45:00,0.850912,0.338874,1.288974,5,3.0
2020-11-09 21:45:00,1.764013,0.720927,0.5805,5,3.5
2020-11-09 22:45:00,0.336454,0.507634,0.790993,5,4.0
2020-11-09 23:45:00,0.073741,0.169104,0.289152,5,4.5


### Dealing with missing data

Pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. If there is a `NaN` entry in a Series of integers, the type of the Series will be changed to floats.

In [43]:
df_wNan = dfb[dfb > 0.5]
df_wNan

Unnamed: 0,A,B,C,D,E
2020-11-09 14:45:00,1.0,1.607488,,5,
2020-11-09 15:45:00,0.549843,1.484993,2.135181,5,
2020-11-09 16:45:00,,,1.556294,5,1.0
2020-11-09 17:45:00,0.663736,0.69242,1.049939,5,1.5
2020-11-09 18:45:00,1.682505,,0.901752,5,2.0
2020-11-09 19:45:00,,1.822185,,5,2.5
2020-11-09 20:45:00,0.850912,,1.288974,5,3.0
2020-11-09 21:45:00,1.764013,0.720927,0.5805,5,3.5
2020-11-09 22:45:00,,0.507634,0.790993,5,4.0
2020-11-09 23:45:00,,,,5,4.5


In [45]:
# dropping rows with at least a Nan
df_wNan.dropna(how='any') 
df_wNan
#così non funziona poichè tale operazione non modifica il df originale, 
#per modificarlo bisogna:
dfc = df_wNan.dropna(how='any')
dfc

Unnamed: 0,A,B,C,D,E
2020-11-09 17:45:00,0.663736,0.69242,1.049939,5,1.5
2020-11-09 21:45:00,1.764013,0.720927,0.5805,5,3.5


In [46]:
# getting a mask
df_wNan.isna()
# df_wNan.notna()

Unnamed: 0,A,B,C,D,E
2020-11-09 14:45:00,False,False,True,False,True
2020-11-09 15:45:00,False,False,False,False,True
2020-11-09 16:45:00,True,True,False,False,False
2020-11-09 17:45:00,False,False,False,False,False
2020-11-09 18:45:00,False,True,False,False,False
2020-11-09 19:45:00,True,False,True,False,False
2020-11-09 20:45:00,False,True,False,False,False
2020-11-09 21:45:00,False,False,False,False,False
2020-11-09 22:45:00,True,False,False,False,False
2020-11-09 23:45:00,True,True,True,False,False


In [47]:
# filling missing data (not recommended, unless you really mean it)
df_wNan.fillna(value=0)

Unnamed: 0,A,B,C,D,E
2020-11-09 14:45:00,1.0,1.607488,0.0,5,0.0
2020-11-09 15:45:00,0.549843,1.484993,2.135181,5,0.0
2020-11-09 16:45:00,0.0,0.0,1.556294,5,1.0
2020-11-09 17:45:00,0.663736,0.69242,1.049939,5,1.5
2020-11-09 18:45:00,1.682505,0.0,0.901752,5,2.0
2020-11-09 19:45:00,0.0,1.822185,0.0,5,2.5
2020-11-09 20:45:00,0.850912,0.0,1.288974,5,3.0
2020-11-09 21:45:00,1.764013,0.720927,0.5805,5,3.5
2020-11-09 22:45:00,0.0,0.507634,0.790993,5,4.0
2020-11-09 23:45:00,0.0,0.0,0.0,5,4.5


### Operations

Operations on the elements of a DataFrame are quite straightforward, as the syntax is the same as the one used for Series. Also for DataFrames, operations are performed between elements that share the same labels. Operations on columns are extremly fast, almost as fast as the actual operation between elements in a row.

In [48]:
# Some statistics (mean() just as an example)
# on rows
print(df.mean(axis=0), '\n')
# on columns
print(df.mean(axis=1), '\n')

A   -0.559899
B   -0.029917
C    0.429959
D    0.334936
dtype: float64 

2020-11-09 14:45:00   -0.586328
2020-11-09 15:45:00    0.386574
2020-11-09 16:45:00    0.413892
2020-11-09 17:45:00    0.060337
2020-11-09 18:45:00   -0.351984
2020-11-09 19:45:00    0.440444
2020-11-09 20:45:00   -0.350529
2020-11-09 21:45:00   -0.026759
2020-11-09 22:45:00   -0.038877
2020-11-09 23:45:00    0.490928
Freq: H, dtype: float64 



In [49]:
# Global operations on columns
df.apply(np.sum) # or whatever function defined by the user

A   -5.598987
B   -0.299165
C    4.299585
D    3.349356
dtype: float64

In [50]:
# Also lambda functions are accepted
df.apply(lambda x: x - x.max()) #mentre x.max() si applica alle colonne, apply in questo caso si applica ad ogni elemento

Unnamed: 0,A,B,C,D
2020-11-09 14:45:00,-1.134431,-3.429673,-2.416974,-1.30316
2020-11-09 15:45:00,0.0,-3.307179,0.0,-1.085449
2020-11-09 16:45:00,-0.945612,-1.416417,-0.578886,-1.342442
2020-11-09 17:45:00,-1.213578,-1.129765,-1.085241,-2.268991
2020-11-09 18:45:00,-2.232347,-1.992766,-3.036933,-0.084817
2020-11-09 19:45:00,-0.494436,0.0,-1.765135,-1.917579
2020-11-09 20:45:00,-1.400754,-2.16106,-3.424155,-0.355071
2020-11-09 21:45:00,-2.313856,-1.101259,-1.55468,-1.076166
2020-11-09 22:45:00,-0.886297,-2.32982,-1.344188,-1.534129
2020-11-09 23:45:00,-0.476101,-1.653082,-1.846028,0.0


In [51]:
# syntax is as usual similar to that of numpy arrays
df['S'] = df['A'] + df['C']
df

Unnamed: 0,A,B,C,D,S
2020-11-09 14:45:00,-0.584588,-1.607488,-0.281793,0.128556,-0.866381
2020-11-09 15:45:00,0.549843,-1.484993,2.135181,0.346267,2.685023
2020-11-09 16:45:00,-0.39577,0.405768,1.556294,0.089274,1.160525
2020-11-09 17:45:00,-0.663736,0.69242,1.049939,-0.837275,0.386203
2020-11-09 18:45:00,-1.682505,-0.17058,-0.901752,1.3469,-2.584257
2020-11-09 19:45:00,0.055407,1.822185,0.370045,-0.485863,0.425452
2020-11-09 20:45:00,-0.850912,-0.338874,-1.288974,1.076645,-2.139886
2020-11-09 21:45:00,-1.764013,0.720927,0.5805,0.35555,-1.183513
2020-11-09 22:45:00,-0.336454,-0.507634,0.790993,-0.102413,0.454539
2020-11-09 23:45:00,0.073741,0.169104,0.289152,1.431716,0.362894


### Application of a function

User-defined or standard functions can be applied on entire DataFrames or columns, with very short execution times.

There are two main methods, `apply()` and `transform()`:

In [52]:
def dcos(theta):
    theta = theta * (np.pi / 180)
    return np.cos(theta)

# Apply method with custom function
dfa['cosine'] = dfa["E"].apply(dcos)

# Transform method with lambda function
dfa['EplusOne'] = dfa["E"].transform(lambda x: x + 1)
dfa

Unnamed: 0,A,B,C,D,E,E prime,cosine,EplusOne
2020-11-09 14:45:00,1.0,1.607488,0.281793,5,0.0,0,1.0,1.0
2020-11-09 15:45:00,0.549843,1.484993,2.135181,5,0.5,2,0.999962,1.5
2020-11-09 16:45:00,0.39577,0.405768,1.556294,5,1.0,4,0.999848,2.0
2020-11-09 17:45:00,0.663736,0.69242,1.049939,5,1.5,6,0.999657,2.5
2020-11-09 18:45:00,1.682505,0.17058,0.901752,5,2.0,8,0.999391,3.0
2020-11-09 19:45:00,0.055407,1.822185,0.370045,5,2.5,10,0.999048,3.5
2020-11-09 20:45:00,0.850912,0.338874,1.288974,5,3.0,12,0.99863,4.0
2020-11-09 21:45:00,1.764013,0.720927,0.5805,5,3.5,14,0.998135,4.5
2020-11-09 22:45:00,0.336454,0.507634,0.790993,5,4.0,16,0.997564,5.0
2020-11-09 23:45:00,0.073741,0.169104,0.289152,5,4.5,18,0.996917,5.5


The major differences between `apply` and `transform` are:

   - Input: `apply` passes all the columns to the custom function, while `transform` passes each column.
   - Output: the custom function passed to `apply` can return a scalar, or a Series or DataFrame, while the custom function passed to `transform` must return a sequence (a Series, array or list) with the same length.

In summary, `transform` works on just one Series, and `apply` works on the entire DataFrame.

### Merge

Pandas provides various functions for easily combining together Series and DataFrames in join / merge-type operations.

**Concat**

Concatenation (adding rows) is straightforward:

In [53]:
rdf = pd.DataFrame(np.arange(40).reshape(10, 4))
rdf

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19
5,20,21,22,23
6,24,25,26,27
7,28,29,30,31
8,32,33,34,35
9,36,37,38,39


In [54]:
# split DataFrame into 3 pieces, row-wise
pieces = [rdf[:3], rdf[3:7], rdf[7:]]
pieces #si ottiene una lista di dataframe

[   0  1   2   3
 0  0  1   2   3
 1  4  5   6   7
 2  8  9  10  11,
     0   1   2   3
 3  12  13  14  15
 4  16  17  18  19
 5  20  21  22  23
 6  24  25  26  27,
     0   1   2   3
 7  28  29  30  31
 8  32  33  34  35
 9  36  37  38  39]

In [55]:
# put it back together
pd.concat(pieces)

# in this case, indices are already set; if they are not, indices can be ignored
#pd.concat(pieces, ignore_index=True)

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19
5,20,21,22,23
6,24,25,26,27
7,28,29,30,31
8,32,33,34,35
9,36,37,38,39


In case of dimension mismatch, Nan are added where needed.


**Append**

Appending rows and columns also works:

In [56]:
# appending a single row (as a Series)
s = rdf.iloc[3]
rdf = rdf.append(s, ignore_index=True) # remember to assign the returned object, or use inplace=True
rdf

  rdf = rdf.append(s, ignore_index=True) # remember to assign the returned object, or use inplace=True


Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19
5,20,21,22,23
6,24,25,26,27
7,28,29,30,31
8,32,33,34,35
9,36,37,38,39


**Merge/Join**

SQL-like operations on table can be performed on DataFrames. This is a quite advanced use case, refer to the [doc](https://pandas.pydata.org/pandas-docs/stable/merging.html#merging) for more info/examples.

In [57]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

pd.merge(left, right, on="key")

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


### Grouping

In real world applications, it's quite common that several entries (row) belong to a certain entity, or "group". DataFrames have a powerful tool to perform operations on entries of the same group. The method is called `.groupby()`, and it usually involves one or more of the following steps:

* Splitting the data into groups based on some criteria
* Applying a function to each group independently (dopo averli suddivisi=
* Combining the results into a data structure (non operazione fornita da groupby, ma deve essere eseguita in seguito)


In [58]:
gdf = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three',
                           'two', 'two', 'one', 'three'],
                    'C' : np.arange(8),
                    'D' : np.linspace(10, -10, 8)})
gdf

Unnamed: 0,A,B,C,D
0,foo,one,0,10.0
1,bar,one,1,7.142857
2,foo,two,2,4.285714
3,bar,three,3,1.428571
4,foo,two,4,-1.428571
5,bar,two,5,-4.285714
6,foo,one,6,-7.142857
7,foo,three,7,-10.0


In [None]:
# Grouping and then applying the sum() 
# function to the resulting groups (effective only where numerical values are present)
gdf.groupby('A').sum() #voglio applicare la somma, sulla base del contenuto della colonna A (criterio passato in input)

In [None]:
# Example: find maximum value in column D for each group, and assign the value to a new column
gdf['M'] = gdf.groupby('A')['D'].transform(np.max) #(): criterio, []:dove applicare la funzione
gdf

### Multi-indexing

Hierarchical / Multi-level indexing allows sophisticated data analysis on higher dimensional data. In practice, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1D) and DataFrames (2D).

In [None]:
# Creat multi-dimensional index
tuples = list(zip(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']))
multi_index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
print(multi_index, '\n', type(multi_index), '\n')

# Create multi-indexed dataframe or series
s = pd.Series(np.arange(8)/np.pi, index=multi_index)
s

In [None]:
# multi-indexing enables further features of the groupby method,
# e.g. when group-by by multiple columns
gdf.groupby(['A', 'B']).sum()

## Summary: a demonstration of the efficiency of the DataFrame

Let's go the hard way and load a (relatively) large dataset with approximately 1 million rows:

In [64]:
# Uncomment to download the file. Run the command just once
!wget https://www.dropbox.com/s/xvjzaxzz3ysphme/data_000637.txt -P ./data/

/bin/bash: wget: command not found


In [None]:
file_name = "./data/data_000637.txt"
data = pd.read_csv(file_name)
data

Let's now do some operations among (elements of) columns

In [61]:
itime = dt.datetime.now()
print("Begin time:", itime)

# the one-liner command
data['WEIGHTEDSUM'] = data['TDC_CHANNEL'] * 2.1 + data['BX_COUNTER'] * 0.1 + 2

ftime = dt.datetime.now()
print("End time:", ftime)
print("Elapsed time:", ftime - itime)

data

Begin time: 2022-11-11 10:48:03.626554


NameError: name 'data' is not defined

In [None]:
# the loop
def conversion(data):
    result = []
    for i in range(len(data)): 
        result.append(data.loc[data.index[i], 'TDC_CHANNEL'] * 2.1 + data.loc[data.index[i], 'BX_COUNTER'] * 0.1 + 2)
    return result

itime = dt.datetime.now()
print("Begin time:", itime)
data['WEIGHTEDSUM'] = conversion(data)
ftime = dt.datetime.now()
print("End time:", ftime)
print("Elapsed time:", ftime - itime)

data