# pandas

pandas is a Python library that can be used for data analysis. With pandas we get two more data types: `pandas.Series` and `pandas.DataFrame`.

To work with pandas, it must be imported beforehand. Mostly pandas is abbreviated as pd:

In [1]:
import pandas as pd

## Series and DataFrames

A series corresponds to a table column or a one-dimensional array with the special feature that the contained values have a label (usually a continuous index):

In [2]:
s = pd.Series(["A", "B", "C"])
print(s)

0    A
1    B
2    C
dtype: object


However, the labels or the indexes can also be set by the user:

In [3]:
s = pd.Series(["A", "B", "C"], index=["x", "y", "z"])
print(s)

x    A
y    B
z    C
dtype: object


DataFrames are two-dimensional data structures such as tables with rows and columns, which also have an index and additionally column names. Like the index, the column names are also numbered consecutively starting with 0 by default:

In [4]:
df = pd.DataFrame([[10, 20, 30],
                   [40, 50, 60]])
print(df)

    0   1   2
0  10  20  30
1  40  50  60


However, column names can also be specified here. Alternative indexes could also be specified. Usually, the continuous indexes starting at 0 are retained:

In [5]:
data = [["Joel Embiid", "Philadelphia 76ers", "#21", 33.4, 10.2, 4.1], 
        ["Giannis Antetokounmpo", "Milwaukee Bucks", "#34", 32.2, 12.4, 5.3],
        ["Luka Doncic", "Dallas Mavericks", "#77", 33.4, 8.9, 8.2], 
        ["Damien Lillard", "Portland Trail Blazers", "#0", 30.8, 4.0, 7.2]]
df = pd.DataFrame(data=data, columns=["name", "team", "number", "ppg", "rpg", "apg"])
print(df)

                    name                    team number   ppg   rpg  apg
0            Joel Embiid      Philadelphia 76ers    #21  33.4  10.2  4.1
1  Giannis Antetokounmpo         Milwaukee Bucks    #34  32.2  12.4  5.3
2            Luka Doncic        Dallas Mavericks    #77  33.4   8.9  8.2
3         Damien Lillard  Portland Trail Blazers     #0  30.8   4.0  7.2


## Indexing DataFrames

To access cells within a DataFrame, the `loc` attribute can be used. It is attached directly to the DataFrame with a dot operator and takes the row (and optionally the column) of the cell. If only the row is specified, the entire row is returned as a Series, but if both row and column are specified, the corresponding cell is returned:

In [6]:
print("Indexing a row:")
print("==========================")
print(df.loc[2])

print("\nIndexing a cell:")
print("==========================")
print(df.loc[1, "name"])

Indexing a row:
name           Luka Doncic
team      Dallas Mavericks
number                 #77
ppg                   33.4
rpg                    8.9
apg                    8.2
Name: 2, dtype: object

Indexing a cell:
Giannis Antetokounmpo


It is also possible to access multiple rows or multiple columns. These are passed as a list for this purpose:

In [7]:
print(df.loc[[1, 2], ["name", "team"]])

                    name              team
1  Giannis Antetokounmpo   Milwaukee Bucks
2            Luka Doncic  Dallas Mavericks


If you want to get all columns from one row, it is sufficient to specify no column as described above. However, if you want to get all rows from a column, you have to specify a `:` instead of the rows:

In [8]:
print(df.loc[:, "name"])

0              Joel Embiid
1    Giannis Antetokounmpo
2              Luka Doncic
3           Damien Lillard
Name: name, dtype: object


The colon is used to slice a DataFrame. This makes it possible to get a range of rows without passing all rows within a list:

In [9]:
print(df.loc[1:2])

                    name              team number   ppg   rpg  apg
1  Giannis Antetokounmpo   Milwaukee Bucks    #34  32.2  12.4  5.3
2            Luka Doncic  Dallas Mavericks    #77  33.4   8.9  8.2


It is also possible to filter the rows by conditions. The following example shows how to extract all players with more than ten rebounds per game (rpg):

In [10]:
print(df.loc[df.loc[:, "rpg"] > 10])

                    name                team number   ppg   rpg  apg
0            Joel Embiid  Philadelphia 76ers    #21  33.4  10.2  4.1
1  Giannis Antetokounmpo     Milwaukee Bucks    #34  32.2  12.4  5.3


If you explicitly want all lines, you can access them without the `loc` attribute. Square brackets or the dot operator are sufficient for this:

In [11]:
print(df["name"])

0              Joel Embiid
1    Giannis Antetokounmpo
2              Luka Doncic
3           Damien Lillard
Name: name, dtype: object


In [12]:
print(df.name)

0              Joel Embiid
1    Giannis Antetokounmpo
2              Luka Doncic
3           Damien Lillard
Name: name, dtype: object


Thus, it follows that the following expressions are interchangeable: `df.loc[:, "name"]` == `df["name"]` == `df.name`, which is why filtering by conditions could be rewritten this way:

In [13]:
print(df.loc[df["rpg"] > 10])

                    name                team number   ppg   rpg  apg
0            Joel Embiid  Philadelphia 76ers    #21  33.4  10.2  4.1
1  Giannis Antetokounmpo     Milwaukee Bucks    #34  32.2  12.4  5.3


In [14]:
print(df.loc[df.rpg > 10])

                    name                team number   ppg   rpg  apg
0            Joel Embiid  Philadelphia 76ers    #21  33.4  10.2  4.1
1  Giannis Antetokounmpo     Milwaukee Bucks    #34  32.2  12.4  5.3
