In [1]:
import pandas as pd
import numpy as np
import os

fp = os.getcwd()

df = pd.read_csv(fp + r"\data\Input\transfermarkt_values.csv", sep=";") 
del df["Unnamed: 5"]

df.head(3)

Unnamed: 0,id,first_name,last_name,age,value
0,1,Kylian,Mbappe,23,160000000
1,2,Erling,Haaland,21,150000000
2,3,Vinicius,Junior,21,100000000


# Selecting Data

In this chapter, we want to access a subset of the DataFrame to perform CRUD operations. There are multiple ways to filter the Dataframe which are quite overlapping. I will show my preferred way of filtering the dataframe with df[col] and df.loc[].

## 1) Specific columns (all rows)

![05_column](img/05_column.png)

<code> df[column(s)] </code>
* column name, list of column names, all (:)
* return type solely depends on the amount of columns (1 = Series, >1 = DataFrame)

#### Create new col

In [2]:
# by array or another column
df["age_new"] = [50, 50, 50, 50, 50, 45, 45, 45, 45, 45]

#### Read

In [11]:
# single column (returns Series - use .values for array of values)
subset= df["age"]

# multiple columns (returns new dataframe)
subset = df[["age", "last_name"]]

#### Update (same as create)

In [12]:
# by another column
df["age_copy"] = df["age"]

#### Delete

In [13]:
del df["age_copy"]
del df["age_new"]

## 2) Specific rows

When specifying rows, there are multiple ways to do it, like basic df [condition] notation, df.iloc [row index, col index] or df.loc[condition, col]. The first one is best perfomance wise, but can't perform all CRUD operations. iloc can be useful at some points, for example when iterating, but i think the most natural way to filter the data is to use df.loc[ ] with the columns named by name.

![05_conditional](img/05_conditional.png)

<code> df.loc[condition] </code>
* condition, index or all (:)
* return type depends on amount of columns (1 -> Series, more -> DF)

#### Create new row

In [14]:
# accessing by index to create new row
df.loc[11, :] = [11, "Jadon", "Sancho", 21, 85000000]

#### Read

In [15]:
# conditional selection (returns Series - use .values for array of values)
subset = df.loc[df["age"] < 23, "first_name"]

# index selection (all columns -> DataFrame)
subset = df.loc[[2, 3, 5], :]

#### Update

In [16]:
# updating one value of one row with multiple conditions 
df.loc[(df["age"] > 23) & (df["age"] < 28), "value"] = 91000000

# update one value with a list of conditions
df.loc[df["last_name"].isin(["Haaland", "Mbappe"]), "id"] = df["id"] + 20

#### Delete

In [17]:
# To delete we need to use df.drop combined with df.loc (loc to get the index)
df = df.drop(df.loc[df["first_name"] == "Jadon"].index)

# Iteration
Before showing how to iterate by columns or by row, i want to note that iterating over a DataFrame isn't optimal. It is better to filter the DataFrame like shown before and use vectoritation with functions, new values etc. Anyway, in some rare cases it is helpful to iterate and thats why im showing it here.

## Iterate by column

![05_iterate_by_col](img/05_iterate_by_col.png)

In [10]:
# shortening df for readability (first 5 rows)
df = df.loc[:4, ["id", "first_name", "last_name"]]

# returns each column as Series, values can be accessed with .values as an array 
for key, value in df.iteritems():
    print(key, value.values)

id [1. 2. 3. 4. 5.]
first_name ['Kylian' 'Erling' 'Vinicius' 'Mohamed' 'Harry']
last_name ['Mbappe' 'Haaland' 'Junior' 'Salah' 'Kane']


## Iterate by row

![05_iterate_by_row](img/05_iterate_by_row.png)

In [11]:
# returns each row as Series, values can be accessed with .values as an array 
for key, value in df.iterrows():
    print(key, value.values)

0 [1.0 'Kylian' 'Mbappe']
1 [2.0 'Erling' 'Haaland']
2 [3.0 'Vinicius' 'Junior']
3 [4.0 'Mohamed' 'Salah']
4 [5.0 'Harry' 'Kane']
