In [1]:
import pandas as pd
import numpy as np
import os

fp = os.getcwd()

df = pd.read_csv(fp + r"\data\Input\transfermarkt_values.csv", sep=";") 
del df["Unnamed: 5"]

df

Unnamed: 0,id,first_name,last_name,age,value
0,1,Kylian,Mbappe,23,160000000
1,2,Erling,Haaland,21,150000000
2,3,Vinicius,Junior,21,100000000
3,4,Mohamed,Salah,29,100000000
4,5,Harry,Kane,28,100000000
5,6,Romelu,Lukaku,28,100000000
6,7,Bruno,Fernandes,27,90000000
7,8,Kevin,De Bruyne,30,90000000
8,9,Neymar,Junior,30,90000000
9,10,Phil,Foden,21,85000000


# Selecting Data

In this chapter, we want to access a subset of the DataFrame to either print, access or change the values of the subset. There are multiple ways to achieve the filtering to get the subset. They are quite overlapping, for example loc/iloc (see 3) can be used for the first two tasks aswell, but they lack a bit of performance, are a bit more verbose in syntax and aren't officially recommended for these tasks by the pandas documentation. Thats the reason, why im showing the official way of filtering data.


*Source:* https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html


## 1) Specific columns (all rows)

![05_column](img/05_column.png)

<code> df[column(s)] </code>
* column name, list of column names, all (:)
* return type solely depends on the amount of columns (1 = Series, >1 = DataFrame)

<br>

#### Create new col

In [2]:
# by array
df["age_new"] = [50, 50, 50, 50, 50, 45, 45, 45, 45, 45]

# by another column
df["age_copy"] = df["age"]

#### Read

In [3]:
# single column
subset = df["age"]

# multiple columns
subset = df[["age", "last_name"]]

#### Update (same as create)

In [4]:
# by array
df["age_new"] = [50, 50, 50, 50, 50, 45, 45, 45, 45, 45]

# by another column
df["age_copy"] = df["age"]

#### Delete

In [5]:
del df["age_copy"]
del df["age_new"]

## 2) Specific rows

When specifying rows, there are multiple ways to do it, like basic df [condition] notation, df.iloc [row index, col index] or df.loc[condition, col]. The first one is best perfomance wise, but can't perform all CRUD operations. iloc can be useful at some points, for example when iterating, but i think the most natural way to filter the data is to use df.loc[ ] with the columns named by name.

![05_conditional](img/05_conditional.png)

<code> df.loc[condition] </code>
* condition, index or all (:)
* return type depends on amount of columns (1 -> Series, more -> DF)

<br>

#### Create new row

In [6]:
# accessing by index to create new row at the end
df.loc[11, :] = [11, "Jadon", "Sancho", 21, 85000000]

#### Read

In [7]:
# conditional selection (one column -> Series)
subset = df.loc[df["age"] < 23, "first_name"]

# index selection (all columns -> DataFrame)
subset = df.loc[[2, 3, 5], :]

#### Update

In [8]:
# updating one value of one row with multiple conditions 
df.loc[(df["age"] > 23) & (df["age"] < 28), "value"] = 91000000

#### Delete

In [9]:
# To delete we need to use df.drop combined with df.loc (loc to get the index)
df = df.drop(df.loc[df["first_name"] == "Jadon"].index)

# Iteration
Before showing how to iterate by columns or by row, i want to note that iterating over a DataFrame isn't optimal. It is better to filter the DataFrame like shown before and use vectoritation with functions, new values etc. Anyway, in some rare cases it is helpful to iterate and thats why im showing it here.

## Iterate by column

![05_iterate_by_col](img/05_iterate_by_col.png)

In [10]:
# returns each column as Series, values can be accessed with .values as an array 
for key, value in df.iteritems():
    print(key, value.values)

id [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
first_name ['Kylian' 'Erling' 'Vinicius' 'Mohamed' 'Harry' 'Romelu' 'Bruno' 'Kevin'
 'Neymar' 'Phil']
last_name ['Mbappe' 'Haaland' 'Junior' 'Salah' 'Kane' 'Lukaku' 'Fernandes'
 'De Bruyne' 'Junior' 'Foden']
age [23. 21. 21. 29. 28. 28. 27. 30. 30. 21.]
value [1.6e+08 1.5e+08 1.0e+08 1.0e+08 1.0e+08 1.0e+08 9.1e+07 9.0e+07 9.0e+07
 8.5e+07]


## Iterate by column

![05_iterate_by_row](img/05_iterate_by_row.png)

In [11]:
# returns each row as Series, values can be accessed with .values as an array 
for key, value in df.iterrows():
    print(key, value.values)

0 [1.0 'Kylian' 'Mbappe' 23.0 160000000.0]
1 [2.0 'Erling' 'Haaland' 21.0 150000000.0]
2 [3.0 'Vinicius' 'Junior' 21.0 100000000.0]
3 [4.0 'Mohamed' 'Salah' 29.0 100000000.0]
4 [5.0 'Harry' 'Kane' 28.0 100000000.0]
5 [6.0 'Romelu' 'Lukaku' 28.0 100000000.0]
6 [7.0 'Bruno' 'Fernandes' 27.0 91000000.0]
7 [8.0 'Kevin' 'De Bruyne' 30.0 90000000.0]
8 [9.0 'Neymar' 'Junior' 30.0 90000000.0]
9 [10.0 'Phil' 'Foden' 21.0 85000000.0]
