<a href="https://colab.research.google.com/github/MMRES-PyBootcamp/MMRES-python-bootcamp2021/blob/master/04_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 2 - Pandas (Secon part) TODO
> An introduction on Pandas basics. TODO Here you will hear (just a bit) about Python *packages* and *modules*. Then you will be introduced to *lists* and *dictionaries*, some of the most versatile data types in Python. Finally, you will become familiar with the concept of *flow control* and the definition of your own *functions*. TODO

## Outline TODO
 * [DataFrame transformations](#DataFrame-transformations)
   * [DataFrame numerical transformations](#DataFrame-numerical-transformations)
   * [DataFrame text transformations](#DataFrame-text-transformations)
 * [Exporting DataFrames](#Exporting-DataFrames)
 * [Grouping-by and aggregating DataFrames](#Grouping-by-and-aggregating-DataFrames)
 * [Pivoting DataFrames](#Pivoting-DataFrames)
 * [Melting DataFrames](#Melting-DataFrames)
 * [User defined functions](#User-defined-functions)
 * [User defined functions](#User-defined-functions)

<div class="alert alert-block alert-success"><b>Practice:</b> Practice cells announce exercises that you should try during the current boot camp session.
</div>

<div class="alert alert-block alert-warning"><b>Extension:</b> Extension cells correspond to exercises (or links to contents) that are a bit more advanced. We recommend to try them after the current boot camp session.
</div>

<div class="alert alert-block alert-info"><b>Tip:</b> Tip cells just give some advice or complementary information.
</div>

<div class="alert alert-block alert-danger"><b>Caveat:</b> Caveat cells warn you about the most common pitfalls one founds when starts his/her path learning Python.

</div>

**This document is devised as a tool to enable your self-learning process. If you get stuck at some step or need any kind of help, please don't hesitate to raise your hand and ask for the teacher's guidance.**

---

## DataFrame transformations

We are now familiar on how to *access* the data stored in a DataFrame. Our next step will be how to *transform* such data. Let's begin again by loading Pandas with the `pd` alias and by importing `ToySpreadsheet.xlsx` from the `/MMRES-python-bootcamp2022/datasets` sub-folder:

In [None]:
# Load package with its corresponding alias
import pandas as pd

# Reading an Excel SpreadSheet and storing it in as a DataFrame called `df`
df = pd.read_excel(io='datasets/ToySpreadsheet.xlsx')

# Return the DataFrame
df

### DataFrame numerical transformations

Let's start by *standardizing* the values of a numerical column. By *standardizing* we mean taking a given distribution of values and bring it to a newer distribution with mean equal zero and standard deviation equal one. This *standardized* distribution is usually known as the [standard score](https://en.wikipedia.org/wiki/Standard_score) or *Z-score*. The $i$<sup>th</sup> observation of an $x$ magnitude, $(x_i)$, has a Z-score, $(Z_i)$, given by the following equation:

\begin{equation}
Z_i = \frac{x_i - \mu(x)}{\sigma(x)} ,
\end{equation}

where, $\mu(x)$ and $\sigma(x)$ are the mean and the standard deviation of $x$, respectively. For example, let's get the Z-score of `df['Intensity']`:

In [None]:
# Get the mean of the 'Intensity': I_mean
I_mean = df['Intensity'].mean()
print(I_mean)

# Get the standard deviation the 'Intensity': I_std
I_std = df['Intensity'].std()
print(I_std)

# Computing the Z-score of the 'Intensity' column
I_z = (df['Intensity'] - I_mean) / I_std

# Storing it in a new 'Z-Intensity' column
df['Z-Intensity'] = I_z

# Return the DataFrame
df

Note how easy is:
* To operate with a Pandas Series and numeric constants stored in variables: `(df['Intensity'] - I_mean) / I_std`.
* To store a freshly created Series `I_z` into a pre-existing DataFrame `df` with a new column name `'Z-Intensity'`.

<div class="alert alert-block alert-success"><b>Practice:</b>

The $i$<sup>th</sup> observation of an $x$ magnitude, $(x_i)$, has a 0-to-1 normalization, $(N_i)$, given by the following equation:

\begin{equation}
N_i = \frac{x_i - m(x)}{M(x) - m(x)},
\end{equation}

where, $m(x)$ and $M(x)$ are the minimum and the maximum values of $x$, respectively.
    
1) In the 1<sup>st</sup> code cell below, compute the 0-to-1 normalization of `df['Amplitude']`. 
    
Un-comment and fill only those code lines with underscores `___` or `print()`.
</div>

In [None]:
# Get the minimum of the 'Amplitude': A_min
#A_min = ___
#print(A_min)

# Get the mnaximum of the 'Amplitude': A_max
#A_max = ___
#print(A_max)

# Computing the N-normalization of the 'Amplitude' column and storing it in a new 'N-Amplitude' column
#df['N-Amplitude'] = ___

# Return the DataFrame
#df

In [None]:
# Get the minimum of the 'Amplitude': A_min
A_min = df['Amplitude'].min()
print(A_min)

# Get the mnaximum of the 'Amplitude': A_max
A_max = df['Amplitude'].max()
print(A_max)

# Computing the N-normalization of the 'Amplitude' column and storing it in a new 'N-Amplitude' column
df['N-Amplitude'] = (df['Amplitude'] - A_min) / (A_max - A_min)

# Return the DataFrame
df

Let's now devote some time in arranging `df` a bit more. For example, now that we have `'Z-Intensity'` and `'N-Amplitude'` we could discard `'Intensity'` and `'Amplitude'` using the [`.drop()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method:

In [None]:
# Dropping redundant columns 'Intensity' and 'Amplitude'
list_drop = ['Intensity', 'Amplitude']
df = df.drop(columns=list_drop)

# Return the DataFrame
df

In general, is a good practice to use nice *self explanatory* labels for DataFrame columns. However, it is also recommended to use labels as *short* as possible (try to find your balance between self explanatory and short). With this in mind, let's update some column labels from `df` using the [`.rename()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) method:

In [None]:
# Creating a renaming dictionary for incomming column rename
dic_rename = {'Software': 'Soft', 'Sequence': 'Seq', 'Z-Intensity': 'I', 'N-Amplitude': 'A'}
# Key (Old name). Value (New name)

# Renaming some columns from `df`
df = df.rename(columns=dic_rename)

# Return the DataFrame
df

Do you remember that dictionaries were know as a *mapping data types*? When calling `df.rename(columns=dic_rename)`, we used `dic_rename` to *map* old (*keys*) to new (*values*) column labels.

### DataFrame text transformations

Sometimes is useful to use text transformations on a given DataFrame column. For example, look at the column `'Raw'`. The strings within this column have a well organized structure comprising multiple substrings joined with underscores (`_`):

In [None]:
# Return 'Raw' column as a Series
df['Raw']

It seems that we have a date (`1985-04-06`), a four-digit code (`0123`), a two-letters code (`GA`), some kind of single letter indicator (`T` / `C`), and another correlative indicator (`R1` / `R2` / `R3` / `R4`). Let's integrate this info as `df` independent columns:

In [None]:
# Splitting the 'Raw' column by the underscore '_'
df['Split raw'] = df['Raw'].str.split('_')

# Return the DataFrame
df

Note that we get a new column called `'Split raw'` that has lists within! To achieve this we first used the accessor method [`.str`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) to *access* the strings stored in column `'Raw'`. Then, we chained the string method [`.split()`](https://docs.python.org/3/library/stdtypes.html#str.split) which (as you already know) returns lists. Now we should access the substrings stored within the lists stored in column `'Split raw'`:

In [None]:
# Taking the 1st element of the list in 'Split raw' as 'Date'
df['Date'] = df['Split raw'].str[0]

# Return the DataFrame
df

We get a new column `'Date'` with the information we were looking for.

<div class="alert alert-block alert-success"><b>Practice:</b>
    
1) In the 1<sup>st</sup> code cell below, get a new column called `'ID'` for the four-digit code (`0123`); a new column called `'User'` for the two-letters code (`GA`); a new column called `'Cond'` for the single letter indicator (`T` / `C`); and a new column called `'Rep'` for the correlative indicator (`R1` / `R2` / `R3` / `R4`).
    
2) In the 2<sup>nd</sup> code cell below, discard columns `df['Raw']` and `df['Split raw']`.

    
    
Un-comment and fill only those code lines with underscores `___`.
</div>

In [None]:
# Taking 2nd, 3rd, 4th and 5th elements of the list in 'Split raw' as 'ID', 'User', 'Cond' and 'Rep'
#df['ID'] = ___
#df['User'] = ___
#df['Cond'] = ___
#df['Rep'] = ___

In [None]:
# Taking 2nd, 3rd, 4th and 5th elements of the list in 'Split raw' as 'ID', 'User', 'Cond' and 'Rep'
df['ID'] = df['Split raw'].str[1]
df['User'] = df['Split raw'].str[2]
df['Cond'] = df['Split raw'].str[3]
df['Rep'] = df['Split raw'].str[4]

In [None]:
# Dropping redundant columns 'Raw' and 'Split raw'
#list_drop = ___
#df = df.drop(columns=___)

# Return the DataFrame
#___

In [None]:
# Dropping redundant columns 'Raw' and 'Split raw'
list_drop = ['Raw', 'Split raw']
df = df.drop(columns=list_drop)

# Return the DataFrame
df

## Exporting DataFrames

At this point, `df` is tidy enough as to be exported and locally stored. Look how easy is to save a DataFrame into our hard-disk with the method [`.to_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html):

In [None]:
# Exporting the DataFrame as an Excel SpreadSheet
df.to_excel(excel_writer='datasets/DataFrameSpreadsheet.xlsx', sheet_name='Excel_df', index=False)

## Grouping-by and aggregating DataFrames

The DataFrame method [`.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) is one of the most useful to start diving in your data. A group-by-and-aggregate operation takes place is three steps.

1) DataFrame rows are **grouped by** the categories within a given column (or columns).
2) The column (or columns) we want to aggregate are accessed.
3) The accessed columns are then **aggregated** using an aggregating function.

For example, suppose that we would like to know the mean `'I'` and `'A'` according to each `Soft`:

In [None]:
# Grouping by 'Soft' and aggregating with mean
df_g = df.groupby(by=['Soft'])[['I', 'A']].mean()

# Return the DataFrame
df_g

Similarly, maybe we would like to know the mean `'Intensity'` and `'Amplitude'` according to each `Node`:

In [None]:
# Grouping by 'Soft' and aggregating with mean
df_g = df.groupby(by=['Node'])[['I', 'A']].mean()

# Return the DataFrame
df_g

Wen can also group-by multiple columns to have the information a bit more explicit:

In [None]:
# Grouping by 'Soft', 'Node' and aggregating with mean
df_g = df.groupby(by=['Soft', 'Node', 'RNA'])[['I', 'A']].mean()

# Return the DataFrame
df_g

Note that grouping by `'RNA'` is not really necessary with this DataFrame (we have a single RNA sequence instead of multiple RNA sequences).

<div class="alert alert-block alert-success"><b>Practice:</b>
    
1) In the 1<sup>st</sup> code cell below, group `df` by `'Soft'`, `'Node'`, `'RNA'`, and aggregate `'A'` with the minimum. Store the "grouped-by-and-aggregated" DataFrame as `df_g_Amin`.
    
2) In the 2<sup>nd</sup> code cell below, group `df` by `'Soft'`, `'Node'`, `'RNA'`, and aggregate `'I'` with the maximum. Store the "grouped-by-and-aggregated" DataFrame as `df_g_Imax`.
    
Un-comment and fill only those code lines with underscores `___`.
</div>

In [None]:
# Grouping by 'Soft', 'Node', 'RNA' and aggregating 'A' with min
#df_g_Amin = ___

# Return the DataFrame
#___

In [None]:
# Grouping by 'Soft', 'Node', 'RNA' and aggregating 'A' with min
df_g_Amin = df.groupby(by=['Soft', 'Node', 'RNA'])[['A']].min()

# Return the DataFrame
df_g_Amin

In [None]:
# Grouping by 'Soft', 'Node', 'RNA' and aggregating 'I' with max
#df_g_Imax = ___

# Return the DataFrame
#___

In [None]:
# Grouping by 'Soft', 'Node', 'RNA' and aggregating 'I' with max
df_g_Imax = df.groupby(by=['Soft', 'Node', 'RNA'])[['I']].max()

# Return the DataFrame
df_g_Imax

Calling the `.groupby()` method on a DataFrame gives a DataFrameGroupBy object that has another method called [`.agg()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html). This method is useful when we want to use multiple aggregating functions at the same time:

In [None]:
# Creating a list of columns to group by with
list_gby = ['Soft', 'Node', 'RNA']

# Creating a list of columns to aggregate
list_agg = ['A', 'I']

# Creating a list with string function names to aggregate with
list_funs = ['min', 'max']

# Group by and aggregate with multiple fuctions
df_g = df.groupby(by=list_gby)[list_agg].agg(func=list_funs)

# Return the DataFrame
df_g

Note that now we get the minimum and the maximum for both columns `I` and `A`. Since we just want maximum `I` and minimum `A`, it would be great to specify which aggregating functions we want for each column. We can achieve this with a dictionary:

In [None]:
# Creating a dictionary specifying how to aggregate each column
dict_aggfuns = {'A': 'min', 'I': 'max'}

# Group by and aggregate specifying how to aggregate each column
df_g = df.groupby(by=list_gby).agg(func=dict_aggfuns)

# Return the DataFrame
df_g

<div class="alert alert-block alert-warning"><b>Extension:</b>

We can specify the aggregating functions passed to the `.agg()` method as:
1) Lists of builtin functions (like `min`, `max`).
2) Lists of functions from packages (like `np.min`, `np.max`).
3) Lists of "function string names" (like `'min'`, `'max'`).
</div>

## Pivoting DataFrames

If you are an experienced spreadsheet user, maybe you will find more familiar the term "pivot table" rather than "grouping-by and aggregating". In general, all that can be achieve by grouping-by-and-aggregating can be also be done with the [`.pivot_table()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) method:

In [None]:
# Group by and aggregate with multiple fuctions
df.groupby(by=list_gby).agg(func=dict_aggfuns)

In [None]:
# Pivoting and aggregating with multiple fuctions
df.pivot_table(index=list_gby, aggfunc=dict_aggfuns)

Note the perfect correspondence between the `by=` parameter from `.groupby()` and the `index=` parameter from `.pivot_table()`, and similarly, between the `func=` parameter from the `.groupby()` method `.agg()` and the `aggfunc=` parameter from `.pivot_table()`. One distinguishing feature of `.pivot_table()` is the parameter `column=`:

In [None]:
# Creating a list of columns to group by with
list_indexes = ['RNA']

# Creating a list of columns to group by with
list_columns = ['Soft', 'Node']

df.pivot_table(index=list_indexes, columns=list_columns, aggfunc=dict_aggfuns)

By specifying `columns=`, we can now split `'Soft'` and `'Node'` categories by means of DataFrame columns.

In [None]:
df.pivot_table(index=list_indexes, columns=list_columns)

## Melting DataFrames

In a "[Tidy DataFrame](https://www.jstatsoft.org/article/view/v059i10)", each variable is a column and each observation is a row. The Pandas function [`melt()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) allows to switch from a "Non-tidy DataFrame" to a "Tidy DataFrame" very easily. Since our example DataFrame `df` is quite tidy, let's rename columns `'I'` and `'A'` just to better illustrate how does `melt()` works:

In [None]:
# Creating a renaming dictionary for incomming column rename
dic_rename = {'I': 'Cat', 'A': 'Dog'}
# Key (Old name). Value (New name)

# Renaming some columns from `df`
df = df.rename(columns=dic_rename)

# Return the DataFrame (BEFORE melting)
df

Now it looks like that we have rows that mix observations for `'Cat'` and `'Dog'`. Let's melt this "Non-tidy DataFrame":

In [None]:
# Melting 'Cat' and 'Dog', keeping 'Soft', 'Node', 'RNA', 'Cond' and 'Rep'
df_melt = pd.melt(frame=df,
                  id_vars=['Soft', 'Node', 'RNA', 'Cond', 'Rep'],
                  value_vars=['Cat', 'Dog'],
                  var_name='Animal',
                  value_name='Score')

# Return the DataFrame (AFTER melting)
df_melt

Now in `df_melt`, each row is an observation and each column is a variable. Note the arguments we used in `pd.melt()`:
 + `id_vars=`: List of columns to use as identifiers on the "melted" DataFrame.
 + `value_vars=`: List of columns to "melt".
 + `var_name=`: String to name "melted" columns.
 + `value_name=`: String to name "melted" values.

Despite pivot tables are easier to inspect at a glance than Tidy DataFrames, it is always recommended to work with *tidy data*. In the boot camp session that we will devote to data visualization on September 21 <sup>st</sup> (16:00-17:00), we will see that many Python plotting functions work better with "Tidy DataFrames".

## `Apply()`

In [None]:
import numpy as np
import pandas as pd
print('Pandas:', pd.__version__)

In [None]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

In [None]:
numbers = [1, 2, 3]
pd.Series(numbers)

Notice that the series is indexed by default by integers. We can change this indexing by using a dictionary instead of a list to create the series.

In [None]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

On the other hand, dataframes can be built from two-dimensional arrays, with the ability of labelling columns and indexing the rows. **Every column in a dataframe is a series**. 

In [None]:
# Sampling a 1000 rows 6 cols 2D array from the standard normal distribution and creating DataFrame
u = pd.DataFrame(np.random.randn(1000, 6),
                 index=np.arange(0, 3000, 3),
                 columns=['A', 'B', 'C', 'D', 'E', 'F'])

print(type(u))

u

As you might have noticed, it is not the best to look at massive dataframes. There are some functions that allow us to have a nicer look at parts of the dataframe to have an idea of "how things are going".

In [None]:
u.head()

In [None]:
u.tail()

In [None]:
u.info()

In [None]:
u.describe()

### Indexing/Slicing in Pandas

The easiest way to access information in a Pandas dataframe, equivalent to the way used in NumPy, is using the `iloc` command. With `iloc` we can use the same indexing techniques that we saw with NumPy in the previous notebook.

In [None]:
# Slice-in rows index 125 to 132 (132 included!) from columns index 0, 2 and 5
u.iloc[125:132, [0, 2, 5]]

We can choose specific columns according to their names using `loc` instead of `iloc`.

In [None]:
# Slice-in rows 375 to 393 (393 included!) from columns A, C and F
u.loc[375:393, ['A', 'C', 'F']]

However, there are a few different ways of accessing the data in a Pandas dataframe, that typically have a more "direct" connection with the actual content fo the dataframe. Individual or sets of columns can also be accessed by their column names. Choosing one single column will give a Series, while two or more will produce a DataFrame

In [None]:
u['A'].head()

In [None]:
u[['A', 'D']].head()

Not only that, we can access a single column without the need of brackets []

In [None]:
u.A.head()

Or, we can retrieve the elements that satisfy some condition

In [None]:
u[u.D > 2]

Dataframes provide the `query` functionality for the same purpose. While it is less powerful than boolean indexing, it is often faster and shorter (when names are longer than just `u`)

In [None]:
u.query('D > 2')

### Reshaping `DataFrame`

We can reshape and concatenate dataframes in a pretty similar way to numpy arrays. 

In [None]:
df1 = pd.DataFrame()

df1['sample'] = ['A', 'A', 'A', 'B', 'B', 'B']
df1['replicate'] = ['01', '02', '03', '01', '02', '03']
df1['protein'] = 'P02768'
df1['value1'] = np.random.randn(6)

df1

In [None]:
pivot_df1 = df1.pivot(index='replicate', columns='sample', values='value1')

pivot_df1.head()

### Computing With `DataFrames`

We can calculate with `DataFrames` or their columns (which are `Series`) the same way we would work with numpy arrays.

In [None]:
df1['value2'] = 1 / df1['value1']
df1.head()

In [None]:
np.mean(df1)

We can also apply functions to the whole dataset or specific columns with the `apply` command. `apply` acts on the whole column at a time (i.e. a Pandas `Series`), so we can compute things that depend on several values of the column, for instance, the mean value. To apply functions in a real element-by-element basis the function `applymap` or `Series.apply` should be used.

In [None]:
def mean(col):
    return sum(col) / len(col)

df1[['value1', 'value2']].apply(mean)

While most can be directly calculated (including the given example of the mean), `apply` also works on columns with strings or categorical data, where no mathematical operations are defined. The limit is the imagination.

### Combining `DataFrames`

Something we will do quite often as scientists is combining data from different sources into one single source. This can be achieved by different commands in Pandas, depending on the actual goal we want.

To begin with, appending new rows of data is achieved by the command `append`.

In [None]:
df2 = pd.DataFrame()

df2['sample'] = ['A', 'A', 'A', 'B', 'B', 'B']
df2['replicate'] = ['01', '02', '03', '01', '02', '03']
df2['protein'] = 'P69892'
df2['value1'] = np.random.randn(6)
df2['value2'] = 1 / df2['value1']

df2

In [None]:
df1.append(df2, ignore_index=True)

The same result can be obtained with `concat`.

In [None]:
df = pd.concat([df1, df2], ignore_index=True)

df

### Grouping Data

In [None]:
df.groupby('protein').agg(sum)

In [None]:
df.groupby(['protein', 'sample']).agg(sum)

In [None]:
df.groupby(['protein', 'sample', 'replicate']).agg(sum)

In [None]:
df.groupby('protein').transform(np.mean)

In [None]:
df.groupby('protein')['value1', 'value2'].transform(np.mean)

In [None]:
for g, g_df in df.groupby(['protein', 'sample']):
    print(g_df)
    print(f"{g} --> mean value1: {np.mean(g_df['value1'])}")
    print(f"      mean value2: {np.mean(g_df['value2'])}\n")

In [None]:
df.groupby(['protein', 'sample']).describe()

In [None]:
df.pivot_table(index='protein',
               columns='sample', 
               aggfunc='mean')

In [None]:
df.pivot_table(index='protein',
               columns='sample',
               aggfunc={'value1': min,
                        'value2': max})

### Loading and saving dataframes

To load and save Pandas dataframes we will use the `to_csv` and `read_csv` commands. Whenever the dataframe does not contain any kind of column that is of type `object` we can also use feather format with `to_feather`. In case we have objects in the cells, such as functions, for example, we can use pickle format with `to_pickle`. 

In [None]:
df.to_csv('test.csv')
pd.read_csv('test.csv', index_col=0)

But, as an addition, Pandas has special commands to load and save Excel spreadsheets (yay!). However, to use it you'll need the `openpyxl` and `xlrd` packages.

In [None]:
df.to_excel('test.xlsx', sheet_name='My sheet')
pd.read_excel('test.xlsx', 'My sheet', index_col=0)

**Exercise 5**: Download [this dataset](https://raw.githubusercontent.com/ChihChengLiang/pokemongor/master/data-raw/pokemons.csv) and load it, using the first column as the index. Take a look at it, and do the following things:
- Choose the columns 'Identifier', 'BaseStamina', 'BaseAttack', 'BaseDefense', 'Type1' and 'Type2' 
- Create a function that lowercases strings and apply it to 'Type1' and 'Type2' (*Extra: just capitalize the strings, i.e., leave the first letter uppercase and lowercase the rest*)
- Create a function that returns a Boolean value (don't be afraif by this, it is a function that returns either True or False) that tells if a Pokémon has high stamina (BaseStamina>170) or not. Store this information in a new column and show the list of Pokémon with high stamina
- Show the instructor the last 15 rows of your dataset

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/ChihChengLiang/pokemongor/master/data-raw/pokemons.csv', 
                 index_col=0)

df = df[['Identifier', 'BaseStamina', 'BaseAttack', 'BaseDefense', 'Type1', 'Type2']]

capitalize = lambda st: st.capitalize()

for col in ['Type1', 'Type2']:
    df[col] = df[col].apply(capitalize)
    
def highstamina(x):
    return True if x > 170 else False

df['HighStamina'] = df.BaseStamina.apply(highstamina)

print(df[df['HighStamina'] == True].Identifier)

df.tail(15)