## Pandas - Looping & Manipulation in a Series/DataFrame


* there are many ways on how to access values within a DataFrame 
* often we want to calculate new values based on our existing values 

<ins>Various Looping Methods</ins>
1. Accessing/Manipulating data in Series / Dataframe with **iterrows()**
2. Accessing/Manipulating data in Series / DataFrame with **apply()**
3. Additonal handy functions: **map(), applymap()**
4. Extra: Speed up the process - Accessing/Manipulating data in Series / DataFrame using **vectorization** methods


<ins>Why will have a look at different methods for doing the "same thing?"</ins>
- simple looping takes a lot more time than using specific methods -> important when manipulating big or multiple data sets  
- sometimes it is hard to vectorize your function -> need to go with apply



A simple dataframe consists of series which we know as columns

<img src="./pics/Dataframe.JPG" width = 300/>

In [49]:
import pandas as pd

# read in the Titanic dataframe
df = pd.read_csv("http://bit.ly/kaggletrain")

# create a sub dataframe with only three rows to work with
df_short = df.loc[:2, "PassengerId": "Age"]
df_short

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0


**How would you loop through a dataframe with the knowledge you have up to now?** ("Crude Looping")

In [50]:
df_short.shape

(3, 6)

- create looping variables for the row and the column and then iterate through the dataframe

In [51]:
for row in range(len(df_short)): #  number of rows
    for col in range(len(df_short.columns)): # number of columns
        print(df_short.iloc[row, col])

1
0
3
Braund, Mr. Owen Harris
male
22.0
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
female
38.0
3
1
3
Heikkinen, Miss. Laina
female
26.0


**Problem**: 
- not efficient and slow 

<img src="./pics/Speed_looping.png" width = 500/>

source: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

### 1. Looping through/Manipulating data in Series with **items()** / Dataframe with **.iterrows()**

- the following chapter is for demonstrating the methods items() and iterrows() 
- it is a similar method which you may can relate from other programming languages looping through an array
- but it is slow! - for performance reasons you should use apply or vectorization methods which are covered in the following chapters

### 1.1  Looping through/Manipulating data in <ins>Series</ins> with a **for loop / items()**

In [52]:
series_short = df_short["Name"].copy()
series_short

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
Name: Name, dtype: object

**a) .items() methods to retrieve index, value**
- returns the index and the value of the a series

<img src="./pics/Series_items().JPG" width=200>
* green = index, 
* orange = value

In [53]:
for index, value in series_short.items():
    print("The index (green) is: " + str(index))
    print("The value (orange) is: " + str(value))

The index (green) is: 0
The value (orange) is: Braund, Mr. Owen Harris
The index (green) is: 1
The value (orange) is: Cumings, Mrs. John Bradley (Florence Briggs Thayer)
The index (green) is: 2
The value (orange) is: Heikkinen, Miss. Laina


In [54]:
# changing data in the dataframe to to lowercase

for index, value in series_short.items():
    series_short.loc[index] = value.lower()
    
series_short

0                              braund, mr. owen harris
1    cumings, mrs. john bradley (florence briggs th...
2                               heikkinen, miss. laina
Name: Name, dtype: object

**b) simple for-loop to retrieve only the value**

<img src="./pics/apply Series.JPG" width = 250/>

In [55]:
# difficult to change the data in the dataframe with this method because of missing index, but can be used to get each value
list_of_names = []
for element in series_short:
    list_of_names.append(element)

list_of_names

['braund, mr. owen harris',
 'cumings, mrs. john bradley (florence briggs thayer)',
 'heikkinen, miss. laina']

### 1.2  Looping through/Manipulating data in <ins>DataFrames</ins> with **iterrows()**
* generates two pieces of data -> row name and the actual data
* row is a Series which can then accessed with via subsetting 

**You may think: let's do it like with a Series**

In [56]:
for each in df_short():
    print(each)

TypeError: 'DataFrame' object is not callable

- this causes an error because a DataFrame does not return an object which we can iterate on. 

**This is why we use iterrows() here**

<img src="./pics/iterrows.jpg" width = 400/>


In [57]:
for index, value in df_short.iterrows():
    print(type(index))
    print(type(value))    

<class 'int'>
<class 'pandas.core.series.Series'>
<class 'int'>
<class 'pandas.core.series.Series'>
<class 'int'>
<class 'pandas.core.series.Series'>


In [10]:
for index, value in df_short.iterrows():
    print(index)
    print(value)

0
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
Name: 0, dtype: object
1
PassengerId                                                    2
Survived                                                       1
Pclass                                                         1
Name           Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                       female
Age                                                           38
Name: 1, dtype: object
2
PassengerId                         3
Survived                            1
Pclass                              3
Name           Heikkinen, Miss. Laina
Sex                            female
Age                                26
Name: 2, dtype: object


In [58]:
## let's extract the last name from the Name column
for index, value in df_short.iterrows():
    df_short.loc[index,"last_name"] = value["Name"].split(",")[0]
df_short

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,last_name
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,Braund
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,Cumings
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,Heikkinen


**Exercise**

- Calculate a new column "Culmen combined" in the DataFrame df_penguins that adds upp "Culmen Length" with "Culmen Depth" using .iterrows()
- print out the first three rows 

In [59]:
import pandas as pd
df_penguins = pd.read_csv("./data/penguins/penguins.csv")
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,


**Solution**

In [60]:
for index, value in df_penguins.iterrows():
    df_penguins.loc[index,"Culmen combined"] = df_penguins.loc[index, "Culmen Length (mm)"] + df_penguins.loc[index, "Culmen Depth (mm)"]
    
   
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments,Culmen combined
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.,57.8
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,,56.9
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,,58.3


### 2. Looping through/Manipulating data in a Series/DataFrame with **apply()**

**2.1 For Series:**
- apply(): It is used when you want to apply a function on the values of Series.

**2.2 For DataFrame:**
- apply(): It is used when you want to apply a function along the row or column. 
    - axis = 0 for column 
    - axis = 1 for row.


### 2.1  Accessing/Manipulating data in <ins>Series</ins> with **.apply()**
- applies a function to each element in a Series e.g. calculate the string length of each value within a column

<img src="./pics/Series_apply.JPG" width = 700/>

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html


<img src="./pics/apply Series.JPG" width = 250/>

In [61]:
df_short = df.loc[:2, "PassengerId": "Age"]
series_short = df_short["Name"].copy()
series_short

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
Name: Name, dtype: object

In [62]:
series_short_length = series_short.apply(len)
series_short_length 

0    23
1    51
2    22
Name: Name, dtype: int64

### 2.2  Accessing/Manipulating data in a <ins>DataFrame</ins> with **.apply()**

- applies a function to either axis of a dataframe
- access index with .name

<img src="pics/DataFrame_apply.JPG" width = 700/>


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html


In [63]:
df_short

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0



<table><tr><td> axis = 0<img src="./pics/apply_df_axis=0.JPG" width = 400></td><td> axis = 1 <img src="./pics/apply_df_axis=1.JPG" width = 400></td></tr></table>


We create a function that returns True if the value in our row of the column "Age" is above 25. Otherwise it returns False.
We apply then this function on our dataframe.

In [64]:
def filter_on_age(row):
    if row["Age"] > 25:
        return True
    else:
        return False

df_short.loc[:,"Old"] = df_short.apply(filter_on_age, axis=1)

In [65]:
# another example to check if a person has survived and was in the first class
def check_survived_Pclass(row):
    #print(my_series[my_series["Survived"]==1])
    if row[1] == 1 and row[2]==1:
        return True
    else:
        return False

df_short.loc[:,"PC-Class Survived"] = df_short.apply(check_survived_Pclass, axis = 1)

df_short

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Old,PC-Class Survived
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,False,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,True,True
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,True,False


**Exercise**
- Calculate a new column "Culmen combined" in the DataFrame df_penguins that adds upp "Culmen Length" with "Culmen Depth" using .apply()
- print out the first three rows 

In [67]:
import pandas as pd
df_penguins = pd.read_csv("./data/penguins/penguins.csv")
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,


**Solution 1**

In [69]:
def culmen_combined(row):
    return row["Culmen Length (mm)"] + row["Culmen Depth (mm)"] 

df_penguins.loc[:,"Culmen combined"] = df_penguins.apply(culmen_combined, axis =1)
df_penguins.head(2)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments,Culmen combined
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.,57.8
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,,56.9


**Solution 2 with lambda**

In [70]:
df_penguins = pd.read_csv("./data/penguins/penguins.csv")

df_penguins.loc[:,"Culmen combined"] = df_penguins.apply(lambda row: row["Culmen Length (mm)"] + row["Culmen Depth (mm)"], axis =1)
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments,Culmen combined
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.,57.8
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,,56.9
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,,58.3


### 3. Additional handy functions: Map() / Applymap()
- map(): It is used to substitute each value with another value.
- applymap(): It is used for element-wise operation across the whole DataFrame.


### Map 
- is a Series Method
- allows to map an existing value of a Series to a different set of values
- e.g. translate male and female into 0 and 1


In [71]:
df_short = df.loc[:2, "PassengerId": "Age"]
df_short

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0


In [72]:
df_short["Sex_num"] = df_short["Sex"].map({"female": 0, "male": 1})

In [73]:
df_short

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Sex_num
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0


### Applymap
- DataFrame method
- applies a function to every element of a DataFrame


<img src="./pics/df_applymap.JPG" width = 400/>

In [74]:
df_short = df.loc[:2, "PassengerId": "Age"]
df_short

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0


In [75]:
df_short = df_short.loc[:,["PassengerId", "Survived", "Pclass", "Age"]].applymap(float)
df_short

Unnamed: 0,PassengerId,Survived,Pclass,Age
0,1.0,0.0,3.0,22.0
1,2.0,1.0,1.0,38.0
2,3.0,1.0,3.0,26.0


**Exercise**
- map the values "MALE/FEMALE" of the column "Sex" to "1/0" in the penguins DataFrame with .map() and name the new column "Sex_numerical"

In [76]:
df_penguins = pd.read_csv("./data/penguins/penguins.csv")
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,


**Solution**

In [77]:
df_penguins.loc[:,"Sex_numerical"] = df_penguins.loc[:,"Sex"].map({"MALE":1, "FEMALE":0, "NaN":0})
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments,Sex_numerical
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.,1.0
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,,0.0
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,,0.0


### 4. Extra: Speed up the process - Accessing/Manipulating data in <ins>Series / DataFrame</ins> using **vectorization** methods
<ins>What are vectorization methods ?</ins>
- applying a manipulation to a whole array/(vector), instead of single values 
- we have been indirectly using vectorization when we used e.g. groupyby

<ins>Why should we use it?</ins>
- we can use it to avoid looping over our dataset and save a lot of time



In [22]:
import numpy as np
# just an example: assume we want to add two columns and save it in a new one
#  here we add randomly the PassengerID with the Age of the Passenger and save it in a new column
df.loc[:,"New_column"] = df["PassengerId"] + df["Age"]
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,New_column
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,23.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,40.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,29.0


**Lets convert our apply function to a vectorized function**

In [23]:
## review: apply function

def filter_on_age(row):
    if row["Age"] > 25:
        return True
    else:
        return False

df_short.loc[:, "Old"] = df_short.apply(filter_on_age, axis=1)

First guess would be to pass the whole vector/(Series) (instead of single rows as before) into our function and then make the calculation

In [23]:
def filter_on_age_df(series):
    print(series)
    if series > 25:
        return True
    else:
        return False

df_short.loc[:,"Old"] = filter_on_age_df(df_short["Age"])

0    22.0
1    38.0
2    26.0
Name: Age, dtype: float64


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

**This throws us a value error** because Python does not know how to tell if a whole column is greater than 25. This is where Numpy (another library like pandas but aimed at calculating with arrays/vectors) comes in handy

**np.where()** ("if statement Excel")

In [None]:
## do not run this cell
import numpy as np

np.where(
    conditional statement -> bool array,
    series/array/function()/scalar if True,
    series/array/function()/scalar if False
)

In [24]:
import numpy as np
df.loc[:, "Old"] = np.where(
    df["Age"]>25, # <-- condition
    True, # <-- return if true
    False # <-- return if false
    )
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,New_column,Old
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,23.0,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,40.0,True
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,29.0,True


- more speed up by using .values
- for multiple conditions use **np.select**

In [None]:
## do not run this cell
conditions = [
    condition1
    condition2
    etc.
]

choices = [
    value1
    value2
    etc.
]

df["new column"] = np.select(conditions, choices, default="NA")

In [26]:
conditions = [
    df["Age"]<20, # first condition to test: if true return choice1, if false check next condition
    df["Age"]<30, # second condition to test: if true return choice2, if false check next condition
    df["Age"]<40 # third condition to test: if true return choice3, if false default value is returned
]

choices = [
    "young",  # choice1
    "middle", # choice2
    "old",    # choice3
]

df.loc[:,"Age_grouped"] = np.select(conditions, choices, default="very old") #  default value is the value if non of the conditions are true
print(df.Age_grouped.value_counts())
df.head(3)

very old    340
middle      220
old         167
young       164
Name: Age_grouped, dtype: int64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,New_column,Old,Age_grouped
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,23.0,False,middle
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,40.0,True,old
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,29.0,True,middle


**Exercise**

- Calculate a new column "Culmen combined" in the DataFrame df_penguins that adds upp "Culmen Length" with "Culmen Depth" using vectorization
- print out the first three rows 

In [35]:
df_penguins = pd.read_csv("./data/penguins/penguins.csv")
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,


**Solution**

In [36]:
df_penguins.loc[:,"Culmen combined"] = df_penguins["Culmen Length (mm)"] + df_penguins["Culmen Depth (mm)"]
df_penguins.head()

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments,Culmen combined
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.,57.8
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,,56.9
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,,58.3
3,4,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.,
4,5,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,,56.0


**Exercise 2** </br>
Create a new column that identifies if a penguin weighs too much. For this use the following formula that calculates the BMI of a penguin.
$$\frac{(Culmen Length (mm) + Culmen Depth (mm) + Flipper Length (mm))}{ Body Mass (g) }< 0.07 $$

Create a new column "BMI". If the penguins weighs too much (according to the formula above), a "True" value should be stored, if not "False.

**Solution**

In [47]:
import numpy as np

df_penguins.loc[:, "BMI"] = np.where(
    (df_penguins["Culmen Length (mm)"] + df_penguins["Culmen Depth (mm)"] + df_penguins["Flipper Length (mm)"])/df_penguins["Body Mass (g)"]  <0.07, # <-- condition
    "True", # <-- return if true
    False # <-- return if false
    )
df_penguins.head(3)

Unnamed: 0.1,Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments,Culmen combined,BMI
0,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.,57.8,True
1,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,,56.9,True
2,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,,58.3,False


<img src='./pics/single colored_line.JPG' width = 700 />