In [1]:
import pandas as pd
import numpy as np
import random

In [2]:
random.seed(42)

In [3]:
URL = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv"
penguins = pd.read_csv(URL)

In [4]:
#add a new column body_mass_kg
penguins['body_mass_kg'] = penguins['body_mass_g']/1000

# confirm the new column is in the data frame
print("body_mass_kg is in the data frame's columns: ", 'body_mass_kg' in penguins.columns)
#look at the new column
penguins.head()

body_mass_kg is in the data frame's columns:  True


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.75
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.8
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,3.25
3,Adelie,Torgersen,,,,,,2007,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,3.45


### …at a specific location

The new column was added by default at the end of the data frame. If we want to create a new column and insert it at a particular position we can use the data frame method insert():

### example
give each penguin observation a unique identifier as a three digital number, add this column at the begining of the dataframe

In [5]:
# create unique random 3 digit codes

codes = random.sample(range(100,1000), len(penguins)) # sampling w/o replacement

penguins.insert(loc = 0, # index
               column = 'id_code', # new column name
                value=codes)

In [6]:
penguins.head()

Unnamed: 0,id_code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg
0,754,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.75
1,214,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.8
2,125,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,3.25
3,859,Adelie,Torgersen,,,,,,2007,
4,381,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,3.45


## adding multiple columns

use `assign()` to create or update multiple columns in the same call:
    
```
df = df.assign( new_col1_name = new_col1_values,
                new_col2_name = new_col2_values)
```

### example

we want to add these new columns:

- flipper length converted from mm to cm and
- a code representing the observer

In [7]:
# create columns with observer codes and flipper length in cm
penguins = penguins.assign(flipper_length_cm=penguins['flipper_length_mm']/10,
                          observer=random.choices(['A','B','C'], k=len(penguins)))

## removing columns

remove column using the `drop()` method

```
df = df.drop(column=col_names)
```

col_names can be a single column name (string) or a list of column names (each a string).

### example

In [8]:
# Remove duplicate length and mass measurements
penguins = penguins.drop(columns=['flipper_length_mm','body_mass_g'])

# Confirm result
print(penguins.columns)

Index(['id_code', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'sex', 'year', 'body_mass_kg', 'flipper_length_cm', 'observer'],
      dtype='object')


## updating values

sometimes we want to update specific values in out data frame.

### a single value

access a single value in a `pd.DataFrame` using the locators:

- `at[]` to select by labels, or
- `iat[]` to select by position.

The syntax for `all[]`:

```
df.at[single_value_index, 'column_name']
```

thhink of `at[]` as the equivalent of loc[] when we want ot access a single value

### example first uodate the index of the data frame to be id_column

In [9]:
penguins = penguins.set_index('id_code')
penguins

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,sex,year,body_mass_kg,flipper_length_cm,observer
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
754,Adelie,Torgersen,39.1,18.7,male,2007,3.750,18.1,C
214,Adelie,Torgersen,39.5,17.4,female,2007,3.800,18.6,A
125,Adelie,Torgersen,40.3,18.0,female,2007,3.250,19.5,C
859,Adelie,Torgersen,,,,2007,,,A
381,Adelie,Torgersen,36.7,19.3,female,2007,3.450,19.3,B
...,...,...,...,...,...,...,...,...,...
140,Chinstrap,Dream,55.8,19.8,male,2009,4.000,20.7,C
183,Chinstrap,Dream,43.5,18.1,female,2009,3.400,20.2,A
969,Chinstrap,Dream,49.6,18.2,male,2009,3.775,19.3,C
635,Chinstrap,Dream,50.8,19.0,male,2009,4.100,21.0,A


what was the bill length of the penguin with ID number 127

In [10]:
# Check bill length of penguin with ID 127
penguins.at[127, 'bill_length_mm']

38.2

In [11]:
penguins.at[127, 'bill_length_mm'] = 38.3

# confirm value was updated
penguins.loc[127]

species              Adelie
island               Biscoe
bill_length_mm         38.3
bill_depth_mm          18.1
sex                    male
year                   2007
body_mass_kg           3.95
flipper_length_cm      18.5
observer                  B
Name: 127, dtype: object

if we want to access or update a single value by position we use `iat[]` locator:

```
df.iat[index_integer_location, column_integer_location]
```

we can dynamically get the location of a single column this way:

```
df.columns.get_loc('column_name')
```

## check-in
a. obtain the location of the `bill_length_mm` column

b. use `iat[]` to access the same bill length value for *your* penguin and revert it back to NA value. confirm your update using `iloc[]`.

In [12]:
bill_length_index = penguins.columns.get_loc("bill_length_mm")

In [13]:
penguins.iat[3, bill_length_index] = np.nan
penguins.iloc[3]

species                 Adelie
island               Torgersen
bill_length_mm             NaN
bill_depth_mm              NaN
sex                        NaN
year                      2007
body_mass_kg               NaN
flipper_length_cm          NaN
observer                     A
Name: 859, dtype: object

##  update multiple values in a column

what if we want to update mulyiple values in a column

### using a condition

when we need to create a new column where the new values depend on conditions on another column

#### example

we want to classify the penguins such that 

- penguins with body mass < 3 kg are small,
- penguins with 3 kg <= body mass < 5 kg are medium,
- penguins with 5kg <= body mass are large

we can add this info to a new column with the `numpy.select()` function:

In [14]:
# create a list with the conditions
conditions = [penguins.body_mass_kg < 3,
             (3 <= penguins.body_mass_kg) & (penguins.body_mass_kg < 5),
             5 <= penguins.body_mass_kg]

# create a list with the choices
choices = ['small', 'medium', 'large']

# add the selections using np.select
penguins['size'] = np.select(conditions,
                            choices,
                            default = np.nan) # value for anything outside conditions
# Display the updated data frame to confirm the new column
penguins.head()

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,sex,year,body_mass_kg,flipper_length_cm,observer,size
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
754,Adelie,Torgersen,39.1,18.7,male,2007,3.75,18.1,C,medium
214,Adelie,Torgersen,39.5,17.4,female,2007,3.8,18.6,A,medium
125,Adelie,Torgersen,40.3,18.0,female,2007,3.25,19.5,C,medium
859,Adelie,Torgersen,,,,2007,,,A,
381,Adelie,Torgersen,36.7,19.3,female,2007,3.45,19.3,B,medium


## by selecting values an then updating

we can update some values in a column by selecting this data using `loc` (if selecting by labels) `iloc` (if selecting by position). the general syntax for updating with `loc` is:

```
df.loc[row_selection, column_name] = new_values
```

where:

- 'row_selection': the rows we want to update, any expression that gives us a boolean `pandas.Series`
- `col_name`: is a single column name,
- `new_values`: the new value or values we want. if using multiple values, then `new_values` must be of the same length as the number of rows selecting

using `loc[]` assignment modifies data frame directly without the need for reassignment

### example

update the `male` values in the sex column to 'm'

In [15]:
# select rows with sex=male and simplify values in 'sex' column
penguins.loc[penguins.sex=='male', 'sex'] = 'M'

# check changes in sex column specifically
print(penguins.sex.unique())

['M' 'female' nan]


### best practices

we want to similarly update the 'female' values in the sex column to 'F'. we might try to do it this way:


In [16]:
# select rows where 'sex' is 'female' and attempt to update values
penguins[penguins.sex=='female']['sex'] = 'F'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  penguins[penguins.sex=='female']['sex'] = 'F'


the use of double brackets '[][]' is called **chained indexing**

when we select the data we want to update using chained indexing instead of 'loc[]' we get a 'SettingWithCopyWarning'

the bug that this warning is trying to tell us about is that we did not update our data frame

```
penguins['sex'].unique()
array(['M', 'female', nan], dtype = object)
```

**Avoid chained '[][]' and use .loc[] instead**
this warning often arises from chained indexing

update the 'female' values in the penguins data frame without using chained indexing. confirm that the values are updated

In [17]:
# no chained indexxing = no warning

penguins.loc[penguins.sex=='female', 'sex'] = 'F'

penguins['sex'].unique()

array(['M', 'F', nan], dtype=object)

In [18]:
penguins

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,sex,year,body_mass_kg,flipper_length_cm,observer,size
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
754,Adelie,Torgersen,39.1,18.7,M,2007,3.750,18.1,C,medium
214,Adelie,Torgersen,39.5,17.4,F,2007,3.800,18.6,A,medium
125,Adelie,Torgersen,40.3,18.0,F,2007,3.250,19.5,C,medium
859,Adelie,Torgersen,,,,2007,,,A,
381,Adelie,Torgersen,36.7,19.3,F,2007,3.450,19.3,B,medium
...,...,...,...,...,...,...,...,...,...,...
140,Chinstrap,Dream,55.8,19.8,M,2009,4.000,20.7,C,medium
183,Chinstrap,Dream,43.5,18.1,F,2009,3.400,20.2,A,medium
969,Chinstrap,Dream,49.6,18.2,M,2009,3.775,19.3,C,medium
635,Chinstrap,Dream,50.8,19.0,M,2009,4.100,21.0,A,medium


The `SettingWithCopyWarning` comes up because some pandas operations return a view to your data, while others return a copy to your data

- **views** area actual subsets of the original data, when we update them, we update the original data fframe
- **Copies** are unique objects, independent of our original data frames. When we update a copy we are not modifying the original data frame.

In [19]:
# select penguins from biscoe island
biscoe = penguins[penguins.island=='Biscoe']
#...other analyses...

# add column
biscoe['sample_column'] = 100 # this raises SettingWithCopyWarning

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


to fix this we can take **take control of the copy-view situation and explicitely ask foa a copy of the dataset when subsetting the data. use the copy() method to do this:

In [20]:
biscoe = penguins[penguins.island=='Biscoe'].copy()

#add column
biscoe['sample_column'] = 100 # this raises SettingWithCopyWarning

biscoe.head()

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,sex,year,body_mass_kg,flipper_length_cm,observer,size,sample_column
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
338,Adelie,Biscoe,37.8,18.3,F,2007,3.4,17.4,A,medium,100
617,Adelie,Biscoe,37.7,18.7,M,2007,3.6,18.0,C,medium,100
716,Adelie,Biscoe,35.9,19.2,F,2007,3.8,18.9,C,medium,100
127,Adelie,Biscoe,38.3,18.1,M,2007,3.95,18.5,B,medium,100
674,Adelie,Biscoe,38.8,17.2,M,2007,3.8,18.0,C,medium,100


In [21]:
print('sample_column' in penguins.columns)

False
