16 October 2023

# Updating Data Frames  
https://carmengg.github.io/eds-220-book/lectures/lesson-5-updating-dataframes.html

#### Data: Palmer Penguins

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import random

In [4]:
penguins = sns.load_dataset('penguins') #import data from seaborn

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Adding a Column

```
df['new_col_name'] = new_col _values
```

`new_column_values` can be
- `pd.Series` or `numpy.array` of the same length as the data frame
- a single scalar (single number or string)

**Example**  
We want to creat a new column where body mass is in kg instead of grams

In [13]:
penguins['body_mass_kg'] = penguins.body_mass_g/1000

print('body_mass_kg' in penguins.columns) # returns True if this col is in the df

penguins.head()

True


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.8
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25
3,Adelie,Torgersen,,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45


### Adding a column in a specific place
To create a column and insert at a specific position, we use `insert()`

```
df.insert(loc = integers_index,
          column = 'new_col_name',
          value = new_col_values) #location of new column
```

In [14]:
# create random 3 digit codes
codes = random.sample(range(100,1000), len(penguins))

# insert codes at the front of data frame = index 0
penguins.insert(loc=0, 
                column = 'code',
                value = codes)
        
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,384,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75
1,757,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.8
2,208,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25
3,109,Adelie,Torgersen,,,,,,
4,974,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45


### Adding multiple columns 

We can assign multiple columns in the same call by using the data frame’s `assign()` method

```
df.assign( new_col1_name = new_col1_values, 
           new_col2_name = new_col2_values)
```

In [15]:
# create new columns in the data frame
# random.choices used for random sampling with replacement
# need to reassign output of assign() to update the data frame
penguins = penguins.assign( flipper_length_cm = penguins.flipper_length_mm /10, 
                            observer =   random.choices(['A','B','C'], k=len(penguins)))
# look at result
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg,flipper_length_cm,observer
0,384,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75,18.1,A
1,757,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.8,18.6,C
2,208,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25,19.5,C
3,109,Adelie,Torgersen,,,,,,,,B
4,974,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45,19.3,A


## Removing columns

We can remove columns using the `drop()` method for data frames

```
df = df.drop(columns = col_names)
```

In [16]:
# use a list of column names
# reassign output of drop() to dataframe to update it
penguins = penguins.drop(columns=['flipper_length_mm','body_mass_g'])

# check columns
print(penguins.columns)

Index(['code', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'sex',
       'body_mass_kg', 'flipper_length_cm', 'observer'],
      dtype='object')


## Updating values

### A single value

We can access a single value in a pd.DataFrame using the locators  
- `at[] `to select by labels, or
- `iat[]` to select by position

```
df.at[single_index_value, 'column_name']
```

In [17]:
# access value at row with index=3 and column='bill_length_mm'
penguins.at[3,'bill_length_mm']

nan

In [18]:
# update NA to 38.3
penguins.at[3,'bill_length_mm'] = 38.3

# check it was updated
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,flipper_length_cm,observer
0,384,Adelie,Torgersen,39.1,18.7,Male,3.75,18.1,A
1,757,Adelie,Torgersen,39.5,17.4,Female,3.8,18.6,C
2,208,Adelie,Torgersen,40.3,18.0,Female,3.25,19.5,C
3,109,Adelie,Torgersen,38.3,,,,,B
4,974,Adelie,Torgersen,36.7,19.3,Female,3.45,19.3,A


### Multiple values in a column

#### By condition
 
Often we want to create a new column where the new values depend on conditions on another column’s values.

Suppose we want to classify all penguins with body mass less than 3kg as small, penguins with body mass greater or equal than 3kg but less than 5kg as medium, and those with body mass greater or equal than 5kg as large. One way to add this information in a new column using `numpy.select()`

In [19]:
# create a list with the conditions
conditions = [penguins.body_mass_kg < 3, 
              (3 <= penguins.body_mass_kg) & (penguins.body_mass_kg < 5),
              5 <= penguins.body_mass_kg]

# create a list with the choices
choices = ["small",
           "medium",
           "large"]

# add the selections using np.select
# default = value for anything that falls outside conditions
penguins['size'] = np.select(conditions, choices, default=np.nan)

penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,flipper_length_cm,observer,size
0,384,Adelie,Torgersen,39.1,18.7,Male,3.75,18.1,A,medium
1,757,Adelie,Torgersen,39.5,17.4,Female,3.8,18.6,C,medium
2,208,Adelie,Torgersen,40.3,18.0,Female,3.25,19.5,C,medium
3,109,Adelie,Torgersen,38.3,,,,,B,
4,974,Adelie,Torgersen,36.7,19.3,Female,3.45,19.3,A,medium


#### By selecting values

When we only want to update some values in a column we can do this by selecting this data using `loc` (if selecting by labels) or `iloc` (if selecting by position)

```
# modifies data in place
df.loc[row_selection, col_name] = new_values
```

- `row_selection` is the rows we want to update,
- `col_name` is a single column name, and
- `new_values` is the new value or values we want. If using multiple values, then new_values must be of the same length as the number of rows selected

In [20]:
# select rows with sex=male and update the values in the sex column
penguins.loc[penguins.sex=='Male', 'sex'] = 'M'

# check changes
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,flipper_length_cm,observer,size
0,384,Adelie,Torgersen,39.1,18.7,M,3.75,18.1,A,medium
1,757,Adelie,Torgersen,39.5,17.4,Female,3.8,18.6,C,medium
2,208,Adelie,Torgersen,40.3,18.0,Female,3.25,19.5,C,medium
3,109,Adelie,Torgersen,38.3,,,,,B,
4,974,Adelie,Torgersen,36.7,19.3,Female,3.45,19.3,A,medium


## A note
- **Views are actual subsets of the original data, when we update them, we are modifying the original data frame**
- **Copies are unique objects, independent of our original data frames. When we update a copy we are not modifying the original data frame**

In [24]:
# select penguins from Biscoe island
biscoe = penguins[penguins.island=='Biscoe']

# 50 lines of code here

# add a column, we get a warning
biscoe['sample_col'] = 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biscoe['sample_col'] = 100


To fix this we can take control of the copy-view situation and explicitely ask for a copy of the dataset when subsetting the data. Use the `copy()` method to do this:

In [25]:
# make sure you get a new data frame with penguins from Biscoe island
biscoe = penguins[penguins.island=='Biscoe'].copy()

# add a column, no warning
biscoe['sample_col'] = 100

biscoe.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,flipper_length_cm,observer,size,sample_col
20,819,Adelie,Biscoe,37.8,18.3,F,3.4,17.4,B,medium,100
21,550,Adelie,Biscoe,37.7,18.7,M,3.6,18.0,A,medium,100
22,338,Adelie,Biscoe,35.9,19.2,F,3.8,18.9,B,medium,100
23,885,Adelie,Biscoe,38.2,18.1,M,3.95,18.5,A,medium,100
24,232,Adelie,Biscoe,38.8,17.2,M,3.8,18.0,B,medium,100


The `SettingWithCopyWarning` can often be tricky, there are also false positives and false negatives. Avoiding chained indexing and making a copy of your data frame subset whenever possible will save you from the usual pitfalls.

## Practice Check-In

Update the “Female” values in the penguins data frame to “F”. Don’t use chained indexing

In [23]:
# no chained indexing in assignment = no warning
penguins.loc[penguins.sex=='Female','sex'] = 'F'

# notice the values were updated now
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,flipper_length_cm,observer,size
0,384,Adelie,Torgersen,39.1,18.7,M,3.75,18.1,A,medium
1,757,Adelie,Torgersen,39.5,17.4,F,3.8,18.6,C,medium
2,208,Adelie,Torgersen,40.3,18.0,F,3.25,19.5,C,medium
3,109,Adelie,Torgersen,38.3,,,,,B,
4,974,Adelie,Torgersen,36.7,19.3,F,3.45,19.3,A,medium
