# Modifying Data in DataFrames
Updating rows, columns, and cells.

In [79]:
import pandas as pd
import numpy as np
from IPython.display import display

In [80]:
people = {
    "first": ["Lorem", "John", "Jane"],
    "last": ["Ipsum", "Doe", "Doe"],
    "email": ["lorem@yahoo.com", "john@gmail.com", "jane@outlook.com"],
}
df = pd.DataFrame(people)
display(df)

Unnamed: 0,first,last,email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


---
## Assigning Columns
We can assign or rename columns in a 2 ways:  
- Using the `.columns` attribute
- Using the `.rename()` method

---

---
### Assigning Columns Using The .columns Attribute 
We can assign or rename columns by assigning an iterable of whose length is equal to the number of columns, to the DataFrames' columns attribute.  
**Note:** this modifies the original df in-place.

---

In [81]:
# Renaming all columns
print(f"n of columns: {df.shape[1]}")
display(df)
df.columns = ["first_name", "last_name", "email"]
display(df)

n of columns: 3


Unnamed: 0,first,last,email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


Unnamed: 0,first_name,last_name,email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


In [82]:
# Title case for columns using list comprehension.
display(df)
df.columns = [c.title() for c in df.columns]
display(df)

Unnamed: 0,first_name,last_name,email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


Unnamed: 0,First_Name,Last_Name,Email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


---
### Assigning Columns Using The .rename() Method 
We can assign or rename columns by passing a dict of any length into the `columns` parameter of the `.rename()` method. The dict should have the name of the column to be replaced as the key, and the new column name as the value, i.e.    
{"old_colname": "new_colname", "old2": "new2"}  

**Note:** this does not modify the original df. Use inplace=True as parameter to modify original df.

---

In [83]:
# Changing column names back to original
display(df)

# Note that we can change any number of columns.
replacement = {
    "First_Name": "first",
    "Last_Name": "last",
}
df.rename(columns=replacement, inplace=True)
display(df)

# Change Email column back to email for consistency
df.rename(columns={"Email": "email"}, inplace=True)
display(df)


Unnamed: 0,First_Name,Last_Name,Email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


Unnamed: 0,first,last,Email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


Unnamed: 0,first,last,email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


---
## Updating Row and Column Data
We can modify data by selecting the cell(s) and assigning desired data to it.

**Note:** row data here is **not** the "row names". Row names in pandas are the `indices`; refer to `pandas_03` for working with indices.

---

---
### Updating Multiple Cells
Generally, we can update values by selecting the DataFrame or Series and assigning the new values on it.

We can assign values to cells by assigning an iterable of matching length. The behavior of assignment will depend if the value passed is an iterable or not:
- `a.` Assigning an iterable of matching length to the cells to be replaced will assign that iterable to the corresponding cells.
  
- `b.` Assigning a single non-iterable object will assign that object to all the cells.  

**Note:** this modifies the original df in-place.

---

In [84]:
# Assigning a Series to a column

# Turn all emails to uppercase using the pandas string method
display(df)
df["email"] = df["email"].str.upper()
display(df)

Unnamed: 0,first,last,email
0,Lorem,Ipsum,lorem@yahoo.com
1,John,Doe,john@gmail.com
2,Jane,Doe,jane@outlook.com


Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,John,Doe,JOHN@GMAIL.COM
2,Jane,Doe,JANE@OUTLOOK.COM


In [85]:
# a. Assigning row cells using an iterable
display(df)
df.loc[1] = ["Foo", "Bar", "foobar@email.com"]
display(df)

Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,John,Doe,JOHN@GMAIL.COM
2,Jane,Doe,JANE@OUTLOOK.COM


Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,foobar@email.com
2,Jane,Doe,JANE@OUTLOOK.COM


In [86]:
# b. Assigning a single object to row cells. 
display(df)
df.loc[2] = "Dog"
display(df)

Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,foobar@email.com
2,Jane,Doe,JANE@OUTLOOK.COM


Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,foobar@email.com
2,Dog,Dog,Dog


In [87]:
# Assigning n-length values
display(df)
df.loc[2, ["last", "email"]] = ["Doe", "jane@gmail.com"]
display(df)

Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,foobar@email.com
2,Dog,Dog,Dog


Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,foobar@email.com
2,Dog,Doe,jane@gmail.com


---
### Updating Single Cells
Functionally, both `at` and `loc` can update a single cell but `at` is more performant than `loc`. `at` can only select single cells, so both a row indexer and column indexer is needed, while `loc` can select a single or multiple cells.

**Note:** this modifies the original df in-place.

---

In [88]:
# Uncomment this to see performance comparison

# %timeit df.loc[1, "email"]
# %timeit df.at[1, "email"]

In [89]:
display(df)

df.at[2, "first"] = "Jane"
display(df)

df.at[0, "email"] = df.loc[0, "email"].lower()
display(df)

Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,foobar@email.com
2,Dog,Doe,jane@gmail.com


Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,foobar@email.com
2,Jane,Doe,jane@gmail.com


Unnamed: 0,first,last,email
0,Lorem,Ipsum,lorem@yahoo.com
1,Foo,Bar,foobar@email.com
2,Jane,Doe,jane@gmail.com


---
### Updating Data Using Methods
We can use built in pandas methods to create a Series or DataFrame which we can then use to update our DataFrame.

Commonly used methods:    
- `a.` apply()
- `b.` map()
- `c.` applymap()
- `d.` replace()  

**Note:** these methods do not modify the original df.

---

---
### apply()
`apply()` is a Series or DataFrame method.

It takes a function as its argument and then applies that function according to where it was applied to. If the function passed takes an argument, you can pass them by passing keyword arguments to `apply()`.

If `apply()` is used on a Series, the function passed will be applied to all the values in the Series.  

If used on a DataFrame, the function will be applied to all the Series in the DataFrame, **NOT** on the values themselves. These Series can either be the row or column, and can be specified by passing on the `axis` parameter. Numpy functions are generally used as the function to `apply()` as they can operate on a Series object.

The `axis` parameter is specific to DataFrames and determines if the Series' orientation would be horizontal or vertical. You can pass "rows" or 0 for the Series to be vertical (as in to change the row values), or pass "columns" or 1 for the Series to be horizontal (values in the horizontal). This defaults to rows axis.

**Note:** these methods do not modify the original df.

---

---
### Axes Quick Reference

`axis="rows" or axis=0`  
(Default)
Refers to the vertical axis, or "rows"

`axis="columns" or axis=1`  
Refers to the horizontal axis, or "columns"

In [90]:
# Using apply() on a Series
# Example 1

# User-defined function
def to_upper(s: str):
    return s.upper()


display(df)

# Using apply() to a Series (email column) to make str values uppercase
# This basically applies our user-defined function to_upper()
# to all the values in the Series.
a = df["email"].apply(to_upper)
display(a)

# If we want to apply these changes to the df,
# assign the updated values back to the email Series.
df["email"] = a
display(df)

# Revert.
df["email"] = df["email"].str.lower()

Unnamed: 0,first,last,email
0,Lorem,Ipsum,lorem@yahoo.com
1,Foo,Bar,foobar@email.com
2,Jane,Doe,jane@gmail.com


0     LOREM@YAHOO.COM
1    FOOBAR@EMAIL.COM
2      JANE@GMAIL.COM
Name: email, dtype: object

Unnamed: 0,first,last,email
0,Lorem,Ipsum,LOREM@YAHOO.COM
1,Foo,Bar,FOOBAR@EMAIL.COM
2,Jane,Doe,JANE@GMAIL.COM


In [93]:
# Using apply() on a Series
# Example 2

# Let us see what happens if we pass the len() function
# to the apply method on the "first name" column.
a = df["first"].apply(len)
display(a)

# Makes sense. It took the length of characters of
# the first name values.

0    5
1    3
2    4
Name: first, dtype: int64

In [102]:
# Using apply() on a DataFrame

# On "vertical" Series (i.e. axis="rows")(default)
# Let us apply the same len() function to a DataFrame
# with axis="rows" (or axis=0).
a = df.apply(len, axis="rows")
display(a)

# Note how apply() returns a Series where the values 
# of 3, 3, 3, instead of the count of characters of all
# the cells in the DataFrame. 
# This is because the len is not applied on the individual cells,
# but on all the Series of the DataFrame instead. In this case, 
# it applies the len() function to all Series (in this case, 
# the vertical Series or "rows" axis).

# AXES NAMING CONVENTION STILL FEELS COUNTER-INTUITIVE FOR ME.
# TODO Continue on apply() and the other methods.

first    3
last     3
email    3
dtype: int64

[RangeIndex(start=0, stop=3, step=1),
 Index(['first', 'last', 'email'], dtype='object')]