<img src=images/gdd-logo.png width=300px align=right> 

# Creating New Columns

Often you will want to create a new column so that you can use it at a later date.

This notebook covers:

* [Creating new columns: avoid common bad practice](#bad-pract)
* [Using `assign()` to create new columns](#assign)
    * [<mark>Exercise: Create new weight columns</mark>](#ex-weight)
* [Shifting a column](#shifting)
    * [<mark>Exercise: Find the differences</mark>](#ex-diff)
* [Other methods](#other-verbs)
    * [Calculating the difference](#diff)
    * [Renaming columns](#rename)

First of all, let's load Pandas and the dataset again:

In [None]:
import pandas as pd

chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)

<a id='bad-pract'></a>
## Creating new columns: avoid common bad practice

Say you want to create a new column where the weight is doubled.

You could use the assignment tool to create a new column as seen below.

In [None]:
chickweight['weight_doubled'] = chickweight['weight'] * 2

In [None]:
chickweight.head()

However, adding columns like this is considered bad practice, as you have modified the original dataframe.

**Code should always perform in the same way regardless of where it is in the project**

Imagine in your analysis you were using the code below to find the max of the second-last column (`'chick'`) of the dataframe:

In [None]:
chickweight.iloc[:,-2].max()

<mark>**Question:** How is `.iloc[]` different to `.loc[]`? </mark>

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
- `.loc[]` is label-based. This means it makes selections based on the row/column labels you provide.
- `.iloc[]` is integer position-based. This means you specify rows and columns by their integer position values (0-based integer position).

</details>

If someone else picks up your code and doesn't realise the original dataframe was overwritten, they may get a different result.

In [None]:
chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)

In [None]:
chickweight.iloc[:,-2].max()

This kind of thing can lead to Pandas frustration...

<img src='images/04_Creating_Columns/panda.gif' width='300px' align='left'>

To avoid this, you never want to overwrite your data frame.

You should also avoid things like:

In [None]:
chickweight_temp = chickweight.copy()
chickweight_temp['weight_doubled'] = chickweight_temp['weight'] * 2

In this case you didn't overwrite the dataframe, but eventually you will end up with too many versions of a dataframe, which is not memory efficient & will also become confusing.

<mark>***So what's the answer?***</mark>

<a id='assign'></a>
## Using `.assign()` to create new columns
You can tell pandas to make a new column with `.assign()`, and specify **how** to to calculate with a lambda function.

In [None]:
(
    chickweight
    .assign(weight_doubled = lambda df: df['weight'] * 2)
).head()

Note that the original dataframe is unchanged.

In [None]:
chickweight.head()

Want to assign two columns at the same time? No problem!

In [None]:
(
    chickweight
    .assign(weight_doubled = lambda df: df['weight'] * 2, 
            weight_quadrupled = lambda df: df['weight_doubled'] * 2)
).head()

<mark>**Question:** What are the benefits of using a lambda instead of using `weight_doubled = chickweight['weight'] * 2`?</mark>

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
- The code works with any dataframe that has a `weight` column.
- If you decide to rename the `chickweight` dataframe, you don't need to change your code.

</details>

Note that you can also drop columns if required!

In [None]:
(
    chickweight
    .drop(columns = ['rownum', 'time'])
).head()

<a id=ex-weight></a>

### <mark>Exercise: Make new weight columns</mark>

1. Assuming that the chick weights are in grams, can you add a column that gives the chickweights in kg?
2. In the same `.assign()`, also add the chickweights in pounds.

*1000 g = 1 kg = 2.205 pounds*

In [None]:
# %load answers/04_Creating_Columns/new-column.py

<a id='diff'></a>

## Finding the differences in a column

Imagine you wanted to investigate the rate of growth for different diets - then the `.diff()` method will come in handy. It calculates the difference between (consecutive) rows:

In [None]:
chickweight['weight'].diff()

### <mark>Exercise: calculate the differences</mark>

1. Add a new column called `diff` to chickweight containing the differences in weight.

2. Take a look at the first 15 rows of the resulting DataFrame. What issue do you see?

3. To fix this, you need to calculate the difference *per chicken*. In other words: inside your `.assign()`, you should groupby `chick` and then apply `.diff()` on the `weight` column.

<details>
    
  <summary><span style="color:blue">Show hint</span></summary>
    
The syntax of your lambda function should look like this:
  
`lambda df: df.groupby('col')['col2'].diff()`

</details>

4. Let's what you could do to remove the missing values (NaN).

   Use the following two methods after the assign. What does each one do?

   1. `.dropna()`
   2. `.fillna(0)`

In [None]:
# %load answers/04_Creating_Columns/differences.py

<details>
    
  <summary><span style="color:blue">Show answer 2 and 4</span></summary>

*Exercise 2*
    
The differences are not calculated by chick, so chick 2 has as first difference -165 (first weight of chick 2 - last weight of chick 1)
    
*Exercise 4*
- Removes all rows where a missing exists
- Fills missings with a 0

</details>

<a id='other-methods'></a>

## Other Methods

The following methods can also be useful.

### Shifting values

If you're interested in shifting the values in a column up or down, you can use the `.shift()` method

In [None]:
chickweight['weight'].shift(3)

Which can of course also be added as a column using the `.assign()` method.

In [None]:
(
    chickweight
    .assign(differences = lambda df: df['weight'].shift(3))
).head(15)

<a id='rename'></a>
### Renaming columns

The `.rename()` method can be used to rename your columns. 

In [None]:
(
    chickweight
    .rename(columns = {"chick": "chicken_id"})
    .head(3)
)

In [None]:
(
    chickweight
    .rename(str.upper, axis = "columns")
    .head(3)
)

# Conclusions

You have now seen the `.assign()` method as the best practice way of creating new columns. Dataframes are mutable objects so it is important to be careful when creating new columns or making any changes that you don't accidentally change the original dataframe.

You have also seen some extra methods like:
- `.diff()`: Find the difference between rows
- `.shift()`: Shift the data row-wise
- `.rename()`: Rename columns