# Data Journalism Cheat Sheet: Recoding

Damian Trilling and Penny Sheets

This notebook summarizes some approaches to recoding data in pandas dataframes.

In [1]:
import pandas as pd

In [2]:
# get example data: the titanic dataset
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 1. Recoding using a function

Sometimes, we want to apply a function that maps each value to another one. For instance, we could use the function `int` to convert a value to an integer, or `len` to calcuate the length of something, ....

In [3]:
df['adult_male_as_int'] = df['adult_male'].map(int)

We don't get any output, but we could now inspect `df` again and see that we have a new column.
NB: We could of course also override the original column, but often, its more convenient and less confusing to create a new column.

We could also define our own function:

In [4]:
def is_adult(x):     # you could extend this to distinguish more categories of course
    if x>=18:
        return True  # you could also return sth else, for instance "adult" ...
    else:
        return False # ... and "child"
    
df['adult']=df['age'].map(is_adult)

You may have seen the `lambda` keyword in other examples. This allows us to define "throwaway" functions; unnamed functions that we only want to use once. You essentially say: Take the value from the cell, call it `x` (or anything else, it's arbitraty) and do the following with it). For instance:

In [5]:
df['adult2']=df['age'].map(lambda x: x>18)  # same result as above

In [6]:
df['fare_in_euros'] = df['fare'].map(lambda x: x * 233.8)   # assuming a (fictive) exchange rate of 1:233.8

## 2. Recoding using a mapping

Sometimes, especially for nominal or categorical variables, you just want to rename values, or merge a couple of them. You can then pass a dictionary to `.map()`, and `.map()` will look up the orignal value (the key) and replace it by the corresponding value:

In [7]:
my_mapping = {'First':'high',
             'Second':'low',
              'Third': 'low'}

In [8]:
df['class_dichotomized'] = df['class'].map(my_mapping)

In [9]:
# or:
class_to_numbers = {'First':1,
                   'Second':2,
                   'Third':3}
df['class_as_int'] = df['class'].map(class_to_numbers)

## 3. Recoding into bins

To recode a continuos variable into bins:

In [10]:
df['age_in_four_categories'] = pd.cut(df['age'],4)

In [11]:
df['age_in_four_categories'].value_counts(sort=False)

(0.34, 20.315]     179
(20.315, 40.21]    385
(40.21, 60.105]    128
(60.105, 80.0]      22
Name: age_in_four_categories, dtype: int64

## In short:

Using a function, you can essentially do any recoding, as specific as you want it. But there are shortcuts for frequently occuring tasks, such as renaming categories or binning.