## Transforming

Data scientist transform dataframe columns when they need to change each value in a feature in the same way. For example, transforming heights from feet to centimeters.

In [1]:
import pandas as pd

In [2]:
baby = pd.read_csv('chapter6/babynames.txt')
baby

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
2,Oliver,M,14147,2020
3,Elijah,M,13034,2020
4,William,M,12541,2020
...,...,...,...,...
2020717,Ula,F,5,1880
2020718,Vannie,F,5,1880
2020719,Verona,F,5,1880
2020720,Vertie,F,5,1880


In NYT article:
1. Names starting with the letter L or K became popular after 2000.
2. Names starting with the letter J peaked in popularity in the 1970s and 1980s and dropped off in popularity since.

We can verify these claims using the ```baby``` dataset.
1. Transforming the ```Name``` column into a new column that contains the first letters of each value in ```Name```.
2. Group the dataframe by the first letter and year.
3. Aggregate the name counts by summing.

In [4]:
names = baby['Name']

To extract the first letter of each name, we define a function and pass it into ```.apply()```:

In [5]:
def first_letter(string):
    return string[0]

names.apply(first_letter)

0          L
1          N
2          O
3          E
4          W
          ..
2020717    U
2020718    V
2020719    V
2020720    V
2020721    W
Name: Name, Length: 2020722, dtype: object

Now, we can assign the first letters to a new column in the dataframe:

```Firsts``` in the parameters is the **name of the new column** being created.

In [6]:
letters = baby.assign(Firsts=names.apply(first_letter))
letters

Unnamed: 0,Name,Sex,Count,Year,Firsts
0,Liam,M,19659,2020,L
1,Noah,M,18252,2020,N
2,Oliver,M,14147,2020,O
3,Elijah,M,13034,2020,E
4,William,M,12541,2020,W
...,...,...,...,...,...
2020717,Ula,F,5,1880,U
2020718,Vannie,F,5,1880,V
2020719,Verona,F,5,1880,V
2020720,Vertie,F,5,1880,V


Now we can use the ```letters``` dataframe to see the popularity of first letters over time:

In [7]:
letters_counts = (
    letters.groupby(['Firsts', 'Year'])
    ['Count']
    .sum()
    .reset_index()
)
letters_counts

Unnamed: 0,Firsts,Year,Count
0,A,1880,16740
1,A,1881,16257
2,A,1882,18790
3,A,1883,18831
4,A,1884,20898
...,...,...,...
3636,Z,2016,54679
3637,Z,2017,54495
3638,Z,2018,55996
3639,Z,2019,55293


In [16]:
import plotly.express as px

fig = px.line(letters_counts.loc[letters_counts['Firsts'] == 'L'],
              x='Year', y='Count', title='Popularity of "L" names',
              width=350, height=250)
fig.update_layout(margin=dict(t=50))
fig.show()

The plot shows that "L" names were popular in the 1960s, dipped in the decades after, but have indeed resurged in popularity since 2000.

What about "J" names?

In [17]:
fig = px.line(letters_counts.loc[letters_counts['Firsts'] == 'J'],
              x='Year', y='Count', title='Popularity of "J" names',
              width=350, height=250)
fig.update_layout(margin=dict(t=50))

The NYT article says that "J" names were popular in the 1970s and 80s. The plot agrees and shows that they have become less popular since 2000.

#### The Price of Apply
Using ```.apply()``` for numeric calculations is much slower than using vectorised operations directly on  ```pd.Series``` objects.

In [18]:
%%timeit

# Calculate the decade using vectorised operators
baby['Year'] // 10 * 10

3.4 ms ± 140 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
%%timeit

def decade(yr):
    return yr // 10 * 10

# Calculate the decade using apply
baby['Year'].apply(decade)

240 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
