# UBC
## Programming in Python for DS
### Week 3
Instructor: Socorro Dominguez-Vidana

- [] Explain what **tidy data** is.
- [] Use `.melt()` and `.pivot()`
- [] Learn how to reset a data frame's index.
- [] Combine data frames using `.merge()` and `.concat()`

## Tidy Data

- The concept stems from a [paper](https://vita.had.co.nz/papers/tidy-data.pdf) written by Hadley Wickham in 2014.
- We tidy our data to create a standard across multiple analysis tools. 

Rules for Tidy Data:

* each row is a single observation,
* each column is a single variable, and
* each value is a single cell (i.e., its entry in the data frame is not shared with another value).

**Note:** Tidy data will also be defined in the context of the problem that you are solving.

![](https://datasciencebook.ca/_main_files/figure-html/02-tidy-image-1.png)

Or:

![](https://datasciencebook.ca/_main_files/figure-html/02-obs-1.png)

## What to do when data is not tidy...

Data sometimes comes in tables that are easy to read/make sense for humans.

- However, they are not necessarily in the best format for computations to happen.

Most common scenarios:
- We need to convert from Wide to Long tables
- We need to convert from Long to Wide tables

## Wide to Long

## Goal

![](https://datasciencebook.ca/img/wrangling/pivot_functions.001.png)

A common task:
- Get data into a tidy format by combining values that are stored in separate columns, but are really part of the same variable, into one.
- Data is often stored this way because this format is sometimes more intuitive for human readability.

## From Long to Wide

![](https://datasciencebook.ca/img/wrangling/pivot_functions.003.png)

- Sometimes we have observations spread across multiple rows rather than in a single row.
- In the previous figure, the table on the left is in an untidy, long format because the count column contains three variables (*population*, *commuter count*, and *year the city was incorporated*) and information about each observation (here, *population*, *commuter*, and *incorporated* values for a region) is split across three rows.
- **Remember:** one of the criteria for tidy data is that each observation must be in a single row.



### How to Manipulate Long and Wide Tables In Python

### Melt

> wide to long format

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)

![](https://pandas.pydata.org/docs/_images/reshaping_melt.png)

In [1]:
import pandas as pd

cereal = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/programming-in-python-for-data-science/master/data/cereal.csv')
cereal = cereal.drop(labels=['type', 'shelf', 'weight', 'cups', 'rating'], axis=1)
cereal.head()

Unnamed: 0,name,mfr,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins
0,100% Bran,N,70,4,1,130,10.0,5.0,6,280,25
1,100% Natural Bran,Q,120,3,5,15,2.0,8.0,8,135,0
2,All-Bran,K,70,4,1,260,9.0,7.0,5,320,25
3,All-Bran with Extra Fiber,K,50,4,0,140,14.0,8.0,0,330,25
4,Almond Delight,R,110,2,2,200,1.0,14.0,8,1,25


In [2]:
cereal[cereal['name'] == '100% Bran']

Unnamed: 0,name,mfr,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins
0,100% Bran,N,70,4,1,130,10.0,5.0,6,280,25


In [4]:
cereal_long = cereal.melt(id_vars=['name', 'mfr'])
cereal_long
cereal_long[cereal_long['name'] == '100% Bran']

Unnamed: 0,name,mfr,variable,value
0,100% Bran,N,calories,70.0
77,100% Bran,N,protein,4.0
154,100% Bran,N,fat,1.0
231,100% Bran,N,sodium,130.0
308,100% Bran,N,fiber,10.0
385,100% Bran,N,carbo,5.0
462,100% Bran,N,sugars,6.0
539,100% Bran,N,potass,280.0
616,100% Bran,N,vitamins,25.0


### Pivot

> long to wide

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html)


![](https://pandas.pydata.org/docs/_images/reshaping_pivot.png)

In [5]:
lego = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/programming-in-python-for-data-science/master/data/lego_untidy.csv')
lego.head()

Unnamed: 0,set_num,name,year,lego_info,value
0,00-1,Weetabix Castle,1970,theme_id,414
1,00-1,Weetabix Castle,1970,num_parts,471
2,00-2,Weetabix Promotional House 1,1976,num_parts,147
3,00-2,Weetabix Promotional House 1,1976,theme_id,413
4,00-3,Weetabix Promotional House 2,1976,num_parts,149


In [6]:
lego_tidy = lego.pivot(index='set_num',
                       columns='lego_info',
                       values='value')
lego_tidy

lego_info,num_parts,theme_id
set_num,Unnamed: 1_level_1,Unnamed: 2_level_1
00-1,471,414
00-2,147,413
00-3,149,413
00-4,126,413
00-6,3,67
...,...,...
tominifigs-1,2,50
trucapam-1,71,598
tsuper-1,3,12
vwkit-1,22,366


In [9]:
lego_tidy = lego.pivot(index=['set_num', 'name'],
                       columns='lego_info',
                       values='value')
lego_tidy.head()

Unnamed: 0_level_0,lego_info,num_parts,theme_id
set_num,name,Unnamed: 2_level_1,Unnamed: 3_level_1
00-1,Weetabix Castle,471,414
00-2,Weetabix Promotional House 1,147,413
00-3,Weetabix Promotional House 2,149,413
00-4,Weetabix Promotional Windmill,126,413
00-6,Special Offer,3,67


In [10]:
lego_tidy2 = lego_tidy.reset_index()
lego_tidy2

lego_info,set_num,name,num_parts,theme_id
0,00-1,Weetabix Castle,471,414
1,00-2,Weetabix Promotional House 1,147,413
2,00-3,Weetabix Promotional House 2,149,413
3,00-4,Weetabix Promotional Windmill,126,413
4,00-6,Special Offer,3,67
...,...,...,...,...
11668,tominifigs-1,Town Minifig Packs 2-Pack,2,50
11669,trucapam-1,Captain America Mosaic,71,598
11670,tsuper-1,Technic Super Set,3,12
11671,vwkit-1,Volkswagen Kit,22,366


In [11]:
# What if I want to use only the name?
lego.pivot(index='name',
           columns='lego_info',
           values='value')

ValueError: Index contains duplicate entries, cannot reshape

## What duplicates?

### How do I find them?

In [13]:
lego_tidy.reset_index().value_counts('name')#.head()

name
Basic Building Set                                     55
Universal Building Set                                 32
Helicopter                                             23
Basic Set                                              23
Fire Station                                           14
                                                       ..
FIRST LEGO League Challenge 2010 - Body Forward v46     1
FIRST LEGO League Challenge 2011 - Food Factor          1
FIRST LEGO League Challenge 2012 - Senior Solutions     1
FIRST LEGO League Challenge 2013 - Nature's Fury        1
{Yellow Cab}                                            1
Name: count, Length: 10370, dtype: int64

In [14]:
bbset = lego[lego['name'] == 'Basic Building Set']
bbset

Unnamed: 0_level_0,set_num,name,year,lego_info,value
typing.Literal[<no_default>],Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30,010-3,Basic Building Set,1968,num_parts,77
31,010-3,Basic Building Set,1968,theme_id,366
32,011-1,Basic Building Set,1968,num_parts,145
33,011-1,Basic Building Set,1968,theme_id,366
34,022-1,Basic Building Set,1968,num_parts,110
...,...,...,...,...,...
15727,730-2,Basic Building Set,1985,num_parts,432
15842,735-1,Basic Building Set,1990,num_parts,538
15843,735-1,Basic Building Set,1990,theme_id,467
15846,740-1,Basic Building Set,1985,num_parts,530


In [15]:
bbset.pivot(index=['set_num', 'name'],
                       columns='lego_info',
                       values='value').head()

Unnamed: 0_level_0,lego_info,num_parts,theme_id
set_num,name,Unnamed: 2_level_1,Unnamed: 3_level_1
010-3,Basic Building Set,77,366
011-1,Basic Building Set,145,366
022-1,Basic Building Set,110,366
033-2,Basic Building Set,177,366
044-1,Basic Building Set,225,366


What if I do not care about the `set_num`, just about the `name`?

### Alternative: Use `pivot_table`

In [16]:
bbset.head()

Unnamed: 0_level_0,set_num,name,year,lego_info,value
typing.Literal[<no_default>],Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30,010-3,Basic Building Set,1968,num_parts,77
31,010-3,Basic Building Set,1968,theme_id,366
32,011-1,Basic Building Set,1968,num_parts,145
33,011-1,Basic Building Set,1968,theme_id,366
34,022-1,Basic Building Set,1968,num_parts,110


In [17]:
bbset2 = lego.pivot_table(index='name',
           columns='lego_info',
           values='value', aggfunc='mean').reset_index()
bbset2

lego_info,name,num_parts,theme_id
0,1 stud Blue Storage Brick,0.0,501.0
1,Scenery and Dagger Trap polybag,25.0,435.0
2,Spectre,7.0,552.0
3,White Spaceman Key Chain,0.0,503.0
4,'Where Are My Pants?' Guy,6.0,549.0
...,...,...,...
10365,{Red Race Car Number 3},39.0,82.0
10366,{Roadplates and Scenery},85.0,533.0
10367,{Rock Saw Vehicle},22.0,442.0
10368,{Town Vehicles},158.0,533.0


In [18]:
bbset2[bbset2['name']=="Basic Building Set"]

lego_info,name,num_parts,theme_id
1663,Basic Building Set,216.672727,454.363636


In [19]:
bbset

Unnamed: 0_level_0,set_num,name,year,lego_info,value
typing.Literal[<no_default>],Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30,010-3,Basic Building Set,1968,num_parts,77
31,010-3,Basic Building Set,1968,theme_id,366
32,011-1,Basic Building Set,1968,num_parts,145
33,011-1,Basic Building Set,1968,theme_id,366
34,022-1,Basic Building Set,1968,num_parts,110
...,...,...,...,...,...
15727,730-2,Basic Building Set,1985,num_parts,432
15842,735-1,Basic Building Set,1990,num_parts,538
15843,735-1,Basic Building Set,1990,theme_id,467
15846,740-1,Basic Building Set,1985,num_parts,530


In [20]:
bbset.shape

(110, 5)

In [21]:
bbset.pivot(index=['set_num', 'name'],
                       columns='lego_info',
                       values='value').reset_index()['num_parts'].mean()

216.6727272727273

In [22]:
bbset.pivot(index=['set_num', 'name'],
                       columns='lego_info',
                       values='value').reset_index()['theme_id'].mean()

454.3636363636364

## Merging Data Frames

In [23]:
df = pd.DataFrame({'animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'id' : [1, 2, 1, 2],
                   'speed': [390., 370., 24., 26.]})

df_food = pd.DataFrame({'animal': ['Falcon', 'Falcon',
                                   'Parrot', 'Parrot'],
                        'id' : [1, 2, 1, 2],
                        'food': ['kibble', 'meat', 'seeds', 'fruits']})

In [24]:
df

Unnamed: 0,animal,id,speed
0,Falcon,1,390.0
1,Falcon,2,370.0
2,Parrot,1,24.0
3,Parrot,2,26.0


In [25]:
df_food

Unnamed: 0,animal,id,food
0,Falcon,1,kibble
1,Falcon,2,meat
2,Parrot,1,seeds
3,Parrot,2,fruits


In [26]:
df.merge(df_food, on = 'animal')

Unnamed: 0,animal,id_x,speed,id_y,food
0,Falcon,1,390.0,1,kibble
1,Falcon,1,390.0,2,meat
2,Falcon,2,370.0,1,kibble
3,Falcon,2,370.0,2,meat
4,Parrot,1,24.0,1,seeds
5,Parrot,1,24.0,2,fruits
6,Parrot,2,26.0,1,seeds
7,Parrot,2,26.0,2,fruits


In [27]:
df.merge(df_food, on = ['animal', 'id'])

Unnamed: 0,animal,id,speed,food
0,Falcon,1,390.0,kibble
1,Falcon,2,370.0,meat
2,Parrot,1,24.0,seeds
3,Parrot,2,26.0,fruits


In [29]:
df.merge(df_food, on = ['animal', 'id'], how='right')

Unnamed: 0,animal,id,speed,food
0,Falcon,1,390.0,kibble
1,Falcon,2,370.0,meat
2,Parrot,1,24.0,seeds
3,Parrot,2,26.0,fruits


In [30]:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'ID' : [1, 2, 1, 2],
                   'Speed': [380, 370, 24, 26]})
df_food = pd.DataFrame({'animal': ['Falcon', 'Falcon',
                                   'Parrot', 'Parrot',
                                   'Tiger'],
                        'id' : [1, 2, 1, 2, 1],
                        'food': ['kibble', 'meat', 'seeds', 'fruits', 'meat']})
df_food2 = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                                   'Parrot', 'Parrot',
                                   'Tiger'],
                        'ID' : [1, 2, 1, 2, 1],
                        'Food': ['kibble', 'meat', 'seeds', 'fruits', 'meat']})

In [31]:
df

Unnamed: 0,Animal,ID,Speed
0,Falcon,1,380
1,Falcon,2,370
2,Parrot,1,24
3,Parrot,2,26


In [32]:
df_food

Unnamed: 0,animal,id,food
0,Falcon,1,kibble
1,Falcon,2,meat
2,Parrot,1,seeds
3,Parrot,2,fruits
4,Tiger,1,meat


In [37]:
df2 = df.merge(df_food, left_on = ['Animal', 'ID'], right_on = ['animal', 'id'], how='outer', indicator=True)
df2

Unnamed: 0,Animal,ID,Speed,animal,id,food,_merge
0,Falcon,1.0,380.0,Falcon,1,kibble,both
1,Falcon,2.0,370.0,Falcon,2,meat,both
2,Parrot,1.0,24.0,Parrot,1,seeds,both
3,Parrot,2.0,26.0,Parrot,2,fruits,both
4,,,,Tiger,1,meat,right_only


In [38]:
df2[df2['_merge']=='right_only']

Unnamed: 0,Animal,ID,Speed,animal,id,food,_merge
4,,,,Tiger,1,meat,right_only


## Difference with Concat

In [39]:
df

Unnamed: 0,Animal,ID,Speed
0,Falcon,1,380
1,Falcon,2,370
2,Parrot,1,24
3,Parrot,2,26


In [40]:
df_food2

Unnamed: 0,Animal,ID,Food
0,Falcon,1,kibble
1,Falcon,2,meat
2,Parrot,1,seeds
3,Parrot,2,fruits
4,Tiger,1,meat


In [41]:
pd.concat([df, df_food2])


Unnamed: 0,Animal,ID,Speed,Food
0,Falcon,1,380.0,
1,Falcon,2,370.0,
2,Parrot,1,24.0,
3,Parrot,2,26.0,
0,Falcon,1,,kibble
1,Falcon,2,,meat
2,Parrot,1,,seeds
3,Parrot,2,,fruits
4,Tiger,1,,meat


In [42]:
df_food

Unnamed: 0,animal,id,food
0,Falcon,1,kibble
1,Falcon,2,meat
2,Parrot,1,seeds
3,Parrot,2,fruits
4,Tiger,1,meat


In [43]:
df.columns = df_food.columns

In [44]:
pd.concat([df, df_food])

Unnamed: 0,animal,id,food
0,Falcon,1,380
1,Falcon,2,370
2,Parrot,1,24
3,Parrot,2,26
0,Falcon,1,kibble
1,Falcon,2,meat
2,Parrot,1,seeds
3,Parrot,2,fruits
4,Tiger,1,meat


### Concatenating sideways ("merging" by position?)

In [47]:
df3 = pd.concat([df_food, df], axis = 1)
df3

Unnamed: 0,animal,id,food,animal.1,id.1,food.1
0,Falcon,1,kibble,Falcon,1.0,380.0
1,Falcon,2,meat,Falcon,2.0,370.0
2,Parrot,1,seeds,Parrot,1.0,24.0
3,Parrot,2,fruits,Parrot,2.0,26.0
4,Tiger,1,meat,,,


In [53]:
df4 = df3[['animal', 'id']]
df4

Unnamed: 0,animal,animal.1,id,id.1
0,Falcon,Falcon,1,1.0
1,Falcon,Falcon,2,2.0
2,Parrot,Parrot,1,1.0
3,Parrot,Parrot,2,2.0
4,Tiger,,1,


In [51]:
df4[df4['animal'] == 'Parrot']

ValueError: cannot reindex on an axis with duplicate labels