# Long vs. Wide Format 🐼

There are two general forms of tabular data: **long** and **wide** format (see Course Material for image).  

**Long:**   
- single value column and a column that contains the variable name for each value
- great for plotting 
- to get a table into long format, use ``stack`` or ``melt``

**Wide:**   
- each variable has its own column
- great for descriptive statistics and Machine Learning
- to get a table into wide format, use ``unstack`` or ``pivot``

Depending on your use case, you might want to represent some information as individual rows of a single column (long format), or represent that same information across multiple, separate columns (wide format).

In [59]:
import pandas as pd

In [60]:
df = pd.read_excel('../data/gapminder_lifeexpectancy.xlsx', index_col=0)
df.head()

Unnamed: 0_level_0,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Life expectancy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abkhazia,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,28.13,28.12,...,52.4,52.8,53.3,53.6,54.0,54.4,54.8,54.9,53.8,52.72
Akrotiri and Dhekelia,,,,,,,,,,,...,,,,,,,,,,
Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,76.6,76.8,77.0,77.2,77.4,77.5,77.7,77.9,78.0,78.1
Algeria,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,75.3,75.5,75.7,76.0,76.1,76.2,76.3,76.3,76.4,76.5


# Warmup: Handling the Index

In [61]:
df.index

Index(['Abkhazia', 'Afghanistan', 'Akrotiri and Dhekelia', 'Albania',
       'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla',
       'Antigua and Barbuda',
       ...
       'Vietnam', 'Virgin Islands (U.S.)', 'North Yemen (former)',
       'South Yemen (former)', 'Yemen', 'Yugoslavia', 'Zambia', 'Zimbabwe',
       'Åland', 'South Sudan'],
      dtype='object', name='Life expectancy', length=260)

In [62]:
df.index.name = 'country'

In [63]:
df.head()

Unnamed: 0_level_0,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abkhazia,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,28.13,28.12,...,52.4,52.8,53.3,53.6,54.0,54.4,54.8,54.9,53.8,52.72
Akrotiri and Dhekelia,,,,,,,,,,,...,,,,,,,,,,
Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,76.6,76.8,77.0,77.2,77.4,77.5,77.7,77.9,78.0,78.1
Algeria,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,75.3,75.5,75.7,76.0,76.1,76.2,76.3,76.3,76.4,76.5


In [64]:
df.reset_index(inplace=True)  # same as df = df.reset_index()

In [65]:
df

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Abkhazia,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,28.21,28.20,28.19,28.18,28.17,28.16,28.15,28.14,28.13,...,52.4,52.8,53.3,53.6,54.0,54.4,54.8,54.9,53.8,52.72
2,Akrotiri and Dhekelia,,,,,,,,,,...,,,,,,,,,,
3,Albania,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,...,76.6,76.8,77.0,77.2,77.4,77.5,77.7,77.9,78.0,78.10
4,Algeria,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,75.3,75.5,75.7,76.0,76.1,76.2,76.3,76.3,76.4,76.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255,Yugoslavia,,,,,,,,,,...,,,,,,,,,,
256,Zambia,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,...,49.0,51.1,52.3,53.1,53.7,54.7,55.6,56.3,56.7,57.10
257,Zimbabwe,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,...,46.4,47.3,48.0,49.1,51.6,54.2,55.7,57.0,59.3,61.69
258,Åland,,,,,,,,,,...,,,,,,,,,,


In [66]:
df = df.set_index('country')
df

Unnamed: 0_level_0,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abkhazia,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,28.21,28.20,28.19,28.18,28.17,28.16,28.15,28.14,28.13,28.12,...,52.4,52.8,53.3,53.6,54.0,54.4,54.8,54.9,53.8,52.72
Akrotiri and Dhekelia,,,,,,,,,,,...,,,,,,,,,,
Albania,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,...,76.6,76.8,77.0,77.2,77.4,77.5,77.7,77.9,78.0,78.10
Algeria,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,75.3,75.5,75.7,76.0,76.1,76.2,76.3,76.3,76.4,76.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yugoslavia,,,,,,,,,,,...,,,,,,,,,,
Zambia,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,...,49.0,51.1,52.3,53.1,53.7,54.7,55.6,56.3,56.7,57.10
Zimbabwe,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,...,46.4,47.3,48.0,49.1,51.6,54.2,55.7,57.0,59.3,61.69
Åland,,,,,,,,,,,...,,,,,,,,,,


In [67]:
df.reset_index(inplace=True)

# Lets bring this table into long format so we can plot it nicely! 
What we want: Three columns, country, year and the life_expectancy-values. So we need to turn columns into rows. 

In [68]:
df_long = df.melt(id_vars='country', var_name='year', value_name='life_expectancy')

In [69]:
df_long

Unnamed: 0,country,year,life_expectancy
0,Abkhazia,1800,
1,Afghanistan,1800,28.21
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,35.40
4,Algeria,1800,28.82
...,...,...,...
56415,Yugoslavia,2016,
56416,Zambia,2016,57.10
56417,Zimbabwe,2016,61.69
56418,Åland,2016,


In [56]:
df = df.sort_values(["country", "year"])

In [57]:
df

Unnamed: 0,country,year,life_expectancy
0,Abkhazia,1800,
260,Abkhazia,1801,
520,Abkhazia,1802,
780,Abkhazia,1803,
1040,Abkhazia,1804,
...,...,...,...
55378,Åland,2012,
55638,Åland,2013,
55898,Åland,2014,
56158,Åland,2015,


In [45]:
df[df['country'] == 'Abkhazia']

Unnamed: 0,country,year,life_expectancy
0,Abkhazia,1800,
260,Abkhazia,1801,
520,Abkhazia,1802,
780,Abkhazia,1803,
1040,Abkhazia,1804,
...,...,...,...
55120,Abkhazia,2012,
55380,Abkhazia,2013,
55640,Abkhazia,2014,
55900,Abkhazia,2015,


In [70]:
df_long.dtypes

country             object
year                object
life_expectancy    float64
dtype: object

In [71]:
df_long['year'] = df_long['year'].astype(int)

In [72]:
df_long.dtypes

country             object
year                 int64
life_expectancy    float64
dtype: object

#### The other way of getting a wide table into long format: 

In [None]:
# stack 

In [73]:
df.head()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Abkhazia,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,28.13,...,52.4,52.8,53.3,53.6,54.0,54.4,54.8,54.9,53.8,52.72
2,Akrotiri and Dhekelia,,,,,,,,,,...,,,,,,,,,,
3,Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,76.6,76.8,77.0,77.2,77.4,77.5,77.7,77.9,78.0,78.1
4,Algeria,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,75.3,75.5,75.7,76.0,76.1,76.2,76.3,76.3,76.4,76.5


In [74]:
df.set_index('country', inplace=True)

In [75]:
df

Unnamed: 0_level_0,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abkhazia,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,28.21,28.20,28.19,28.18,28.17,28.16,28.15,28.14,28.13,28.12,...,52.4,52.8,53.3,53.6,54.0,54.4,54.8,54.9,53.8,52.72
Akrotiri and Dhekelia,,,,,,,,,,,...,,,,,,,,,,
Albania,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,35.40,...,76.6,76.8,77.0,77.2,77.4,77.5,77.7,77.9,78.0,78.10
Algeria,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,75.3,75.5,75.7,76.0,76.1,76.2,76.3,76.3,76.4,76.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yugoslavia,,,,,,,,,,,...,,,,,,,,,,
Zambia,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,32.60,...,49.0,51.1,52.3,53.1,53.7,54.7,55.6,56.3,56.7,57.10
Zimbabwe,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,33.70,...,46.4,47.3,48.0,49.1,51.6,54.2,55.7,57.0,59.3,61.69
Åland,,,,,,,,,,,...,,,,,,,,,,


In [78]:
df_stacked = pd.DataFrame(df.stack())  #multiindex

In [79]:
df_stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,0
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,1800,28.21
Afghanistan,1801,28.20
Afghanistan,1802,28.19
Afghanistan,1803,28.18
Afghanistan,1804,28.17
...,...,...
South Sudan,2012,56.00
South Sudan,2013,56.00
South Sudan,2014,56.10
South Sudan,2015,56.10


In [81]:
df_stacked = df_stacked.reset_index()

In [82]:
df_stacked.head()

Unnamed: 0,country,level_1,0
0,Afghanistan,1800,28.21
1,Afghanistan,1801,28.2
2,Afghanistan,1802,28.19
3,Afghanistan,1803,28.18
4,Afghanistan,1804,28.17


In [83]:
df_stacked = df_stacked.rename(columns={'level_1': 'year', 0: 'life_expectancy'})

Unnamed: 0,country,year,life_expectancy
0,Afghanistan,1800,28.21
1,Afghanistan,1801,28.20
2,Afghanistan,1802,28.19
3,Afghanistan,1803,28.18
4,Afghanistan,1804,28.17
...,...,...,...
43852,South Sudan,2012,56.00
43853,South Sudan,2013,56.00
43854,South Sudan,2014,56.10
43855,South Sudan,2015,56.10


### Lets plot something: 

In [None]:
germany = 

# Lets bring the table back into wide format! (for practice)
- What we want: year as index and countries as columns
- We do the inverse of melting/stacking: pivot or unstack

## One last, neat little trick: 

### Nice resource on reshaping with pandas: 
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html?highlight=reshape

# Exercises:

 
## 🥳🥳  Start with your first project! 🥳🥳 

Go through the steps here: http://krspiced.pythonanywhere.com/chapters/project_gapminder/long_vs_wide.html  

- Bonus: Read the first couple of pages from the paper "Tidy Data" in the Course Material