# Tidy Datasets

In [None]:
%pip install -r https://raw.githubusercontent.com/vrughetti/python4DS/main/requirements.txt

In [8]:
import pandas as pd
import numpy as np

## Untidy Data example

In [4]:
untidy_data = {
    'Name': ['John', 'Mary', 'Adam'],
    'Maths': [80, 90, 75],
    'Science': [85, 92, 78],
    'English': [70, 88, 82]
}

df_untidy = pd.DataFrame(untidy_data)
df_untidy.head()

Unnamed: 0,Name,Maths,Science,English
0,John,80,85,70
1,Mary,90,92,88
2,Adam,75,78,82


Let's transform the previous dataset into a tidy dataset. We will make use of melt() function from pandas library.

```python

df_untidy.melt(id_vars='Name',var_name='Subject',value_name='Score')

```

In [6]:
df_tidy = df_untidy.melt(id_vars='Name', var_name='Subject', value_name='Score')
df_tidy.head()

Unnamed: 0,Name,Subject,Score
0,John,Maths,80
1,Mary,Maths,90
2,Adam,Maths,75
3,John,Science,85
4,Mary,Science,92


So now we have a tidy dataset. Let's see how we can untidy it.

```python

df_tidy.pivot(index='Name',columns='Subject',values='Score')

```

In [7]:
df_untidy_from_tidy = df_tidy.pivot(index='Name', columns='Subject', values='Score')
df_untidy_from_tidy.columns.name = None
df_untidy_from_tidy = df_untidy_from_tidy.reset_index()
df_untidy_from_tidy.head()

Unnamed: 0,Name,English,Maths,Science
0,Adam,82,75,78
1,John,70,80,85
2,Mary,88,90,92


## Long and Wide Format

In [16]:
df = pd.DataFrame(
    {
        'year': np.random.randint(1950, 1970, size=20),
        'month': np.random.randint(1, 13, size=20),
        'passengers': np.random.randint(100, 1000, size=20)
    }
)
df.head()

Unnamed: 0,year,month,passengers
0,1957,5,593
1,1963,1,479
2,1955,6,933
3,1967,1,235
4,1962,2,768


From long to wide format:

```python
df.pivot(index=..., columns=..., values=...)
```

In [20]:
df.pivot(index='year', columns='month', values='passengers')

month,1,2,3,4,5,6,7,8,9,12
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1950,,,,,,,,737.0,,
1955,,,,,,933.0,440.0,,,
1956,,,,275.0,,,,,,
1957,,,,476.0,593.0,,,,,986.0
1959,272.0,,,,,,,,444.0,
1961,,,643.0,,328.0,,945.0,,,
1962,,768.0,,843.0,,,,,,
1963,479.0,,,,,,,,,
1965,,,,,769.0,,,,,563.0
1967,235.0,,509.0,,,,,,,


If there are no values for a particular combination of index and column, then NaN is returned.

---

# What's next?

Next notebook: [Overview of Pandas](https://github.com/vrughetti/python4DS/blob/main/notebooks/pandas/pandas.ipynb)