# Tidy data
---
#### *Sergiy Tkachuk*  
tkachuk.sergiy23@gmail.com

<details><summary>Sprawdź materiały do prezentacji</summary>
<p>
    
Autor: **Hadley Wickham**  
[Wikipedia](https://en.wikipedia.org/wiki/Tidy_data)  
[Link do artykułu](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf)

</p>
</details>

**Pamiętajmy o podstawowych zasadach:**  
1. Każda zmienna tworzy kolumnę.
2. Każda obserwacja stanowi wiersz.
3. Każdy typ obserwacji tworzy jednostkę.

In [None]:
import pandas as pd
import numpy as np

In [None]:
url1 = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1273/datasets/df1.csv'
url2 = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1273/datasets/df2.csv'

df1 = pd.read_csv(url1, sep = ',')
df2 = pd.read_csv(url2, sep = ',')

#### Jakie zasady nie spełniają załadowane tabele?

<details><summary>Odpowiedź</summary>
<p>

df2 --> reguła 2  
Obserwojemy coś w jakimś momencie, dlatego czasowe dane nie mogą stanowić nagłówków.

</p>
</details>

#### Zrób dane bardziej 'tidy'

In [None]:
df2_melted = 

print(df2_melted)

<details><summary>Odpowiedź</summary>
<p>

pd.melt(df2, id_vars=['Country'])

</p>
</details>

In [None]:
df2_tidy = df2_melted.rename(columns = {'variable': 'Year', 'value': 'Income'})
df2_tidy

In [None]:
df2_melted

In [None]:
df2_melted.rename(columns = {'variable': 'Year', 'value': 'Income'}, inplace=True)
df2_melted

#### Formatowanie

In [None]:
df2_melted.dtypes

In [None]:
df2_melted['Year'] = df2_melted['Year'].apply(lambda x: x[1:5])
df2_melted

In [None]:
df2_melted['Year'].apply(pd.to_numeric)

In [None]:
df2_melted['Year'].astype('int64')

### Więcej zabawy z danymi

In [None]:
messy = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'], 
                      'Last' : ['Smith', 'Doe', 'Johnson'], 
                      'Treatment A' : [np.nan, 16, 3], 
                      'Treatment B' : [2, 11, 1]})
messy

In [None]:
# messy.transpose()
messy.T

In [None]:
tidy = pd.melt(messy, 
               id_vars=['First','Last'], 
               var_name='treatment', 
               value_name='result')
tidy

In [None]:
tidy['Name'] = tidy['First'] + ' ' + tidy['Last']

In [None]:
messy1 = tidy.pivot(index='Name',columns='treatment',values='result')
messy1

In [None]:
messy1.index

In [None]:
messy1.reset_index(inplace=True)
messy1

#### Wiele zmiennych przechowywanych w jednej kolumnie

In [None]:
messy_df = pd.read_csv('https://raw.githubusercontent.com/hadley/tidy-data/master/data/tb.csv', sep=',')

display(messy_df.head())

print(messy_df.columns)

print('\nIlość obserwacji - %d.\nIlość zmiennych - %d.' % messy_df.shape)

In [None]:
messy_df.columns = messy_df.columns.str.replace('new_sp_','')
messy_df.rename(columns = {'iso2' : 'country'}, inplace=True)
messy_df = messy_df[messy_df['year'] == 2000]
messy_df.drop(['new_sp','m04','m514','f04','f514'], axis=1, inplace=True)
messy_df.iloc[:,:11].head(10)

In [None]:
molten = pd.melt(messy_df, id_vars=['country', 'year'], value_name='cases')
molten.sort_values(by=['year', 'country'], inplace=True)
molten.head(10)

In [None]:
tidy = molten[molten['variable'] != 'mu'].copy()
def parse_age(s):
    s = s[1:]
    if s == '65':
        return '65+'
    else:
        return s[:-2]+'-'+s[-2:]

tidy['sex'] = tidy['variable'].apply(lambda s: s[:1])
tidy['age'] = tidy['variable'].apply(parse_age)
tidy = tidy[['country', 'year', 'sex', 'age', 'cases']]
tidy.head(10)

In [None]:
tidy.fillna(0)