### How should a DataFrame look like ?!  


#### Intro:

- each variable (attribute, features) is a column
- each observation (unit, record, instance) is a row
- each observational unit is a table



### `Tidy data`

#### !!! If your data is tidy, you can easily switch between wide and long formats
- In the **long format**, there is a single value column and another column that contains _**the variable name as a category_** for each of the values. This format is great for plotting with seaborn.

- In the **wide format**, each variable has its own column. This format is great for calculating descriptive statistics or for applying machine learning with sklearn.



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

#### Each observation is a penguin

In [None]:
df = pd.read_csv('penguins_simple.csv', sep=";")
df.head()

In [None]:
#Cleaning column names
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')


#### 1. Creating new column

In [None]:

df['penguin_id'] = range(len(df))
df.head()

#### 2. Move columns into the index and out again

In [None]:
df.set_index('penguin_id')

In [None]:
df.reset_index()  #moves the index into a column

In [None]:
df.columns


In [None]:
df

In [None]:
df.head().isna().sum()  # checking null values

In [None]:
sns.scatterplot(data = df, x = 'culmen_length_mm', y = 'culmen_depth_mm', hue = 'species', style = 'sex')

## Long vs Wide 

## `df.melt`: ***from wide to long***



- **id_vars**: Column(s) to use as identifier variables
- **value_vars**: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- **var_name**: Name to use for the ‘variable’ column.
- **value_name**: Name to use for the ‘value’ column.

In [None]:
df.columns

In [None]:
df_melted = df.melt(id_vars=['penguin_id','sex', 'species'],  # column names from the original df
        value_vars=['culmen_length_mm', 'culmen_depth_mm'], # column names from the original df
        var_name='bill_measurement',  # new column name for the variable column
        value_name='value_mm')    # new column name for the values column
df_melted

- Often used for plotting things with seaborn (except the pairplot)

In [None]:
df_melted.bill_measurement.unique()

In [None]:
sns.boxplot(data=df_melted, y='value_mm', x='bill_measurement', hue = 'species')

In [None]:
sns.violinplot(data=df_melted, y='value_mm', x='bill_measurement', hue = 'sex', split = True)

##  `df.pivot`: ***from long to wide***


In [None]:
wide = df_melted.pivot(index='penguin_id', columns='bill_measurement', values='value_mm')
wide

In [None]:
del wide['culmen_length_mm']


## ***Stack and Unstack***

In [None]:
data = [
    ['Germany', 2000, 80, 1.5],
    ['Germany', 2010, 81, 1.4],
    ['Germany', 2020, 82, 1.3],
    
    ['Iran', 2000, 61, 2.5],
    ['Iran', 2010, 70, 2.4],
    ['Iran', 2020, 80, 2.3],    
]

x = pd.DataFrame(data, columns=('country', 'year','pop', 'fert'))
x.set_index(['country', 'year'], inplace=True)
x


In [None]:
y = x.stack()
y

In [None]:
y.unstack(0)

In [None]:
y.unstack(0).plot.bar()

In [None]:
y.unstack(1).plot.bar()

In [None]:
fig = plt.figure()
y.unstack(2).loc['Germany'].plot.bar(secondary_y="fert")
#y.unstack(2).loc['Iran'].plot.bar()

plt.ylim([0,5])


In [None]:
y.head()

In [None]:
y["Germany",2010,"fert"]

### What should I do Next?
- Your job is to melt this data into a longer format so that it's easier to work with and merge with other tables in the Animated Scatterplot Exercise in the course material. https://krspiced.pythonanywhere.com/chapters/project_gapminder/long_vs_wide.html#animated-scatterplot-exercise