# Melt (Columns=>Rows)

For data to be tidy, it must have:

    Each variable as a separate column.
    Each row as a separate observation.

As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

Melting data is the process of turning columns of your data into rows of data. 

In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, we wanted thesThere are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.e variables to be in rows instead, we could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently.

There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

In [11]:
# Print the head of airquality
import pandas as pd
airquality=pd.read_csv('airquality.csv')
df=pd.DataFrame(airquality)
print(airquality)

# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=df,id_vars=['Month','Day'])
print('\n\n')
# Print the head of airquality_melt
print(airquality_melt)


     Ozone  Solar.R  Wind  Temp  Month  Day
0     41.0    190.0   7.4    67      5    1
1     36.0    118.0   8.0    72      5    2
2     12.0    149.0  12.6    74      5    3
3     18.0    313.0  11.5    62      5    4
4      NaN      NaN  14.3    56      5    5
5     28.0      NaN  14.9    66      5    6
6     23.0    299.0   8.6    65      5    7
7     19.0     99.0  13.8    59      5    8
8      8.0     19.0  20.1    61      5    9
9      NaN    194.0   8.6    69      5   10
10     7.0      NaN   6.9    74      5   11
11    16.0    256.0   9.7    69      5   12
12    11.0    290.0   9.2    66      5   13
13    14.0    274.0  10.9    68      5   14
14    18.0     65.0  13.2    58      5   15
15    14.0    334.0  11.5    64      5   16
16    34.0    307.0  12.0    66      5   17
17     6.0     78.0  18.4    57      5   18
18    30.0    322.0  11.5    68      5   19
19    11.0     44.0   9.7    62      5   20
20     1.0      8.0   9.7    59      5   21
21    11.0    320.0  16.6    73 

When melting DataFrames, it would be better to have column names more meaningful than variable and value.

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter

In [14]:
# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt =pd.melt(df,id_vars=['Month','Day'], var_name='Measurement', value_name='Reading')
print('\n\n')
# Print the head of airquality_melt
print(airquality_melt)


   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5



     Month  Day Measurement  Reading
0        5    1       Ozone     41.0
1        5    2       Ozone     36.0
2        5    3       Ozone     12.0
3        5    4       Ozone     18.0
4        5    5       Ozone      NaN
5        5    6       Ozone     28.0
6        5    7       Ozone     23.0
7        5    8       Ozone     19.0
8        5    9       Ozone      8.0
9        5   10       Ozone      NaN
10       5   11       Ozone      7.0
11       5   12       Ozone     16.0
12       5   13       Ozone     11.0
13       5   14       Ozone     14.0
14       5   15       Ozone     18.0
15       5   16       Ozone     14.0
16       5   17       Ozone     34.0
17       5   18       Ozone      6.0
18       5   19       Ozone     30.0
19   

# PIVOT (Rows=>Columns)

Pivoting data is the opposite of melting it.

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

.pivot_table() has an index parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that you have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted).

In [17]:
# Print the head of airquality_melt
print(airquality_melt.head())

# Pivot airquality_melt: airquality_pivot
airquality_pivot =airquality_melt.pivot_table(index=['Month','Day'],columns='Measurement',values='Reading')
print('\n\n')
# Print the head of airquality_pivot
print(airquality_pivot.head())


   Month  Day Measurement  Reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN



Measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3


After pivoting airquality_melt in the previous exercise, you didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex).They allow you to group columns or rows by another variable - in this case, by 'Month' as well as 'Day'. 

In [19]:
# Print the index of airquality_pivot
print(airquality_pivot.index)
print('\n\n')
# Reset the index of airquality_pivot: airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the new index of airquality_pivot
print(airquality_pivot.index)

# Print the head of airquality_pivot
print(airquality_pivot.head())


RangeIndex(start=0, stop=153, step=1)



RangeIndex(start=0, stop=153, step=1)
Measurement  index  Month  Day  Ozone  Solar.R  Temp  Wind
0                0      5    1   41.0    190.0  67.0   7.4
1                1      5    2   36.0    118.0  72.0   8.0
2                2      5    3   12.0    149.0  74.0  12.6
3                3      5    4   18.0    313.0  62.0  11.5
4                4      5    5    NaN      NaN  56.0  14.3


## Pivoting Duplicate Values

So far,we've used the .pivot_table() method when there are multiple index values ,we want to hold constant during a pivot. We'll see that by using .pivot_table() and the aggfunc parameter, we can not only reshape our data, but also remove duplicates. Finally, we can then flatten the columns of the pivoted DataFrame using .reset_index().

In [25]:

import numpy as np
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month','Day'], columns='Measurement', values='Reading', aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot =airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())
print('\n\n')
# Print the head of airquality
print(airquality.head())


Measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3



   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5


In [27]:
# Melt tb: tb_melt
tb=pd.read_csv('tb.csv')
tb_melt =pd.melt(frame=tb, id_vars=['country','year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
# Print the head of tb_melt
print(tb_melt.head())

#Notice the new 'gender' and 'age_group' columns you created. 
#It is vital to be able to split columns 
#as needed so you can access the data that is relevant to your question.

  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014


Another common way multiple variables are stored in columns is with a delimiter. We'll learn how to deal with such cases , using a dataset consisting of Ebola cases and death counts by state and country. 

The data has column names such as Cases_Guinea and Deaths_Guinea. Here, the underscore _ serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, we cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called .split(). By default, this method will split a string into parts separated by a space.

Next we need to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. You can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part you want.

In [29]:
# Melt ebola: ebola_melt
ebola=pd.read_csv('ebola.csv')
ebola_melt = pd.melt(ebola, id_vars=['Date','Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] =ebola_melt['type_country'].str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt['str_split'].str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)

# Print the head of ebola_melt
print(ebola_melt)


            Date  Day  type_country  counts        str_split    type country
0       1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]   Cases  Guinea
1       1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]   Cases  Guinea
2       1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]   Cases  Guinea
3       1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]   Cases  Guinea
4     12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]   Cases  Guinea
5     12/28/2014  281  Cases_Guinea  2706.0  [Cases, Guinea]   Cases  Guinea
6     12/27/2014  280  Cases_Guinea  2695.0  [Cases, Guinea]   Cases  Guinea
7     12/24/2014  277  Cases_Guinea  2630.0  [Cases, Guinea]   Cases  Guinea
8     12/21/2014  273  Cases_Guinea  2597.0  [Cases, Guinea]   Cases  Guinea
9     12/20/2014  272  Cases_Guinea  2571.0  [Cases, Guinea]   Cases  Guinea
10    12/18/2014  271  Cases_Guinea     NaN  [Cases, Guinea]   Cases  Guinea
11    12/14/2014  267  Cases_Guinea  2416.0  [Cases, Guinea]   Cases  Guinea