<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Reshaping Data</h1>
</div>

© Copyright Machine Learning Plus

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. Cross Tabulation</h2>
</div>

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

In [None]:
rows = df['Survived']
columns = df['Pclass']
pd.crosstab(rows, columns, rownames=['Survived'])

You can add row and column sums usig the `margins` parameter.

In [None]:
pd.crosstab(rows, columns, rownames=['Survived'], margins=['rows', 'columns'])

Compute fractions instead of counts

In [None]:
pd.crosstab(rows, columns, rownames=['Survived'], normalize=True)

By looking at the fractions, you can easily make out if there is a bias in survival rates amongst the classes.

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Pivoting</h2>
</div>

__When to use__

Pivoting is a reshaping of data to view the distribution of data between two or more categorical columns. 

For example: In Titanic data, you want to know what is the average Fare paid by people different classes (in columns) vs Survived (in rows).


__How to use__

You can use the `pd.pivot_table` function for this. You need to specify what categorical columns goes in rows and columns of the pivot table, what numeric column should be used to fill in the cells and how that numeric column should be aggregated.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
pd.pivot_table(index='Survived', columns='Pclass', values=['Fare'], aggfunc=lambda x: np.mean(x), data=df)

Unnamed: 0_level_0,Fare,Fare,Fare
Pclass,1,2,3
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,64.684008,19.412328,13.669364
1,95.608029,22.0557,13.694887


`pivot_table` is available as a method of the dataframe as well.

In [None]:
df_pivot = df.pivot_table(index='Survived', columns='Pclass', values=['Fare'], aggfunc=lambda x: np.mean(x))
df_pivot

Unnamed: 0_level_0,Fare,Fare,Fare
Pclass,1,2,3
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,64.684008,19.412328,13.669364
1,95.608029,22.0557,13.694887


You can unstack this.

In [None]:
df_unstack = df_pivot.unstack()
df_unstack

      Pclass  Survived
Fare  1       0           64.684008
              1           95.608029
      2       0           19.412328
              1           22.055700
      3       0           13.669364
              1           13.694887
dtype: float64

In [None]:
type(df_unstack)

pandas.core.series.Series

This same information can also be computed using the groupby-aggregate construct. The values are same as computed by pivot, but represented in long format.

In [None]:
df_agg = df.groupby(['Pclass', 'Survived']).agg({'Fare': np.mean})
df_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare
Pclass,Survived,Unnamed: 2_level_1
1,0,64.684008
1,1,95.608029
2,0,19.412328
2,1,22.0557
3,0,13.669364
3,1,13.694887


### Challenge

Compute a pivot showing the average age of people by:
1. Survived vs Class
2. Survived vs Sex
3. Class vs Sex

What inferences can you from from it?

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

In [None]:
# Solution 1
df_pivot1 = df.pivot_table(index='Survived', columns='Pclass', values=['Age'], aggfunc=lambda x: np.mean(x))
df_pivot1

Unnamed: 0_level_0,Age,Age,Age
Pclass,1,2,3
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,43.695312,33.544444,26.555556
1,35.368197,25.901566,20.646118


__Inference__
1. Younger people survived more across all classes.
2. The average age of survived increased with class. Ex: avg age of survived in class 1 > avg age of not-survived in class 2.

In [None]:
# Solution 2
df_pivot2 = df.pivot_table(index='Survived', columns='Sex', values=['Age'], aggfunc=lambda x: np.mean(x))
df_pivot2

Unnamed: 0_level_0,Age,Age
Sex,female,male
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2
0,25.046875,31.618056
1,28.847716,27.276022


The avg age of female who survived is slightly more. Older men seem to have been given lesser priority.

In [None]:
df_pivot3 = df.pivot_table(index='Pclass', columns='Sex', values=['Age'], aggfunc=lambda x: np.mean(x))
df_pivot3

Unnamed: 0_level_0,Age,Age
Sex,female,male
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2
1,34.611765,41.281386
2,28.722973,30.740707
3,21.75,26.507589


The females aboard the ship are younger overall.

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Wide to long and back</h2>
</div>


__When to use__

Sometime when storing your dataset to a database, you might want to convert the data to a standard pre-existing format in the database. Typically in the form of "Id", "Variable Name" and "Value" columns.

This can happen when your company mandates that certain datasets be stored in standard format so it is compatible with pre-existing data visualization softwares and applications.

__Example__

In Titanic dataset, each row contains multiple attributes (like Survived, Class, Age etc) for each individual. But the dataset in DB allows for only one attribute to be stored per row. Each row can have the following columns:
"PassengerId", "Name", "Variable", "Value"


Store the "Survived" and "Pclass" columns of the dataset in this format.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


__melt__

In [None]:
df_melted = pd.melt(df, id_vars=['PassengerId', 'Name'], value_vars=['Survived', 'Pclass'])
df_melted

Unnamed: 0,PassengerId,Name,variable,value
0,1,"Braund, Mr. Owen Harris",Survived,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Survived,1
2,3,"Heikkinen, Miss. Laina",Survived,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Survived,1
4,5,"Allen, Mr. William Henry",Survived,0
...,...,...,...,...
1777,887,"Montvila, Rev. Juozas",Pclass,2
1778,888,"Graham, Miss. Margaret Edith",Pclass,1
1779,889,"Johnston, Miss. Catherine Helen ""Carrie""",Pclass,3
1780,890,"Behr, Mr. Karl Howell",Pclass,1


__Make Long to Wide again__

Use `pd.pivot()`. This is very different from `pd.pivot_table` in the sense that `pd.pivot` is not designed to work with duplicate records.

In [None]:
df_wide = pd.pivot(df_melted, index=['PassengerId', 'Name'], columns=['variable'], values='value')
df_wide

Unnamed: 0_level_0,variable,Pclass,Survived
PassengerId,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"Braund, Mr. Owen Harris",3,0
2,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",1,1
3,"Heikkinen, Miss. Laina",3,1
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,1
5,"Allen, Mr. William Henry",3,0
...,...,...,...
887,"Montvila, Rev. Juozas",2,0
888,"Graham, Miss. Margaret Edith",1,1
889,"Johnston, Miss. Catherine Helen ""Carrie""",3,0
890,"Behr, Mr. Karl Howell",1,1


__Convert index to column__

In [None]:
# Convert index to column
df_wide.reset_index()

variable,PassengerId,Name,Pclass,Survived
0,1,"Braund, Mr. Owen Harris",3,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,1
2,3,"Heikkinen, Miss. Laina",3,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,1
4,5,"Allen, Mr. William Henry",3,0
...,...,...,...,...
886,887,"Montvila, Rev. Juozas",2,0
887,888,"Graham, Miss. Margaret Edith",1,1
888,889,"Johnston, Miss. Catherine Helen ""Carrie""",3,0
889,890,"Behr, Mr. Karl Howell",1,1
