---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.20(Pandas-11)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

## _Reshape using pivot, melt, and crosstab.ipynb_

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## Learning agenda of this notebook

1. Reshape Data Using `pivot()` and `pivot_table()` methods
2. Reshape Data Using `melt()` method
3. Reshape Data Using `crosstab()` method
4. Reshape Data Using `Stack()` and `Unstack()`

**Note : We need to reshape our data for data analysis purpose based on what kind of analysis.**

## 1. Reshaping Data Using `df.pivot()`  and `df.pivot_table()` Methods

**```df.pivot(index=None, columns=None, values=None)```**<br>
**```pandas.pivot(data, index=None, columns=None, values=None)```**

Where,
- `index`: Column to use as new dataframe's index. If None, uses existing index.
- `columns`: Column to use to make new dataframe columns.
- `values`:  Column(s) to use for populating new frame's values. 

Read more about `pd.pivot()`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

Read more about `df.pivot()`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

**`df.pivot_table(index=None, columns=None, values=None, aggfunc= 'mean', fill_valus=None)`**<br>
**`pandas.pivot_table(data, index=None, columns=None, values=None, aggfunc= 'mean', fill_valus=None)`**

Where,
- `index`: Column to use as new dataframe's index. If None, uses existing index.
- `columns`: Column to use to make new dataframe columns.
- `values`:  Column(s) to use for populating new frame's values. 
- `aggfunc`:  default is numpy.mean
- `fill_value`: Value to replace missing values with



Read more about `pd.pivot_table()`: https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

Read more about `df.pivot_table()`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html?highlight=dataframe%20pivot_table#pandas.DataFrame.pivot_table

### DataSet 1:

You can use both `pivot()` as well as `pivot_table()` methods over here

<img align="center" width="900" height="600"  src="images/ds1.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/pivot_weather1.csv')
df

In [None]:
df.pivot_table(index='city', columns='date', values='humidity')

>**Suppose we want to have one record for each city, containing temperature and humidity for each date**

In [None]:
# using pivot()
df1 = df.pivot(index='city', columns='date')
df1

>**Let us repeat the same using `pivot_table()`**

In [None]:
# using pivot_table()
df1 = df.pivot_table(index='city', columns='date')
df1

- By setting the `index='city'`, the city column is the left most column now having unique values. 
- By setting the `columns='date'`, the values from the date column have become the  column headers now.


**Suppose we want to see only the temperature or humidity column in the output dataframe. This can be achieved by setting the `values` argument to the name of the column**

In [None]:
df1 = df.pivot(index='city', columns='date', values='temperature')
df1

In [None]:
df1 = df.pivot(index='city', columns='date', values='humidity')
df1

**Let us keep the date along index and city at the column, so that the output dataframe should have one record for each date, containing temperature and humidity for each city**

In [None]:
# using pivot()
df1 = df.pivot(index='date', columns='city')
df1

In [None]:
# using pivot_table()
df1 = df.pivot_table(index='date', columns='city')
df1

### DataSet 2:
You cannot use `pivot()` due to multiple values, however, `pivot_table()` will work, as it will take the `mean()` of those values
<img align="center" width="900" height="600"  src="images/ds2.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/pivot_weather2.csv')
df

>Note in this dataset we donot have unique values for date and city combined 

In [None]:
#df1 = df.pivot(index='date', columns='city')
#df1

- When we set the index to `date` and columns to `city`, the `pivot()` tries to set the left key to `20/06/2021` and then match the column name of the differing city (Lahore) values. 
- In this case there are two rows which have `20/06/2021` and columns of `Lahore`. The function doesn't know what value to put into cell values. 
- So raise a ValueError: Index contains duplicate entries, cannot reshape
- Pivot and pivot_table may only exhibit the same functionality if the data allows. If there are duplicate entries possible from the index(es) of interest you will need to aggregate the data in pivot_table, not pivot (due to duplicate error).


- Let us try to do the same using `pivot_table()` method
- In the `pivot_table` function, there is another argument `aggfunc=’mean’` that decides this.

In [None]:
df1 = df.pivot_table(index='date', columns='city')
df1

>The default value to the `aggfunc` argument is `mean`, and you can explicitly pass any other aggregate function name.

In [None]:
df1 = df.pivot_table(index='date', columns='city', aggfunc='sum')
df1

### DataSet 3:
<img align="center" width="900" height="600"  src="images/ds3.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/pivot_std1.csv')
df

**Suppose we want to have one record for each gender, containing age, height and weight for each sport**

In [None]:
df1 = df.pivot_table(index='gender', columns='sport')
df1

When we try to repeat the same using `pivot()`, we get a ValueError: Index contains duplicate entries, cannot reshape

In [None]:
#df1 = df.pivot(index='gender', columns='sport')
#df1

- When we set the index to `gender` and columns to `sport`, the `pivot()`
- In this case there are two rows which have `female` and play `basketball`. 
- The `pivot()` function doesn't know what value to put into cell values. 
- So raise a ValueError: Index contains duplicate entries, cannot reshape
- The `pivot_table()` method use the default `aggfunc=’mean’` argument to decide this.

**Use of margins argument to `pivot_table()` method**

In [None]:
df.pivot_table(index='gender', columns='sport', margins=True)

### DataSet 4:
In this dataset, since we have ...., so the `pivot()` method will flag an error as it donot know what out of the three values to place in the dataframe. However, the `pivot_table()` method will use some aggregation function to compute the value to be placed and will work fine....

<img align="center" width="900" height="600"  src="images/ds4.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/waterneed.csv')
df

- **The `pivot()` method requires atleast two arguments index and columns**

- **The `pivot_table()` on the contrary can work on index argument only, the values place are using the mean aggregate function.**

In [None]:
df1 = df.pivot_table(index='animal')
df1

You can apply aggregation function on the new dataframe as well, such as compute the average speed

In [None]:
df1['speed'].agg('mean')

In [None]:
# You can also perfrom aggragtion to summarize data
df1[['speed','water_need']].agg('mean')

**Multilevel indexing** You can perfrom multi-level indexing by passing the columns as a list to index argument to `pivottable()`

In [None]:
df.pivot_table(index=['animal','uniq_id'])

## 2. Reshaping Data Using `df.melt()` Method

- Similar to `pivot()` and `pivot_table()`, Pandas `melt()` method is also used to transform or reshape data. 
- The `pd.melt()` method is used to change the DataFrame format from wide to long
- The Pandas `pd.melt()` method is useful to reshape a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars). Its signature is:
```
pandas.melt(Dataframe, id_vars=None, value_vars=None, var_name=None, value_name='value',ignore_index=True)
```
Where,
- `id_vars`: tuple, list, or ndarray, optional  (Column(s) to use as identifier variables)
- `value_var`: tuple, list, or ndarray, optional (If not specified, uses all columns that are not set as id_vars)
- `var_name`: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
- `value_name`: Name to use for the ‘value’ column.
- `ignore_index`: bool, default True (If True, original index is ignored. If False, the original index is retained.)

### DataSet 1:
<img align="center" width="900" height="600"  src="images/melt2.png"  >

<img align="center" width="900" height="600"  src="images/melt1.png"  >

In [None]:
df = pd.read_csv('datasets/weather.csv')
df

In [None]:
df1 = pd.melt(df, id_vars =['day'])
df1

>You can change the name of columns for example, replace the column `variable` and `value` with some meaningful names. like `city` and `temperature` using the `var_name` and `value_name` arguments of `melt()` method

In [None]:
df1 = pd.melt(df, id_vars =['day'], var_name='city', value_name='temperature')
df1

>You can filter the rows of your choice using the `value_vars` argument of `melt()` method

In [None]:
df2 = pd.melt(df, id_vars = ['day'], value_vars =['lahore'], var_name='city', value_name='temperature')
df2

In [None]:
df3 = pd.melt(df, id_vars =['day'], value_vars=['karachi'],var_name='city', value_name='temperature')
df3

In [None]:
# You can achieve the similar result by using Boolean indexing
df1[df1['city'] == 'karachi']

>You can apply aggregation function on the new dataframe as well, such as compute the average temperatures

In [None]:
df1

In [None]:
# compute the average temperature of entire dataframe
df1['temperature'].agg('mean')

In [None]:
df1[df1['city'] == 'lahore' ]

In [None]:
# compute the average temperature of lahore city only
df1[df1['city'] == 'lahore' ].temperature.agg('mean')

In [None]:
df1[df1['city'] == 'karachi' ]

In [None]:
# compute the average temperature of karachi city only
df1[df1['city'] == 'karachi' ].temperature.agg('mean')

## 3. Reshaping Data Using `df.crosstab()` Method

- The `pd.crosstab()` method is also used for data restructuring and reshaping.
- It is normally used for quickly comparing categorical variables.
- The cross table is also known as contingency table, which is a matrix type table that displays the (multivariate) frequency distribution of variables.
```
pandas.crosstab(index, 
                columns, 
                aggfunc=None,
                values=None,
                margins=False, 
                normalize=False)
```
Where,
- `index`: array-like, Series, or list of arrays/Series (Values to group by in the rows)
- `columns`: array-like, Series, or list of arrays/Series (Values to group by in the columns)
- `values`: array-like, optional (Array of values to aggregate according to the factors. Requires aggfunc be specified)
- `aggfunc`: function, optional If specified, requires values be specified as well.
- `margins`: bool, default False, Add row/column margins (subtotals).
- `normalize`: bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False (Normalize by dividing all values by the sum of values)

In [None]:
# Reading data from 'datasets/sample.csv' file
import numpy as np
import pandas as pd
df = pd.read_csv('datasets/sample1.csv')
df

>Suppose we want to get the frequency distribution of males and females. You pass `city` column as `index` argument and `gender` column as `columns` argument to the `pd.crosstab()` method. It returns a frequency table containing the male and female count in each city

In [None]:
pd.crosstab(index=df.city, columns=df.gender)

> You can also get the count of total male and female in each city by setting `margins` attribute to `True`

In [None]:
pd.crosstab(index=df.city, columns=df.gender, margins=True)

>Instead of getting frequencies in whole number you can also calculate the percentage of male and female in each city. For that you set the `normalize` argument to a value of `True`

In [None]:
pd.crosstab(index=df.city, columns=df.gender, normalize=True)

>Suppose you want to get the average age of male and female in different cities. To achieve this, set the `values` argument to `age` column, and pass the appropariate aggregate function to the `aggfunc` argument

In [None]:
pd.crosstab(index=df.city, columns=df.gender, values=df.age, aggfunc=np.mean)

## Practice Exercise no 01
**A pivot table is a table of statistics that summarizes the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.**

In [None]:
# load the necessary dataset for this task, we will use `SaleData.xlsx`
import pandas as pd
import numpy as np

In [2]:
df = pd.read_excel('datasets/SaleData.xlsx')
# df.head()

In [108]:
# df.shape

In [109]:
# df.columns

#### Write a Pandas program to create a Pivot table with multiple indexes like Region,SalesMan from a given excel sheet (Salesdata.xlsx).

In [110]:
# df1 = pd.pivot_table(df,index=['Region','SalesMan'])
# df1

#### Write a Pandas program to create a Pivot table and find the total sale amount region wise, manager wise.

In [111]:
# df2 = pd.pivot_table(df, index=['Region','Manager'], values='Sale_amt', aggfunc=np.sum)
# df2

In [None]:
# df2 = pd.pivot_table(df, index=['Region','Manager'], values='Sale_amt', aggfunc=np.sum, margins=True)
# df2

#### Write a Pandas program to create a Pivot table and find the total sale amount region wise, manager wise, sales man wise.

In [None]:
# df3 = pd.pivot_table(df, index=['Region','Manager', 'SalesMan'], values='Sale_amt', aggfunc=np.sum)
# df3

#### Write a Pandas program to create a Pivot table and find the item wise unit sold.

In [None]:
# df.columns

In [None]:
# df4 = pd.pivot_table(df, index=['Item'], values='Units', aggfunc=np.sum)
# df4

#### Write a Pandas program to create a Pivot table and find the region wise total sale.

In [None]:
# pd.pivot_table(df, index='Region',values='Sale_amt', aggfunc=np.sum)

In [112]:
# pd.pivot_table(df, index='Region',values='Sale_amt', aggfunc=np.sum)

#### Write a Pandas program to create a Pivot table and find the region wise, item wise unit sold.

In [6]:
# pd.pivot_table(df, index=['Region','Item'], values='Units', aggfunc=np.sum)

#### Write a Pandas program to create a Pivot table and count the manager wise sale and mean value of sale amount. 

In [113]:
# pd.pivot_table(df, index=['Manager'],values=['SalesMan','Sale_amt'],aggfunc={'SalesMan':len,
#                                                                             'Sale_amt':np.mean})

#### Write a Pandas program to create a Pivot table and find manager wise, salesman wise total sale and also display the sum of all sale amount at the bottom. 

In [114]:
# pd.pivot_table(df, index=['Manager','SalesMan'], values='Sale_amt', aggfunc=np.sum, margins=True)

#### Write a Pandas program to create a Pivot table and find the total sale amount region wise, manager wise, sales man wise where Manager = "Douglas".

In [115]:
# result = pd.pivot_table(df, index=['Region','Manager','SalesMan'], values='Sale_amt', aggfunc=np.sum)
# result.query('Manager==["Douglas"]')

In [116]:
# a = result.reset_index()
# a[a.Manager =='Douglas'].set_index(keys=['Region','Manager','SalesMan'])

In [117]:
# df.head()

#### Write a Pandas program to create a Pivot table and find the region wise Television and Home Theater sold.

In [118]:
# result = pd.pivot_table(df,index=['Region', 'Item'],values= ['Units'], aggfunc=np.sum)
# result.query('Item == ["Television","Home Theater"]')

In [119]:
# a = result.reset_index()
# a.loc[(a.Item=='Television') | (a.Item == 'Home Theater')].set_index(['Region','Item'])

In [120]:
# df.columns

#### Write a Pandas program to create a Pivot table and find the maximum sale value of the items.

In [121]:
# pd.pivot_table(df, index=['Item'], values= ['Sale_amt'], aggfunc=np.max )

#### Write a Pandas program to create a Pivot table and find the minimum sale value of the items.

In [122]:
# pd.pivot_table(df, index=['Item'], values= ['Sale_amt'], aggfunc=np.min)

In [123]:
# df.groupby('Item').agg({'Unit_price':['min','max','sum'],
#                        'Sale_amt':['min','max','sum']})

## Practice Exercise no 02

In [None]:
# For this exercise , we will use `titanic.csv` dataset

In [124]:
df = pd.read_csv('datasets/titanic3.csv')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### Write a Pandas program to create a Pivot table with multiple indexes like sex and pclass, and find minimum and maximum age and fare from the data set.

In [131]:
pd.pivot_table(df, index=['sex','pclass'], values=['age', 'fare'], aggfunc=[min, max])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,min,max,max
Unnamed: 0_level_1,Unnamed: 1_level_1,age,fare,age,fare
sex,pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
female,1.0,2.0,25.7,76.0,512.3292
female,2.0,0.9167,10.5,60.0,65.0
female,3.0,0.1667,6.75,63.0,69.55
male,1.0,0.9167,0.0,80.0,512.3292
male,2.0,0.6667,0.0,70.0,73.5
male,3.0,0.3333,0.0,74.0,69.55


In [31]:
!ls datasets/SaleData.xlsx

datasets/SaleData.xlsx


## Check Your Concepts:
- What is Pandas?

# Pandas - Assignment no 11
- Here is link of [Pandas - Assignment no 11]()