In [1]:
import pandas as pd
import numpy as np

# Create date range (12 days)
dates = pd.date_range(start="2024-01-01", periods=12, freq="D")

# Tickers
tickers = ["GOOG", "MSFT"]

# Create cartesian product of dates and tickers
stocks = pd.DataFrame(
    [(date, ticker) for date in dates for ticker in tickers],
    columns=["date", "ticker"]
)

# Add price and volume data (example realistic values)
np.random.seed(42)

stocks["open"] = np.where(
    stocks["ticker"] == "GOOG",
    np.random.uniform(135, 145, len(stocks)),
    np.random.uniform(360, 380, len(stocks))
).round(2)

stocks["close"] = (stocks["open"] + np.random.uniform(-2, 2, len(stocks))).round(2)

stocks["volume"] = np.where(
    stocks["ticker"] == "GOOG",
    np.random.randint(1_800_000, 2_200_000, len(stocks)),
    np.random.randint(2_000_000, 2_400_000, len(stocks))
)

# Final result
stocks


Unnamed: 0,date,ticker,open,close,volume
0,2024-01-01,GOOG,138.75,138.94,1924375
1,2024-01-01,MSFT,375.7,374.44,2273109
2,2024-01-02,GOOG,142.32,144.2,1936330
3,2024-01-02,MSFT,370.28,371.38,2370210
4,2024-01-03,GOOG,136.56,138.32,1964231
5,2024-01-03,MSFT,360.93,362.51,2375396
6,2024-01-04,GOOG,135.58,135.97,1861858
7,2024-01-04,MSFT,363.41,365.1,2158338
8,2024-01-05,GOOG,141.01,139.36,1812666
9,2024-01-05,MSFT,378.98,377.76,2214020


1. Reshape the data so:

- Rows = date

- Columns = ticker

- Values = close

In [2]:
close_wide = stocks.pivot(index='date', columns='ticker', values='close')

2. What happens if there are multiple rows per date + ticker and you try pivot?
Why does Pandas raise an error?

If the original data has multiple data per date, per ticker, the data need to be aggregated during the pivot operation. Pivot function does not aggregate values. Therefore it throws an error. pivot_table function can be used to perform pivot a table with aggregation.

3. Create a table showing:

- Rows = date

- Columns = ticker

- Values = average close price

In [3]:
stocks.pivot_table(index= 'date', columns= 'ticker', values='close', aggfunc= 'mean')

ticker,GOOG,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-01-01,138.94,374.44
2024-01-02,144.2,371.38
2024-01-03,138.32,362.51
2024-01-04,135.97,365.1
2024-01-05,139.36,377.76
2024-01-06,133.39,375.47
2024-01-07,142.87,361.04
2024-01-08,138.13,368.23
2024-01-09,137.16,370.07
2024-01-10,137.88,379.4


4. Modify the previous result to include:

- Values = close and volume

- Aggregation = mean

In [4]:
stocks.pivot_table(index= 'date', columns= 'ticker', values=['close', 'volume'], aggfunc= 'mean')

Unnamed: 0_level_0,close,close,volume,volume
ticker,GOOG,MSFT,GOOG,MSFT
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2024-01-01,138.94,374.44,1924375,2273109
2024-01-02,144.2,371.38,1936330,2370210
2024-01-03,138.32,362.51,1964231,2375396
2024-01-04,135.97,365.1,1861858,2158338
2024-01-05,139.36,377.76,1812666,2214020
2024-01-06,133.39,375.47,1934633,2373133
2024-01-07,142.87,361.04,2126649,2121626
2024-01-08,138.13,368.23,1983323,2393422
2024-01-09,137.16,370.07,2070536,2218969
2024-01-10,137.88,379.4,2175713,2023419


5. Why would you prefer pivot_table over pivot in production code?

pivot_table function can work with data when there are duplicate values. pivot function throws errors with duplicate data. When we use pivot_table function instead of pivot function we get less errors in the code.

In [5]:
close_wide

ticker,GOOG,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-01-01,138.94,374.44
2024-01-02,144.2,371.38
2024-01-03,138.32,362.51
2024-01-04,135.97,365.1
2024-01-05,139.36,377.76
2024-01-06,133.39,375.47
2024-01-07,142.87,361.04
2024-01-08,138.13,368.23
2024-01-09,137.16,370.07
2024-01-10,137.88,379.4


6. Convert close_wide table into long format with columns:

date

ticker

close

In [8]:
#close_wide.melt(id_vars = ['date'], value_vars = ['GOOG', 'MSFT'], value_name = ['ticker', 'close'])
close_wide = close_wide.reset_index()
close_wide.melt(id_vars = ['date'], var_name = 'ticker', value_name= 'close')

Unnamed: 0,date,ticker,close
0,2024-01-01,GOOG,138.94
1,2024-01-02,GOOG,144.2
2,2024-01-03,GOOG,138.32
3,2024-01-04,GOOG,135.97
4,2024-01-05,GOOG,139.36
5,2024-01-06,GOOG,133.39
6,2024-01-07,GOOG,142.87
7,2024-01-08,GOOG,138.13
8,2024-01-09,GOOG,137.16
9,2024-01-10,GOOG,137.88


7. Why is long format preferred for:

- BI tools

- Machine learning

- Groupby operations


In long format tables, each row represent a single observation and columns represent feature names. This makes it easy to slice and dice. Machine learning models expects a fixed number of features (columns). In long tables, new data points can be added without changing the number of columns. When aggregating data using functions like Groupby, single column with the values of the same features makes it easy to aggregate.

Starting from a pivoted close_wide DataFrame, Use stack() to move column values into rows