<a href="https://colab.research.google.com/github/stevenkhwun/P4DS/blob/main/Chp10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 10: Data Aggregation and Group Operations

Python skills learnt in this notebook:
* *average* and *weighted average* using `np.average`
* the `apply()` function

In [24]:
# import the necessary packages
import numpy as np
import pandas as pd

## Group Weighted Average and Correlation (p. 344)

**Example 1: Simulated data**

In [25]:
# Prepare the sample data
df = pd.DataFrame({"category": ["a", "a", "a", "a",
                                "b", "b", "b", "b"],
                   "data": np.random.standard_normal(8),
                   "weights": np.random.uniform(size=8)})
df

Unnamed: 0,category,data,weights
0,a,0.488534,0.106716
1,a,-1.529343,0.415267
2,a,0.66828,0.368923
3,a,-0.125261,0.027237
4,b,0.379524,0.781314
5,b,-2.721607,0.039885
6,b,1.305529,0.180044
7,b,1.682803,0.845112


In [26]:
# Weighted average by category
grouped = df.groupby("category")

def get_wavg(group):
  return np.average(group["data"], weights=group["weights"])

grouped.apply(get_wavg)

category
a   -0.370117
b    0.999367
dtype: float64

**`np.average`** function

The following example demonstrate the use of `np.average` to calculate the averages and the weighted averages.

In [27]:
# Prepare the sample data
df2 = pd.DataFrame({"Obs": ["a", "b", "c", "d"],
                   "data": [100, 200, 300, 400],
                   "weights": [0.1, 0.3, 0.4, 0.2]})
df2

Unnamed: 0,Obs,data,weights
0,a,100,0.1
1,b,200,0.3
2,c,300,0.4
3,d,400,0.2


In [28]:
# Calculate the average
avg = np.average(df2.data)
print("The average is", avg,".")

# Calculate the weighted average
wavg = np.average(df2.data, weights=df2.weights)
print("The weighted average is", wavg,".")

The average is 250.0 .
The weighted average is 270.0 .


**Example 2: Stock prices and S&P 500 index**

In [32]:
# Load "stock_px.csv" from my GitHub site
stock_px = "https://raw.githubusercontent.com/stevenkhwun/P4DS/main/Data/stock_px.csv"
close_px = pd.read_csv(stock_px, parse_dates=True, index_col=0)
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-02-01 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB


In [33]:
close_px.tail(4)

Unnamed: 0_level_0,AAPL,MSFT,XOM,SPX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011-11-10,400.29,27.0,76.27,1195.54
2011-12-10,402.19,26.96,77.16,1207.25
2011-10-13,408.43,27.18,76.37,1203.66
2011-10-14,422.0,27.27,78.11,1224.58
