# 07. Data Grouping and Filtering

Another quite popular operations to perform on data are **grouping** and **filtering**.
The former is used to group values into mulitple subgroups and operate on specific groups; the latter is to dwelve into data based on specific categories or filters on values.
Most of the time, these two operations are performed together.

Pandas allows to do group and filtering, thanks to the `groupby` and `filter` functions. 

We will see examples in this notebook.

**Note** 

In this notebook, we will be using `numpy.random` functions for random selection of values and random number generations to work on fake data.

In [1]:
import numpy as np
import pandas as pd

In [2]:
gender = ["Male", "Female"]
income = ["Poor", "Middle Class", "Rich"]

In [3]:
n = 500

gender_data = []
income_data = []

for i in range(0, 500):
    gender_data.append(np.random.choice(gender))
    income_data.append(np.random.choice(income))

In [11]:
gender_data[:10]

['Female',
 'Male',
 'Male',
 'Female',
 'Female',
 'Male',
 'Male',
 'Male',
 'Male',
 'Female']

In [12]:
income_data[:10]

['Middle Class',
 'Poor',
 'Middle Class',
 'Middle Class',
 'Poor',
 'Poor',
 'Middle Class',
 'Rich',
 'Poor',
 'Rich']

Z -> N(0, 1)
<br>
N(m, s) -> m + s * Z

In [6]:
#Z -> N(0,1)
#N(m, s) -> m + s * Z
height = 160 + 30 * np.random.randn(n)
weight = 65 + 25 * np.random.randn(n)
age = 30 + 12 * np.random.randn(n)
income = 18000 + 3500 * np.random.rand(n)

In [7]:
data = pd.DataFrame(
    {
        "Gender" : gender_data,
        "Economic Status" : income_data,
        "Height" : height,
        "Weight" : weight,
        "Age" : age,
        "Income" : income
    }
)

In [9]:
data.head(30)

Unnamed: 0,Gender,Economic Status,Height,Weight,Age,Income
0,Female,Middle Class,108.91931,54.053707,27.54202,20977.295007
1,Male,Poor,180.440425,69.915184,34.208488,19744.997546
2,Male,Middle Class,136.215786,41.815437,33.663745,19142.291562
3,Female,Middle Class,142.516745,49.996199,35.041471,18743.683989
4,Female,Poor,175.520617,43.014027,26.323569,20884.448053
5,Male,Poor,147.943771,70.137523,38.354636,21425.638293
6,Male,Middle Class,154.451622,53.411307,37.7495,20217.130755
7,Male,Rich,221.004251,53.192329,30.268086,18808.019685
8,Male,Poor,168.643705,75.435287,33.187742,19036.924644
9,Female,Rich,183.068506,58.904473,45.840153,19719.025395


## Data Grouping

←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←← stopped here

In [13]:
grouped_gender = data.groupby("Gender")

In [16]:
grouped_gender.groups

{'Female': [0, 3, 4, 9, 11, 12, 13, 14, 15, 18, 21, 25, 26, 27, 32, 33, 34, 36, 37, 41, 44, 47, 48, 50, 51, 52, 55, 57, 58, 59, 61, 62, 64, 65, 66, 68, 69, 72, 73, 75, 78, 80, 82, 83, 84, 86, 87, 89, 90, 92, 93, 94, 96, 98, 99, 102, 105, 106, 107, 108, 109, 111, 116, 117, 119, 121, 122, 124, 126, 127, 129, 130, 131, 133, 134, 135, 136, 137, 140, 141, 145, 146, 147, 150, 151, 153, 154, 157, 162, 164, 167, 168, 170, 171, 172, 177, 178, 179, 182, 185, ...], 'Male': [1, 2, 5, 6, 7, 8, 10, 16, 17, 19, 20, 22, 23, 24, 28, 29, 30, 31, 35, 38, 39, 40, 42, 43, 45, 46, 49, 53, 54, 56, 60, 63, 67, 70, 71, 74, 76, 77, 79, 81, 85, 88, 91, 95, 97, 100, 101, 103, 104, 110, 112, 113, 114, 115, 118, 120, 123, 125, 128, 132, 138, 139, 142, 143, 144, 148, 149, 152, 155, 156, 158, 159, 160, 161, 163, 165, 166, 169, 173, 174, 175, 176, 180, 181, 183, 184, 189, 191, 194, 195, 196, 199, 201, 202, 206, 207, 211, 213, 214, 215, ...]}

In [17]:
for names, groups in grouped_gender:
    print(names)
    print(groups)

Female
     Gender Economic Status      Height      Weight        Age        Income
0    Female    Middle Class  108.919310   54.053707  27.542020  20977.295007
3    Female    Middle Class  142.516745   49.996199  35.041471  18743.683989
4    Female            Poor  175.520617   43.014027  26.323569  20884.448053
9    Female            Rich  183.068506   58.904473  45.840153  19719.025395
11   Female            Rich  171.571673   68.311118  41.476114  20538.716060
..      ...             ...         ...         ...        ...           ...
490  Female            Poor  126.309601   72.281999  43.244447  19042.881567
492  Female            Rich  171.943335   68.929817  25.569880  18428.515754
493  Female            Rich  142.943854   46.138044  15.355971  19952.083041
495  Female    Middle Class  214.295610   46.354274  45.142851  20893.703853
499  Female    Middle Class  157.474534  100.245393  16.270387  20483.511671

[250 rows x 6 columns]
Male
    Gender Economic Status      Height  

In [None]:
grouped_gender.get_group("Female")

In [None]:
double_group = data.groupby(["Gender", "Economic Status"])

In [None]:
len(double_group)

In [None]:
for names, groups in double_group:
    print(names)
    print(groups)

## Operations on Groups

In [None]:
double_group.sum()

In [None]:
double_group.mean()

In [None]:
double_group.size()

In [None]:
double_group.describe()

In [None]:
grouped_income = double_group["Income"]

In [None]:
grouped_income.describe()

In [None]:
double_group.aggregate(
    {
        "Income": np.sum,
        "Age" : np.mean,
        "Height" : np.std
    }
)

In [None]:
double_group.aggregate(
    {
        "Age" : np.mean,
        "Height" : lambda h:(np.mean(h))/np.std(h)
    }
)

In [None]:
double_group.aggregate([np.sum, np.mean, np.std])

In [None]:
double_group.aggregate([lambda x: np.mean(x) / np.std(x)])

## Data Filtering

In [None]:
double_group["Age"].filter(lambda x: x.sum()>2400)

## Transforming variables

In [None]:
zscore = lambda x : (x - x.mean())/x.std()

In [None]:
z_group = double_group.transform(zscore)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.hist(z_group["Age"])

In [None]:
fill_na_mean = lambda x : x.fillna(x.mean())

In [None]:
double_group.transform(fill_na_mean)

## Other operations on data and groups

In [None]:
double_group.head(1)

In [None]:
double_group.tail(1)

In [None]:
double_group.nth(32)

In [None]:
double_group.nth(82)

In [None]:
data_sorted = data.sort_values(["Age", "Income"])

In [None]:
data_sorted.head(10)

In [None]:
age_grouped = data_sorted.groupby("Gender")

In [None]:
age_grouped.head(1)

In [None]:
age_grouped.tail(1)