### 그룹 연산
- 복잡한 데이터를 어떤 기준에 따라 여러 그룹으로 나누어서 관찰할 수 있으며 이런 방식으로 분할 처리하는 것을 그룹 연산이라 함
- 그룹 연산은 데이터를 집계, 변환, 필터링하는데 효율적이며 판다스 groupby() 메소드를 사용함
- 그룹 객체 만들기(분할)
- 그룹 연산 메소드(적용, 결합)

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]

print('승객 수 : ', len(df))
print(df.head())

승객 수 :  891
    age     sex  class     fare  survived
0  22.0    male  Third   7.2500         0
1  38.0  female  First  71.2833         1
2  26.0  female  Third   7.9250         1
3  35.0  female  First  53.1000         1
4  35.0    male  Third   8.0500         0


In [4]:
df.value_counts(['class'])

class 
Third     491
First     216
Second    184
dtype: int64

In [5]:
grouped = df.groupby(['class'])
# list(grouped)
for key, group in grouped:
    print('* key :', key)
    print('* number:', len(group))
    print(group.head(), '\n')

* key : First
* number: 216
     age     sex  class     fare  survived
1   38.0  female  First  71.2833         1
3   35.0  female  First  53.1000         1
6   54.0    male  First  51.8625         0
11  58.0  female  First  26.5500         1
23  28.0    male  First  35.5000         1 

* key : Second
* number: 184
     age     sex   class     fare  survived
9   14.0  female  Second  30.0708         1
15  55.0  female  Second  16.0000         1
17   NaN    male  Second  13.0000         1
20  35.0    male  Second  26.0000         0
21  34.0    male  Second  13.0000         1 

* key : Third
* number: 491
    age     sex  class     fare  survived
0  22.0    male  Third   7.2500         0
2  26.0  female  Third   7.9250         1
4  35.0    male  Third   8.0500         0
5   NaN    male  Third   8.4583         0
7   2.0    male  Third  21.0750         0 



In [6]:
# 각 그룹의 평균값

average = grouped.mean()
average

Unnamed: 0_level_0,age,fare,survived
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
First,38.233441,84.154687,0.62963
Second,29.87763,20.662183,0.472826
Third,25.14062,13.67555,0.242363


In [7]:
# 각 그룹의 최대값

grouped.max()

Unnamed: 0_level_0,age,sex,fare,survived
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
First,80.0,male,512.3292,1
Second,70.0,male,73.5,1
Third,74.0,male,69.55,1


In [8]:
# 'Third' 그룹만을 선택해서 group3 이름으로 저장하고 통계요약표를 출력하세요. ((describe()))
# https://kongdols-room.tistory.com/172

group3 = grouped.get_group('Third')
group3.describe()

Unnamed: 0,age,fare,survived
count,355.0,491.0,491.0
mean,25.14062,13.67555,0.242363
std,12.495398,11.778142,0.428949
min,0.42,0.0,0.0
25%,18.0,7.75,0.0
50%,24.0,8.05,0.0
75%,32.0,15.5,0.0
max,74.0,69.55,1.0


In [14]:
# class 열, sex 열을 기준으로 분할하여 grouped_two에 저장

grouped_two = df.groupby(['class','sex'])
for key, group in grouped_two:
    print('* key :', key)
    print('* number :', len(group))
    print(group.head)

* key : ('First', 'female')
* number : 94
<bound method NDFrame.head of       age     sex  class      fare  survived
1    38.0  female  First   71.2833         1
3    35.0  female  First   53.1000         1
11   58.0  female  First   26.5500         1
31    NaN  female  First  146.5208         1
52   49.0  female  First   76.7292         1
..    ...     ...    ...       ...       ...
856  45.0  female  First  164.8667         1
862  48.0  female  First   25.9292         1
871  47.0  female  First   52.5542         1
879  56.0  female  First   83.1583         1
887  19.0  female  First   30.0000         1

[94 rows x 5 columns]>
* key : ('First', 'male')
* number : 122
<bound method NDFrame.head of       age   sex  class      fare  survived
6    54.0  male  First   51.8625         0
23   28.0  male  First   35.5000         1
27   19.0  male  First  263.0000         0
30   40.0  male  First   27.7208         0
34   28.0  male  First   82.1708         0
..    ...   ...    ...       ...   

In [10]:
average_two = grouped_two.mean()
average_two.describe()

Unnamed: 0,age,fare,survived
count,6.0,6.0,6.0
mean,30.602403,40.640712,0.508474
std,6.764608,37.853902,0.36426
min,21.75,12.661633,0.135447
25%,27.061435,17.024553,0.210269
50%,29.73184,20.855952,0.434426
75%,33.644,55.912126,0.815789
max,41.281386,106.125798,0.968085


In [11]:
grouped_two.describe(percentiles = np.arange(1, 10) * 0.1).transpose()

# np.arange(1, 11) * 0.1
# [i * 0.1 for i in range(1, 11)]

Unnamed: 0_level_0,class,First,First,Second,Second,Third,Third
Unnamed: 0_level_1,sex,female,male,female,male,female,male
age,count,85.0,101.0,74.0,99.0,102.0,253.0
age,mean,34.611765,41.281386,28.722973,30.740707,21.75,26.507589
age,std,13.612052,15.13957,12.872702,14.793894,12.729964,12.159514
age,min,2.0,0.92,2.0,0.67,0.75,0.42
age,10%,18.0,24.0,9.5,16.0,4.0,14.0
age,20%,22.0,28.0,19.0,21.0,9.2,18.4
age,30%,24.0,33.0,24.0,24.0,16.0,20.3
age,40%,30.0,36.0,26.2,27.2,18.0,22.0
age,50%,35.0,40.0,28.0,30.0,21.5,25.0
age,60%,38.0,46.0,30.8,32.4,24.0,28.0


In [19]:
# 'Third', 'Female' 그룹을 선택해서 group3f에 저장하고 처음 5개 행을 출력하세요

group3f = grouped_two.get_group(('Third', 'female'))
display(group3f.head())
group3f[['age', 'fare']].describe()

Unnamed: 0,age,sex,class,fare,survived
2,26.0,female,Third,7.925,1
8,27.0,female,Third,11.1333,1
10,4.0,female,Third,16.7,1
14,14.0,female,Third,7.8542,0
18,31.0,female,Third,18.0,0


Unnamed: 0,age,fare
count,102.0,144.0
mean,21.75,16.11881
std,12.729964,11.690314
min,0.75,6.75
25%,14.125,7.8542
50%,21.5,12.475
75%,29.75,20.221875
max,63.0,69.55


### 적용-결합
##### 
#### 데이터 집계(agg)
- 집계 연산을 처리하는 사용자 정의함수를 그룹 객체에 적용하려면 agg() 메소드 사용
- 모든 열에 여러 함수를 매핑 : group객체.agg([함수1, 함수2, 함수3, ...])
- 각 열마다 다른 함수를 매핑 : group객체.agg({'열1' : 함수1, '열2' : 함수2, ...}) -> dict형으로 만들어준다

#### 변환(transform)
- 원소의 본래 행 인덱스와 열 이름을 기준으로 연산 결과를 반환
- 데이터 변환 연산 : group객체.transform(매핑함수)

#### 필터링(filter)
- 그룹 객체에 filter() 메소드를 적용할 때 조건식을 진 함수를 전달하면 조건이 참인 그룹만을 남김
- 그룹 객체 필터링 : group객체.filter(조건 함수식)

#### 객체에 함수 매핑(apply)
- 판다스 객체의 개별 원소를 특정 함수에 일대일로 매핑
- 사용자가 원하는 대부분의 연산을 그룹 객체에 적용
- group객체.apply(매핑함수)

In [280]:
# (최대값 - 최소값)을 반환하는 사용자 함수를 정의하고 agg() 함수에 인수로 전달하여 그룹별로 집계하여 agg_minmax 이름으로 저장 후 처음 5개행을 출력하세요

titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]
grouped = df.groupby(['class'])

agg_minmax = grouped.agg(lambda x: min(x) - max(x))
agg_minmax

  results[key] = self.aggregate(func)


Unnamed: 0_level_0,age,fare,survived
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
First,-79.08,-512.3292,-1
Second,-69.33,-73.5,-1
Third,-73.58,-69.55,-1


In [283]:
# grouped의 모든 열에 min, max 함수를 적용하여 출력하세요.

agg_all = grouped.agg(['min', 'max'])
agg_all

Unnamed: 0_level_0,age,age,sex,sex,fare,fare,survived,survived
Unnamed: 0_level_1,min,max,min,max,min,max,min,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
First,0.92,80.0,female,male,0.0,512.3292,0,1
Second,0.67,70.0,female,male,0.0,73.5,0,1
Third,0.42,74.0,female,male,0.0,69.55,0,1


In [273]:
# grouped에서 age열은 mean, fare열은 min, max 함수를 적용하여 출력하세요'

agg_1 = grouped.agg({'age' : 'mean', 'fare' : ['min', 'max']})
agg_1

Unnamed: 0_level_0,age,fare,fare
Unnamed: 0_level_1,mean,min,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
First,38.233441,0.0,512.3292
Second,29.87763,0.0,73.5
Third,25.14062,0.0,69.55


In [301]:
# 데이터의 개수가 200개 이상인 그룹만을 필터링하여 데이터프레임으로 반환하여 처음 5개행을 출력하세요

a = grouped.filter(lambda x: len(x) >= 200)
print(a)
print(a.value_counts('class'))

      age     sex  class     fare  survived
0    22.0    male  Third   7.2500         0
1    38.0  female  First  71.2833         1
2    26.0  female  Third   7.9250         1
3    35.0  female  First  53.1000         1
4    35.0    male  Third   8.0500         0
..    ...     ...    ...      ...       ...
885  39.0  female  Third  29.1250         0
887  19.0  female  First  30.0000         1
888   NaN  female  Third  23.4500         0
889  26.0    male  First  30.0000         1
890  32.0    male  Third   7.7500         0

[707 rows x 5 columns]
class
Third     491
First     216
Second      0
dtype: int64


In [314]:
# age열의 평균이 30보다 작은 그룹만을 필터링하여 데이터프레임으로 반환하여 age_filter 이름으로 저장 후 끝에서 5개행 출력

age_filter = grouped.filter(lambda x: x.age.mean() < 30)
print(age_filter.value_counts())
age_filter

age   sex     class   fare     survived
19.0  male    Third   7.8958   0           3
22.0  male    Third   7.2500   0           3
30.0  male    Second  13.0000  0           3
23.0  male    Second  13.0000  0           3
25.0  male    Second  13.0000  0           3
                                          ..
21.0  male    Third   7.7333   0           1
                      7.2500   0           1
              Second  11.5000  0           1
      female  Third   34.3750  0           1
74.0  male    Third   7.7750   0           1
Length: 486, dtype: int64


Unnamed: 0,age,sex,class,fare,survived
0,22.0,male,Third,7.2500,0
2,26.0,female,Third,7.9250,1
4,35.0,male,Third,8.0500,0
5,,male,Third,8.4583,0
7,2.0,male,Third,21.0750,0
...,...,...,...,...,...
884,25.0,male,Third,7.0500,0
885,39.0,female,Third,29.1250,0
886,27.0,male,Second,13.0000,0
888,,female,Third,23.4500,0


In [317]:
# 각 그룹별 통계요약표를 집계하세요

for key, group in grouped:
    print(key, '\n', group.describe(), '\n')

First 
               age        fare    survived
count  186.000000  216.000000  216.000000
mean    38.233441   84.154687    0.629630
std     14.802856   78.380373    0.484026
min      0.920000    0.000000    0.000000
25%     27.000000   30.923950    0.000000
50%     37.000000   60.287500    1.000000
75%     49.000000   93.500000    1.000000
max     80.000000  512.329200    1.000000 

Second 
               age        fare    survived
count  173.000000  184.000000  184.000000
mean    29.877630   20.662183    0.472826
std     14.001077   13.417399    0.500623
min      0.670000    0.000000    0.000000
25%     23.000000   13.000000    0.000000
50%     29.000000   14.250000    0.000000
75%     36.000000   26.000000    1.000000
max     70.000000   73.500000    1.000000 

Third 
               age        fare    survived
count  355.000000  491.000000  491.000000
mean    25.140620   13.675550    0.242363
std     12.495398   11.778142    0.428949
min      0.420000    0.000000    0.000000
25%  

In [328]:
# 평균값에서 표준편차의 몇배 떨어져 있는지를 평가하는 사용자 함수를 작성하세요

z_score = lambda x: (x - x.mean()) / x.std

def z_score(x):
    return (x - x.mean()) / x.std()

In [339]:
# 위에서 구한 사용자 함수를 이용하여 age칼럼을 transform()함수로 변환하세요

print(list((grouped['age'])))
# grouped.age.transform(z_score)
grouped['age'].transform(z_score)

[('First', 1      38.0
3      35.0
6      54.0
11     58.0
23     28.0
       ... 
871    47.0
872    33.0
879    56.0
887    19.0
889    26.0
Name: age, Length: 216, dtype: float64), ('Second', 9      14.0
15     55.0
17      NaN
20     35.0
21     34.0
       ... 
866    27.0
874    28.0
880    25.0
883    28.0
886    27.0
Name: age, Length: 184, dtype: float64), ('Third', 0      22.0
2      26.0
4      35.0
5       NaN
7       2.0
       ... 
882    22.0
884    25.0
885    39.0
888     NaN
890    32.0
Name: age, Length: 491, dtype: float64)]


0     -0.251342
1     -0.015770
2      0.068776
3     -0.218434
4      0.789041
         ...   
886   -0.205529
887   -1.299306
888         NaN
889   -0.826424
890    0.548953
Name: age, Length: 891, dtype: float64

In [444]:
# 위에서 구한 사용자 함수를 이용하여 age칼럼을 apply() 함수로 매핑하여 출력하세요

grouped.age.apply(z_score).head()

grouped[['age']].apply(lambda x: z_score(x))[:3]

Unnamed: 0,age
0,-0.251342
1,-0.01577
2,0.068776


In [627]:
# class 값이 First인 행을 선택하여 출력하세요

df1 = df.groupby(['class', 'sex'])

# 복수의 칼럼으로 이뤄진 그룹의 특정 행의 로우데이터를 보기 위한 코드

df1_First_F = df1.get_group(('First', 'female'))
print(df1_First_F)
df1_First_M = df1.get_group(('First', 'male'))
print(df1_First_M)

a = pd.concat((df1_First_F,df1_First_M))
a.sort_index(ascending = True)

      age     sex  class      fare  survived
1    38.0  female  First   71.2833         1
3    35.0  female  First   53.1000         1
11   58.0  female  First   26.5500         1
31    NaN  female  First  146.5208         1
52   49.0  female  First   76.7292         1
..    ...     ...    ...       ...       ...
856  45.0  female  First  164.8667         1
862  48.0  female  First   25.9292         1
871  47.0  female  First   52.5542         1
879  56.0  female  First   83.1583         1
887  19.0  female  First   30.0000         1

[94 rows x 5 columns]
      age   sex  class      fare  survived
6    54.0  male  First   51.8625         0
23   28.0  male  First   35.5000         1
27   19.0  male  First  263.0000         0
30   40.0  male  First   27.7208         0
34   28.0  male  First   82.1708         0
..    ...   ...    ...       ...       ...
839   NaN  male  First   29.7000         1
857  51.0  male  First   26.5500         1
867  31.0  male  First   50.4958         0
872  33

Unnamed: 0,age,sex,class,fare,survived
1,38.0,female,First,71.2833,1
3,35.0,female,First,53.1000,1
6,54.0,male,First,51.8625,0
11,58.0,female,First,26.5500,1
23,28.0,male,First,35.5000,1
...,...,...,...,...,...
871,47.0,female,First,52.5542,1
872,33.0,male,First,5.0000,0
879,56.0,female,First,83.1583,1
887,19.0,female,First,30.0000,1


In [629]:
# class 값이 First이고 sex 값이 female인 행을 선택하여 출력하세요

df1_First_F = df1.get_group(('First', 'female'))
df1_First_F

Unnamed: 0,age,sex,class,fare,survived
1,38.0,female,First,71.2833,1
3,35.0,female,First,53.1000,1
11,58.0,female,First,26.5500,1
31,,female,First,146.5208,1
52,49.0,female,First,76.7292,1
...,...,...,...,...,...
856,45.0,female,First,164.8667,1
862,48.0,female,First,25.9292,1
871,47.0,female,First,52.5542,1
879,56.0,female,First,83.1583,1
