# 그룹 연산

- 데이터를 특정 기준에 따라 몇 개의 그룹으로 분할아여 처리하는 것
- 데이터를 집계, 변환, 필터링하는데 효율적

- groupby()
    - groupby() 메서드의 처리 과정
        1. 분할 : 데이터를 특정 조건에 의해 분할
        2. 적용 : 데이터를 집계, 변환,필터링 하는데 필요한 메서드를 적용
        3. 결합 : 2단계의 처리 결과를 하나로 결합 

# 그룹 객체 만들기

## 1개의 열을 기준으로 그룹화

In [1]:
import pandas as pd

In [3]:
# https://url.kr/g4nreb
df = pd.read_csv("./data/occupation.tsv", sep = "|")

In [6]:
df.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [10]:
df["gender"].unique()

array(['M', 'F'], dtype=object)

In [11]:
# gender 열을 기준으로 그룹화
grouped = df.groupby("gender")
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000269EF28D490>

In [12]:
# 그룹 객체를 iteration으로 출력
# head() 메서드로 첫 5행만 출력
for key, group in grouped:
    print("KEY :", key)
    print("NUMBER :", len(group))
    print(group.head())
    print()

KEY : F
NUMBER : 273
    user_id  age gender occupation zip_code
1         2   53      F      other    94043
4         5   33      F      other    15213
10       11   39      F      other    30329
11       12   28      F      other    06405
14       15   49      F   educator    97301

KEY : M
NUMBER : 670
   user_id  age gender     occupation zip_code
0        1   24      M     technician    85711
2        3   23      M         writer    32067
3        4   24      M     technician    43537
5        6   42      M      executive    98101
6        7   57      M  administrator    91344



In [13]:
# 연산 메서드 적용
grouped.mean()

TypeError: Could not convert otherotherotherothereducatorotherhomemakerartistartistlibrarianstudentadministratorhomemakerstudentotherlibrarianmarketingstudentstudentadministratoreducatoradministratoradministratoradministratorlibrarianadministratorartistexecutiveotherwriterartistlawyermarketingmarketingadministratorstudentlibrarianmarketingartistadministratoreducatorotherstudentotherotherhealthcareeducatoradministratorscientistadministratorlibrarianexecutivestudenteducatorstudentlibrarianstudenteducatoreducatorlibrarianstudenteducatoradministratorstudentlibrarianstudentwriteradministratoreducatorstudentlibrariannonestudentartiststudentwriteradministratorlibrarianstudentotherstudentadministratorlibrarianlibrarianstudentprogrammeradministratoreducatorprogrammerstudentstudentothereducatorstudenteducatorlibrarianstudentotherlibrarianlibrarianstudentprogrammerlibrarianhomemakerhomemakerstudentstudentotherotherwriterwriteradministratorhealthcareprogrammerhealthcareartiststudentothernoneprogrammermarketingstudentadministratorotheradministratoradministratorlawyereducatorsalesmanotherstudenthealthcarestudentstudenteducatorartistwriteradministratorstudentwriterwriterotherwriterlibrarianwriteradministratoradministratoradministratorsalesmaneducatoradministratorstudentotherstudenteducatoreducatorwriterwriteradministratorlibrarianstudentstudentlibrarianeducatormarketingstudentartistotherhealthcarestudentmarketingwriterstudentwritereducatorotherhealthcarestudentlibrarianstudentstudentstudenthealthcarenonewriteradministratorstudentstudentmarketinglibrarianhealthcareadministratorhealthcareprogrammerlibrarianlibrarianstudentlibrarianhomemakerstudentstudentotheradministratorotheradministratorentertainmenthomemakeradministratorscientisteducatorotherotherotherhealthcarewritereducatorotherlibrarianeducatornonestudentothereducatorartistengineerstudentwriterotherwriteradministratorotherhealthcaremarketingothereducatorstudentlibrarianengineerexecutiveartistentertainmentstudentstudentmarketingadministratorotherretiredstudentstudentadministratorstudenteducatoradministratorotherstudentadministratorlibrarianartiststudentotherlibrarianeducatorwriterotherstudentartiststudentadministratorsalesmanscientisttechnicianstudentlibrarian to numeric

In [14]:
grouped["age"].mean()

gender
F    33.813187
M    34.149254
Name: age, dtype: float64

In [15]:
# 개별 그룹 선택하기
group_female = grouped.get_group("F")
group_female.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
1,2,53,F,other,94043
4,5,33,F,other,15213
10,11,39,F,other,30329
11,12,28,F,other,6405
14,15,49,F,educator,97301


## 여러 열을 기준으로 그룹화

In [16]:
# gender 열, occupation 열을 기준으로 그룹화
grouped2 = df.groupby(["gender", "occupation"])

In [17]:
# grouped2 그룹 객체를 iteration으로 출력
for key, group in grouped2:
    print("KEY :", key)
    print("NUMBER :", len(group))
    print(group.head(2))
    print()

KEY : ('F', 'administrator')
NUMBER : 36
    user_id  age gender     occupation zip_code
33       34   38      F  administrator    42141
61       62   27      F  administrator    97214

KEY : ('F', 'artist')
NUMBER : 13
    user_id  age gender occupation zip_code
22       23   30      F     artist    48197
23       24   21      F     artist    94533

KEY : ('F', 'educator')
NUMBER : 26
    user_id  age gender occupation zip_code
14       15   49      F   educator    97301
64       65   51      F   educator    48118

KEY : ('F', 'engineer')
NUMBER : 2
     user_id  age gender occupation zip_code
785      786   36      F   engineer    01754
826      827   23      F   engineer    80228

KEY : ('F', 'entertainment')
NUMBER : 2
     user_id  age gender     occupation zip_code
720      721   24      F  entertainment    11238
838      839   38      F  entertainment    90814

KEY : ('F', 'executive')
NUMBER : 3
     user_id  age gender occupation zip_code
97        98   49      F  executive   

In [19]:
# grouped2 그룹 객체에 연산 메서드 적용
grouped2.mean()

TypeError: Could not convert 42141972147303403755522416810615237481035230260202582024412478756554061680378213173454320420817494280410219711442248030320879900956047644265165068053819716V1G4L33763553370306221114 to numeric

In [20]:
grouped2["age"].mean()

gender  occupation   
F       administrator    40.638889
        artist           30.307692
        educator         39.115385
        engineer         29.500000
        entertainment    31.000000
        executive        44.000000
        healthcare       39.818182
        homemaker        34.166667
        lawyer           39.500000
        librarian        40.000000
        marketing        37.200000
        none             36.500000
        other            35.472222
        programmer       32.166667
        retired          70.000000
        salesman         27.000000
        scientist        28.333333
        student          20.750000
        technician       38.000000
        writer           37.631579
M       administrator    37.162791
        artist           32.333333
        doctor           43.571429
        educator         43.101449
        engineer         36.600000
        entertainment    29.000000
        executive        38.172414
        healthcare       45.40000

In [21]:
# grouped2 그룹 객체에서 개별 그룹 선택하기
grouped2.get_group(("M", "scientist")).head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
13,14,45,M,scientist,55106
39,40,38,M,scientist,27514
70,71,39,M,scientist,98034
73,74,39,M,scientist,T8H1N
106,107,39,M,scientist,60466


## 그룹 연산 메서드

### 데이터 집계

- 그룹 객체에 다양한 연산을 적용하는 과정
- 집계 기능을 내장하고 있는 판다스 기본 함수
    - mean()
    - max()
    - min()
    - sus()
    - count()
    - size()
    - var()
    - std()
    - describe()
    - info()
    - first()
    - last()
    

In [22]:
df.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [24]:
grouped = df.groupby("occupation")

In [25]:
# 각 그룹에 대한 모든 열의 표준편차를 집계하여 데이터프레임으로 변환
grouped.std()

ValueError: could not convert string to float: 'M'

In [26]:
# 각 그룹에 대한 age열의 표준편차를 집계하여 시리즈로 변환
grouped["age"].std()

occupation
administrator    11.123397
artist            8.668116
doctor           12.501428
educator         10.413264
engineer         11.199236
entertainment    10.056052
executive        10.608075
healthcare       11.313524
homemaker        10.737119
lawyer           10.830303
librarian        11.023611
marketing         9.474500
none             13.757826
other            10.738227
programmer        9.624512
retired           5.757461
salesman         14.079859
scientist         7.392964
student           5.284081
technician        9.867210
writer           11.423306
Name: age, dtype: float64

In [28]:
# 그룹 객체에 agg() 메서드 적용 - 사용자 정의 함수를 인수로 전달
def sin_max(x):
    return x.max() - x.sin()

- 집계 연산을 처리하는 사용자 정의 함수를 그룹 객체에 적용하기 위해서는 agg() 메서드를 사용

In [30]:
# 각 그룹의 최댓값과 최솟값의 차이를 계산하여 그룹별로 집계
grouped["age"].agg(min_max)

NameError: name 'min_max' is not defined

In [31]:
grouped["age"].agg(lambda x : x.max() - x.min())

occupation
administrator    49
artist           29
doctor           36
educator         40
engineer         48
entertainment    35
executive        47
healthcare       40
homemaker        30
lawyer           32
librarian        46
marketing        31
none             44
other            51
programmer       43
retired          22
salesman         48
scientist        32
student          35
technician       34
writer           42
Name: age, dtype: int64

- 동시에 여러 개의 함수를 사용하여 각 그룹별 데이터에 대한 집계 연산을 처리
    - 각각의 열에 여러 개의 함수를 일괄 적용할 때는 리스트 형태로 인수를 전달
    - 열마다 다른 종류의 함수를 전달하려면 {열 : 함수} 형태의 딕셔너리를 전달

In [32]:
# 여러 함수를 각 열에 동일하게 적용하여 집계
grouped.agg(["min", "max"])

Unnamed: 0_level_0,user_id,user_id,age,age,gender,gender,zip_code,zip_code
Unnamed: 0_level_1,min,max,min,max,min,max,min,max
occupation,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
administrator,7,940,21,70,F,M,2154,V1G4L
artist,23,920,19,48,F,M,1945,V5A2B
doctor,138,935,28,64,M,M,47401,97405
educator,13,937,23,63,F,M,1602,M4J2K
engineer,25,934,22,70,F,M,0,T8H1N
entertainment,16,926,15,50,F,M,1040,V3N4P
executive,6,901,22,69,F,M,0,L1V3W
healthcare,60,910,22,62,F,M,2154,97232
homemaker,20,898,20,50,F,M,17331,96349
lawyer,10,846,21,53,F,M,6371,90703


In [33]:
# 각 열마다 다른 함수를 적용하여 집계
grouped.agg({"user_id" : "min",
            "age" : ["mean", "std"],
            "gender" : "count",
            "zip_code" : "max"})

Unnamed: 0_level_0,user_id,age,age,gender,zip_code
Unnamed: 0_level_1,min,mean,std,count,max
occupation,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
administrator,7,38.746835,11.123397,79,V1G4L
artist,23,31.392857,8.668116,28,V5A2B
doctor,138,43.571429,12.501428,7,97405
educator,13,42.010526,10.413264,95,M4J2K
engineer,25,36.38806,11.199236,67,T8H1N
entertainment,16,29.222222,10.056052,18,V3N4P
executive,6,38.71875,10.608075,32,L1V3W
healthcare,60,41.5625,11.313524,16,97232
homemaker,20,32.571429,10.737119,7,96349
lawyer,10,36.75,10.830303,12,90703
