# 데이터 처리 심화

### 주요 내용

1. 변수 수정, 추가 및 제거
2. 변수 형식 변환
3. 결측값 처리 및 파생변수 생성

<br>

### 목표 
1. 분석 목적에 맞게 변수를 수정하고 파생 변수를 추가할 수 있다.
2. 날짜 등 변수 형식을 활용할 수 있다.
3. 결측값을 적절한 값으로 대체하는 방법을 확인한다.


<br>
<hr>
<br>

<br>

## 1. 변수(열)의 수정, 추가, 제거

**pandas**의 기본 기능과 메서드를 활용하여 변수를 추가 하거나 수정, 업데이트하거나 제거 가능  
변수를 선택하듯 **=** 을 활용해서 변수를 추가하거나 업데이트 가능

### 1.1. 변수 수정 및 추가

In [122]:
# 라이브러리 불러오기
import pandas as pd

# 예제 만들기 : 딕셔너리를 활용한 DataFrame 생성
df_own = pd.DataFrame({
    'FIRST' : ['A', 'B', 'C', 'D', 'E'],
    'SECOND': [7, 6, 5, 8, 9],
    'THIRD' : pd.date_range('2023-01-01', periods=5, freq='W-SAT') # freq='W-SAT' : 매주 토요일, periods=주기 횟수
})
df_own

Unnamed: 0,FIRST,SECOND,THIRD
0,A,7,2023-01-07
1,B,6,2023-01-14
2,C,5,2023-01-21
3,D,8,2023-01-28
4,E,9,2023-02-04


In [51]:
# 변수이름을 활용한 변수선택
df_own['SECOND'] = 0
df_own

Unnamed: 0,FIRST,SECOND,THIRD
0,A,0,2023-01-07
1,B,0,2023-01-14
2,C,0,2023-01-21
3,D,0,2023-01-28


In [55]:
# =을 활용한 추가
df_own['FOURTH'] = 1
df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH
0,A,7,2023-01-07,1
1,B,6,2023-01-14,1
2,C,5,2023-01-21,1
3,D,8,2023-01-28,1


In [56]:
# =을 활용한 업데이트
df_own['FOURTH'] = df_own['SECOND'] + 1
df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH
0,A,7,2023-01-07,8
1,B,6,2023-01-14,7
2,C,5,2023-01-21,6
3,D,8,2023-01-28,9


<br>

### 1.2. 객체 메서드와 Series 메서드의 비교

특히 날짜시간 변수의 경우 월, 일, 요일, 시간 등 다양한 요소를 추출해서 변수로 추가할 수 있음  
Python은 개발언어로 객체의 형식에 매우 엄격하므로 메서드의 구분 필요  
개별 날짜에 적용할 수 있는 메서드가 아닌 **pandas**의 **Series** 메서드 활용을 추천 

In [57]:
df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH
0,A,7,2023-01-07,8
1,B,6,2023-01-14,7
2,C,5,2023-01-21,6
3,D,8,2023-01-28,9


In [58]:
# []와 for를 활용한 파생변수 생성
df_own.loc[0, 'THIRD'].weekday() # 0인 인덱스, 'TRIRD' 변수에 해당하는 값의 요일 출력
    ## 0~6: 월~일
    ## 5: 토
    
    ## 하나의 값에 대해서는 메서드 활용가능

5

In [7]:
# Series에 대해서는 Series의 메서드만 활용 가능
df_own['THIRD'].weekday()

AttributeError: 'Series' object has no attribute 'weekday'

In [59]:
df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH
0,A,7,2023-01-07,8
1,B,6,2023-01-14,7
2,C,5,2023-01-21,6
3,D,8,2023-01-28,9


In [60]:
# 간편함수 lambda 사용하여 apply 전달
df_own['DAY'] = df_own['THIRD'].apply(lambda x: x.weekday())
df_own

# 함수로 기능을 선언하여 apply 전달
# def convertToDay(x):
#     if x.weekday() == 5 :
#         return '토요일'
#     elif x.weekday() == 0:
#         return '월요일'
    
# df_own['DAY'] = df_own['THIRD'].apply(convertToDay)
# df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH,DAY
0,A,7,2023-01-07,8,5
1,B,6,2023-01-14,7,5
2,C,5,2023-01-21,6,5
3,D,8,2023-01-28,9,5


<br>

>pandas의 *dt.weekday*를 활용하면 훨씬 손쉽게 파생변수 생성 가능
 * 참고: [dt.weekday](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.weekday.html)

In [61]:
# pandas의 dt.weekday 활용
df_own['THIRD'].dt.weekday

0    5
1    5
2    5
3    5
Name: THIRD, dtype: int32

In [62]:
df_own['WEEKDAY'] = df_own['THIRD'].dt.weekday

In [63]:
df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH,DAY,WEEKDAY
0,A,7,2023-01-07,8,5,5
1,B,6,2023-01-14,7,5,5
2,C,5,2023-01-21,6,5,5
3,D,8,2023-01-28,9,5,5


<br>

### 1.3. 조건을 활용한 값 변경, 생성

조건을 활용해 일부 관측치를 선택하듯이, 조건을 설정하고 변수를 추가하거나 업데이트 가능

In [64]:
df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH,DAY,WEEKDAY
0,A,7,2023-01-07,8,5,5
1,B,6,2023-01-14,7,5,5
2,C,5,2023-01-21,6,5,5
3,D,8,2023-01-28,9,5,5


In [65]:
# 조건을 활용한 일부 관측치 선택
cond = df_own['FIRST'].isin(['A','B'])
df_own.loc[cond]

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH,DAY,WEEKDAY
0,A,7,2023-01-07,8,5,5
1,B,6,2023-01-14,7,5,5


In [66]:
# 조건을 활용한 일부 관측치의 특정 변수 값 변경
cond = df_own['FIRST'].isin(['A','B'])
df_own.loc[cond, 'FOURTH']

0    8
1    7
Name: FOURTH, dtype: int64

In [21]:
df_own.loc[df_own['FIRST'].isin(['A','B']), 'FOURTH'] = 0
df_own

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH,DAY,WEEKDAY
0,A,0,2023-01-07,0,5,5
1,B,0,2023-01-14,0,5,5
2,C,0,2023-01-21,1,5,5
3,D,0,2023-01-28,1,5,5


In [22]:
# 일부 관측치만 값 생성
df_own.loc[df_own['FIRST'].isin(['A','B']), 'OPTIONAL'] = 9999
df_own
    ## NaN := 결측값(missing)

Unnamed: 0,FIRST,SECOND,THIRD,FOURTH,DAY,WEEKDAY,OPTIONAL
0,A,0,2023-01-07,0,5,5,9999.0
1,B,0,2023-01-14,0,5,5,9999.0
2,C,0,2023-01-21,1,5,5,
3,D,0,2023-01-28,1,5,5,


In [23]:
# df_own.loc[[0,1], 'FIRST'] = 0
df_own[0:2]['SECOND'] = 0
df_own

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_own[0:2]['SECOND'] = 0


Unnamed: 0,FIRST,SECOND,THIRD,FOURTH,DAY,WEEKDAY,OPTIONAL
0,A,0,2023-01-07,0,5,5,9999.0
1,B,0,2023-01-14,0,5,5,9999.0
2,C,0,2023-01-21,1,5,5,
3,D,0,2023-01-28,1,5,5,


<br>

### 1.4. 변수 제거

*drop()* 은 관측치와 변수를 제거할 수 있는데 **index**와 **columns**를 활용  
`axis=`옵션에 따라 `axis=0`이면 관측치(행방향)를 제거하고 `axis=1`이면 변수(열방향)를 제거  

`columns=`이라는 옵션을 명시해서 변수를 제거하는 것이 가장 명확하고 실수를 줄일 수 있음

In [28]:
df_own

In [25]:
# drop()을 활용한 관측치/변수 제거
    # axis = 0 : 관측치
    # axis = 1 : 변수
df_own = df_own.drop('FOURTH', axis=1, inplace=True)

In [27]:
df_own

In [26]:
# drop()을 활용한 관측치/변수 제거(columns 활용)
df_own.drop(columns=['THIRD'])

AttributeError: 'NoneType' object has no attribute 'drop'

In [None]:
# drop( ) 실행 후 원본 데이터는 변함이 없음
df_own

In [None]:
# 원본 데이터의 업데이트
df_own = df_own.drop(columns=['FOURTH'])
df_own

In [None]:
# 행의 삭제
df_own.drop([0, 3], axis=0)

<br>

#### [실습] drop 함수의 활용

1. 'region'이 'southwest' 또는 'northwest'인 데이터(행) 선택하기
2. 위 1번의 선택결과의 index target에 저장하기
3. drop 함수를 활용하여 target에 저장된 행 삭제하기

In [123]:
df_ins = pd.read_csv('./data/insurance.csv')
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [46]:
# 1. 'region'이 'southwest' 또는 'northwest'인 데이터(행) 선택하기
df_ins['region'].unique() # 값 중복 제거 확인
cond = df_ins['region'].isin(['southwest', 'northwest']) # 값들 중 2개의 값의 여부 논리값 할당
df_res = df_ins.loc[cond] # True인 값의 행들 출력
df_res

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
7,37,female,27.740,3,no,northwest,7281.50560
9,60,female,25.840,0,no,northwest,28923.13692
...,...,...,...,...,...,...,...
1331,23,female,33.400,0,no,southwest,10795.93733
1332,52,female,44.700,3,no,southwest,11411.68500
1333,50,male,30.970,3,no,northwest,10600.54830
1336,21,female,25.800,0,no,southwest,2007.94500


In [124]:
# 2. 위 1번의 선택결과의 index를 target에 저장하기

target = df_res.index
target

Index([   0,    3,    4,    7,    9,   12,   15,   18,   19,   21,
       ...
       1316, 1319, 1320, 1324, 1329, 1331, 1332, 1333, 1336, 1337],
      dtype='int64', length=650)

In [49]:
# 3. drop 함수를 활용하여 target에 저장된 행 삭제하기
df_ins2 = df_ins.drop(target, axis=0)
df_ins2

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
5,31,female,25.740,0,no,southeast,3756.62160
6,46,female,33.440,1,no,southeast,8240.58960
8,37,male,29.830,2,no,northeast,6406.41070
...,...,...,...,...,...,...,...
1327,51,male,30.030,1,no,southeast,9377.90470
1328,23,female,24.225,2,no,northeast,22395.74424
1330,57,female,25.740,2,no,southeast,12629.16560
1334,18,female,31.920,0,no,northeast,2205.98080


In [42]:
# 1. 'region'이 'southwest' 또는 'northwest'인 데이터(행) 선택하기
df_ins['region'].unique()
cond = df_ins['region'].isin(['southwest', 'northwest'])
df_temp = df_ins.loc[cond]
df_temp

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
7,37,female,27.740,3,no,northwest,7281.50560
9,60,female,25.840,0,no,northwest,28923.13692
...,...,...,...,...,...,...,...
1331,23,female,33.400,0,no,southwest,10795.93733
1332,52,female,44.700,3,no,southwest,11411.68500
1333,50,male,30.970,3,no,northwest,10600.54830
1336,21,female,25.800,0,no,southwest,2007.94500


In [44]:
# 2. 위 1번의 선택결과의 index를 target에 저장하기
target = df_temp.index
target

Index([   0,    3,    4,    7,    9,   12,   15,   18,   19,   21,
       ...
       1316, 1319, 1320, 1324, 1329, 1331, 1332, 1333, 1336, 1337],
      dtype='int64', length=650)

In [45]:
# 3. drop 함수를 활용하여 target에 저장된 행 삭제하기
df_ins2 = df_ins.drop(target, axis=0)
df_ins2

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
5,31,female,25.740,0,no,southeast,3756.62160
6,46,female,33.440,1,no,southeast,8240.58960
8,37,male,29.830,2,no,northeast,6406.41070
...,...,...,...,...,...,...,...
1327,51,male,30.030,1,no,southeast,9377.90470
1328,23,female,24.225,2,no,northeast,22395.74424
1330,57,female,25.740,2,no,southeast,12629.16560
1334,18,female,31.920,0,no,northeast,2205.98080


<br>

### 1.5. 변수 이름 변경

변수이름을 바꾸고 싶을 때는 **DataFrame**의 메서드 *rename()* 을 활용  
이때 `columns=` 옵션을 활용하고 딕셔너리 형식으로 기존변수이름과 새변수이름을 콜론으로 연결

In [None]:
df_own

In [75]:
# rename() 활용 변수 이름 바꾸기 
df_own = df_own.rename(columns={'FIRST':'var1', 'SECOND':'var2'})

# 데이터프레임의 값을 직접 바꾸기
df_own.columns = ['a', 'b', 'c', 'd', 'e', 'f']
df_own

Unnamed: 0,a,b,c,d,e,f
0,A,7,2023-01-07,8,5,5
1,B,6,2023-01-14,7,5,5
2,C,5,2023-01-21,6,5,5
3,D,8,2023-01-28,9,5,5


<br>

#### [실습] df_sp를 활용

1. 'math score', 'reading score', 'writing score'를 합한 변수 'total'을 **df_sp**에 추가
2. 1의 'total'이 270이상인 학생들만 'EX'라는 값을 갖는 'grade' 변수 추가
3. 'math score', 'reading score', 'writing score' 중 한과목이라도 40보다 작은지 확인하기
4. 3.의 결과를 활용해서 세 점수 중 하나라도 40점 미만은 학생은 'grade'를 'FAIL'로 수정하기
5. 변수 'grade'의 이름을 'class'로 바꾸기
6. 변수 'total'을 제거하기

In [106]:
df_sp = pd.read_csv('./data/StudentsPerformance.csv')
df_sp

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [113]:
# 1
df_sp['total'] = df_sp['math score'] + df_sp['reading score'] + df_sp['writing score']
df_sp['total']

0      218
1      247
2      278
3      148
4      229
      ... 
995    282
996    172
997    195
998    223
999    249
Name: total, Length: 1000, dtype: int64

In [108]:
# 2. 1의 'total'이 270이상인 학생들만 'EX'라는 값을 갖는 'grade' 변수 추가
cond1 = df_sp['total'] >= 270
df_sp.loc[cond1, 'grade'] = 'EX' # 270이상인 행에 'grade'변수 추가 후 'EX' 할당
df_sp

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total,grade
0,female,group B,bachelor's degree,standard,none,72,72,74,218,na
1,female,group C,some college,standard,completed,69,90,88,247,na
2,female,group B,master's degree,standard,none,90,95,93,278,EX
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,na
4,male,group C,some college,standard,none,76,78,75,229,na
...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282,EX
996,male,group C,high school,free/reduced,none,62,55,55,172,na
997,female,group C,high school,free/reduced,completed,59,71,65,195,na
998,female,group D,some college,standard,completed,68,78,77,223,na


In [109]:
# 3. 'math score', 'reading score', 'writing score' 중 한과목이라도 40보다 작은지 확인하기
cond2 = (df_sp['math score'] < 40) | (df_sp['reading score'] < 40) | (df_sp['writing score'] < 40)
df_sp[cond2]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total,grade
7,male,group B,some college,free/reduced,none,40,43,39,122,na
9,female,group B,high school,free/reduced,none,38,60,50,148,na
17,female,group B,some high school,free/reduced,none,18,32,28,78,na
33,male,group D,some college,standard,none,40,42,38,120,na
55,female,group C,high school,free/reduced,none,33,41,43,117,na
59,female,group C,some high school,free/reduced,none,0,17,10,27,na
61,male,group A,some high school,free/reduced,none,39,39,34,112,na
66,male,group D,some high school,free/reduced,none,45,37,37,119,na
69,female,group C,associate's degree,standard,none,39,64,57,160,na
75,male,group B,associate's degree,free/reduced,none,44,41,38,123,na


In [116]:
# 4. 3의 결과를 활용해서 세 점수 중 하나라도 40점 미만은 학생은 'grade'를 'FAIL'로 수정하기
df_sp.loc[cond2, 'grade'] = 'FAIL'
df_sp[df_sp['grade'] == 'FAIL'].shape # FAIL에 해당하는 값의 개수

(51, 11)

In [111]:
# 5. 변수 'grade'의 이름을 'class'로 바꾸기
df_sp = df_sp.rename(columns={'grade' : 'class'}) # inplace=True : 원본 변경 시 적용
df_sp

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total,class
0,female,group B,bachelor's degree,standard,none,72,72,74,218,na
1,female,group C,some college,standard,completed,69,90,88,247,na
2,female,group B,master's degree,standard,none,90,95,93,278,EX
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,na
4,male,group C,some college,standard,none,76,78,75,229,na
...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282,EX
996,male,group C,high school,free/reduced,none,62,55,55,172,na
997,female,group C,high school,free/reduced,completed,59,71,65,195,na
998,female,group D,some college,standard,completed,68,78,77,223,na


In [112]:
# 6. 변수 'total'을 제거하기
df_sp2 = df_sp.drop('total', axis=1)
df_sp2

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,class
0,female,group B,bachelor's degree,standard,none,72,72,74,na
1,female,group C,some college,standard,completed,69,90,88,na
2,female,group B,master's degree,standard,none,90,95,93,EX
3,male,group A,associate's degree,free/reduced,none,47,57,44,na
4,male,group C,some college,standard,none,76,78,75,na
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,EX
996,male,group C,high school,free/reduced,none,62,55,55,na
997,female,group C,high school,free/reduced,completed,59,71,65,na
998,female,group D,some college,standard,completed,68,78,77,na


In [None]:
df_sp = pd.read_csv('./data/StudentsPerformance.csv')
df_sp

# 1. 'math score', 'reading score', 'writing score'를 합한 변수 'total'을 df_sp에 추가
df_sp['total'] = df_sp['math score'] + df_sp['reading score'] + df_sp['writing score']
df_sp

# 2. 1의 'total'이 270이상인 학생들만 'EX'라는 값을 갖는 'grade' 변수 추가
cond = df_sp['total'] >= 270
df_sp.loc[cond, 'grade'] = 'EX'
df_sp

# 3. 'math score', 'reading score', 'writing score' 중 한 과목이라도 40보다 작은지 확인하기
cond2 = (df_sp['math score'] < 40) | (df_sp['reading score'] < 40) | (df_sp['writing score'] < 40)
df_sp[cond2]

# 4. 3.의 결과를 활용해서 세 점수 중 하나라도 40점 미만은 학생은 'grade'를 'FAIL'로 수정하기
df_sp.loc[cond2, 'grade'] = 'FAIL'
df_sp[df_sp['grade'] == 'FAIL'].shape[0]

# 5. 변수 'grade'의 이름을 'class'로 바꾸기
df_sp = df_sp.rename(columns={'grade':'class'})
df_sp

# 6. 변수 'total'을 제거하기
df_sp.drop('total', axis=1, inplace=True)
df_sp

<br>
<hr>
<br>

## 2. 결측값 처리

결측값은 다양한 이유로 발생
- 애초에 값이 없는 경우
- 값이 있으나 사람 실수로 누락한 경우
- 센서, 통신망 등의 오류로 값이 들어오지 않은 경우

먼저 결측값 존재 여부 확인하고, 대체를 할 지 그대로 둘 지를 결정  
대체를 한다면 어떤 값으로 채울지도 고민하여 지정

In [125]:
# 예제 데이터 불러오기
df_na = pd.read_csv('./data/data_dupna.csv')
df_na
    # NaN : 결측

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
0,101,A,2022-01-03,20.0,FC,C887,N,
1,101,C,2022-05-03,120.0,FC,C887,,
2,101,B,2022-04-12,32.0,FC,C887,,N
3,103,C,2022-03-03,,CM,,,
4,103,B,2022-03-02,25.0,FC,C453,,N
5,105,C,2022-02-23,92.0,CM,,,
6,201,B,2022-02-16,31.0,FC,C453,,
7,204,A,2022-04-11,15.0,CM,,Y,
8,204,B,2022-04-11,18.0,CM,,,N


<br>

> 아래의 명령어를 활용하면 전체 데이터에서 결측값이 있는 관측치나 변수를 확인할 수 있습니다. 


In [126]:
# isnull 함수를 활용한 결측값 필터
cond = df_na['info1'].isnull()
# df_na[cond]
df_na[~cond]

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
0,101,A,2022-01-03,20.0,FC,C887,N,
1,101,C,2022-05-03,120.0,FC,C887,,
2,101,B,2022-04-12,32.0,FC,C887,,N
4,103,B,2022-03-02,25.0,FC,C453,,N
6,201,B,2022-02-16,31.0,FC,C453,,


In [127]:
# 여러개의 변수의 결측값 확인 및 필터
cond = df_na['info1'].isnull() | df_na['info3'].isnull()
# df_na[cond]
df_na[~cond]

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
2,101,B,2022-04-12,32.0,FC,C887,,N
4,103,B,2022-03-02,25.0,FC,C453,,N


In [128]:
# any 함수를 활용한 모든열의 결측값 필터
df_na.isnull()[['info1', 'info3']].any()
# cond = df_na.isnull().any(axis=1) # 
# df_na[~cond]

info1    True
info3    True
dtype: bool

In [None]:
# any 함수를 활용한 부분열의 결측값 필터
cond = df_na[['info1', 'info3']].isnull().any(axis=1)
df_na[~cond]

<br>

### 2.1. 결측값 포함 관측치 제거

결측값이 있는 관측치에 대응하는 가장 간단한 방법은 결측치를 포함한 변수나 관측치를 제거하는 것

In [None]:
# 하나라도 결측값이 있는 관측치 제거
df_na.dropna()

In [None]:
# 특정 변수 기준 결측값이 있는 관측치 제거
df_na.dropna(subset=['info1', 'info3'])

<br>

### 2.2. 결측값 대체

일반적으로 결측값을 그대로 두거나 다음과 같이 결측값을 적절한 값으로 대체하고 활용

In [None]:
df_na

In [129]:
# 모든 결측값을 일괄 대체, 결측값을 0으로 대체
df_na.fillna(value=0)

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
0,101,A,2022-01-03,20.0,FC,C887,N,0
1,101,C,2022-05-03,120.0,FC,C887,0,0
2,101,B,2022-04-12,32.0,FC,C887,0,N
3,103,C,2022-03-03,0.0,CM,0,0,0
4,103,B,2022-03-02,25.0,FC,C453,0,N
5,105,C,2022-02-23,92.0,CM,0,0,0
6,201,B,2022-02-16,31.0,FC,C453,0,0
7,204,A,2022-04-11,15.0,CM,0,Y,0
8,204,B,2022-04-11,18.0,CM,0,0,N


In [130]:
# 변수별 결측값 대체 지정
df_na.fillna(value={'info1':0, 'info2':'NA'})

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
0,101,A,2022-01-03,20.0,FC,C887,N,
1,101,C,2022-05-03,120.0,FC,C887,,
2,101,B,2022-04-12,32.0,FC,C887,,N
3,103,C,2022-03-03,,CM,0,,
4,103,B,2022-03-02,25.0,FC,C453,,N
5,105,C,2022-02-23,92.0,CM,0,,
6,201,B,2022-02-16,31.0,FC,C453,,
7,204,A,2022-04-11,15.0,CM,0,Y,
8,204,B,2022-04-11,18.0,CM,0,,N


In [131]:
# 가장 앞쪽의 결측이 아닌 값으로 대체
    ## 센서 등의 값 누락에 활용
df_na.fillna(method='ffill')

  df_na.fillna(method='ffill')


Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
0,101,A,2022-01-03,20.0,FC,C887,N,
1,101,C,2022-05-03,120.0,FC,C887,N,
2,101,B,2022-04-12,32.0,FC,C887,N,N
3,103,C,2022-03-03,32.0,CM,C887,N,N
4,103,B,2022-03-02,25.0,FC,C453,N,N
5,105,C,2022-02-23,92.0,CM,C453,N,N
6,201,B,2022-02-16,31.0,FC,C453,N,N
7,204,A,2022-04-11,15.0,CM,C453,Y,N
8,204,B,2022-04-11,18.0,CM,C453,Y,N


In [132]:
df_na

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
0,101,A,2022-01-03,20.0,FC,C887,N,
1,101,C,2022-05-03,120.0,FC,C887,,
2,101,B,2022-04-12,32.0,FC,C887,,N
3,103,C,2022-03-03,,CM,,,
4,103,B,2022-03-02,25.0,FC,C453,,N
5,105,C,2022-02-23,92.0,CM,,,
6,201,B,2022-02-16,31.0,FC,C453,,
7,204,A,2022-04-11,15.0,CM,,Y,
8,204,B,2022-04-11,18.0,CM,,,N


In [None]:
# 이후 값중 결측이 아닌 값으로 대체
    ## groupby()를 활용하여 id 등 범위 내 대체
df_na.groupby('id').fillna(method='bfill')

In [None]:
# 특정한 변수만 결측값 대체
    ## groupby()와 fillna()를 활용할 경우 그룹변수가 사라짐
    ## 특정 변수만 선택해서 결측값 대체하고 업데이트
df_na['info2'] = df_na.groupby('id')['info2'].fillna(method='ffill')    
df_na

<br>
<hr>
<br>

## 3. 변수 형식 변환 및 파생변수 생성

*read_csv()* 로 데이터를 불러오면 적당한 형식으로 지정되는데, 가끔 형식을 직접 바꿔야할 상황이 있음  
상황에 따라 날짜에서 요일을 추출하듯이 기존 변수를 활용해서 새로운 변수를 추가해서 분석에 활용해야하는 경우도 있음  

<br>

### 3.1. 변수 형식의 확인/변환
**DataFrame**에서는 다음과 같은 Series 형식을 활용

+ float: 실수(소수점을 포함한 숫자)
+ int: 정수(integer)
+ datetime: 날짜시간
+ bool: 불/불린(True 혹은 False)
+ category: 범주형
+ object: 문자형(string) 혹은 그 외

*.dtypes* 를 활용하면 변수 형식을 확인 가능  
*.astype()* 을 활용해서 변수 형식을 변환 가능



In [133]:
# 데이터 불러오기
import pandas as pd
df_ins = pd.read_csv('./data/insurance.csv')
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [134]:
# 변수 형식 확인
df_ins.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [135]:
# children을 float으로 변환
df_ins['children'].astype('float')
df_ins.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [136]:
# children을 object로 변환
df_ins['children'].astype('object')

0       0
1       1
2       3
3       0
4       0
       ..
1333    3
1334    0
1335    0
1336    0
1337    0
Name: children, Length: 1338, dtype: object

In [137]:
# 기존 변수의 형식 업데이트
df_ins['children_float'] = df_ins['children'].astype('float')
df_ins.head()
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children_float
0,19,female,27.900,0,yes,southwest,16884.92400,0.0
1,18,male,33.770,1,no,southeast,1725.55230,1.0
2,28,male,33.000,3,no,southeast,4449.46200,3.0
3,33,male,22.705,0,no,northwest,21984.47061,0.0
4,32,male,28.880,0,no,northwest,3866.85520,0.0
...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,3.0
1334,18,female,31.920,0,no,northeast,2205.98080,0.0
1335,18,female,36.850,0,no,southeast,1629.83350,0.0
1336,21,female,25.800,0,no,southwest,2007.94500,0.0


In [138]:
# 복수 변수의 형식 일괄 업데이트
category_vars = ['sex', 'smoker', 'region']
df_ins[category_vars] = df_ins[category_vars].astype('category')
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children_float
0,19,female,27.900,0,yes,southwest,16884.92400,0.0
1,18,male,33.770,1,no,southeast,1725.55230,1.0
2,28,male,33.000,3,no,southeast,4449.46200,3.0
3,33,male,22.705,0,no,northwest,21984.47061,0.0
4,32,male,28.880,0,no,northwest,3866.85520,0.0
...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,3.0
1334,18,female,31.920,0,no,northeast,2205.98080,0.0
1335,18,female,36.850,0,no,southeast,1629.83350,0.0
1336,21,female,25.800,0,no,southwest,2007.94500,0.0


In [139]:
# select_dtypes()의 활용
df_ins.select_dtypes('category')
# df_ins

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest
...,...,...,...
1333,male,no,northwest
1334,female,no,northeast
1335,female,no,southeast
1336,female,no,southwest




#### [참고] 범주형 데이터 사용의 이점

범주형 데이터는 category의 순서를 부여하여 데이터 사이의 관계를 만들어 줄 수 있으며,<br/>
성능 및 그래프 그리기 등의 이점이 존재한다.


In [142]:
# 데이터 성능상의 이점 (적은 메모리 공간의 사용)

df_subway = pd.read_csv('./data/서울교통공사_역별일별승하차인원정보_20220731.csv')
df_subway
# df_subway.nunique()
# df_subway.info()

df_subway['호선'] = df_subway['호선'].astype('category')
df_subway['구분'] = df_subway['구분'].astype('category')
df_subway.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115606 entries, 0 to 115605
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   날짜      115606 non-null  object  
 1   호선      115606 non-null  category
 2   역번호     115606 non-null  int64   
 3   역명      115606 non-null  object  
 4   구분      115606 non-null  category
 5   이용객수    115606 non-null  int64   
dtypes: category(2), int64(2), object(2)
memory usage: 3.7+ MB


In [143]:
# 점주형 데이터 간의 순서를 통한 정렬
df_products = pd.DataFrame({
    'id': ['p1', 'p2', 'p3', 'p4', 'p5'],
    'size': ['X-Large', 'Small', 'Large', 'X-Small', 'Medium']
})
df_products
df_products.sort_values('size')

# df_products['size'] = df_products['size'].astype('category')
# df_products['size'] = df_products['size'].cat.reorder_categories(["X-Small", "Small", "Medium", "Large", "X-Large"], ordered=True)
# df_products
# df_products.sort_values('size')

Unnamed: 0,id,size
2,p3,Large
4,p5,Medium
1,p2,Small
0,p1,X-Large
3,p4,X-Small


<br>

#### [실습] df_pr의 활용

1. Pulse2(뛴 후)와 Pulse1(뛰기 전)의 차이를 계산하고 'Diff'로 변수 추가하기
2. `df_pr.dtypes`로 형식 확인하고 `df_pr.nunique()`로 중복값 제거한 값 개수 확인하기
3. 범주형 형식이 적당한 변수 이름 목록 만들기
4. 3.의 변수들을 astype()으로 category 형식으로 변환하고 업데이트 하기
5. Ran, Smokes, Alcohol별 1.의 Diff의 평균 계산하기

In [144]:
df_pr = pd.read_csv('./data/PulseRates.csv')
df_pr

Unnamed: 0,Height,Weight,Age,Gender,Smokes,Alcohol,Exercise,Ran,Pulse1,Pulse2,Year
0,173,57.0,18,2,2,1,2,2,86.0,88.0,93
1,179,58.0,19,2,2,1,2,1,82.0,150.0,93
2,167,62.0,18,2,2,1,1,1,96.0,176.0,93
3,195,84.0,18,1,2,1,1,2,71.0,73.0,93
4,173,64.0,18,2,2,1,3,2,90.0,88.0,93
...,...,...,...,...,...,...,...,...,...,...,...
105,93,27.0,19,2,2,2,3,2,119.0,120.0,98
106,161,43.0,19,2,2,2,3,2,90.0,89.0,98
107,182,60.0,22,1,2,1,3,2,86.0,84.0,98
108,170,65.0,18,1,2,1,1,2,69.0,64.0,98


In [145]:
# 1. Pulse2(뛴 후)와 Pulse1(뛰기 전)의 차이를 계산하고 'Diff'로 변수 추가하기
df_pr['Diff'] = df_pr['Pulse2'] - df_pr['Pulse1']
df_pr

Unnamed: 0,Height,Weight,Age,Gender,Smokes,Alcohol,Exercise,Ran,Pulse1,Pulse2,Year,Diff
0,173,57.0,18,2,2,1,2,2,86.0,88.0,93,2.0
1,179,58.0,19,2,2,1,2,1,82.0,150.0,93,68.0
2,167,62.0,18,2,2,1,1,1,96.0,176.0,93,80.0
3,195,84.0,18,1,2,1,1,2,71.0,73.0,93,2.0
4,173,64.0,18,2,2,1,3,2,90.0,88.0,93,-2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
105,93,27.0,19,2,2,2,3,2,119.0,120.0,98,1.0
106,161,43.0,19,2,2,2,3,2,90.0,89.0,98,-1.0
107,182,60.0,22,1,2,1,3,2,86.0,84.0,98,-2.0
108,170,65.0,18,1,2,1,1,2,69.0,64.0,98,-5.0


In [147]:
# 2. df_pr.dtypes로 형식 확인하고 df_pr.nunique()로 중복값 제거한 값 개수 확인하기
df_pr.dtypes
df_pr.nunique()

Height      41
Weight      51
Age         13
Gender       2
Smokes       2
Alcohol      2
Exercise     3
Ran          2
Pulse1      38
Pulse2      54
Year         5
Diff        53
dtype: int64

In [148]:
# 3. 범주형 형식이 적당한 변수 이름 목록 만들기
c_list = ['Ran', 'Smokes', 'Alcohol']

In [149]:
# 4. 3의 변수들을 astype()으로 category 형식으로 변환하고 업데이트 하기
df_pr[c_list] = df_pr[c_list].astype('category')
df_pr

Unnamed: 0,Height,Weight,Age,Gender,Smokes,Alcohol,Exercise,Ran,Pulse1,Pulse2,Year,Diff
0,173,57.0,18,2,2,1,2,2,86.0,88.0,93,2.0
1,179,58.0,19,2,2,1,2,1,82.0,150.0,93,68.0
2,167,62.0,18,2,2,1,1,1,96.0,176.0,93,80.0
3,195,84.0,18,1,2,1,1,2,71.0,73.0,93,2.0
4,173,64.0,18,2,2,1,3,2,90.0,88.0,93,-2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
105,93,27.0,19,2,2,2,3,2,119.0,120.0,98,1.0
106,161,43.0,19,2,2,2,3,2,90.0,89.0,98,-1.0
107,182,60.0,22,1,2,1,3,2,86.0,84.0,98,-2.0
108,170,65.0,18,1,2,1,1,2,69.0,64.0,98,-5.0


In [150]:
# 5. Ran, Smokes, Alcohol별 1.의 Diff의 평균 계산하기
df_pr.groupby(c_list, as_index=False)['Diff'].mean()

  df_pr.groupby(c_list, as_index=False)['Diff'].mean()


Unnamed: 0,Ran,Smokes,Alcohol,Diff
0,1,1,1,47.666667
1,1,1,2,
2,1,2,1,50.642857
3,1,2,2,53.533333
4,2,1,1,-2.5
5,2,1,2,-1.0
6,2,2,1,-0.666667
7,2,2,2,-1.04


<br>

### 3.2. 수치형 변수의 구간화

수치형 변수는 그대로 활용하기 보다는 구간화하는 경우가 많음  
*cut()* 이나 *qcut()* 함수를 주로 활용

+ *cut()*: 등간격 혹은 주어진 구간 경계로 구간화
+ *qcut()*: 등비율로 구간화

In [161]:
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children_float,age_grp,charges_grp2
0,19,female,27.900,0,yes,southwest,16884.92400,0.0,10대,8
1,18,male,33.770,1,no,southeast,1725.55230,1.0,10대,1
2,28,male,33.000,3,no,southeast,4449.46200,3.0,20대,3
3,33,male,22.705,0,no,northwest,21984.47061,0.0,30대,9
4,32,male,28.880,0,no,northwest,3866.85520,0.0,30대,2
...,...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,3.0,50대,6
1334,18,female,31.920,0,no,northeast,2205.98080,0.0,10대,1
1335,18,female,36.850,0,no,southeast,1629.83350,0.0,10대,1
1336,21,female,25.800,0,no,southwest,2007.94500,0.0,20대,1


In [162]:
# 연령대 변수 생성
    ## //: 몫 계산
    ## %: 나머지 계산
df_ins['age'] // 10
df_ins['age_grp'] = (df_ins['age'] // 10).astype('category')
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children_float,age_grp,charges_grp2
0,19,female,27.900,0,yes,southwest,16884.92400,0.0,1,8
1,18,male,33.770,1,no,southeast,1725.55230,1.0,1,1
2,28,male,33.000,3,no,southeast,4449.46200,3.0,2,3
3,33,male,22.705,0,no,northwest,21984.47061,0.0,3,9
4,32,male,28.880,0,no,northwest,3866.85520,0.0,3,2
...,...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,3.0,5,6
1334,18,female,31.920,0,no,northeast,2205.98080,0.0,1,1
1335,18,female,36.850,0,no,southeast,1629.83350,0.0,1,1
1336,21,female,25.800,0,no,southwest,2007.94500,0.0,2,1


In [163]:
df_ins['age'] // 10

df_ins['age_grp'] = (df_ins['age'] // 10).apply(lambda x: str(x)+'0대')
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children_float,age_grp,charges_grp2
0,19,female,27.900,0,yes,southwest,16884.92400,0.0,10대,8
1,18,male,33.770,1,no,southeast,1725.55230,1.0,10대,1
2,28,male,33.000,3,no,southeast,4449.46200,3.0,20대,3
3,33,male,22.705,0,no,northwest,21984.47061,0.0,30대,9
4,32,male,28.880,0,no,northwest,3866.85520,0.0,30대,2
...,...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,3.0,50대,6
1334,18,female,31.920,0,no,northeast,2205.98080,0.0,10대,1
1335,18,female,36.850,0,no,southeast,1629.83350,0.0,10대,1
1336,21,female,25.800,0,no,southwest,2007.94500,0.0,20대,1


<br>

*cut()* 을 활용해서 등간격으로 구간화할 수 있고, `bins=` 옵션에 적절한 구간값을 직접 넣을 수도 있음

In [167]:
# 등간격으로 구간화하기
pd.cut(df_ins['charges'], bins=10)

0        (13651.585, 19916.44]
1         (1059.225, 7386.729]
2         (1059.225, 7386.729]
3        (19916.44, 26181.296]
4         (1059.225, 7386.729]
                 ...          
1333     (7386.729, 13651.585]
1334      (1059.225, 7386.729]
1335      (1059.225, 7386.729]
1336      (1059.225, 7386.729]
1337    (26181.296, 32446.151]
Name: charges, Length: 1338, dtype: category
Categories (10, interval[float64, right]): [(1059.225, 7386.729] < (7386.729, 13651.585] < (13651.585, 19916.44] < (19916.44, 26181.296] ... (38711.006, 44975.862] < (44975.862, 51240.717] < (51240.717, 57505.573] < (57505.573, 63770.428]]

In [168]:
charges_breaks = [0, 5000, 10000, 20000, 100000000]

In [170]:
pd.cut(df_ins['charges'], bins=charges_breaks, right=True, labels=['A','B','C','D'])

0       C
1       A
2       A
3       D
4       A
       ..
1333    C
1334    A
1335    A
1336    A
1337    D
Name: charges, Length: 1338, dtype: category
Categories (4, object): ['A' < 'B' < 'C' < 'D']

In [171]:
# cut()을 활용한 10등급화
df_ins['charges_grp'] = pd.cut(df_ins['charges'], bins=10, labels=range(10))
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children_float,age_grp,charges_grp2,charges_grp
0,19,female,27.900,0,yes,southwest,16884.92400,0.0,10대,8,2
1,18,male,33.770,1,no,southeast,1725.55230,1.0,10대,1,0
2,28,male,33.000,3,no,southeast,4449.46200,3.0,20대,3,0
3,33,male,22.705,0,no,northwest,21984.47061,0.0,30대,9,3
4,32,male,28.880,0,no,northwest,3866.85520,0.0,30대,2,0
...,...,...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,3.0,50대,6,1
1334,18,female,31.920,0,no,northeast,2205.98080,0.0,10대,1,0
1335,18,female,36.850,0,no,southeast,1629.83350,0.0,10대,1,0
1336,21,female,25.800,0,no,southwest,2007.94500,0.0,20대,1,0


In [172]:
# 등구간의 관측치 불균형 문제
df_ins['charges_grp'].value_counts()

charges_grp
0    536
1    398
2    129
3     86
5     59
6     57
4     35
7     32
9      4
8      2
Name: count, dtype: int64

In [173]:
# qcut()을 활용한 등비율 구간화
df_ins['charges_grp2'] = pd.qcut(df_ins['charges'], q=10, labels=range(1, 11))
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children_float,age_grp,charges_grp2,charges_grp
0,19,female,27.900,0,yes,southwest,16884.92400,0.0,10대,8,2
1,18,male,33.770,1,no,southeast,1725.55230,1.0,10대,1,0
2,28,male,33.000,3,no,southeast,4449.46200,3.0,20대,3,0
3,33,male,22.705,0,no,northwest,21984.47061,0.0,30대,9,3
4,32,male,28.880,0,no,northwest,3866.85520,0.0,30대,2,0
...,...,...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,3.0,50대,6,1
1334,18,female,31.920,0,no,northeast,2205.98080,0.0,10대,1,0
1335,18,female,36.850,0,no,southeast,1629.83350,0.0,10대,1,0
1336,21,female,25.800,0,no,southwest,2007.94500,0.0,20대,1,0


In [174]:
df_ins['charges_grp2'].value_counts()

charges_grp2
1     134
2     134
3     134
5     134
6     134
8     134
9     134
10    134
4     133
7     133
Name: count, dtype: int64

<br>

#### [실습] 데이터 df_sp 활용

1. cut()으로 'reading score'를 20점 단위로 5개 그룹 변수 추가 
2. cut()으로 'reading score'를 등간격(구간 길이가 동일)으로 5개 그룹 변수 추가
3. qcut()으로 'readiong score'를 등비율로 5 등급화
4. pivot_table()을 활용해서 'parental level of education'과 3.의 그룹 변수로 'math score'의 평균 계산

### 3.3. 그룹 내 순위, 이동, 누적 변수 생성

> 데이터 분석 과정에서 그룹별로 순위를 매기거나, 직전 값과 비교를 통해서 변화량 등을 확인하기도 합니다. 뿐만 아니라 이동 평균이나 누적 최댓값 등을 계산하기도 합니다.

In [175]:
# @데이터 불러오기
df_dup = pd.read_csv('data/data_dupna.csv')
df_dup

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3
0,101,A,2022-01-03,20.0,FC,C887,N,
1,101,C,2022-05-03,120.0,FC,C887,,
2,101,B,2022-04-12,32.0,FC,C887,,N
3,103,C,2022-03-03,,CM,,,
4,103,B,2022-03-02,25.0,FC,C453,,N
5,105,C,2022-02-23,92.0,CM,,,
6,201,B,2022-02-16,31.0,FC,C453,,
7,204,A,2022-04-11,15.0,CM,,Y,
8,204,B,2022-04-11,18.0,CM,,,N


In [176]:
# @순위 생성(동점일 경우 평균 등수)
df_dup['amount'].rank(ascending=False)

0    6.0
1    1.0
2    3.0
3    NaN
4    5.0
5    2.0
6    4.0
7    8.0
8    7.0
Name: amount, dtype: float64

In [177]:
# @순위 생성(동점일 경우 index 순)
df_dup['date'].rank(ascending=True, method='first')

0    1.0
1    9.0
2    8.0
3    5.0
4    4.0
5    3.0
6    2.0
7    6.0
8    7.0
Name: date, dtype: float64

In [178]:
# @ 사용자별 순위 파생변수 추가
df_dup['seq'] = df_dup.groupby('id')['date'].rank(method='min',ascending=False)
df_dup

Unnamed: 0,id,product_cd,date,amount,channel,info1,info2,info3,seq
0,101,A,2022-01-03,20.0,FC,C887,N,,3.0
1,101,C,2022-05-03,120.0,FC,C887,,,1.0
2,101,B,2022-04-12,32.0,FC,C887,,N,2.0
3,103,C,2022-03-03,,CM,,,,1.0
4,103,B,2022-03-02,25.0,FC,C453,,N,2.0
5,105,C,2022-02-23,92.0,CM,,,,1.0
6,201,B,2022-02-16,31.0,FC,C453,,,1.0
7,204,A,2022-04-11,15.0,CM,,Y,,1.0
8,204,B,2022-04-11,18.0,CM,,,N,1.0


In [None]:
# rank 활용 최종건 선택
df_dup[df_dup['seq']==1]

In [None]:
# 데이터 정렬 및 날짜 형식 변환
df_dup = df_dup.sort_values(['id','date']).reset_index(drop=True)
df_dup['date'] = df_dup['date'].astype('datetime64')
df_dup.dtypes

In [None]:
# 그룹별 이동 값 변수 추가
df_dup['date_prev'] = df_dup.groupby('id')['date'].shift()
df_dup

In [None]:
# 시차의 계산
df_dup['date_diff'] = df_dup['date'] - df_dup['date_prev']
df_dup

In [None]:
# 그룹별 누적합 계산
df_dup['cum_amount'] = df_dup.groupby('id')['amount'].cumsum()
df_dup

In [None]:
# rolling() 활용 그룹별 이동 평균 계산
df_dup['ma_amount'] = df_dup.groupby('id').rolling(2)['amount'].mean().reset_index(drop=True)
df_dup

## 4. 날짜시간 변수 활용

날짜시간 변수에서 요소를 추출할 수 있고, 날짜시간별로 집계된 데이터로 시각화 가능

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
# Windows
sns.set(rc={'font.family':'Malgun Gothic', 'axes.unicode_minus':False})

#macOS
# sns.set(rc={'font.family':'AppleGothic', 'axes.unicode_minus':False})

In [None]:
df_subway = pd.read_csv('data/서울교통공사_역별일별승하차인원정보_20220731.csv')
df_subway

In [None]:
# 형식 변환없이 집계 및 시각화
agg = df_subway.groupby(['날짜','호선'], as_index=False)['이용객수'].sum()
display(agg)
sns.lineplot(data=agg, x='날짜', y='이용객수', hue='호선')

In [None]:
# to_datetime()을 활용한 형식 변환
df_subway['호선'] = df_subway['호선'].astype('category')
df_subway['날짜'] = pd.to_datetime(df_subway['날짜'])
df_subway.dtypes

In [None]:
# 요일 변수 생성
df_subway['요일'] = df_subway['날짜'].dt.weekday
df_subway

In [None]:
# 월 변수 생성
df_subway['월'] = df_subway['날짜'].dt.month
df_subway

In [None]:
# 날짜별 집계값의 생성
agg = df_subway.groupby(['날짜','호선'], as_index=False)['이용객수'].sum()
agg

In [None]:
# 시계열 데이터의 시각화 
sns.lineplot(data=agg, x='날짜', y='이용객수', hue='호선')

#### [실습] df_accident를 활용하여 7, 8월 새벽 1~5시 사고 건수 계산

#### End of script