# [정답/해설] H&M 고객 데이터 — 기본 전처리 & 인덱싱/슬라이싱

In [1]:
import pandas as pd
path = "customer_hm.csv"
df = pd.read_csv(path)
print("로드 완료! shape:", df.shape)

로드 완료! shape: (1048575, 6)


### 1. `head()` — 데이터 미리보기

In [2]:
df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0,0,ACTIVE,NONE,49
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0,0,ACTIVE,NONE,25
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0,0,ACTIVE,NONE,24
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0,0,ACTIVE,NONE,54
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1,1,ACTIVE,Regularly,52


**해설:** 초기에 데이터 구조를 빠르게 파악합니다.

### 2. `shape` — (행, 열) 크기

In [3]:
df.shape

(1048575, 6)

**해설:** 전처리 전/후로 행/열 수 변화를 추적합니다.

### 3. `info()` — 타입 & 결측 개요

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 6 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   customer_id             1048575 non-null  object
 1   FN                      1048575 non-null  int64 
 2   Active                  1048575 non-null  int64 
 3   club_member_status      1048575 non-null  object
 4   fashion_news_frequency  1048574 non-null  object
 5   age                     1048575 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 48.0+ MB


**해설:** 각 컬럼의 dtype/결측 개수를 한 번에 확인합니다.

### 4. `describe()` — 수치형 요약 통계

In [5]:
df.describe()

Unnamed: 0,FN,Active,age
count,1048575.0,1048575.0,1048575.0
mean,0.3555397,0.3461326,36.36919
std,0.4786767,0.4757363,14.30899
min,0.0,0.0,16.0
25%,0.0,0.0,24.0
50%,0.0,0.0,32.0
75%,1.0,1.0,49.0
max,1.0,1.0,99.0


**해설:** 범위/분포/이상치 감을 잡습니다.

### 5. 열 선택 — 리스트로 여러 열

In [6]:
df[['customer_id','age','club_member_status']].head()

Unnamed: 0,customer_id,age,club_member_status
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,49,ACTIVE
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,25,ACTIVE
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,24,ACTIVE
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,54,ACTIVE
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,52,ACTIVE


**해설:** `[['col1','col2']]`는 **DataFrame**을 반환합니다.

### 6. `Series` vs `DataFrame` 선택 차이

In [7]:
s = df['age']     # Series
d = df[['age']]   # DataFrame
type(s), type(d)

(pandas.core.series.Series, pandas.core.frame.DataFrame)

**해설:** 단일 대괄호 → Series(1차원), 이중 대괄호 → DataFrame(2차원).

### 7. `.loc` — 라벨 기반 슬라이싱(끝 포함)

In [8]:
df.loc[0:5, ['customer_id','club_member_status','age']]

Unnamed: 0,customer_id,club_member_status,age
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,ACTIVE,49
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,ACTIVE,25
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,ACTIVE,24
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,ACTIVE,54
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,ACTIVE,52
5,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,ACTIVE,20


**해설:** `.loc[a:b]`는 **b를 포함**합니다.

### 8. `.iloc` — 정수 위치 기반 슬라이싱(끝 제외)

In [9]:
df.iloc[0:6, [0,1,2,5]]

Unnamed: 0,customer_id,FN,Active,age
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0,0,49
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0,0,25
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0,0,24
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0,0,54
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1,1,52
5,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,0,0,20


**해설:** `.iloc[a:b]`는 **b를 제외**합니다.

### 9. 불리언 필터 — `age >= 50`

In [10]:
df.loc[df['age'] >= 50, ['customer_id','age','Active']].head()

Unnamed: 0,customer_id,age,Active
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,54,0
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,52,1
12,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,56,1
14,0000b2f1829e23b24feec422ef13df3ccedaedc85368e6...,54,1
15,0000b7a134c3ec0d8842fad1fd4ca28517424c14fc4848...,75,0


**해설:** 조건이 True인 행만 남깁니다.

### 10. 복합 조건 — `&` 사용

In [11]:
mask = (df['Active'] == 1) & (df['fashion_news_frequency'] == 'Regularly')
df.loc[mask, ['customer_id','fashion_news_frequency','age']].head()

Unnamed: 0,customer_id,fashion_news_frequency,age
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,Regularly,52
6,00007d2de826758b65a93dd24ce629ed66842531df6699...,Regularly,32
12,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,Regularly,56
13,0000ae1bbb25e04bdc7e35f718e852adfb3fbb72ef38b3...,Regularly,29
14,0000b2f1829e23b24feec422ef13df3ccedaedc85368e6...,Regularly,54


**해설:** `&`는 AND, `|`는 OR. 괄호로 각 조건을 감싸야 합니다.

### 11. `value_counts()` — 범주 분포

In [12]:
df['club_member_status'].value_counts()

club_member_status
ACTIVE        982635
PRE-CREATE     65581
LEFT CLUB        359
Name: count, dtype: int64

**해설:** 범주형의 빈도/비율 파악에 유용합니다.

### 12. `isnull().sum()` — 결측 개수

In [13]:
df.isnull().sum()

customer_id               0
FN                        0
Active                    0
club_member_status        0
fashion_news_frequency    1
age                       0
dtype: int64

**해설:** 결측치 처리 전략 수립에 필요합니다.

### 13. `to_csv` — 필터 결과 저장

In [14]:
adults = df.loc[df['age'] >= 18]
adults.to_csv('hm_adults.csv', index=False)
print('saved: hm_adults.csv')

saved: hm_adults.csv


**해설:** 전처리 산출물을 파일로 기록하면 재현성과 협업이 좋아집니다.