# Analysis Assumptions & Data Considerations

본 분석은 Kaggle H&M 데이터를 활용하며, 아래와 같은 전처리 기준 및 해석 상의 주의점을 전제로 한다.

1. 결측치·이상치 처리

age는 비현실적 값(0세, 고령 등)이 존재할 수 있으므로 연령대 파생 전 이상치 점검이 선행되어야 한다.

price는 0 또는 극단치가 존재할 수 있으며, 이는 무료 배포·오입력·프로모션 등의 가능성을 포함한다. 본 분석에서는 “구매 단가”로 해석한다.

club_member_status 중 LEFT CLUB 고객의 포함 여부를 명시한다.

2. 조인 전략

기본 분석은 transactions 중심 inner join으로 진행한다.

상품 테이블 조인 시 중복 컬럼 발생 여부를 점검하고 필요한 컬럼만 사용한다.

3. 매출 해석

본 데이터에는 수량 정보가 없어 price의 합을 매출로 사용한다.

통화 단위는 SEK이며, 절대적 규모보다 상대 비교 중심으로 해석한다.

4. 시간 단위 처리

t_dat는 datetime 변환 후 월·요일 파생 변수를 생성한다.

특정 시즌 이벤트 영향 가능성을 고려하여 월별 급증을 성장으로 단정하지 않는다.

5. 데이터 편향

Kaggle 공개 표본으로 전체 고객을 대표하지 않는다.

온라인 채널 비중 과대표 가능성을 고려한다.

6. 비즈니스 해석 주의

관찰된 패턴은 인과관계가 아니다.

매출 상위 상품 = 트렌드 선도 상품으로 단정하지 않는다.

In [346]:
import pandas as pd

df_art = pd.read_csv("../../h&m dataset/articles_hm.csv")
df_cust = pd.read_csv("../../h&m dataset/customer_hm.csv")
df_tran = pd.read_csv("../../h&m dataset/transactions_hm.csv")

df_tran.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2019-11-05,3e2b60b679e62fb49516105b975560082922011dd752ec...,698328010,0.016932,2
1,2019-05-22,89647ac2274f54c770aaa4b326e0eea09610c252381f37...,760597002,0.033881,2
2,2019-05-10,2ebe392150feb60ca89caa8eff6c08b7ef1138cd6fdc71...,488561032,0.016932,2
3,2019-08-26,7b3205de4ca17a339624eb5e3086698e9984eba6b47c56...,682771001,0.033881,2
4,2019-08-10,3b77905de8b32045f08cedb79200cdfa477e9562429a39...,742400033,0.00322,1


In [347]:
df_art2 = df_art.copy()
df_cust2 = df_cust.copy()
df_tran2 = df_tran.copy()


In [348]:
df_art2.info()

<class 'pandas.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype
---  ------                        --------------   -----
 0   article_id                    105542 non-null  int64
 1   product_code                  105542 non-null  int64
 2   prod_name                     105542 non-null  str  
 3   product_type_no               105542 non-null  int64
 4   product_type_name             105542 non-null  str  
 5   product_group_name            105542 non-null  str  
 6   graphical_appearance_no       105542 non-null  int64
 7   graphical_appearance_name     105542 non-null  str  
 8   colour_group_code             105542 non-null  int64
 9   colour_group_name             105542 non-null  str  
 10  perceived_colour_value_id     105542 non-null  int64
 11  perceived_colour_value_name   105542 non-null  str  
 12  perceived_colour_master_id    105542 non-null  int64
 13  perceived_colour_master_n

In [349]:
df_cust2.info()

<class 'pandas.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 6 columns):
 #   Column                  Non-Null Count    Dtype
---  ------                  --------------    -----
 0   customer_id             1048575 non-null  str  
 1   FN                      1048575 non-null  int64
 2   Active                  1048575 non-null  int64
 3   club_member_status      1048575 non-null  str  
 4   fashion_news_frequency  1048574 non-null  str  
 5   age                     1048575 non-null  int64
dtypes: int64(3), str(3)
memory usage: 48.0 MB


In [350]:
df_tran2.info()

<class 'pandas.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 5 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   t_dat             1048575 non-null  str    
 1   customer_id       1048575 non-null  str    
 2   article_id        1048575 non-null  int64  
 3   price             1048575 non-null  float64
 4   sales_channel_id  1048575 non-null  int64  
dtypes: float64(1), int64(2), str(2)
memory usage: 40.0 MB


In [351]:
df_cust2.shape

(1048575, 6)

In [352]:
df_art2.shape

(105542, 25)

In [353]:
df_tran2.shape 
#약 104만 건의 거래 데이터(tran)를 중심으로, 고객(cust)(약 105만 명)과 상품(art)(약 10만 종) 데이터를 결합하여 분석을 진행할 예정

(1048575, 5)

## Dataset Overview

- Articles: 105,542 products × 25 attributes  
- Customers: 1,048,575 customers × 6 attributes  
- Transactions: 1,040,404 purchase records × 5 attributes  

본 분석은 transactions를 중심 fact table로 하여 고객·상품 테이블을 결합한다.

In [354]:
df_art2 = df_art2.drop_duplicates()
df_cust2 = df_cust2.drop_duplicates() 
df_tran2 = df_tran2.drop_duplicates()

#중복값 날려주시고

In [355]:
print(df_cust2.isnull().sum())

#cust, tran, art 중 결측치는 customer에만 존재

customer_id               0
FN                        0
Active                    0
club_member_status        0
fashion_news_frequency    1
age                       0
dtype: int64


In [356]:
df_cust2[df_cust2['fashion_news_frequency'].isnull()]

#얘가 결측치였고

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age
876108,a79d9cbfaceb0d25a91caccfad167d4d6391fd5bb4292b...,1,0,ACTIVE,,38


In [357]:
df_cust2['fashion_news_frequency'] = df_cust2['fashion_news_frequency'].fillna('Unknown_but_FN1')
#fashion_news_frequency의 결측치는 정보 미수집 고객군으로 간주하여 별도 그룹으로 유지하였다.

In [358]:
df_cust2[df_cust2['fashion_news_frequency'].isnull()]

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age


In [359]:
df_cust2['fashion_news_frequency'].value_counts()

fashion_news_frequency
NONE               674698
Regularly          373218
Monthly               658
Unknown_but_FN1         1
Name: count, dtype: int64

customer table 전처리 됐고, 이제 transaction 해보자

In [360]:
df_tran2.info()

<class 'pandas.DataFrame'>
Index: 1040101 entries, 0 to 1048574
Data columns (total 5 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   t_dat             1040101 non-null  str    
 1   customer_id       1040101 non-null  str    
 2   article_id        1040101 non-null  int64  
 3   price             1040101 non-null  float64
 4   sales_channel_id  1040101 non-null  int64  
dtypes: float64(1), int64(2), str(2)
memory usage: 47.6 MB


In [361]:
df_tran2.isnull().sum()

t_dat               0
customer_id         0
article_id          0
price               0
sales_channel_id    0
dtype: int64

In [362]:
df_tran2["t_dat"] = pd.to_datetime(df_tran2["t_dat"], format="%Y-%m-%d")

In [363]:
df_tran2.info()

<class 'pandas.DataFrame'>
Index: 1040101 entries, 0 to 1048574
Data columns (total 5 columns):
 #   Column            Non-Null Count    Dtype         
---  ------            --------------    -----         
 0   t_dat             1040101 non-null  datetime64[us]
 1   customer_id       1040101 non-null  str           
 2   article_id        1040101 non-null  int64         
 3   price             1040101 non-null  float64       
 4   sales_channel_id  1040101 non-null  int64         
dtypes: datetime64[us](1), float64(1), int64(2), str(1)
memory usage: 47.6 MB


In [364]:
df_tran2['sales_channel_id'] = df_tran2['sales_channel_id'].map({1: '오프라인', 2: '온라인'})

In [365]:
df_tran2.rename(columns={'sales_channel_id': '판매채널'}, inplace=True)
df_tran2.head()

Unnamed: 0,t_dat,customer_id,article_id,price,판매채널
0,2019-11-05,3e2b60b679e62fb49516105b975560082922011dd752ec...,698328010,0.016932,온라인
1,2019-05-22,89647ac2274f54c770aaa4b326e0eea09610c252381f37...,760597002,0.033881,온라인
2,2019-05-10,2ebe392150feb60ca89caa8eff6c08b7ef1138cd6fdc71...,488561032,0.016932,온라인
3,2019-08-26,7b3205de4ca17a339624eb5e3086698e9984eba6b47c56...,682771001,0.033881,온라인
4,2019-08-10,3b77905de8b32045f08cedb79200cdfa477e9562429a39...,742400033,0.00322,오프라인


In [366]:
df_tran2['price'].describe() 
print("price가 0.1 이상:", len(df_tran2[df_tran2["price"] >= 0.1]))
print("price가 0.2 이상:", len(df_tran2[df_tran2["price"] >= 0.2]))
print("price가 0.3 이상:", len(df_tran2[df_tran2["price"] >= 0.3]))
print("price가 0.4 이상:", len(df_tran2[df_tran2["price"] >= 0.4]))
print("price가 0.5 이상:", len(df_tran2[df_tran2["price"] >= 0.5]))


price가 0.1 이상: 10704
price가 0.2 이상: 719
price가 0.3 이상: 126
price가 0.4 이상: 29
price가 0.5 이상: 2


In [367]:
high_price_transactions = df_tran2[df_tran2["price"] >= 0.4]
high_price_transactions['판매채널'].replace({1: '오프라인', 2: '온라인'}).value_counts()

판매채널
온라인    29
Name: count, dtype: int64

In [368]:
df_tran2['year_month'] = df_tran2['t_dat'].dt.to_period('M')

In [369]:
monthly_sales = df_tran2.groupby('year_month').size()
monthly_price = df_tran2.groupby('year_month')['price'].sum()

print(monthly_price)

year_month
2019-01    2129.926131
2019-02    1989.217641
2019-03    2374.905504
2019-04    2703.443538
2019-05    2748.199469
2019-06    3088.776976
2019-07    2552.035334
2019-08    1943.422489
2019-09    2559.226862
2019-10    2358.486793
2019-11    2463.769270
2019-12    1985.492149
Freq: M, Name: price, dtype: float64


In [370]:
best_month = monthly_price.idxmax()
best_value = monthly_price.max()

print("가장 매출이 높은 달:", best_month)
print("그 달의 매출:", best_value) 

가장 매출이 높은 달: 2019-06
그 달의 매출: 3088.776976142


이제 articles table 전처리 ㄱㄱ

In [371]:
df_art2.info()

<class 'pandas.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype
---  ------                        --------------   -----
 0   article_id                    105542 non-null  int64
 1   product_code                  105542 non-null  int64
 2   prod_name                     105542 non-null  str  
 3   product_type_no               105542 non-null  int64
 4   product_type_name             105542 non-null  str  
 5   product_group_name            105542 non-null  str  
 6   graphical_appearance_no       105542 non-null  int64
 7   graphical_appearance_name     105542 non-null  str  
 8   colour_group_code             105542 non-null  int64
 9   colour_group_name             105542 non-null  str  
 10  perceived_colour_value_id     105542 non-null  int64
 11  perceived_colour_value_name   105542 non-null  str  
 12  perceived_colour_master_id    105542 non-null  int64
 13  perceived_colour_master_n

In [372]:
df_art2.isnull().sum() 

article_id                        0
product_code                      0
prod_name                         0
product_type_no                   0
product_type_name                 0
product_group_name                0
graphical_appearance_no           0
graphical_appearance_name         0
colour_group_code                 0
colour_group_name                 0
perceived_colour_value_id         0
perceived_colour_value_name       0
perceived_colour_master_id        0
perceived_colour_master_name      0
department_no                     0
department_name                   0
index_code                        0
index_name                        0
index_group_no                    0
index_group_name                  0
section_no                        0
section_name                      0
garment_group_no                  0
garment_group_name                0
detail_desc                     416
dtype: int64

In [373]:
df_art2["detail_desc"] = df_art2["detail_desc"].fillna("No description")
#제품 상세 설명란에 아무런 설명이 없음으로 결측치를 'No description'으로 대체

In [374]:
df_art2.isnull().sum()

article_id                      0
product_code                    0
prod_name                       0
product_type_no                 0
product_type_name               0
product_group_name              0
graphical_appearance_no         0
graphical_appearance_name       0
colour_group_code               0
colour_group_name               0
perceived_colour_value_id       0
perceived_colour_value_name     0
perceived_colour_master_id      0
perceived_colour_master_name    0
department_no                   0
department_name                 0
index_code                      0
index_name                      0
index_group_no                  0
index_group_name                0
section_no                      0
section_name                    0
garment_group_no                0
garment_group_name              0
detail_desc                     0
dtype: int64

In [375]:
df_art2['detail_desc'].value_counts().head()

detail_desc
No description                                                        416
T-shirt in printed cotton jersey.                                     159
Leggings in soft organic cotton jersey with an elasticated waist.     138
T-shirt in soft, printed cotton jersey.                               137
Socks in a soft, jacquard-knit cotton blend with elasticated tops.    136
Name: count, dtype: int64

In [376]:
cols_to_drop = [
    'product_type_no', 
    'graphical_appearance_no', 
    'colour_group_code', 
    'perceived_colour_value_id',
    'perceived_colour_master_id',
    'department_no',
    'index_code',
    'index_group_no',
    'section_no',
    'garment_group_no'
]
df_art_cleaned = df_art2.drop(columns=cols_to_drop) 

In [385]:
#파생변수 생성 및 이름만 조금 더 직관적으로 분류
df_art_cleaned['product_season'] = df_art_cleaned['section_name'].apply(get_season)
df_art_cleaned['category_main'] = df_art_cleaned['index_group_name']   
#원본 보존하고 복사해서 이름 변경 (메인 카테고리 정의)

In [378]:
status_map = {'ACTIVE': 2, 'PRE-CREATE': 1, 'LEFT CLUB': 0}
frequency_map = {'Regularly': 2, 'Monthly': 1, 'NONE': 0, 'Unknown_but_FN1': 0.5}

In [379]:
df_cust2['club_member_status'] = df_cust2['club_member_status'].map(status_map)
df_cust2['fashion_news_frequency'] = df_cust2['fashion_news_frequency'].map(frequency_map)

In [380]:
df_cust2['club_member_status'].value_counts()

club_member_status
2    982635
1     65581
0       359
Name: count, dtype: int64

In [381]:
df_cust2['fashion_news_frequency'].value_counts()

fashion_news_frequency
0.0    674698
2.0    373218
1.0       658
0.5         1
Name: count, dtype: int64

In [382]:
def cate_age(age):
    if age < 20: return '10대'
    elif age < 30: return '20대'
    elif age < 40: return '30대'
    elif age < 50: return '40대'
    elif age < 60: return'50대'
    else: return '60대 이상'

df_cust2 = df_cust2[(df_cust2['age'] >= 10) & (df_cust2['age'] <= 100)]
df_cust2['age_segment'] = df_cust2['age'].apply(cate_age)

In [383]:
df_cust2.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,age_segment
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0,0,2,0.0,49,40대
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0,0,2,0.0,25,20대
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0,0,2,0.0,24,20대
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0,0,2,0.0,54,50대
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1,1,2,2.0,52,50대


In [384]:
# 매핑 및 파생변수 설정 (시즌별 / 고관여&저관여 / 신상품 여부 / 색감 여부)
# 1. 시즌 별 매핑 (product_season) / 시즌성 구분을 위한 키워드 정의
def get_season(section):
    section = section.lower()
    # FW
    if any(kw in section for kw in ['outerwear', 'nightwear', 'socks', 'tights', 'knitted']):
        return 'FW'
    # SS
    elif any(kw in section for kw in ['swimwear', 'sport', 'shorts', 'sandals']):
        return 'SS'
    # 두루두루 아이템(all season)
    else:
        return 'All-Season'