## preprocessing.py 설계

[1.라이브러리 호출](#1-라이브러리-호출)  
[2.중복행 제거](#2-중복행-제거)  
[3.키워드 기반 필터링](#3-키워드-기반-필터링)  
[4.카테고리 기반 필터링](#4-카테고리-기반-필터링)  
[5.결측값 처리](#4-결측값-처리)  
[6.이상치 처리](#5-이상치-처리)  
[7.seller_location 컬럼 전처리](#6-seller_location-컬럼-처리)  

### 1. 라이브러리 호출

In [190]:
import pandas as pd
import os

In [191]:
current_directory = os.getcwd()
data_directory = os.path.join(current_directory, '..', 'data')
data_file_path = os.path.join(data_directory, '호텔_results_20241010_113155.xlsx')
df = pd.read_excel(data_file_path)

입력값  
- xlsx
- mysql

### 2. 중복행 제거  
고유값 : original_link

In [192]:
# origianl_link 기준 중복 제거
df = df.drop_duplicates(subset=['original_link']).reset_index(drop=True)

### 3. 키워드 기반 필터링

In [193]:
df = df[~df['title'].str.contains('임대|야놀자|입장권|상품권|포인트|야놀|주차권|쿠폰|구매|비행기|종일권|자유이용권', na=False)]
df.reset_index(drop=True, inplace=True)

### 4. 카테고리 기반 필터링  

In [194]:
# 카테고리 컬럼에서 중고나라는 결측값으로 들어가 있음
df[df['category'].isnull()]['platform'].unique()

array(['중고나라'], dtype=object)

In [195]:
df['category'] = df['category'].fillna("여행/숙박이용권")

In [196]:
# 카테고리 필터링 : 여행/숙박/렌트, 티켓/교환권, 여행/숙박이용권
df = df[(df['category'] =='여행/숙박/렌트')|(df['category'] =='티켓/교환권')|(df['category']== '여행/숙박이용권')]
df.reset_index(drop=True, inplace=True)

In [197]:
df

Unnamed: 0,platform,original_link,post_time,title,view_count,like_count,price,images,description,category,...,seller_location,expiration_date,market_price,capacity,parking,options,check_in_out_time,shipping_fee,transaction_location,transaction_method
0,당근마켓,https://www.daangn.com/articles/845853060,2024-10-10 10:27:51.013,10/11-12 금토) 호텔케니여수 스탠다드더블룸 양도해요,7,0,44000,https://img.kr.gcp-karroter.net/origin/article...,못 가게되서요 양도합니다,티켓/교환권,...,담양군 금성면,,,,,,,,,
1,당근마켓,https://www.daangn.com/articles/845851669,2024-10-10 10:24:51.013,●농심호텔 허심청브로이 옥토버페스트 수제맥주축제 OKTOBERFEST 티켓,29,2,28000,https://img.kr.gcp-karroter.net/origin/article...,농심호텔\n허심청브로이 옥토버페스트 수제맥주축제 \nOKTOBERFEST\n2024...,티켓/교환권,...,연제구 연산제2동,,,,,,,,,
2,당근마켓,https://www.daangn.com/articles/845850450,2024-10-10 10:22:51.013,호텔스카이파크센트럴 판교 10/12(토)~13(일),21,0,100000,https://img.kr.gcp-karroter.net/origin/article...,환불불가 상품으로\n개인사정으로 못가게 되어서 팝니다\n객실만이에요\n조식 미포함\...,티켓/교환권,...,광진구 구의제3동,,,,,,,,,
3,당근마켓,https://www.daangn.com/articles/845834566,2024-10-10 09:54:51.013,신라호텔 더 파크뷰 레스토랑 식사권 2매 판매합니다,87,6,360000,https://img.kr.gcp-karroter.net/origin/article...,신라호텔 더 파크뷰 레스토랑 식사권 2매 판매합니다,티켓/교환권,...,영등포구 여의동,,,,,,,,,
4,당근마켓,https://www.daangn.com/articles/845832050,2024-10-10 09:54:51.013,"호텔 드포레 ""주말"" 숙박권 (~11-30) [네이처파크 무료]",354,33,100000,https://img.kr.gcp-karroter.net/origin/article...,스파밸리 호텔\n호텔 드포레\n친환경 힐링 숲속호텔\n(네이처파크 무료이용)\n주중...,티켓/교환권,...,수성구 지산동,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5166,중고나라,https://web.joongna.com/product/177050925,2024-10-10 11:31:55.993,"@전세계 4,5성급 호텔 네이버 최저가에서 25% 할인 예약@",9,0,2000,https://img2.joongna.com/cafe-article-data/liv...,셀러회원으로 다시 돌아왔습니다. 전세계 호텔들을 네이버 호텔 비교 최저가 기준에서 ...,여행/숙박이용권,...,,,,,,,,,,
5167,중고나라,https://web.joongna.com/product/177043994,2024-10-10 11:31:55.993,"@전세계 4,5성급 호텔 네이버 최저가에서 25% 할인 예약@",5,0,2000,https://img2.joongna.com/cafe-article-data/liv...,셀러회원으로 다시 돌아왔습니다. 전세계 호텔들을 네이버 호텔 비교 최저가 기준에서 ...,여행/숙박이용권,...,,,,,,,,,,
5168,중고나라,https://web.joongna.com/product/177036544,2024-10-10 11:31:55.993,"@전세계 4,5성급 호텔 네이버 최저가에서 25% 할인 예약@",10,0,2000,https://img2.joongna.com/cafe-article-data/liv...,셀러회원으로 다시 돌아왔습니다. 전세계 호텔들을 네이버 호텔 비교 최저가 기준에서 ...,여행/숙박이용권,...,,,,,,,,,,
5169,중고나라,https://web.joongna.com/product/177030077,2024-10-10 11:31:55.993,당일 롯데호텔 숙박권,63,0,200000,https://img2.joongna.com/media/original/2024/0...,오늘날짜 숙박권 입니다 저렴하게 판매해요,여행/숙박이용권,...,구의제1동,,,,,,,,,


### 5. 결측값 처리

In [198]:
df.isnull().sum()

platform                   0
original_link              0
post_time                  0
title                      0
view_count                 0
like_count                 0
price                      0
images                     3
description               34
category                   0
status                     0
seller_location         4809
expiration_date         5171
market_price            5171
capacity                5171
parking                 5171
options                 5171
check_in_out_time       5171
shipping_fee            5171
transaction_location    5171
transaction_method      5171
dtype: int64

결측값 삭제(모든 행이 nan인 경우)

In [199]:
df.dropna(axis=0,how='all',inplace=True)
df.reset_index(drop=True,inplace=True)

결측값 대체  
images, description -> "정보없음"

In [200]:
df[['images','description']] = df[['images','description']].fillna('정보없음')

In [201]:
df.isnull().sum()

platform                   0
original_link              0
post_time                  0
title                      0
view_count                 0
like_count                 0
price                      0
images                     0
description                0
category                   0
status                     0
seller_location         4809
expiration_date         5171
market_price            5171
capacity                5171
parking                 5171
options                 5171
check_in_out_time       5171
shipping_fee            5171
transaction_location    5171
transaction_method      5171
dtype: int64

### 6. 이상치 처리

price : 상한 - 5,000,000원, 하한 - 30000원

In [202]:
df['price'].describe()

count    5.171000e+03
mean     2.970032e+05
std      1.390770e+07
min      0.000000e+00
25%      2.000000e+03
50%      2.222000e+03
75%      1.000000e+05
max      1.000000e+09
Name: price, dtype: float64

In [203]:
df = df.query('10000 <= price <= 5000000')

In [204]:
df.reset_index(drop=True,inplace=True)

### 7. seller_location 컬럼 처리

In [205]:
df['city'] = None
df['city_goo'] = None
df['city_dong'] = None

def split_location(location):
    if pd.isna(location):
        return pd.Series([None, None, None])

    # 공백 제거
    location_parts = location.strip().split(" ")
    
    si = None
    goo = None
    dong = None
    
    for part in location_parts:
        clean_part = part.strip()
        if clean_part.endswith('시'):
            si = clean_part
        elif clean_part.endswith('구'):
            goo = clean_part
        elif clean_part.endswith('동'):
            dong = clean_part
    
    return pd.Series([si, goo, dong])

# city_dong으로 컬럼명 수정
df[['city', 'city_goo', 'city_dong']] = df['seller_location'].apply(split_location)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['city'] = None
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['city_goo'] = None
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['city_dong'] = None
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in

In [212]:
seoul_goo = ["강남구", "강동구", "강북구", "강서구", "관악구", "광진구", "구로구", "금천구", "노원구", "도봉구", "동대문구", "동작구", "마포구", "서대문구", "서초구", "성동구", "성북구", "송파구", "양천구", "영등포구", "용산구", "은평구", "종로구", "중구", "중랑구"]

In [213]:
df.loc[df['city_goo'].isin(seoul_goo) & df['city'].isna(), 'city'] = '서울특별시'

In [218]:
df[['seller_location','city', 'city_goo', 'city_dong']].head(30)

Unnamed: 0,seller_location,city,city_goo,city_dong
0,담양군 금성면,,,
1,연제구 연산제2동,,연제구,연산제2동
2,광진구 구의제3동,서울특별시,광진구,구의제3동
3,영등포구 여의동,서울특별시,영등포구,여의동
4,수성구 지산동,,수성구,지산동
5,전주시 덕진구 인후3동,전주시,덕진구,인후3동
6,서귀포시 동홍동,서귀포시,,동홍동
7,송파구 삼전동,서울특별시,송파구,삼전동
8,강남구 삼성동,서울특별시,강남구,삼성동
9,용산구 이태원제1동,서울특별시,용산구,이태원제1동


### 8.Pickle로 저장

In [207]:
df.to_pickle('전처리_호텔_results_20241010_113155.pickle')

In [222]:
df

Unnamed: 0,platform,original_link,post_time,title,view_count,like_count,price,images,description,category,...,capacity,parking,options,check_in_out_time,shipping_fee,transaction_location,transaction_method,city,city_goo,city_dong
0,당근마켓,https://www.daangn.com/articles/845853060,2024-10-10 10:27:51.013,10/11-12 금토) 호텔케니여수 스탠다드더블룸 양도해요,7,0,44000,https://img.kr.gcp-karroter.net/origin/article...,못 가게되서요 양도합니다,티켓/교환권,...,,,,,,,,,,
1,당근마켓,https://www.daangn.com/articles/845851669,2024-10-10 10:24:51.013,●농심호텔 허심청브로이 옥토버페스트 수제맥주축제 OKTOBERFEST 티켓,29,2,28000,https://img.kr.gcp-karroter.net/origin/article...,농심호텔\n허심청브로이 옥토버페스트 수제맥주축제 \nOKTOBERFEST\n2024...,티켓/교환권,...,,,,,,,,,연제구,연산제2동
2,당근마켓,https://www.daangn.com/articles/845850450,2024-10-10 10:22:51.013,호텔스카이파크센트럴 판교 10/12(토)~13(일),21,0,100000,https://img.kr.gcp-karroter.net/origin/article...,환불불가 상품으로\n개인사정으로 못가게 되어서 팝니다\n객실만이에요\n조식 미포함\...,티켓/교환권,...,,,,,,,,서울특별시,광진구,구의제3동
3,당근마켓,https://www.daangn.com/articles/845834566,2024-10-10 09:54:51.013,신라호텔 더 파크뷰 레스토랑 식사권 2매 판매합니다,87,6,360000,https://img.kr.gcp-karroter.net/origin/article...,신라호텔 더 파크뷰 레스토랑 식사권 2매 판매합니다,티켓/교환권,...,,,,,,,,서울특별시,영등포구,여의동
4,당근마켓,https://www.daangn.com/articles/845832050,2024-10-10 09:54:51.013,"호텔 드포레 ""주말"" 숙박권 (~11-30) [네이처파크 무료]",354,33,100000,https://img.kr.gcp-karroter.net/origin/article...,스파밸리 호텔\n호텔 드포레\n친환경 힐링 숲속호텔\n(네이처파크 무료이용)\n주중...,티켓/교환권,...,,,,,,,,,수성구,지산동
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1908,중고나라,https://web.joongna.com/product/177054727,2024-10-10 11:31:55.993,8월 15일 ~17일 (2박)세인트 존스 경포 호텔 고져스 오션 트윈,31,0,800000,/cafe-article-data/live/2024/07/18/1067047568/...,세인트존스경포호텔 고져스 오션 트윈 객실 숙박 -예약일정 : 08/15 토 ~ 08...,여행/숙박이용권,...,,,,,,,,,,
1909,중고나라,https://web.joongna.com/product/177054641,2024-10-10 11:31:55.993,센츄리온 호텔 우에노 7.19-21 (2박) 급매!,6,0,150000,/cafe-article-data/live/2024/07/18/1067047466/...,https://search.naver.com/search.naver?where=ne...,여행/숙박이용권,...,,,,,,,,,,
1910,중고나라,https://web.joongna.com/product/177046630,2024-10-10 11:31:55.993,마카오 호텔 숙박권 양도합니다,30,0,90000,https://img2.joongna.com/media/original/2024/0...,세나도광장 걸어서 3-5분 거리에있는 위치 아주 좋은 숙소입니다 ! 16일에 체크인...,여행/숙박이용권,...,,,,,,,,,,
1911,중고나라,https://web.joongna.com/product/177030077,2024-10-10 11:31:55.993,당일 롯데호텔 숙박권,63,0,200000,https://img2.joongna.com/media/original/2024/0...,오늘날짜 숙박권 입니다 저렴하게 판매해요,여행/숙박이용권,...,,,,,,,,,,구의제1동
