##### 요구사항
* 공연기간을 시작일과 종료일로 나누어 두 개의 컬럼에 넣기 <br>
* 현재일 기준으로 종료된 공연 삭제 <br>
* 'URL' 결측치에 기본 프로필 이미지 넣기 <br>
* 모든 사이트의 공연데이터를 하나로 합치기 <br>

### 공연데이터 정리하기

In [192]:
import pandas as pd
import numpy as np

In [193]:
df_inpark = pd.read_csv('../ticketMerge/티켓머지_일정데이터/interpark_data.csv')

In [194]:
df_inpark.head(2)

Unnamed: 0,상품 제목,포스터 URL,상세 포스터 URL,공연장소,공연기간,캐스팅 리스트,해당 링크
0,2024 TREASURE RELAY TOUR ［REBOOT］ FINAL IN SEOUL,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,올림픽공원 KSPO DOME(자세히),2024.08.15,트레저,https://tickets.interpark.com/goods/24008528
1,두아 리파 내한공연,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,고척스카이돔(자세히),2024.12.04 ~2024.12.05,두아 리파,https://tickets.interpark.com/goods/24007623


In [135]:
# 결측치 확인하기

In [136]:
df_inpark.isna().sum()

상품 제목           0
포스터 URL         0
상세 포스터 URL      5
공연장소            0
공연기간            0
캐스팅 리스트       192
해당 링크           0
dtype: int64

In [137]:
# 공연기간을 시작일/종료일 두 컬럼으로 나누기
# 데이터 손실 대비하여 dt_inpark 로 별도 저장하여 작업

In [201]:
dt_inpark = pd.DataFrame(df_inpark['공연기간'])  

In [202]:
dt_inpark

Unnamed: 0,공연기간
0,2024.08.15
1,2024.12.04 ~2024.12.05
2,2024.08.17 ~2024.08.18
3,2024.08.17
4,2024.07.13 ~2024.07.14
...,...
295,2024.08.31
296,2024.06.29
297,2024.06.28
298,2024.09.21


In [140]:
# "~"를 기준으로 공연기간을 분리하여 새로운 컬럼 추가

In [203]:
dt_inpark = dt_inpark[['시작일', '종료일']] = dt_inpark['공연기간'].str.split(' ~', expand=True)

In [204]:
# 필요 없어진 '공연기간' 컬럼 삭제
dt_inpark = dt_inpark.drop(columns=[0,1])

In [205]:
dt_inpark

Unnamed: 0,시작일,종료일
0,2024.08.15,
1,2024.12.04,2024.12.05
2,2024.08.17,2024.08.18
3,2024.08.17,
4,2024.07.13,2024.07.14
...,...,...
295,2024.08.31,
296,2024.06.29,
297,2024.06.28,
298,2024.09.21,


In [144]:
dt_inpark.isna().sum()

시작일      0
종료일    230
dtype: int64

In [145]:
# 종료일이 비어있는 경우는 시작일로 채워주기(공연일정이 하루인 경우)

In [206]:
dt_inpark =dt_inpark.fillna(method='ffill', axis=1)

  dt_inpark =dt_inpark.fillna(method='ffill', axis=1)


In [207]:
dt_inpark

Unnamed: 0,시작일,종료일
0,2024.08.15,2024.08.15
1,2024.12.04,2024.12.05
2,2024.08.17,2024.08.18
3,2024.08.17,2024.08.17
4,2024.07.13,2024.07.14
...,...,...
295,2024.08.31,2024.08.31
296,2024.06.29,2024.06.29
297,2024.06.28,2024.06.28
298,2024.09.21,2024.09.21


In [148]:
# 공연일정 합치기

In [208]:
df_inpark = pd.concat([df_inpark, dt_inpark], axis=1)

In [150]:
df_inpark.head(1)

Unnamed: 0,상품 제목,포스터 URL,상세 포스터 URL,공연장소,공연기간,캐스팅 리스트,해당 링크,시작일,종료일
0,2024 TREASURE RELAY TOUR ［REBOOT］ FINAL IN SEOUL,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,올림픽공원 KSPO DOME(자세히),2024.08.15,트레저,https://tickets.interpark.com/goods/24008528,2024.08.15,2024.08.15


In [151]:
# 공연기간 삭제

In [212]:
df_inpark = df_inpark.drop(columns=['공연기간'])

In [153]:
df_inpark.head(2)

Unnamed: 0,상품 제목,포스터 URL,상세 포스터 URL,공연장소,캐스팅 리스트,해당 링크,시작일,종료일
0,2024 TREASURE RELAY TOUR ［REBOOT］ FINAL IN SEOUL,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,올림픽공원 KSPO DOME(자세히),트레저,https://tickets.interpark.com/goods/24008528,2024.08.15,2024.08.15
1,두아 리파 내한공연,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,고척스카이돔(자세히),두아 리파,https://tickets.interpark.com/goods/24007623,2024.12.04,2024.12.05


In [154]:
# 기본이미지로 된 포스터url 찾기
# 기본 이미지 : 'http://ticketimage.interpark.com/TicketImage/main/common/mobile/noimage_vtc.jpg'

In [155]:
df_inpark.loc[df_inpark['포스터 URL'] == 'http://ticketimage.interpark.com/TicketImage/main/common/mobile/noimage_vtc.jpg']

Unnamed: 0,상품 제목,포스터 URL,상세 포스터 URL,공연장소,캐스팅 리스트,해당 링크,시작일,종료일
68,〈재즈 콜렉티브－윤석철트리오〉－인천,http://ticketimage.interpark.com/TicketImage/m...,,트라이보울(자세히),,https://tickets.interpark.com/goods/24009322,2024.07.31,2024.07.31
115,2024 월간뮤지크 〈어게인 싱어〉,http://ticketimage.interpark.com/TicketImage/m...,https:https://ticketimage.interpark.com/Play/I...,양천문화회관 대극장(자세히),정동화,https://tickets.interpark.com/goods/24009159,2024.07.31,2024.07.31
218,야놀자 연동 테스트_비지정_콘서트,http://ticketimage.interpark.com/TicketImage/m...,,임시공연장(자세히),,https://tickets.interpark.com/goods/24009277,2024.08.01,2024.09.30
272,ASAC오픈클래스 〈쇼팽의 만년을 찾아서〉 -안산,http://ticketimage.interpark.com/TicketImage/m...,,안산문화예술의전당 별무리극장(자세히),,https://tickets.interpark.com/goods/24009158,2024.09.07,2024.09.07


In [156]:
# 인터파크 기본이미지, 상세포스터 url 결측치에 => 우리가 사용할 기본이미지로 변경
# 나무위키 빈 이미지 : https://i.namu.wiki/i/GFw2SMaGiqbS3yeTWO08FD5Df8LKk-xB2ckQ0MtDjdoryMCJtsDap9msW17NVgbTM4432kao4DkEGsgwhs_In0zw9vMOldvNbYoc-n1Ng4XgxBVZXNZz33WScPw6zCJY5XAVrdsAFGDx24HN_nu3oQ.webp

In [157]:
df_inpark = df_inpark.replace('http://ticketimage.interpark.com/TicketImage/main/common/mobile/noimage_vtc.jpg', 'https://i.namu.wiki/i/GFw2SMaGiqbS3yeTWO08FD5Df8LKk-xB2ckQ0MtDjdoryMCJtsDap9msW17NVgbTM4432kao4DkEGsgwhs_In0zw9vMOldvNbYoc-n1Ng4XgxBVZXNZz33WScPw6zCJY5XAVrdsAFGDx24HN_nu3oQ.webp')

In [158]:
df_inpark['상세 포스터 URL'].fillna('https://i.namu.wiki/i/GFw2SMaGiqbS3yeTWO08FD5Df8LKk-xB2ckQ0MtDjdoryMCJtsDap9msW17NVgbTM4432kao4DkEGsgwhs_In0zw9vMOldvNbYoc-n1Ng4XgxBVZXNZz33WScPw6zCJY5XAVrdsAFGDx24HN_nu3oQ.webp', inplace=True)

In [159]:
df_inpark.isna().sum()

상품 제목           0
포스터 URL         0
상세 포스터 URL      0
공연장소            0
캐스팅 리스트       192
해당 링크           0
시작일             0
종료일             0
dtype: int64

In [160]:
# 캐스팅리스트 결측치에 데이터 채우기

In [161]:
df_inpark['캐스팅 리스트'].fillna('업로드 예정', inplace=True)

In [162]:
df_inpark.isna().sum()

상품 제목         0
포스터 URL       0
상세 포스터 URL    0
공연장소          0
캐스팅 리스트       0
해당 링크         0
시작일           0
종료일           0
dtype: int64

In [163]:
df_inpark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   상품 제목       300 non-null    object
 1   포스터 URL     300 non-null    object
 2   상세 포스터 URL  300 non-null    object
 3   공연장소        300 non-null    object
 4   캐스팅 리스트     300 non-null    object
 5   해당 링크       300 non-null    object
 6   시작일         300 non-null    object
 7   종료일         300 non-null    object
dtypes: object(8)
memory usage: 18.9+ KB


In [164]:
# 종료일을 'date' 데이터 타입으로 변환하고, 현재일 기준으로 종료된 공연 삭제하기

In [209]:
pd.to_datetime(df_inpark['종료일'])

0     2024-08-15
1     2024-12-05
2     2024-08-18
3     2024-08-17
4     2024-07-14
         ...    
295   2024-08-31
296   2024-06-29
297   2024-06-28
298   2024-09-21
299   2024-07-20
Name: 종료일, Length: 300, dtype: datetime64[ns]

In [210]:
df_inpark['종료일_date'] = pd.to_datetime(df_inpark['종료일'])

In [213]:
df_inpark.head(2)

Unnamed: 0,상품 제목,포스터 URL,상세 포스터 URL,공연장소,캐스팅 리스트,해당 링크,시작일,종료일,종료일_date
0,2024 TREASURE RELAY TOUR ［REBOOT］ FINAL IN SEOUL,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,올림픽공원 KSPO DOME(자세히),트레저,https://tickets.interpark.com/goods/24008528,2024.08.15,2024.08.15,2024-08-15
1,두아 리파 내한공연,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,고척스카이돔(자세히),두아 리파,https://tickets.interpark.com/goods/24007623,2024.12.04,2024.12.05,2024-12-05


In [168]:
# df_inpark['종료일_연도'] = df_inpark['종료일2'].dt.year
# df_inpark['종료일_월'] = df_inpark['종료일2'].dt.month
# df_inpark['종료일_일'] = df_inpark['종료일2'].dt.day

In [169]:
import datetime as dt

In [170]:
dt.datetime.now()

datetime.datetime(2024, 7, 5, 14, 49, 59, 61503)

In [171]:
current = dt.datetime.now()

In [214]:
df_inpark = df_inpark[df_inpark['종료일_date'] >= current]

In [215]:
df_inpark['종료일_date'] >= current

0      True
1      True
2      True
3      True
4      True
       ... 
293    True
294    True
295    True
298    True
299    True
Name: 종료일_date, Length: 255, dtype: bool

In [216]:
df_inpark

Unnamed: 0,상품 제목,포스터 URL,상세 포스터 URL,공연장소,캐스팅 리스트,해당 링크,시작일,종료일,종료일_date
0,2024 TREASURE RELAY TOUR ［REBOOT］ FINAL IN SEOUL,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,올림픽공원 KSPO DOME(자세히),트레저,https://tickets.interpark.com/goods/24008528,2024.08.15,2024.08.15,2024-08-15
1,두아 리파 내한공연,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,고척스카이돔(자세히),두아 리파,https://tickets.interpark.com/goods/24007623,2024.12.04,2024.12.05,2024-12-05
2,싸이흠뻑쇼 SUMMERSWAG2024 - 인천,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,인천아시아드 주경기장(자세히),싸이,https://tickets.interpark.com/goods/24007199,2024.08.17,2024.08.18,2024-08-18
3,"송소희, 두번째달, 김준수의 모던민요 - 여주",http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,세종국악당(자세히),,https://tickets.interpark.com/goods/24009112,2024.08.17,2024.08.17,2024-08-17
4,싸이흠뻑쇼 SUMMERSWAG2024 - 대구,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,대구스타디움 주경기장(자세히),싸이,https://tickets.interpark.com/goods/24007190,2024.07.13,2024.07.14,2024-07-14
...,...,...,...,...,...,...,...,...,...
293,SOUND WA,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,홍대 BENDER(자세히),,https://tickets.interpark.com/goods/24008524,2024.07.13,2024.07.13,2024-07-13
294,Summer Jazzbreak 2024,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/240075662024...,구로아트밸리 예술극장(자세히),"젠틀레인, 카즈미 타테이시, 마오 스즈키, 시노부 사토, 요시다 케이코",https://tickets.interpark.com/goods/24007566,2024.08.31,2024.08.31,2024-08-31
295,THE GREATEST : 불후의명곡 정동하X알리 - 이천,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,이천아트홀 대공연장(자세히),,https://tickets.interpark.com/goods/24009074,2024.08.31,2024.08.31,2024-08-31
298,TOMIOKA AI 〈BLUE SPOT〉,http://ticketimage.interpark.com/Play/image/la...,https://ticketimage.interpark.com/Play/image/e...,무신사 개러지(자세히),,https://tickets.interpark.com/goods/24007900,2024.09.21,2024.09.21,2024-09-21


In [176]:
# csv 파일로 저장하기

In [217]:
df_inpark.to_csv('../ticketMerge/df_inpark.csv')

### 가수데이터 정리하기

In [178]:
artist_inpark = pd.read_csv('../ticketMerge/티켓머지_아티스트데이터/interpark_artist.csv')

In [179]:
artist_inpark

Unnamed: 0,이름,URL
0,트레저,https://ticketimage.interpark.com/PlayDictiona...
1,두아 리파,https://ticketimage.interpark.com/PlayDictiona...
2,싸이,https://ticketimage.interpark.com/PlayDictiona...
3,싸이,https://ticketimage.interpark.com/PlayDictiona...
4,싸이,https://ticketimage.interpark.com/PlayDictiona...
...,...,...
180,마오 스즈키,https://ticketimage.interpark.com/PlayDictiona...
181,시노부 사토,https://ticketimage.interpark.com/PlayDictiona...
182,요시다 케이코,https://ticketimage.interpark.com/PlayDictiona...
183,소향,https://ticketimage.interpark.com/PlayDictiona...


In [180]:
artist_inpark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   이름      185 non-null    object
 1   URL     185 non-null    object
dtypes: object(2)
memory usage: 3.0+ KB


In [181]:
# 중복데이터 삭제하기

In [182]:
artist_inpark = artist_inpark.drop_duplicates()

In [183]:
# 검토하기

In [184]:
artist_double = artist_inpark.duplicated()

In [185]:
artist_double

0      False
1      False
2      False
7      False
9      False
       ...  
180    False
181    False
182    False
183    False
184    False
Length: 147, dtype: bool

In [186]:
artist_double.unique()

array([False])

In [187]:
# 저장하기

In [190]:
artist_inpark.to_csv('../ticketMerge/artist_inpark.csv')