## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">Libraries</p>

In [23]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, tqdm_notebook

import warnings
warnings.filterwarnings('ignore')

import re
import requests

from googletrans import Translator
from concurrent.futures import ThreadPoolExecutor

from google.cloud import translate_v2 as translate

import os

pd.options.display.max_colwidth = 99999
#pd.options.display.max_rows = 99999

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">Intro</p>

Identifying safe areas within Tokyo and recommending Airbnb accommodations in those areas for travelers

> We will conduct EDA to explore key areas in Tokyo and collect incidents related to Tokyo real estate in those areas.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">Data</p>

**The data** utilizes public information compiled from the Airbnb web-site including the availabiity calendar for 365 days in the future, and the reviews for each listing. 

There are 74 independent variables:
<ul>
<li><strong>listing_gz.csv</strong><ul>
<li><code>id</code> Airbnb's unique identifier for the listing</li>
<li><code>scarpe_id</code> Inside Airbnb "Scrape" this was part of</li>
<li><code>host_id</code> Airbnb's unique identifier for the host/user</li>
<li><code>listing_url</code></li>
<li><code>last_scraped</code> UTC. The date and time this listing was "scraped".</li>
<li><code>source</code> One of "neighbourhood search" or "previous scrape". "neighbourhood search" means that the listing was found by searching the city, while "previous scrape" means that the listing was seen in another scrape performed in the last 65 days, and the listing was confirmed to be still available on the Airbnb site.</li>
<li><code>description</code> Detailed description of the listing</li>
<li><code>neighborhood_overview</code> Host's description of the neighbourhood</li>
<li><code>picture_url</code> URL to the Airbnb hosted regular sized image for the listing</li>
<li><code>host_url</code> The Airbnb page for the host</li>
<li><code>host_name</code> Name of the host. Usually just the first name(s)</li>
<li><code>host_since</code> The date the host/user was created. For hosts that are Airbnb guests this could be the date they registered as a guest.</li>
<li><code>host_location</code> The host's self reported location</li>
<li><code>host_about</code > Description about the host</li>
<li><code>host_response_time</code></li>
<li><code>host_response_rate</code></li>
<li><code>host_acceptance_rate</code> That rate at which a host accepts booking requests.</li>
<li><code>host_is_superhost</code></li>
<li><code>host_thumbnail_url</code></li>
<li><code>host_picture_url</code></li>
<li><code>host_listings_count</code> The number of listings the host has (per Airbnb calculations)</li>
<li><code>host_total_listings_count</code> The number of listings the host has (per Airbnb calculations)</li>
<li><code>host_verifications</code></li>
<li><code>host_has_profile_pic</code></li>
<li><code>host_identity_verified</code></li>
<li><code>neighbourhood</code></li>
<li><code>neighbourhood_cleansed</code> The neighbourhood as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.</li>
<li><code>neighbourhood_group_cleansed</code> The neighbourhood group as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.</li>
<li><code>latitude</code> Uses the World Geodetic System (WGS84) projection for latitude and longitude.</li>
<li><code>longitude</code> Uses the World Geodetic System (WGS84) projection for latitude and longitude.</li>
<li><code>property_type</code> Self selected property type. Hotels and Bed and Breakfasts are described as such by their hosts in this field</li>
<li><code>room_type</code> Entire home/apt|Private room|Shared room|Hotel</li>
<li><code>accommodates</code> The maximum capacity of the listing</li>
<li><code>bathrooms</code> The number of bathrooms in the listing</li>
<li><code>bathrooms_text</code> The number of bathrooms in the listing.</li>
<li><code>bedrooms</code> The number of bedrooms</li>
<li><code>beds</code> The number of bed(s)</li>
<li><code>price</code> daily price in local currency</li>
<li><code>minimum_nights</code> minimum number of night stay for the listing (calendar rules may be different)</li>
<li><code>maximum_nights</code> maximum number of night stay for the listing (calendar rules may be different)</li>
<li><code>minimum_minimum_nights</code> the smallest minimum_night value from the calender (looking 365 nights in the future)</li>
<li><code>maximum_minimum_nights</code> the largest minimum_night value from the calender (looking 365 nights in the future)</li>
<li><code>minimum_maximum_nights</code> the smallest maximum_night value from the calender (looking 365 nights in the future)</li>
<li><code>maximum_maximum_nights</code> the largest maximum_night value from the calender (looking 365 nights in the future)</li>
<li><code>minimum_nights_avg_ntm</code> the average minimum_night value from the calender (looking 365 nights in the future)</li>
<li><code>maximum_nights_avg_ntm</code> the average maximum_night value from the calender (looking 365 nights in the future)</li>
<li><code>calendar_updated</code></li>
<li><code>has_availability</code></li>
<li><code>availability_30</code> avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may not be available because it has been booked by a guest or blocked by the host.</li>
<li><code>availability_60</code> avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may not be available because it has been booked by a guest or blocked by the host.</li>
<li><code>availability_90</code> avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may not be available because it has been booked by a guest or blocked by the host.</li>
<li><code>availability_365</code> avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may not be available because it has been booked by a guest or blocked by the host.</li>
<li><code>number_of_reviews</code> The number of reviews the listing has</li>
<li><code>number_of_reviews_ltm</code> The number of reviews the listing has (in the last 12 months)</li>
<li><code>number_of_reviews_l30d</code> The number of reviews the listing has (in the last 30 days)</li>
<li><code>first_review</code> The date of the first/oldest review</li>
<li><code>last_review</code> The date of the last/newest review</li>
<li><code>review_scores_rating</code></li>
<li><code>review_scores_accuracy</code></li>
<li><code>review_scores_cleanliness</code></li>
<li><code>review_scores_checkin</code></li>
<li><code>review_scores_communication</code></li>
<li><code>review_scores_location</code></li>
<li><code>review_scores_value</code></li>
<li><code>license</code> The licence/permit/registration number</li>
<li><code>calculated_host_listings_count</code> The number of listings the host has in the current scrape, in the city/region geography.</li>
<li><code>calculated_host_listings_count_entire_homes</code> The number of Entire home/apt listings the host has in the current scrape, in the city/region geography</li>
<li><code>calculated_host_listings_count_private_rooms</code> The number of Private room listings the host has in the current scrape, in the city/region geography</li>
<li><code>calculated_host_listings_count_shared_rooms</code> The number of Shared room listings the host has in the current scrape, in the city/region geography</li>
<li><code>reviews_per_month</code> The number of reviews the listing has over the lifetime of the listing</li>

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">EDA</p>

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#1A5D1A; font-size:75%; text-align:left;padding: 0px; border-bottom: 3px solid #1A5D1A">Input Data</p>

In [24]:
listing = pd.read_csv('C:\\Users\\lucky\\Documents\\COLLABORATION\\AirbnbWise\\Tokyo_Airbnb\\data\\listings.csv.gz')
listing.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,197677,https://www.airbnb.com/rooms/197677,20230629055629,2023-06-29,city scrape,Rental unit in Sumida · ★4.78 · 1 bedroom · 2 beds · 1 bath,"<b>The space</b><br />We are happy to welcome you to our apartment, located in the heart of Tokyo downtown. This is an authentic Japanese apartment with Tatami mattress room and sleeping on Japanese Futon, like Ryokan style.<br /><br />Fully equipped and convienient kitchen will give you oportunity to feel like at home. Automatic bath tub. Separate toilet with heating seat and washlet.<br /><br />Direct acces from both Narita and Haneda airports.<br /><br />Easy access to most of Tokyo attractions.<br /><br />10min walk from Oshiage Station,<br /><br />7min walk from Tobu Hikifune Station,<br /><br />8min walk from Heisei Hikifune Station.<br /><br />Free internet access.<br />Air conditioning, 2 semi-double futon bed (for 2 person each), LCD 32 inch TV, full<br />kitchen, microwave, toster, electric pot, refrigerator, coffee maker, iron, hair dryer, washing machine, bathroom with a bath tub and shower, gas grill. Cooking utensils and linens provided.<br /><br />Our apartment is locate",,https://a0.muscache.com/pictures/38437056/d27fa43f_original.jpg,964081,...,4.83,4.53,4.79,M130003350,f,1,1,0,0,1.21
1,776070,https://www.airbnb.com/rooms/776070,20230629055629,2023-06-29,city scrape,Home in Kita-ku · ★4.98 · 1 bedroom · 1 bed · 1 shared bath,"We have been in airbnb since 2011 and it has given us many new opportunity to meet and learn everyday. <br />It is a private room with a toilet just beside the room. <br /><br />This is a house that owned by us Kotaro, Kei and Miya<br />It is 3 story house. 2 Toilets 1 Shower/bathroom.<br />Vending machine in front our house manage by us. <br /><br />We have now in airbnb experience. <br />If you are interested in making bento and gyoza please let us know so we can send you the link for booking. <br /><br /><br /><br />Thank you very much.<br /><br /><b>The space</b><br />INTRO<br /><br />ε(*´･･`)зﾞHey""ε(´･･`*)з<br /><br />*TAKE NOTE* <br />Just a few Notes to inform. <br />From June 15 we have certified for home sharing.<br />New law have enforced in Japan for home sharing. <br />We are certified for Home sharing. <br />We hope people would come enjoy Japan and and to learn about Japanese culture. <br />Please let us know if you have questions.<br />We have also varieties of airbnb ex",We love Nishinippori because is nearer to Tokyo which we are fond of parks and more comfortable space.,https://a0.muscache.com/pictures/efd9f039-dbd2-4996-9467-7742a0c0813b.jpg,801494,...,4.98,4.83,4.91,,f,1,0,1,0,1.89
2,905944,https://www.airbnb.com/rooms/905944,20230629055629,2023-06-29,city scrape,Rental unit in Shibuya · ★4.76 · 2 bedrooms · 4 beds · 1 bath,"NEWLY RENOVATED property entirely for you & your travel companions, located in the vibrant neighborhood of Hatagaya, 3 minutes on train or 15 minutes walking to Shinjuku. Easy access to Shibuya / Shinjuku & fits up to 6 guests. The calendar is always up to date. The apartment has 40+ sq mts / 450+ sq ft, 2 bedrooms, 1 living/dining room. Simply put: it's TWICE as big as most apartments in central Tokyo and features ADSL Wifi, 2 HVACs ( 1 per room), washing machine, shower room and a kitchenette.<br /><br /><b>The space</b><br />Welcome and thank you for reading the details that follow, as they can help answer most common questions.<br /><br />A few things before you start: <br />1.\tI have been in the shared economy community for over 10 years and have come to learn a lot about how it works. I suggest that while comparing my apartments you take into consideration the size, location and ownership. This apartment is 7 minutes from Shinjuku in the heart of Tokyo, very spacious, owned and l","Hatagaya is a great neighborhood located 4 minutes by train train from Shinjuku (15-20 minutes walking).<br />It is extremely lively by any international standards but not as crowded a Shinjuku.<br />It features dozens of family restaurants, convenience stores and bars open 24X7.",https://a0.muscache.com/pictures/miso/Hosting-905944/original/30dc6aa5-faea-4d4f-b365-5bf3b8066407.jpeg,4847803,...,4.9,4.77,4.77,Hotels and Inns Business Act | 渋谷区保健所長 | 31渋健生収第4972号,t,5,5,0,0,1.49
3,1016831,https://www.airbnb.com/rooms/1016831,20230629055629,2023-06-29,city scrape,Home in Setagaya · ★4.94 · 1 bedroom · 2 beds · 1 shared bath,"Hi there, I am Wakana and I live with my two friendly cats.<br />I designed and built a house 11 years ago.<br /> I am a business freelancer and one of my business is a real estate agent, so if you are looking for properties in Japan, I can be of a help!<br /><br />My house is located near Ikenoue station, 5 minutes from Shibuya by Inokashira line. 12 mins walk to Shimokitazawa, hipster neighborhood.<br /><br />If you have time, it would be nice to have a chat or go out for a drink.<br /><br /><b>The space</b><br />The room is located on the first floor, right after the entrance. The toilet, washbasin, and bathroom are all on the first floor, so you can go directly to your room without meeting anyone when you come home, so you can maintain a moderate sense of privacy. We have a double bed in the Ryukyu Tatami room. We have chosen a high quality mattress. There is a built-in walk-in closet for the exclusive use of guests. There is a small study room. There is a table and a chair for rem","The location is walkable distance to famous Shimokitazawa, known to be a posh residential area without any high rise buildings. In just 10 minutes you can be at the heart of the trendy, funky, and bohemian Shimokitazawa that has a smorgasbord of affordable yet mouth-wateringly good restaurants and vintage clothes shops. The house is well situated in between train stations: 5 minutes walk to Ikenoue station on Inokashira line, 3rd stop from Shibuya station. 8 minutes walk to Shimokitazawa station on Odakyu line, 10 minutes from Shimokitazawa to Shinjuku by train. All very convenient! <br /><br />I am happy to share the map for the near-by restaurant (breakfast, lunch and dinner) once the booking is done!",https://a0.muscache.com/pictures/airflow/Hosting-1016831/original/a9db6fec-9d5b-4764-ad3f-c13695c9a041.jpg,5596383,...,4.98,4.92,4.89,,f,1,0,1,0,1.96
4,1196177,https://www.airbnb.com/rooms/1196177,20230629055629,2023-06-29,city scrape,Home in 足立区 · ★4.71 · 1 bedroom · 1.5 shared baths,"Ｓtay with host.We can help your travel.<br />Big hub station Kitasenju,is walking distance and from there,you can go Ginza,Roppongi,Asakusa,Tsukiji ,Ueno,Tokyo sky tree directly.<br />Easy access to all major spot.<br /><br /><b>The space</b><br />One or two people will fit.<br />air conditioner and heater is inside.<br /><br />since our house is built in March 2015,everything is new and clean.<br /><br /><b>Guest access</b><br />your private room and toilet are on 3rd floor.<br /><br />shared bathroom, kitchen and washingmachine are on 2nd floor.<br /><br />microwave,electrickettle,toaster,refrigreat or is available.<br />gas range is not usable.<br />airconditioner ,free wifi,free tea are provided.<br /><br /><b>During your stay</b><br />I will go to pick you up at Senjuohashi station(keisei main line) on your arrival.<br /><br /><b>Other things to note</b><br />We have 0years old and 5years old boys and they may noisy in the morning around 7-8am.<br />They need to go to the preschoo","There are shopping mall near Senjuohashi station.<br />supermarket,restaurants(Mcdonald's,Yoshinoya,Otoya,Chinese,Italian ) are 5min away.<br /><br />Big hub station,Kitasenju is walking distance.15min by foot.<br />From Kitasenju,you can go Ginza,Roppongi,Tsukiji,Tokyo sky tree,even Nikko,Hakone directly.<br /><br /><br />Close to fish market.<br />you can enjoy fresh sashimi.<br /><br />There are many Japanese pubric bath called ' SENTO' near my house.Everybody take a bath all naked. <br />There are some rules to take a bath,so I will teach you how to enjoy SENTO.",https://a0.muscache.com/pictures/72890882/05ecbdaa_original.jpg,5686404,...,4.88,4.67,4.75,,f,1,0,1,0,0.79


In [25]:
#* listing.csv.gz 에서 input data로 사용할 칼럼 지정
inputDF = listing[['latitude', 'longitude', 'price', 'room_type', 'accommodates', 'bedrooms', 'beds', 'review_scores_rating']] #* bathrooms 칼럼엔 데이터 값이 없으므로 사용하지 않는다.
inputDF.head()

Unnamed: 0,latitude,longitude,price,room_type,accommodates,bedrooms,beds,review_scores_rating
0,35.71707,139.82608,"$11,000.00",Entire home/apt,2,1.0,2.0,4.78
1,35.73844,139.76917,"$7,208.00",Private room,1,,1.0,4.98
2,35.67878,139.67847,"$23,066.00",Entire home/apt,6,2.0,4.0,4.76
3,35.658,139.67134,"$16,000.00",Private room,2,,2.0,4.94
4,35.744731,139.797384,"$10,000.00",Private room,4,,,4.71


In [26]:
#* 특수문자 제거
def remove_special_characters(text):
    #* \w는 숫자와 문자를, \s는 공백을, ^는 이들을 제외한 모든 문자를 의미
    pattern = r'[^\w\s]'
    return re.sub(pattern, '', text)

#* objet type 칼럼 중 price 칼럼과 room_type 칼럼의 특수 문자 제거
inputDF['price'] = inputDF['price'].apply(remove_special_characters)
#* price 칼럼 타입 float로 변경
inputDF['price'] = inputDF['price'].astype('float64')
#* room_type 칼럼 '/' 특수문자 제거
inputDF['room_type'] = inputDF['room_type'].str.replace('/', ' ')
inputDF.head()

Unnamed: 0,latitude,longitude,price,room_type,accommodates,bedrooms,beds,review_scores_rating
0,35.71707,139.82608,1100000.0,Entire home apt,2,1.0,2.0,4.78
1,35.73844,139.76917,720800.0,Private room,1,,1.0,4.98
2,35.67878,139.67847,2306600.0,Entire home apt,6,2.0,4.0,4.76
3,35.658,139.67134,1600000.0,Private room,2,,2.0,4.94
4,35.744731,139.797384,1000000.0,Private room,4,,,4.71


In [27]:
#* 위도, 경도 데이터를 활용한 실제 주소 얻기
# def get_address_from_latlng(latitude, longitude, api_key):
#     url = f'https://maps.googleapis.com/maps/api/geocode/json?latlng={latitude},{longitude}&key={api_key}'
#     response = requests.get(url)
#     data = response.json()
#     if data['status'] == 'OK':
#         return data['results'][0]['formatted_address']
#     else:
#         return None

# # Google Maps API 키
# api_key = 'AIzaSyAbPJzcE8aKus-zTk45YZJdLwP9I9Zo01w'

# addressList = []
# for latitude, longitude in tqdm_notebook(zip(inputDF['latitude'],inputDF['longitude'])):
#      address = get_address_from_latlng(latitude, longitude, api_key)
#      if address:
#          #print(f'주소 : {address}')
#          addressList.append(address)
#      else:
#          #print(f'해당 위치의 주소를 찾을 수 없습니다')
#          addressList.append(None)

In [28]:
#inputDF['address'] = addressList

----------------------------------------------------------------------------------------------

In [29]:
inputDF = pd.read_csv('input_main_tmp.csv') #* 위도, 경도값으로 주소 얻어온 데이터로, 주소 전처리 전 데이터이다.
inputDF

Unnamed: 0,latitude,longitude,price,room_type,accommodates,bedrooms,beds,review_scores_rating,address
0,35.717070,139.826080,1100000.0,Entire home apt,2,1.0,2.0,4.78,"2-chōme-27-16 Yahiro, Sumida City, Tokyo 131-0041, Japan"
1,35.738440,139.769170,720800.0,Private room,1,,1.0,4.98,"1-chōme-26-7 Tabatashinmachi, Kita City, Tokyo 114-0012, Japan"
2,35.678780,139.678470,2306600.0,Entire home apt,6,2.0,4.0,4.76,"2-chōme-1-2-1 Hatagaya, Shibuya City, Tokyo 151-0072, Japan"
3,35.658000,139.671340,1600000.0,Private room,2,,2.0,4.94,"Japan, 〒155-0032 Tokyo, Setagaya City, Daizawa, 2-chōme−16, 池ノ上フラット"
4,35.744731,139.797384,1000000.0,Private room,4,,,4.71,"1-2 Senjumiyamotochō, Adachi City, Tokyo 120-0043, Japan"
...,...,...,...,...,...,...,...,...,...
11172,35.697773,139.706543,1200000.0,Entire home apt,4,1.0,3.0,,"2-chōme-4-11 Kabukichō, Shinjuku City, Tokyo 160-0021, Japan"
11173,35.698980,139.694320,1600000.0,Entire home apt,3,1.0,2.0,,"Japan, 〒169-0074 Tokyo, Shinjuku City, Kitashinjuku, 1-chōme−16, メゾン・ブランシェ"
11174,35.700080,139.695020,1600000.0,Entire home apt,4,1.0,2.0,,"1-chōme-6-14 Kitashinjuku, Shinjuku City, Tokyo 169-0074, Japan"
11175,35.699860,139.693340,4000000.0,Entire home apt,9,3.0,6.0,,"1-chōme-25-21 Kitashinjuku, Shinjuku City, Tokyo 169-0074, Japan"


In [30]:
#inputDF.to_csv('input_tmp.csv', index = False)

In [31]:
#! google trans 적용 실패 -> 하지만, 구글 번역 API가 더 정확하다.
#* 구글 번역 라이브러리의 최신 버전에서는 'raise_Exception' 속성 대신 'raise_exception' 속성을 사용하도록 변경되었음
# def translate_english_to_japanese(text):
#     try:
#         translator = Translator(service_urls=['translate.googleapis.com'], raise_exception=True)
#         result = translator.translate(text, src='en', dest='ja')
#         return result.text
#     except:
#         #print(f"Error occurred during translation: {e}")
#         return np.nan

# with concurrent.futures.ThreadPoolExecutor() as executor: #* ThreadPoolExecutor는 여러 개의 스레드를 사용하여 작업을 병렬로 처리하는 Executor 클래스
#     tqdm.pandas()  #* tqdm을 사용하기 위해 pandas에 적용
#     translated_addresses = list(tqdm(executor.map(translate_english_to_japanese, inputDF['address']), total=len(inputDF))) #* 병렬로 실행, 전체 작업의 총 개수

# inputDF['address_Ja'] = translated_addresses

In [32]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:\\Users\\lucky\\Documents\\genie-393805-2ca6729ec32d.json'
!gcloud auth application-default print-access-token

'gcloud'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.


In [33]:
#* 주소를 일본어로 번역, googletrans 가 아닌 구글 번역 API 사용
def translate_english_to_japanese(text):
    translator = translate.Client()
    #* 번역 수행
    result = translator.translate(text, source_language='en', target_language='ja')
    #* 번역된 텍스트 반환
    return result['translatedText']

#* NaN 값을 제거한 후에 번역 작업을 진행
inputDF.dropna(subset=['address'], inplace=True)

#TODO 병렬 처리를 사용하여 'address' 칼럼의 영어를 일본어로 번역하여 'address_japanese' 열에 추가
tqdm.pandas()
with ThreadPoolExecutor() as executor:
    translated_addresses = list(tqdm(executor.map(translate_english_to_japanese, inputDF['address']), total=len(inputDF)))

# 번역된 주소를 'address_japanese' 열에 추가
inputDF['address_japanese'] = pd.Series(translated_addresses)

  0%|          | 0/11176 [00:03<?, ?it/s]


KeyboardInterrupt: 

In [34]:
#* latitude, longitude, address 칼럼 삭제
inputDF2 = inputDF.drop(['latitude', 'longitude', 'address'], axis = 1)

In [35]:
inputDF2

Unnamed: 0,price,room_type,accommodates,bedrooms,beds,review_scores_rating
0,1100000.0,Entire home apt,2,1.0,2.0,4.78
1,720800.0,Private room,1,,1.0,4.98
2,2306600.0,Entire home apt,6,2.0,4.0,4.76
3,1600000.0,Private room,2,,2.0,4.94
4,1000000.0,Private room,4,,,4.71
...,...,...,...,...,...,...
11172,1200000.0,Entire home apt,4,1.0,3.0,
11173,1600000.0,Entire home apt,3,1.0,2.0,
11174,1600000.0,Entire home apt,4,1.0,2.0,
11175,4000000.0,Entire home apt,9,3.0,6.0,


In [36]:
test = pd.read_csv('input_main_listing.csv')
test

Unnamed: 0,price,room_type,accommodates,bedrooms,beds,review_scores_rating,address_japanese,shiku,townpart,detailpart
0,1100000.0,Entire home apt,2,1.0,2.0,4.78,墨田区八広2丁目27-16,墨田区,墨田区八広2丁目,27-16
1,720800.0,Private room,1,,1.0,4.98,北区田端新町1丁目26-7,北区,北区田端新町1丁目,26-7
2,2306600.0,Entire home apt,6,2.0,4.0,4.76,渋谷区幡ヶ谷2丁目1-2-1,渋谷区,渋谷区幡ヶ谷2丁目,1-2-1
3,1600000.0,Private room,2,,2.0,4.94,世田谷区代沢2丁目−16、池ノ上フラット,世田谷区,世田谷区代沢2丁目,−16、池ノ上フラット
4,1000000.0,Private room,4,,,4.71,足立区千住宮元町1-2,足立区,,足立区千住宮元町1-2
...,...,...,...,...,...,...,...,...,...,...
11171,1200000.0,Entire home apt,4,1.0,3.0,,新宿区北新宿1丁目16番地メゾン・ブランシェ,新宿区,新宿区北新宿1丁目,16番地メゾン・ブランシェ
11172,1600000.0,Entire home apt,3,1.0,2.0,,新宿区北新宿1丁目6-14,新宿区,新宿区北新宿1丁目,6-14
11173,1600000.0,Entire home apt,4,1.0,2.0,,新宿区北新宿1丁目25-21,新宿区,新宿区北新宿1丁目,25-21
11174,4000000.0,Entire home apt,9,3.0,6.0,,新宿区歌舞伎町2丁目16-12,新宿区,新宿区歌舞伎町2丁目,16-12


In [None]:
#TODO 주소 데이터 전처리
#* (1) 우편 번호 제거 
def postnum_remove(address):
    #* address 변수가 문자열이 아니라면 문자열로 변환
    if not isinstance(address, str):
        address = str(address)
        
    pattern = r"〒\d{3}-\d{4}"
    #* 정규표현식으로 우편번호 패턴 제거
    cleaned_address = re.sub(pattern, "", address)
    return cleaned_address

inputDF2['address_japanese'] = inputDF2['address_japanese'].apply(postnum_remove)

In [None]:
inputDF2['address_japanese'] = inputDF2['address_japanese'].str.replace('日本、 ', '')

In [None]:
inputDF2['address_japanese'] = inputDF2['address_japanese'].str.replace(' ', '')

In [None]:
#* (2) address의 모든 데이터에 '東京都(도쿄도)' 포함되어 있으므로 제거
inputDF2['address_japanese'] = inputDF2['address_japanese'].str.replace('東京都','')
inputDF2.tail()

In [None]:
#* (3) '시(市) / 구(区) 칼럼(shiku) 추가'
#* inputDF2['address_japanese'].str.contains('区').sum() 주소에 '区'(구)가 포함되어 있는 주소는 10913개로 데이터의 97% 차지
#* 주소에 市(시)가 없다면 구(区)로 추출
def districtExtract(address):
    if '市' in address:
        extracted_data = address[:address.index('市') + 1]
    elif '市' not in address and '区' in address:
        extracted_data = address[:address.index('区') + 1]
    else: 
        extracted_data = address #* '外国外国板橋二丁目639' 데이터는 시, 구가 모두 없는 유일한 데이터로 삭제해준다.
    return extracted_data

inputDF2['shiku'] = inputDF2['address_japanese'].apply(districtExtract)

In [None]:
#* (4) 주소 내에서 구역이 존재하는 데이터 들의 구역 추출하여 칼럼 추가
#* inputDF2['address_japanese'].str.contains('丁目').sum(), '丁目' 이 존재하는 데이터는 10467개, 존재하지 않는 데이터는 709개
def partExtract(address):
    if '丁目' in address:
        extracted_data = address[:address.find('丁目')] + '丁目'
    else:
        extracted_data = np.nan #* 구역 없음
    return extracted_data

inputDF2['townpart'] = inputDF2['address_japanese'].apply(partExtract)

In [None]:
#* (5) 주소 내에서 구역이 존재하지 않는 데이터들의 상세 주소 추출하여 칼럼 추가
def detailExtract(address):
    if pd.isna(address):  
        return address  
    
    if '丁目' not in address:
        extracted_data = address
    else:
        extracted_data = address[address.find('丁目') + 2:]
    return extracted_data 

inputDF2['detailpart'] = inputDF2['address_japanese'].apply(detailExtract)

In [None]:
inputDF2.head()

In [None]:
inputDF2.info()

In [None]:
inputDF2.to_csv("input_main_listing.csv", index=False) #* 주소 가공 포함한 데이터