## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">Libraries</p>

In [1]:
import pandas as pd
import numpy as np

from tqdm import tqdm, tqdm_notebook

import warnings
warnings.filterwarnings('ignore')

import re
import requests

from googletrans import Translator
import concurrent.futures

from google.cloud import translate_v2 as translate

import os

pd.options.display.max_colwidth = 99999
#pd.options.display.max_rows = 99999

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">Intro</p>

Identifying safe areas within Tokyo and recommending Airbnb accommodations in those areas for travelers

> We will conduct EDA to explore key areas in Tokyo and collect incidents related to Tokyo real estate in those areas.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">Data</p>

**The data** is Summary information and metrics for listings in Tokyo (good for visualisations)

There are 19 independent variables:
<ul>
<li><strong>listings.csv</strong><ul>
<li><code>id</code> Airbnb's unique identifier for the listing</li>
<li><code>name</code></li>
<li><code>host_id</code></li>
<li><code>host_name</code></li>
<li><code>neighbourhood_group</code> The neighbourhood group as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.</li>
<li><code>neighbourhood</code> The neighbourhood as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.</li>
<li><code>district</code> </li>
<li><code>latitude</code> Uses the World Geodetic System (WGS84) projection for latitude and longitude.</li>
<li><code>longitude</code> Uses the World Geodetic System (WGS84) projection for latitude and longitude.</li>
<li><code>room_type</code> The Airbnb page for the host</li>
<li><code>price</code> daily price in local currency. Note, $ sign may be used despite locale</li>
<li><code>minimum_nights</code> minimum number of night stay for the listing (calendar rules may be different)</li>
<li><code>number_of_reviews</code> The number of reviews the listing has</li>
<li><code>last_review</code > The date of the last/newest review</li>
<li><code>reviews_per_month</code > Description about the host</li>
<li><code>calculated_host_listings_count</code > The number of listings the host has in the current scrape, in the city/region geography.</li>
<li><code>availability_365</code > avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may be available because it has been booked by a guest or blocked by the host.</li>
<li><code>number_of_reviews_ltm</code > The number of reviews the listing has (in the last 12 months)</li>
<li><code>license</code ></li>

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#b57edc; font-size:140%; text-align:left;padding: 0px; border-bottom: 3px solid #b57edc">EDA</p>

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing:2px; color:#1A5D1A; font-size:75%; text-align:left;padding: 0px; border-bottom: 3px solid #1A5D1A">Input Data</p>

In [2]:
listing = pd.read_csv('C:\\Users\\lucky\\Documents\\COLLABORATION\\AirbnbWise\\Tokyo_Airbnb\\data\\listings.csv')
listing.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,197677,Rental unit in Sumida · ★4.78 · 1 bedroom · 2 beds · 1 bath,964081,Yoshimi & Marek,,Sumida Ku,35.71707,139.82608,Entire home/apt,11000,3,173,2023-05-30,1.21,1,24,8,M130003350
1,776070,Home in Kita-ku · ★4.98 · 1 bedroom · 1 bed · 1 shared bath,801494,Kei,,Kita Ku,35.73844,139.76917,Private room,7208,3,243,2023-06-20,1.89,1,67,15,
2,3427384,Rental unit in Edogawa · ★4.82 · 1 bedroom · 2 beds · 1.5 baths,13018876,Masakatsu,,Edogawa Ku,35.68374,139.85971,Entire home/apt,7847,2,100,2023-05-22,0.93,2,231,19,Hotels and Inns Business Act | 東京都江戸川区保健所 | 18江衛環01第42号
3,905944,Rental unit in Shibuya · ★4.76 · 2 bedrooms · 4 beds · 1 bath,4847803,Best Stay In Tokyo!,,Shibuya Ku,35.67878,139.67847,Entire home/apt,23066,3,186,2023-06-26,1.49,5,229,1,Hotels and Inns Business Act | 渋谷区保健所長 | 31渋健生収第4972号
4,3514008,Rental unit in Arakawa-ku · ★4.86 · 1 bedroom · 2 beds · 1 bath,17694529,Hisao,,Arakawa Ku,35.72672,139.78201,Entire home/apt,2871,1,269,2023-06-08,2.59,9,11,29,Hotels and Inns Business Act | 東京都荒川区保健所 | 31荒保衛環へ第１3号


In [3]:
#* listing.csv.gz 에서 input data로 사용할 칼럼 지정
inputDF = listing[['latitude', 'longitude', 'price', 'room_type']] 
inputDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11177 entries, 0 to 11176
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   latitude   11177 non-null  float64
 1   longitude  11177 non-null  float64
 2   price      11177 non-null  int64  
 3   room_type  11177 non-null  object 
dtypes: float64(2), int64(1), object(1)
memory usage: 349.4+ KB


In [4]:
#* 특수문자 제거
def remove_special_characters(text):
    #* \w는 숫자와 문자를, \s는 공백을, ^는 이들을 제외한 모든 문자를 의미
    pattern = r'[^\w\s]'
    return re.sub(pattern, '', str(text))

#* objet type 칼럼 중 price 칼럼과 room_type 칼럼의 특수 문자 제거
inputDF['price'] = inputDF['price'].apply(remove_special_characters)
#* price 칼럼 타입 float로 변경
inputDF['price'] = inputDF['price'].astype('float64')
#* room_type 칼럼 '/' 특수문자 제거
inputDF['room_type'] = inputDF['room_type'].str.replace('/', ' ')
inputDF.head()

Unnamed: 0,latitude,longitude,price,room_type
0,35.71707,139.82608,11000.0,Entire home apt
1,35.73844,139.76917,7208.0,Private room
2,35.68374,139.85971,7847.0,Entire home apt
3,35.67878,139.67847,23066.0,Entire home apt
4,35.72672,139.78201,2871.0,Entire home apt


In [5]:
#* 위도, 경도 데이터를 활용한 실제 주소 얻기
# def get_address_from_latlng(latitude, longitude, api_key):
#     url = f'https://maps.googleapis.com/maps/api/geocode/json?latlng={latitude},{longitude}&key={api_key}'
#     response = requests.get(url)
#     data = response.json()
#     if data['status'] == 'OK':
#         return data['results'][0]['formatted_address']
#     else:
#         return None

# # Google Maps API 키
# api_key = 'AIzaSyAbPJzcE8aKus-zTk45YZJdLwP9I9Zo01w'

# addressList = []
# for latitude, longitude in tqdm_notebook(zip(inputDF['latitude'],inputDF['longitude'])):
#      address = get_address_from_latlng(latitude, longitude, api_key)
#      if address:
#          #print(f'주소 : {address}')
#          addressList.append(address)
#      else:
#          #print(f'해당 위치의 주소를 찾을 수 없습니다')
#          addressList.append(None)

In [6]:
'''
멀티스레딩(Multithreading)은 하나의 프로세스에서 여러 개의 스레드(Thread)를 동시에 실행하는 기술
여러 작업을 동시에 처리
'''
#* 위도, 경도 데이터를 활용한 실제 주소 얻기 - 멀티스레드 방식 사용
def get_address_from_latlng(latitude, longitude, api_key):
    url = f'https://maps.googleapis.com/maps/api/geocode/json?latlng={latitude},{longitude}&key={api_key}'
    response = requests.get(url)
    data = response.json()
    if data['status'] == 'OK':
        return data['results'][0]['formatted_address']
    else:
        return None

def process_coordinates(lat_lng_data):
    latitude, longitude = lat_lng_data
    return get_address_from_latlng(latitude, longitude, api_key)

#* Google Maps API 키
api_key = 'AIzaSyAbPJzcE8aKus-zTk45YZJdLwP9I9Zo01w'

#* 주소를 얻고자 하는 위도, 경도 데이터 (inputDF['latitude'], inputDF['longitude'])가 있는 리스트
coordinates_list = list(zip(inputDF['latitude'], inputDF['longitude']))

#* 멀티스레딩으로 처리하기 위한 스레드 개수 설정
num_threads = 8  

#* 멀티스레딩 실행
addressList = []
'''
futures = [executor.submit(process_coordinates, data) for data in coordinates_list]: 
executor.submit() 메서드를 사용하여 process_coordinates 함수를 병렬로 실행하는 작업을 생성
coordinates_list에 있는 각각의 위도와 경도 데이터에 대해 process_coordinates 함수를 실행하고, 생성된 작업들을 futures 리스트에 저장
'''
with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
    futures = [executor.submit(process_coordinates, data) for data in coordinates_list]
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
        address = future.result() #* future.result()는 해당 작업이 완료되면 결과 값을 반환
        if address:
            addressList.append(address)
        else:
            addressList.append(None)

100%|██████████| 11177/11177 [17:46<00:00, 10.48it/s] 


In [None]:
inputDF['address'] = addressList

----------------------------------------------------------------------------------------------

In [None]:
inputDF.to_csv('input_sub_tmp.csv', index = False)

In [None]:
inputDF = pd.read_csv('input_sub_tmp.csv') #* 위도, 경도값으로 주소 얻어온 데이터로, 주소 전처리 전 데이터이다.
inputDF

Unnamed: 0,latitude,longitude,price,room_type,address
0,35.717070,139.826080,11000.0,Entire home apt,"4 Chome-7 Higashinippori, Arakawa City, Tokyo 116-0014, Japan"
1,35.738440,139.769170,7208.0,Private room,"1-chōme-26-7 Tabatashinmachi, Kita City, Tokyo 114-0012, Japan"
2,35.683740,139.859710,7847.0,Entire home apt,"Japan, 〒134-0091 Tokyo, Edogawa City, Funabori, 2-chōme−11−１４ Ｋ－２"
3,35.678780,139.678470,23066.0,Entire home apt,"2-chōme-1-2-1 Hatagaya, Shibuya City, Tokyo 151-0072, Japan"
4,35.726720,139.782010,2871.0,Entire home apt,"2-chōme-27-16 Yahiro, Sumida City, Tokyo 131-0041, Japan"
...,...,...,...,...,...
11172,35.697773,139.706543,12000.0,Entire home apt,"2-chōme-4-11 Kabukichō, Shinjuku City, Tokyo 160-0021, Japan"
11173,35.698980,139.694320,16000.0,Entire home apt,"Japan, 〒169-0074 Tokyo, Shinjuku City, Kitashinjuku, 1-chōme−16, メゾン・ブランシェ"
11174,35.700080,139.695020,16000.0,Entire home apt,"1-chōme-6-14 Kitashinjuku, Shinjuku City, Tokyo 169-0074, Japan"
11175,35.699860,139.693340,40000.0,Entire home apt,"1-chōme-25-21 Kitashinjuku, Shinjuku City, Tokyo 169-0074, Japan"


In [None]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:\\Users\\lucky\\Documents\\genie-393805-2ca6729ec32d.json'
!gcloud auth application-default print-access-token

'gcloud'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.


In [11]:
#* 주소를 일본어로 번역, googletrans 가 아닌 구글 번역 API 사용
def translate_english_to_japanese(text):
    translator = translate.Client()
    #* 번역 수행
    result = translator.translate(text, source_language='en', target_language='ja')
    #* 번역된 텍스트 반환
    return result['translatedText']

#* NaN 값을 제거한 후에 번역 작업을 진행
inputDF.dropna(subset=['address'], inplace=True)

#TODO 병렬 처리를 사용하여 'address' 칼럼의 영어를 일본어로 번역하여 'address_japanese' 열에 추가
tqdm.pandas()

num_threads = 8  

with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
    translated_addresses = list(tqdm(executor.map(translate_english_to_japanese, inputDF['address']), total=len(inputDF)))

# 번역된 주소를 'address_japanese' 열에 추가
inputDF['address_japanese'] = pd.Series(translated_addresses)

  2%|▏         | 269/11177 [00:45<30:25,  5.97it/s]


KeyboardInterrupt: 

In [None]:
#* latitude, longitude, address 칼럼 삭제
inputDF2 = inputDF.drop(['latitude', 'longitude', 'address'], axis = 1)

In [None]:
#TODO 주소 데이터 전처리
#* (1) 우편 번호 제거 
def postnum_remove(address):
    #* address 변수가 문자열이 아니라면 문자열로 변환
    if not isinstance(address, str):
        address = str(address)
        
    pattern = r"〒\d{3}-\d{4}"
    #* 정규표현식으로 우편번호 패턴 제거
    cleaned_address = re.sub(pattern, "", address)
    return cleaned_address

inputDF2['address_japanese'] = inputDF2['address_japanese'].apply(postnum_remove)

In [None]:
inputDF2['address_japanese'] = inputDF2['address_japanese'].str.replace('日本、 ', '')

In [None]:
inputDF2['address_japanese'] = inputDF2['address_japanese'].str.replace(' ', '')

In [None]:
#* (2) address의 모든 데이터에 '東京都(도쿄도)' 포함되어 있으므로 제거
inputDF2['address_japanese'] = inputDF2['address_japanese'].str.replace('東京都','')
inputDF2.tail()

In [None]:
#* (3) '시(市) / 구(区) 칼럼(shiku) 추가'
#* inputDF2['address_japanese'].str.contains('区').sum() 주소에 '区'(구)가 포함되어 있는 주소는 10913개로 데이터의 97% 차지
#* 주소에 市(시)가 없다면 구(区)로 추출
def districtExtract(address):
    if '市' in address:
        extracted_data = address[:address.index('市') + 1]
    elif '市' not in address and '区' in address:
        extracted_data = address[:address.index('区') + 1]
    else: 
        extracted_data = address 
    return extracted_data

inputDF2['shiku'] = inputDF2['address_japanese'].apply(districtExtract)

In [None]:
#* (4) 주소 내에서 구역이 존재하는 데이터 들의 구역 추출하여 칼럼 추가
def partExtract(address):
    if '丁目' in address:
        extracted_data = address[:address.find('丁目')] + '丁目'
    else:
        extracted_data = np.nan #* 구역 없음
    return extracted_data

inputDF2['townpart'] = inputDF2['address_japanese'].apply(partExtract)

In [None]:
#* (5) 주소 내에서 구역이 존재하지 않는 데이터들의 상세 주소 추출하여 칼럼 추가
def detailExtract(address):
    if pd.isna(address):  
        return address  
    
    if '丁目' not in address:
        extracted_data = address
    else:
        extracted_data = address[address.find('丁目') + 2:]
    return extracted_data 

inputDF2['detailpart'] = inputDF2['address_japanese'].apply(detailExtract)

In [None]:
inputDF2.head()

In [None]:
inputDF2.info()

In [None]:
inputDF2.to_csv("input_sub_listing.csv", index=False) #* 주소 가공 포함한 데이터