### Content Based Filtering을 이용한 Airbnb Host Recommender

#### Big Data Group 17. 서현종, 김상현, 김승엽

### 시나리오

Airbnb 신규 Host인 Ernest는 뉴욕에 위치한 자신의 집을 숙박 업소로 공개하고자 한다.  
그러나, Ernest는 위치, 방의 개수, 침대 개수, 침대의 타입, 집기구등  
자신이 소유한 집의 정보만을 알고 있다.  
Ernest는 이 집을 Airbnb로 공개하였을때 얼마의 가격을 선정해야  
유의미하게 높은 평점을 받는지가 최대 관심사이다.  
따라서 Ernest는 CAUSWE Big Data 17조가 만든 Content Based Filtering 추천 시스템을 이용하여  
적절한 가격 책정에 대해 도움을 얻고자 한다.

### 프로그램 개요

Airbnb Open Data는 Host 숙소에 대한 106가지의 Feature가 정의되어 있는 데이터 셋이다.  
통상적인 추천 시스템이라면 협업 필터링 (Collaborate Filtering)을 사용하나,  
이 데이터 셋은 평가자인 사용자의 정보가 없기 때문에,  
Content의 정보에 기반하여 추천 시스템이 적합해보였다.  
  
Content Based Filtering을 토대로 이 추천시스템은 호스트가 입력한 숙박 업소의 정보를  
어느 데이터와 유사한지 판별하여 (cos sim), Top N을 추려낸다음, 이 Top N 중에서  
평점이 높은 숙박 업소의 정보만을 추려내어 적절한 가격과 amenities를 추천할 것이다.

### 모델 평가방법

#### 미래의 Dataset을 이용한 Heurisitic 한 평가방법

추천 시스템의 가장 기본적이고 효율적인 평가방법은 실제 출시되고 나서의 Heuristic한 평가이다.
  
현재 데이터셋은 2020년도의 데이터로, 이 추천 시스템을 사용하는 Host, Ernest는 2020년에  
이 추천 시스템을 사용하였다고 가정하자. 그렇다면 이 추천 시스템이 유의미하다면  
다음 해인 2021년에 유의미하게 높은 평점을 받아야 할 것이다.  
  
따라서 본 모델을 평가하기 위해 우리는 2020년에서 산출한 추천 정보와 2021년 데이터 셋 사이에서  
가장 유사도가 높은 데이터를 선별한 후, 이 데이터의 평점이 유의미하게 높은지 판별하여  
모델의 정확도를 평가할 것이다.
  
즉, 실제 이 추천 시스템을 토대로 가격을 설정했을시 내년에 얼마나 높은 평점을 받았는지가  
이 모델의 정확도 측정의 관건인 것이다.

## 0. 모델에 사용할 데이터셋 불러오기

In [1]:
from IPython.core.interactiveshell import InteractiveShell #python의 대화형 쉘, 인터프리터
InteractiveShell.ast_node_interactivity = "all"#모든 출력값을 연속적으로 출력
import warnings
warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd

pd.options.display.max_rows = 200
pd.options.display.max_columns = 50

In [3]:
abm =  pd.read_csv('../input/airbnb-new-york-city-with-106-features/airbnbmark1.csv')
abm.head(3)
print('abm.shape',abm.shape)
print('abm.size',abm.size)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,...,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,https://www.airbnb.com/rooms/2595,20200212052319,2020-02-12,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...","- Spacious (500+ft²), immaculate and nicely fu...","Beautiful, spacious skylit studio in the heart...",none,Centrally located in the heart of Manhattan ju...,,Apartment is located on 37th Street between 5t...,"Guests have full access to the kitchen, bathro...",I am a Sound Therapy Practitioner and Kundalin...,"Make yourself at home, respect the space and t...",,,https://a0.muscache.com/im/pictures/f0813a11-4...,,2845,https://www.airbnb.com/users/show/2845,Jennifer,2008-09-09,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,...,2020-02-12,48,5,2009-11-21,2019-11-04,94.0,9.0,9.0,10.0,10.0,10.0,9.0,f,,,f,f,strict_14_with_grace_period,t,t,2,2,0,0,0.39
1,3831,https://www.airbnb.com/rooms/3831,20200212052319,2020-02-13,Cozy Entire Floor of Brownstone,Urban retreat: enjoy 500 s.f. floor in 1899 br...,Greetings! We own a double-duplex brownst...,Urban retreat: enjoy 500 s.f. floor in 1899 br...,none,Just the right mix of urban center and local n...,,B52 bus for a 10-minute ride to downtown Brook...,"You will have the private, exclusive use of an...","We'll be around, but since you have the top fl...",Smoking - outside please; pets allowed but ple...,,,https://a0.muscache.com/im/pictures/e49999c2-9...,,4869,https://www.airbnb.com/users/show/4869,LisaRoxanne,2008-12-07,"New York, New York, United States",Laid-back bi-coastal actor/professor/attorney.,...,2020-02-13,307,70,2014-09-30,2020-02-08,90.0,9.0,9.0,10.0,9.0,10.0,9.0,f,,,f,f,moderate,f,f,1,1,0,0,4.69
2,5099,https://www.airbnb.com/rooms/5099,20200212052319,2020-02-12,Large Cozy 1 BR Apartment In Midtown East,My large 1 bedroom apartment has a true New Yo...,I have a large 1 bedroom apartment centrally l...,My large 1 bedroom apartment has a true New Yo...,none,My neighborhood in Midtown East is called Murr...,Read My Full Listing For All Information. New ...,From the apartment is a 10 minute walk to Gran...,I will meet you upon arrival.,I usually check in with guests via text or ema...,• Check-in time is 2PM. • Check-out time is 12...,,,https://a0.muscache.com/im/pictures/24020910/1...,,7322,https://www.airbnb.com/users/show/7322,Chris,2009-02-02,"New York, New York, United States","I'm an artist, writer, traveler, and a native ...",...,2020-02-12,78,8,2009-04-20,2019-10-13,90.0,10.0,9.0,10.0,10.0,10.0,9.0,f,,,f,f,moderate,t,t,1,1,0,0,0.59


abm.shape (153254, 106)
abm.size 16244924


## 1. Data Cleaning

Host가 입력할 수 있는 정보들과 (숙소의 위치, 방 타입, 수용 인원, 침대 개수 등)  
평가 지표로 사용이 가능한 정보들 (평점, 평점 개수)  
그리고 최종적으로 추천되는 정보들 (price)

In [4]:
abm1 = abm[['neighbourhood_group_cleansed', 'property_type', 'room_type'
           ,'accommodates', 'bathrooms', 'bedrooms', 'beds'
           , 'price','review_scores_rating', 'number_of_reviews', 'amenities']]

In [5]:
abm1.shape
abm1.head(3)

(153254, 11)

Unnamed: 0,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities
0,Manhattan,Apartment,Entire home/apt,1,1.0,0.0,1.0,$225.00,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park..."
1,Brooklyn,Guest suite,Entire home/apt,3,1.0,1.0,4.0,$89.00,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."
2,Manhattan,Apartment,Entire home/apt,2,1.0,1.0,1.0,$200.00,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."


In [6]:
abm1 = abm1.drop_duplicates() #중복된 데이터 행 삭제
print('abm1.shape after dropping duplicate rows: ',abm1.shape)
print('abm1.size:  ',abm1.size)
print('DataTypes wise size: \n', abm1.dtypes.value_counts())

abm1.shape after dropping duplicate rows:  (93309, 11)
abm1.size:   1026399
DataTypes wise size: 
 object     5
float64    4
int64      2
dtype: int64


### 결측값 검사, Data Cleaning

In [7]:
abm1.replace((' '),np.nan,inplace=True) #inplace가 T면 새로운 return 값이 아닌 원본을 아예 수정
abm1.isnull().sum()

neighbourhood_group_cleansed        0
property_type                       0
room_type                           0
accommodates                        0
bathrooms                          67
bedrooms                          143
beds                              735
price                               0
review_scores_rating            16377
number_of_reviews                   0
amenities                           0
dtype: int64

In [8]:
abm1 = abm1.dropna(subset=['bathrooms', 'bedrooms', 'beds', 'review_scores_rating'], how='any', axis=0)
abm1.isnull().sum()

neighbourhood_group_cleansed    0
property_type                   0
room_type                       0
accommodates                    0
bathrooms                       0
bedrooms                        0
beds                            0
price                           0
review_scores_rating            0
number_of_reviews               0
amenities                       0
dtype: int64

In [9]:
print('abm1.shape after dropping nan rows: ',abm1.shape)

abm1.shape after dropping nan rows:  (76604, 11)


In [10]:
def clean_data(df): #$표시 떼어내기. ,구분표 없애기
    for i in ['price',]:
        df[i]=df[i].str.replace('$','').str.replace(',', '').astype(float)
    
    df.replace('', np.nan, inplace=True)
    
    return df.head(2)
clean_data(abm1)

Unnamed: 0,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities
0,Manhattan,Apartment,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park..."
1,Brooklyn,Guest suite,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."


### 정확도를 위해 리뷰 개수 10개 미만인 호스트는 삭제
리뷰 개수가 소수이면서 비정상적으로 평점이 높은 허수 데이터들을 절삭한다.  

In [11]:
indexNames = abm1[abm1['number_of_reviews'] < 10].index
abm1.drop(indexNames , inplace=True)
abm1.shape

(46537, 11)

### Neighbourhood를 onehot encoding 하기

In [12]:
abm1['neighbourhood_group_cleansed'].value_counts()

Brooklyn         19226
Manhattan        18332
Queens            6976
Bronx             1479
Staten Island      524
Name: neighbourhood_group_cleansed, dtype: int64

In [13]:
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Brooklyn'), 'Brooklyn'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Manhattan'), 'Manhattan'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Queens'), 'Queens'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Bronx'), 'Bronx'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Staten Island'), 'Staten Island'] = 1

abm1 = abm1.fillna(0)

abm1.drop('neighbourhood_group_cleansed', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

property_type           0
room_type               0
accommodates            0
bathrooms               0
bedrooms                0
beds                    0
price                   0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
dtype: int64

Unnamed: 0,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island
0,Apartment,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0
1,Guest suite,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0
2,Apartment,Entire home/apt,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0
4,Apartment,Private room,2,1.0,1.0,1.0,79.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0
5,Apartment,Private room,1,1.0,1.0,1.0,79.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0


### Property type을 onehot encoding 하기

In [14]:
# property_type에는 무슨 값이 있을지
abm1['property_type'].value_counts()

Apartment             34161
House                  5182
Townhouse              2427
Loft                   1558
Condominium            1486
Guest suite             717
Hotel                   223
Boutique hotel          179
Serviced apartment      146
Hostel                   93
Guesthouse               84
Bed and breakfast        46
Bungalow                 45
Villa                    39
Other                    38
Tiny house               34
Camper/RV                26
Cottage                  18
Resort                   11
Boat                     10
Earth house               4
Aparthotel                3
Barn                      2
Castle                    2
Houseboat                 2
Cabin                     1
Name: property_type, dtype: int64

In [15]:
Mod_prop_type=abm1['property_type'].value_counts()[5:len(abm1['property_type'].value_counts())].index.tolist()

def change_prop_type(label):
    if label in Mod_prop_type:
        label='Other'
    return label

In [16]:
abm1.loc[:,'property_type'] = abm1.loc[:,'property_type'].apply(change_prop_type)

In [17]:
abm1['property_type'].value_counts() # 5순위 이하 주거형태는 Others로 분류

Apartment      34161
House           5182
Townhouse       2427
Other           1723
Loft            1558
Condominium     1486
Name: property_type, dtype: int64

In [18]:
abm1.loc[abm1['property_type'].str.contains('Apartment'), 'Apartment'] = 1
abm1.loc[abm1['property_type'].str.contains('House'), 'House'] = 1
abm1.loc[abm1['property_type'].str.contains('Other'), 'Other'] = 1
abm1.loc[abm1['property_type'].str.contains('Townhouse'), 'Townhouse'] = 1
abm1.loc[abm1['property_type'].str.contains('Condominium'), 'Condominium'] = 1
abm1.loc[abm1['property_type'].str.contains('Loft'), 'Loft'] = 1
abm1 = abm1.fillna(0)

abm1.isnull().sum()
abm1.head(5)

property_type           0
room_type               0
accommodates            0
bathrooms               0
bedrooms                0
beds                    0
price                   0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
Apartment               0
House                   0
Other                   0
Townhouse               0
Condominium             0
Loft                    0
dtype: int64

Unnamed: 0,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft
0,Apartment,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Other,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Apartment,Entire home/apt,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,Apartment,Private room,2,1.0,1.0,1.0,79.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,Apartment,Private room,1,1.0,1.0,1.0,79.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [19]:
abm1.drop('property_type', axis = 1, inplace=True)
abm1.head(3)

Unnamed: 0,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft
0,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Entire home/apt,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### Room_type를 one hot encoding 하기

In [20]:
abm1['room_type'].value_counts()

Entire home/apt    24902
Private room       20394
Shared room          875
Hotel room           366
Name: room_type, dtype: int64

In [21]:
abm1.loc[abm1['room_type'].str.contains('Entire home/apt'), 'Entire home/apt'] = 1
abm1.loc[abm1['room_type'].str.contains('Private room'), 'Private room'] = 1
abm1.loc[abm1['room_type'].str.contains('Shared room'), 'Shared room'] = 1
abm1.loc[abm1['room_type'].str.contains('Hotel room'), 'Hotel room'] = 1
abm1 = abm1.fillna(0)

abm1.drop('room_type', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

accommodates            0
bathrooms               0
bedrooms                0
beds                    0
price                   0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
Apartment               0
House                   0
Other                   0
Townhouse               0
Condominium             0
Loft                    0
Entire home/apt         0
Private room            0
Shared room             0
Hotel room              0
dtype: int64

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room
0,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,2,1.0,1.0,1.0,79.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,1,1.0,1.0,1.0,79.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Amenities를 column으로 분해하기

In [22]:
# splitting amenities feature 

amenities_list = list(abm1.amenities)
amenities_list_string = " ".join(amenities_list)
amenities_list_string = amenities_list_string.replace('{', '')
amenities_list_string = amenities_list_string.replace('}', ',')
amenities_list_string = amenities_list_string.replace('"', '')
amenities_set = [x.strip() for x in amenities_list_string.split(',')]
amenities_set = set(amenities_set)
print('\n Number of amenities present in total:',len(amenities_set))
print(amenities_set)


 Number of amenities present in total: 146
{'', 'Baby monitor', 'Stove', 'Toilet paper', '24-hour check-in', 'Wide clearance to shower', 'BBQ grill', 'Buzzer/wireless intercom', 'Lake access', 'Microwave', 'Cable TV', 'Lock on bedroom door', 'Pocket wifi', 'Safety card', 'Air purifier', 'Patio or balcony', 'Internet', 'Extra pillows and blankets', 'Outlet covers', 'Single level home', 'Window guards', 'Bedroom comforts', 'Fireplace guards', 'Free street parking', 'Well-lit path to entrance', 'Pack ’n Play/travel crib', 'Suitable for events', 'Laptop friendly workspace', 'Record player', 'Bath towel', 'Hot water kettle', 'Pool', 'Mobile hoist', 'Cooking basics', 'Handheld shower head', 'Ceiling hoist', 'Smoking allowed', 'Wide entrance for guests', 'Accessible-height toilet', 'Bathtub', 'Paid parking off premises', 'Table corner guards', 'Bathtub with bath chair', 'Oven', 'Private entrance', 'Washer/Dryer', 'Dryer', 'Bread maker', 'Wide entrance', 'Hot water', 'Bathroom essentials', 'B

In [23]:
#직접... 연관된 amenities를 하나의 카테고리화 하여 묶었다.
#그리고 각 amenities에 대해 새로운 컬럼을 추가하였다.
#loc로 행 조회후 컬럼값을 true인 경우 1로 바꿨다.
abm1.loc[abm1['amenities'].str.contains('Air conditioning|Central air conditioning'), 'air_conditioning'] = 1
abm1.loc[abm1['amenities'].str.contains('Amazon Echo|Apple TV|Game console|Netflix|Projector and screen|Smart TV'), 'high_end_electronics'] = 1
abm1.loc[abm1['amenities'].str.contains('BBQ grill|Fire pit|Propane barbeque'), 'bbq'] = 1
abm1.loc[abm1['amenities'].str.contains('Balcony|Patio'), 'balcony'] = 1
abm1.loc[abm1['amenities'].str.contains('Beach view|Beachfront|Lake access|Mountain view|Ski-in/Ski-out|Waterfront'), 'nature_and_views'] = 1
abm1.loc[abm1['amenities'].str.contains('Bed linens'), 'bed_linen'] = 1
abm1.loc[abm1['amenities'].str.contains('Breakfast'), 'breakfast'] = 1
abm1.loc[abm1['amenities'].str.contains('TV'), 'tv'] = 1
abm1.loc[abm1['amenities'].str.contains('Coffee maker|Espresso machine'), 'coffee_machine'] = 1
abm1.loc[abm1['amenities'].str.contains('Cooking basics'), 'cooking_basics'] = 1
abm1.loc[abm1['amenities'].str.contains('Dishwasher|Dryer|Washer'), 'white_goods'] = 1
abm1.loc[abm1['amenities'].str.contains('Elevator'), 'elevator'] = 1
abm1.loc[abm1['amenities'].str.contains('Exercise equipment|Gym|gym'), 'gym'] = 1
abm1.loc[abm1['amenities'].str.contains('Family/kid friendly|Children|children'), 'child_friendly'] = 1
abm1.loc[abm1['amenities'].str.contains('parking'), 'parking'] = 1
abm1.loc[abm1['amenities'].str.contains('Garden|Outdoor|Sun loungers|Terrace'), 'outdoor_space'] = 1
abm1.loc[abm1['amenities'].str.contains('Host greets you'), 'host_greeting'] = 1
abm1.loc[abm1['amenities'].str.contains('Hot tub|Jetted tub|hot tub|Sauna|Pool|pool'), 'hot_tub_sauna_or_pool'] = 1
abm1.loc[abm1['amenities'].str.contains('Internet|Pocket wifi|Wifi'), 'internet'] = 1
abm1.loc[abm1['amenities'].str.contains('Long term stays allowed'), 'long_term_stays'] = 1
abm1.loc[abm1['amenities'].str.contains('Pets|pet|Cat(s)|Dog(s)'), 'pets_allowed'] = 1
abm1.loc[abm1['amenities'].str.contains('Private entrance'), 'private_entrance'] = 1
abm1.loc[abm1['amenities'].str.contains('Safe|Security system'), 'secure'] = 1
abm1.loc[abm1['amenities'].str.contains('Self check-in'), 'self_check_in'] = 1
abm1.loc[abm1['amenities'].str.contains('Smoking allowed'), 'smoking_allowed'] = 1
abm1.loc[abm1['amenities'].str.contains('Step-free access|Wheelchair|Accessible'), 'accessible'] = 1
abm1.loc[abm1['amenities'].str.contains('Suitable for events'), 'event_suitable'] = 1
abm1.loc[abm1['amenities'].str.contains('24-hour check-in'), 'check_in_24h'] = 1

In [24]:
abm1.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,air_conditioning,high_end_electronics,...,balcony,nature_and_views,bed_linen,breakfast,tv,coffee_machine,cooking_basics,white_goods,elevator,gym,child_friendly,parking,outdoor_space,host_greeting,hot_tub_sauna_or_pool,internet,long_term_stays,pets_allowed,private_entrance,secure,self_check_in,smoking_allowed,accessible,event_suitable,check_in_24h
0,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,,...,,,1.0,,1.0,1.0,1.0,,,,1.0,1.0,,,,1.0,1.0,,,,1.0,,,,
1,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,,...,,,,,1.0,1.0,1.0,,,,1.0,1.0,,,,1.0,1.0,1.0,,,1.0,,,,1.0
2,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,,...,,,1.0,,1.0,1.0,1.0,,,,,,,1.0,,1.0,,,,,,,,,
4,2,1.0,1.0,1.0,79.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,,...,,,1.0,,1.0,,,,,,1.0,1.0,,,,1.0,,,,,,,,,
5,1,1.0,1.0,1.0,79.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,,...,,,,1.0,,,,,1.0,,,1.0,,1.0,,1.0,,1.0,,,,,,,


In [25]:
abm1 = abm1.fillna(0)
abm1.drop('amenities', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

accommodates             0
bathrooms                0
bedrooms                 0
beds                     0
price                    0
review_scores_rating     0
number_of_reviews        0
Brooklyn                 0
Manhattan                0
Queens                   0
Bronx                    0
Staten Island            0
Apartment                0
House                    0
Other                    0
Townhouse                0
Condominium              0
Loft                     0
Entire home/apt          0
Private room             0
Shared room              0
Hotel room               0
air_conditioning         0
high_end_electronics     0
bbq                      0
balcony                  0
nature_and_views         0
bed_linen                0
breakfast                0
tv                       0
coffee_machine           0
cooking_basics           0
white_goods              0
elevator                 0
gym                      0
child_friendly           0
parking                  0
o

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,air_conditioning,high_end_electronics,bbq,balcony,nature_and_views,bed_linen,breakfast,tv,coffee_machine,cooking_basics,white_goods,elevator,gym,child_friendly,parking,outdoor_space,host_greeting,hot_tub_sauna_or_pool,internet,long_term_stays,pets_allowed,private_entrance,secure,self_check_in,smoking_allowed,accessible,event_suitable,check_in_24h
0,1,1.0,0.0,1.0,225.0,94.0,48,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,3,1.0,1.0,4.0,89.0,90.0,307,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,2,1.0,1.0,1.0,200.0,90.0,78,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,1.0,1.0,1.0,79.0,84.0,463,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1,1.0,1.0,1.0,79.0,98.0,118,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 이제 여기서 common 한 amenities와 uncommon한 amenities로 분류해본다.

In [26]:
print('Amenities Column Names:\n',abm1.columns[22:],'\n')
print(' Number of Amenities columns after categorizing under same names:',abm1.columns[38:].shape)

Amenities Column Names:
 Index(['air_conditioning', 'high_end_electronics', 'bbq', 'balcony',
       'nature_and_views', 'bed_linen', 'breakfast', 'tv', 'coffee_machine',
       'cooking_basics', 'white_goods', 'elevator', 'gym', 'child_friendly',
       'parking', 'outdoor_space', 'host_greeting', 'hot_tub_sauna_or_pool',
       'internet', 'long_term_stays', 'pets_allowed', 'private_entrance',
       'secure', 'self_check_in', 'smoking_allowed', 'accessible',
       'event_suitable', 'check_in_24h'],
      dtype='object') 

 Number of Amenities columns after categorizing under same names: (12,)


In [29]:
frequent_amenities = []
infrequent_amenities=[]
for col in abm1.iloc[:,22:].columns:
    if abm1[col].sum() > len(abm1)/5: #전체에서 20%를 넘겼으면 common한 amenities
        frequent_amenities.append(col)
    else:
        infrequent_amenities.append(col)
print('Common_amenities: \n',frequent_amenities)
print('-----------------------')
print('Special_amenities: \n',infrequent_amenities)
print('frequent_amenities',len(frequent_amenities))
print('infrequent_amenities',len(infrequent_amenities))

Common_amenities: 
 ['air_conditioning', 'bed_linen', 'tv', 'coffee_machine', 'cooking_basics', 'white_goods', 'child_friendly', 'parking', 'host_greeting', 'internet', 'long_term_stays', 'private_entrance', 'self_check_in']
-----------------------
Special_amenities: 
 ['high_end_electronics', 'bbq', 'balcony', 'nature_and_views', 'breakfast', 'elevator', 'gym', 'outdoor_space', 'hot_tub_sauna_or_pool', 'pets_allowed', 'secure', 'smoking_allowed', 'accessible', 'event_suitable', 'check_in_24h']
frequent_amenities 13
infrequent_amenities 15


In [30]:
#각 숙소마다 special한 amenities를 몇개씩 가지고 있는지
abm1['special_amenities']=abm1[['high_end_electronics', 'bbq', 'balcony'
                                , 'nature_and_views', 'breakfast', 'elevator'
                                , 'gym', 'outdoor_space', 'hot_tub_sauna_or_pool'
                                , 'pets_allowed', 'secure', 'smoking_allowed'
                                , 'accessible', 'event_suitable'
                                , 'check_in_24h']].sum(axis=1)
abm1['special_amenities'].isnull().sum()
abm1.columns
abm1['special_amenities'].astype(float)
abm1['special_amenities'][:10]

0

Index(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'price',
       'review_scores_rating', 'number_of_reviews', 'Brooklyn', 'Manhattan',
       'Queens', 'Bronx', 'Staten Island', 'Apartment', 'House', 'Other',
       'Townhouse', 'Condominium', 'Loft', 'Entire home/apt', 'Private room',
       'Shared room', 'Hotel room', 'air_conditioning', 'high_end_electronics',
       'bbq', 'balcony', 'nature_and_views', 'bed_linen', 'breakfast', 'tv',
       'coffee_machine', 'cooking_basics', 'white_goods', 'elevator', 'gym',
       'child_friendly', 'parking', 'outdoor_space', 'host_greeting',
       'hot_tub_sauna_or_pool', 'internet', 'long_term_stays', 'pets_allowed',
       'private_entrance', 'secure', 'self_check_in', 'smoking_allowed',
       'accessible', 'event_suitable', 'check_in_24h', 'special_amenities'],
      dtype='object')

0         0.0
1         2.0
2         0.0
4         0.0
5         3.0
         ... 
150752    1.0
150903    1.0
151139    1.0
151218    0.0
151596    0.0
Name: special_amenities, Length: 46537, dtype: float64

0     0.0
1     2.0
2     0.0
4     0.0
5     3.0
6     1.0
7     1.0
8     4.0
9     1.0
10    0.0
Name: special_amenities, dtype: float64

In [31]:
abm1['common_amenities']=abm1[['air_conditioning', 'bed_linen', 'tv'
                               , 'coffee_machine', 'cooking_basics'
                               , 'white_goods', 'child_friendly'
                               , 'parking', 'host_greeting', 'internet'
                               , 'long_term_stays', 'private_entrance'
                               , 'self_check_in']].sum(axis=1)
abm1['common_amenities'].isnull().sum()
abm1.columns
abm1['common_amenities'].astype(float)
abm1['common_amenities'][:10]

0

Index(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'price',
       'review_scores_rating', 'number_of_reviews', 'Brooklyn', 'Manhattan',
       'Queens', 'Bronx', 'Staten Island', 'Apartment', 'House', 'Other',
       'Townhouse', 'Condominium', 'Loft', 'Entire home/apt', 'Private room',
       'Shared room', 'Hotel room', 'air_conditioning', 'high_end_electronics',
       'bbq', 'balcony', 'nature_and_views', 'bed_linen', 'breakfast', 'tv',
       'coffee_machine', 'cooking_basics', 'white_goods', 'elevator', 'gym',
       'child_friendly', 'parking', 'outdoor_space', 'host_greeting',
       'hot_tub_sauna_or_pool', 'internet', 'long_term_stays', 'pets_allowed',
       'private_entrance', 'secure', 'self_check_in', 'smoking_allowed',
       'accessible', 'event_suitable', 'check_in_24h', 'special_amenities',
       'common_amenities'],
      dtype='object')

0         10.0
1          9.0
2          7.0
4          6.0
5          4.0
          ... 
150752     7.0
150903     8.0
151139     7.0
151218     4.0
151596     4.0
Name: common_amenities, Length: 46537, dtype: float64

0     10.0
1      9.0
2      7.0
4      6.0
5      4.0
6      7.0
7      6.0
8      7.0
9      6.0
10     6.0
Name: common_amenities, dtype: float64

In [32]:
abm1.isnull().sum()

accommodates             0
bathrooms                0
bedrooms                 0
beds                     0
price                    0
review_scores_rating     0
number_of_reviews        0
Brooklyn                 0
Manhattan                0
Queens                   0
Bronx                    0
Staten Island            0
Apartment                0
House                    0
Other                    0
Townhouse                0
Condominium              0
Loft                     0
Entire home/apt          0
Private room             0
Shared room              0
Hotel room               0
air_conditioning         0
high_end_electronics     0
bbq                      0
balcony                  0
nature_and_views         0
bed_linen                0
breakfast                0
tv                       0
coffee_machine           0
cooking_basics           0
white_goods              0
elevator                 0
gym                      0
child_friendly           0
parking                  0
o

In [33]:
abm1 = abm1.drop(abm1.columns[22:-2], axis = 1, inplace=False)
abm1.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,special_amenities,common_amenities
0,1,1.0,0.0,1.0,225.0,94.0,48,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,10.0
1,3,1.0,1.0,4.0,89.0,90.0,307,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,9.0
2,2,1.0,1.0,1.0,200.0,90.0,78,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,7.0
4,2,1.0,1.0,1.0,79.0,84.0,463,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,6.0
5,1,1.0,1.0,1.0,79.0,98.0,118,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,4.0


## 2. Content Based Filtering 추천 시스템

### 유사도 검사를 위한 새로운 matrix 생성

In [52]:
abm2 = abm1.drop(['price', 'review_scores_rating', 'number_of_reviews'], axis = 1, inplace=False)
abm2.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,special_amenities,common_amenities
0,1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,10.0
1,3,1.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,9.0
2,2,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,7.0
4,2,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,6.0
5,1,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,4.0


In [53]:
abm2 = (abm2 - abm2.mean())/abm2.std()
abm2.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,special_amenities,common_amenities
0,-1.051304,-0.342984,-1.567475,-0.56758,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,0.932087,-0.88322,-0.138427,-0.089033,-0.903032,1.25295
1,-0.082309,-0.342984,-0.295538,1.852492,1.191845,-0.80619,-0.419918,-0.181173,-0.106714,-1.661386,-0.353981,5.099875,-0.234564,-0.181615,-0.186112,0.932087,-0.88322,-0.138427,-0.089033,0.638186,0.842283
2,-0.566807,-0.342984,-0.295538,-0.56758,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,0.932087,-0.88322,-0.138427,-0.089033,-0.903032,0.020949
4,-0.566807,-0.342984,-0.295538,-0.56758,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,-1.072838,1.132197,-0.138427,-0.089033,-0.903032,-0.389717
5,-1.051304,-0.342984,-0.295538,-0.56758,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,-1.072838,1.132197,-0.138427,-0.089033,1.408795,-1.211051


### 유사도 계산해보기
코사인 유사도로 계산한다.

In [54]:
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
  return dot(A, B)/(norm(A)*norm(B))

In [55]:
abm2.columns #사용자로부터 입력받아야 하는 값들

Index(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'Brooklyn',
       'Manhattan', 'Queens', 'Bronx', 'Staten Island', 'Apartment', 'House',
       'Other', 'Townhouse', 'Condominium', 'Loft', 'Entire home/apt',
       'Private room', 'Shared room', 'Hotel room', 'special_amenities',
       'common_amenities'],
      dtype='object')

In [56]:
checklist = ['accommodates', 'bathrooms', 'bedrooms', 'beds', 'Borough'
             , 'property_type', 'room_type', 'special_amenities'
             , 'common_amenities']

def simple_ui():
    data = []
    for item, idx in zip(checklist, range(len(checklist))):
        if idx <= 3:
            x = int(input(f'\'{item}\' ?: '))
            data.append(x)
        elif idx == 4:
            print('Brooklyn: 0, Manhattan: 1, Queens: 2, Bronx: 3, Staten Island: 4')
            x = int(input(f'your local?: '))
            if x < 0 and x > 4:
                print('error. wrong input')
                exit(0)
            else:
                for i in range(5):
                  if i == x:
                      data.append(1)
                  else:
                      data.append(0)
        elif idx == 5:
            print('Apartment: 0, House: 1, Other: 2, Townhouse: 3, Condominium: 4, Loft: 5')
            x = int(input(f'your property type?: '))
            if x < 0 and x > 5:
                print('error. wrong input')
                exit(0)
            else:
                for i in range(6):
                  if i == x:
                      data.append(1)
                  else:
                      data.append(0)
        elif idx == 6:
            print('Entire home/apt: 0, Private room: 1, Shared room: 2, Hotel room: 3')
            x = int(input(f'your room type?: '))
            if x < 0 and x > 3:
                print('error. wrong input')
                exit(0)
            else:
                for i in range(4):
                  if i == x:
                      data.append(1)
                  else:
                      data.append(0)
        elif idx == 7:
            print('Please input how many you have special amenities.')
            print('''Ex) high_end_electronics, bbq, balcony, nature_and_views
            , breakfast, elevator, gym, outdoor_space, hot_tub_sauna_or_pool
            , pets_allowed, smoking_allowed, etc...''')
            x = int(input('How many do you have ?: '))
            if x < 0 and x > 20:
                print('error. wrong input')
                exit(0)
            data.append(x)
        elif idx == 8:
            print('Please input how many you have common amenities.')
            print('''Ex) air_conditioning, bed_linen, tv, coffee_machine
                  , cooking_basics, white_goods, child_friendly
                  , parking, host_greeting, internet, self_check_in, etc...''')
            x = int(input('How many do you have ?: '))
            if x < 0 and x > 20:
                print('error. wrong input')
                exit(0)
            data.append(x)
    return np.array(data)

def norm_input(test):
    idx = 0
    for item in abm2.columns:
        test[idx] = (test[idx] - abm1[item].mean())/abm1[item].std()
        idx += 1
        
data = simple_ui().astype(float)
print(data)

'accommodates' ?:  4
'bathrooms' ?:  1
'bedrooms' ?:  1
'beds' ?:  2


Brooklyn: 0, Manhattan: 1, Queens: 2, Bronx: 3, Staten Island: 4


your local?:  1


Apartment: 0, House: 1, Other: 2, Townhouse: 3, Condominium: 4, Loft: 5


your property type?:  0


Entire home/apt: 0, Private room: 1, Shared room: 2, Hotel room: 3


your room type?:  0


Please input how many you have special amenities.
Ex) high_end_electronics, bbq, balcony, nature_and_views
            , breakfast, elevator, gym, outdoor_space, hot_tub_sauna_or_pool
            , pets_allowed, smoking_allowed, etc...


How many do you have ?:  2


Please input how many you have common amenities.
Ex) air_conditioning, bed_linen, tv, coffee_machine
                  , cooking_basics, white_goods, child_friendly
                  , parking, host_greeting, internet, self_check_in, etc...


How many do you have ?:  5


[4. 1. 1. 2. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 2. 5.]


In [57]:
sim = []
norm_input(data)

for item in abm2.to_numpy():
    sim.append(cos_sim(data, item))
    
sim[:10]

[0.3863006537692014,
 -0.09807740754348374,
 0.635074493569073,
 0.10525657651208936,
 0.3249314862878927,
 0.8721548794694693,
 0.18802268827214957,
 -0.33911535118673336,
 0.14906366678989583,
 0.6131875911267202]

In [58]:
sim_sorted_byindex = sorted(range(len(sim)), key=lambda k: sim[k])
sim_sorted_byindex = sim_sorted_byindex[-30:]
print(sim_sorted_byindex)

sim.sort()
sim = sim[-30:]
print(sim)
# 상위 30개의 유사도를 가지는 데이터를 추출

# 2,1,2,4,200,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0
#abm1.to_numpy()[42430] # 가장 유사도가 높은 호스트. 그러나, 평점이 높은지는 보장 X

[38244, 24592, 38076, 4642, 7649, 10465, 22299, 23683, 24676, 25035, 37466, 5854, 8152, 8485, 9069, 19110, 26470, 44956, 46444, 3254, 4403, 5850, 7548, 11343, 28197, 34315, 7755, 16168, 16682, 31734]
[0.9706743120457662, 0.9764013748004905, 0.9764013748004905, 0.9810913987206142, 0.9810913987206142, 0.9810913987206142, 0.9810913987206142, 0.9810913987206142, 0.9810913987206142, 0.9810913987206142, 0.9810913987206142, 0.9832505936736774, 0.986795980037229, 0.986795980037229, 0.986795980037229, 0.986795980037229, 0.986795980037229, 0.986795980037229, 0.986795980037229, 0.9892594027127277, 0.9892594027127277, 0.9892594027127277, 0.9892594027127277, 0.9892594027127277, 0.9892594027127277, 0.9892594027127277, 1.0000000000000002, 1.0000000000000002, 1.0000000000000002, 1.0000000000000002]


### 평점이 높다라는 기준
이제 유사도를 가진 Top 30개의 데이터를 산출하였지만,  
이중에서 유의미하게 평점이 높은 데이터들을 선별해야 한다.

In [59]:
mean = 0

for idx in sim_sorted_byindex:
    mean += abm1.to_numpy()[idx][5]

mean = mean / 30
print(mean)

92.03333333333333


### 상위 30개의 데이터에서 평점이 평균 이상인 것만 추려내기

In [60]:
selected = {}

for idx, s in zip(sim_sorted_byindex, sim):
    if abm1.to_numpy()[idx][5] > mean:
        selected[idx] = s
        
selected

{4642: 0.9810913987206142,
 23683: 0.9810913987206142,
 37466: 0.9810913987206142,
 5854: 0.9832505936736774,
 8485: 0.986795980037229,
 9069: 0.986795980037229,
 19110: 0.986795980037229,
 26470: 0.986795980037229,
 44956: 0.986795980037229,
 46444: 0.986795980037229,
 4403: 0.9892594027127277,
 5850: 0.9892594027127277,
 7548: 0.9892594027127277,
 28197: 0.9892594027127277,
 34315: 0.9892594027127277,
 7755: 1.0000000000000002,
 16682: 1.0000000000000002,
 31734: 1.0000000000000002}

## 3. 최종 추천 가격

In [61]:
def recommend_price(selected):
    child = 0
    parent = 0
    for key in selected.keys():
        child += abm1.to_numpy()[key][4] * selected[key]
        parent += selected[key]
    return child/parent

In [62]:
recommend_price(selected)

209.20883709478488

In [63]:
for idx in selected.keys():
    print(abm1.to_numpy()[idx])

[  3.   1.   1.   2. 200.  93. 118.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   5.]
[  3.   1.   1.   2. 200.  93. 112.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   5.]
[  3.   1.   1.   2. 200.  94. 121.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   5.]
[  5.   1.   1.   2. 150.  95. 110.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   5.]
[  4.   1.   1.   2. 175.  93.  10.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   6.]
[  4.   1.   1.   2. 110.  96.  18.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   6.]
[  4.   1.   1.   2. 225.  93.  16.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   6.]
[  4.   1.   1.   2. 110.  96.  17.   0.   1.   0.   0.   0.   1.   0.
   0.   0.   0.   0.   1.   0.   0.   0.   2.   6.]
[  4.   1.   1. 