### Content Based Filtering을 이용한 Airbnb Host Recommender

#### Big Data Group 17. 서현종, 김상현, 김승엽

### 시나리오

Airbnb 신규 Host인 Ernest는 뉴욕에 위치한 자신의 집을 숙박 업소로 공개하고자 한다.  
그러나, Ernest는 위치, 방의 개수, 침대 개수, 침대의 타입등 자신이 소유한 집의 정보만을 알고 있다.  
이 집을 Airbnb로 공개하였을때 얼마의 가격과 얼마의 보증금,  
또한 어떤 amenities(집기구 등)를 놓아야하는지가 Ernest의 최대 관심사이다.  
따라서 Ernest는 CAUSWE Big Data 17조가 만든 Content Based Filtering 추천 시스템을 이용하여  
적절한 가격 책정과 집기구 배치에 대해 도움을 얻고자 한다.

### 프로그램 개요

Airbnb Open Data는 Host 숙소에 대한 106가지의 Feature가 정의되어 있는 데이터 셋이다.  
통상적인 추천 시스템이라면 협업 필터링 (Collaborate Filtering)을 사용하나,  
이 데이터 셋은 사용자의 정보가 없기 때문에 (오직 이 숙소에 대한 평균 평점들만 알 수 있음),  
Content의 정보에 기반하여 추천 시스템이 적합해보였다.  
  
Content Based Filtering을 토대로 이 추천시스템은 호스트가 입력한 숙박 업소의 정보를  
어느 데이터와 유사한지 판별하여 (cos sim), Top N을 추려낸다음, 이 Top N 중에서  
평점이 높은 숙박 업소의 정보만을 추려내어 적절한 가격과 amenities를 추천할 것이다.

### 모델 평가방법

#### 미래의 Dataset을 이용한 Heurisitic 한 평가방법

추천 시스템의 가장 기본적이고 효율적인 평가방법은 실제 출시되고 나서의 Heuristic한 평가이다.
  
현재 데이터셋은 2020년도의 데이터로, 이 추천 시스템을 사용하는 Host, Ernest는 2020년에  
이 추천 시스템을 사용하였다고 가정하자. 그렇다면 이 추천 시스템이 정확하다면 다음 해인 2021년에  
유의미하게 높은 평점을 받아야 할 것이다.  
  
따라서 본 모델을 평가하기 위해 우리는 2020년에서 산출한 추천 정보와 2021년 데이터 셋 사이에서  
가장 유사도가 높은 데이터를 선별한 후, 이 데이터의 평점이 유의미하게 높은지 판별하여  
모델의 정확도를 평가할 것이다.
  
즉, 실제 이 추천 시스템을 토대로 가격과 집기구를 배치했을시 내년에 얼마나 높은 평점을 받았는지가  
이 모델의 정확도 측정의 관건인 것이다.

## 0. 모델에 사용할 데이터셋 불러오기

In [1]:
from IPython.core.interactiveshell import InteractiveShell #python의 대화형 쉘, 인터프리터
InteractiveShell.ast_node_interactivity = "all"#모든 출력값을 연속적으로 출력
import warnings
warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd

pd.options.display.max_rows = 200
pd.options.display.max_columns = 50

In [3]:
abm =  pd.read_csv('../input/airbnb-new-york-city-with-106-features/airbnbmark1.csv')
abm.head(3)
print('abm.shape',abm.shape)
print('abm.size',abm.size)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,...,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,https://www.airbnb.com/rooms/2595,20200212052319,2020-02-12,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...","- Spacious (500+ft²), immaculate and nicely fu...","Beautiful, spacious skylit studio in the heart...",none,Centrally located in the heart of Manhattan ju...,,Apartment is located on 37th Street between 5t...,"Guests have full access to the kitchen, bathro...",I am a Sound Therapy Practitioner and Kundalin...,"Make yourself at home, respect the space and t...",,,https://a0.muscache.com/im/pictures/f0813a11-4...,,2845,https://www.airbnb.com/users/show/2845,Jennifer,2008-09-09,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,...,2020-02-12,48,5,2009-11-21,2019-11-04,94.0,9.0,9.0,10.0,10.0,10.0,9.0,f,,,f,f,strict_14_with_grace_period,t,t,2,2,0,0,0.39
1,3831,https://www.airbnb.com/rooms/3831,20200212052319,2020-02-13,Cozy Entire Floor of Brownstone,Urban retreat: enjoy 500 s.f. floor in 1899 br...,Greetings! We own a double-duplex brownst...,Urban retreat: enjoy 500 s.f. floor in 1899 br...,none,Just the right mix of urban center and local n...,,B52 bus for a 10-minute ride to downtown Brook...,"You will have the private, exclusive use of an...","We'll be around, but since you have the top fl...",Smoking - outside please; pets allowed but ple...,,,https://a0.muscache.com/im/pictures/e49999c2-9...,,4869,https://www.airbnb.com/users/show/4869,LisaRoxanne,2008-12-07,"New York, New York, United States",Laid-back bi-coastal actor/professor/attorney.,...,2020-02-13,307,70,2014-09-30,2020-02-08,90.0,9.0,9.0,10.0,9.0,10.0,9.0,f,,,f,f,moderate,f,f,1,1,0,0,4.69
2,5099,https://www.airbnb.com/rooms/5099,20200212052319,2020-02-12,Large Cozy 1 BR Apartment In Midtown East,My large 1 bedroom apartment has a true New Yo...,I have a large 1 bedroom apartment centrally l...,My large 1 bedroom apartment has a true New Yo...,none,My neighborhood in Midtown East is called Murr...,Read My Full Listing For All Information. New ...,From the apartment is a 10 minute walk to Gran...,I will meet you upon arrival.,I usually check in with guests via text or ema...,• Check-in time is 2PM. • Check-out time is 12...,,,https://a0.muscache.com/im/pictures/24020910/1...,,7322,https://www.airbnb.com/users/show/7322,Chris,2009-02-02,"New York, New York, United States","I'm an artist, writer, traveler, and a native ...",...,2020-02-12,78,8,2009-04-20,2019-10-13,90.0,10.0,9.0,10.0,10.0,10.0,9.0,f,,,f,f,moderate,t,t,1,1,0,0,0.59


abm.shape (153254, 106)
abm.size 16244924


## 1. Data Cleaning

Host가 입력할 수 있는 정보들과 (숙소의 위치, 방 타입, 수용 인원, 침대 개수 등)  
평가 지표로 사용이 가능한 정보들 (평점, 평점 개수)  
그리고 최종적으로 추천되는 정보들 (price, amenities)

In [4]:
abm1 = abm[['neighbourhood_group_cleansed', 'property_type', 'room_type'
           ,'accommodates', 'bathrooms', 'bedrooms', 'beds'
           , 'price','review_scores_rating', 'number_of_reviews', 'amenities']]

In [5]:
abm1.shape
abm1.head(3)

(153254, 11)

Unnamed: 0,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities
0,Manhattan,Apartment,Entire home/apt,1,1.0,0.0,1.0,$225.00,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park..."
1,Brooklyn,Guest suite,Entire home/apt,3,1.0,1.0,4.0,$89.00,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."
2,Manhattan,Apartment,Entire home/apt,2,1.0,1.0,1.0,$200.00,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."


In [6]:
abm1 = abm1.drop_duplicates() #중복된 데이터 행 삭제
print('abm1.shape after dropping duplicate rows: ',abm1.shape)
print('abm1.size:  ',abm1.size)
print('DataTypes wise size: \n', abm1.dtypes.value_counts())

abm1.shape after dropping duplicate rows:  (93309, 11)
abm1.size:   1026399
DataTypes wise size: 
 object     5
float64    4
int64      2
dtype: int64


### 결측값 검사, Data Cleaning

In [7]:
abm1.replace((' '),np.nan,inplace=True) #inplace가 T면 새로운 return 값이 아닌 원본을 아예 수정
abm1.isnull().sum()

neighbourhood_group_cleansed        0
property_type                       0
room_type                           0
accommodates                        0
bathrooms                          67
bedrooms                          143
beds                              735
price                               0
review_scores_rating            16377
number_of_reviews                   0
amenities                           0
dtype: int64

In [8]:
abm1 = abm1.dropna(subset=['bathrooms', 'bedrooms', 'beds', 'review_scores_rating'], how='any', axis=0)
abm1.isnull().sum()

neighbourhood_group_cleansed    0
property_type                   0
room_type                       0
accommodates                    0
bathrooms                       0
bedrooms                        0
beds                            0
price                           0
review_scores_rating            0
number_of_reviews               0
amenities                       0
dtype: int64

In [9]:
print('abm1.shape after dropping nan rows: ',abm1.shape)

abm1.shape after dropping nan rows:  (76604, 11)


In [10]:
def clean_data(df): #$표시 떼어내기. ,구분표 없애기
    for i in ['price',]:
        df[i]=df[i].str.replace('$','').str.replace(',', '').astype(float)
    
    df.replace('', np.nan, inplace=True)
    
    return df.head(2)
clean_data(abm1)

Unnamed: 0,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities
0,Manhattan,Apartment,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park..."
1,Brooklyn,Guest suite,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."


### 정확도를 위해 리뷰 개수 10개 미만인 호스트는 삭제
리뷰 개수가 소수이면서 비정상적으로 평점이 높은 허수 데이터들을 절삭한다.  

In [11]:
indexNames = abm1[abm1['number_of_reviews'] < 10].index
abm1.drop(indexNames , inplace=True)
abm1.shape

(46537, 11)

### Neighbourhood를 onehot encoding 하기

In [12]:
abm1['neighbourhood_group_cleansed'].value_counts()

Brooklyn         19226
Manhattan        18332
Queens            6976
Bronx             1479
Staten Island      524
Name: neighbourhood_group_cleansed, dtype: int64

In [13]:
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Brooklyn'), 'Brooklyn'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Manhattan'), 'Manhattan'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Queens'), 'Queens'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Bronx'), 'Bronx'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Staten Island'), 'Staten Island'] = 1

abm1 = abm1.fillna(0)

abm1.drop('neighbourhood_group_cleansed', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

property_type           0
room_type               0
accommodates            0
bathrooms               0
bedrooms                0
beds                    0
price                   0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
dtype: int64

Unnamed: 0,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island
0,Apartment,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0
1,Guest suite,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0
2,Apartment,Entire home/apt,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0
4,Apartment,Private room,2,1.0,1.0,1.0,79.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0
5,Apartment,Private room,1,1.0,1.0,1.0,79.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0


### Property type을 onehot encoding 하기

In [14]:
# property_type에는 무슨 값이 있을지
abm1['property_type'].value_counts()

Apartment             34161
House                  5182
Townhouse              2427
Loft                   1558
Condominium            1486
Guest suite             717
Hotel                   223
Boutique hotel          179
Serviced apartment      146
Hostel                   93
Guesthouse               84
Bed and breakfast        46
Bungalow                 45
Villa                    39
Other                    38
Tiny house               34
Camper/RV                26
Cottage                  18
Resort                   11
Boat                     10
Earth house               4
Aparthotel                3
Castle                    2
Barn                      2
Houseboat                 2
Cabin                     1
Name: property_type, dtype: int64

In [15]:
Mod_prop_type=abm1['property_type'].value_counts()[5:len(abm1['property_type'].value_counts())].index.tolist()

def change_prop_type(label):
    if label in Mod_prop_type:
        label='Other'
    return label

In [16]:
abm1.loc[:,'property_type'] = abm1.loc[:,'property_type'].apply(change_prop_type)

In [17]:
abm1['property_type'].value_counts() # 5순위 이하 주거형태는 Others로 분류

Apartment      34161
House           5182
Townhouse       2427
Other           1723
Loft            1558
Condominium     1486
Name: property_type, dtype: int64

In [18]:
abm1.loc[abm1['property_type'].str.contains('Apartment'), 'Apartment'] = 1
abm1.loc[abm1['property_type'].str.contains('House'), 'House'] = 1
abm1.loc[abm1['property_type'].str.contains('Other'), 'Other'] = 1
abm1.loc[abm1['property_type'].str.contains('Townhouse'), 'Townhouse'] = 1
abm1.loc[abm1['property_type'].str.contains('Condominium'), 'Condominium'] = 1
abm1.loc[abm1['property_type'].str.contains('Loft'), 'Loft'] = 1
abm1 = abm1.fillna(0)

abm1.isnull().sum()
abm1.head(5)

property_type           0
room_type               0
accommodates            0
bathrooms               0
bedrooms                0
beds                    0
price                   0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
Apartment               0
House                   0
Other                   0
Townhouse               0
Condominium             0
Loft                    0
dtype: int64

Unnamed: 0,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft
0,Apartment,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Other,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Apartment,Entire home/apt,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,Apartment,Private room,2,1.0,1.0,1.0,79.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,Apartment,Private room,1,1.0,1.0,1.0,79.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [19]:
abm1.drop('property_type', axis = 1, inplace=True)
abm1.head(3)

Unnamed: 0,room_type,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft
0,Entire home/apt,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Entire home/apt,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Entire home/apt,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### Room_type를 one hot encoding 하기

In [20]:
abm1['room_type'].value_counts()

Entire home/apt    24902
Private room       20394
Shared room          875
Hotel room           366
Name: room_type, dtype: int64

In [21]:
abm1.loc[abm1['room_type'].str.contains('Entire home/apt'), 'Entire home/apt'] = 1
abm1.loc[abm1['room_type'].str.contains('Private room'), 'Private room'] = 1
abm1.loc[abm1['room_type'].str.contains('Shared room'), 'Shared room'] = 1
abm1.loc[abm1['room_type'].str.contains('Hotel room'), 'Hotel room'] = 1
abm1 = abm1.fillna(0)

abm1.drop('room_type', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

accommodates            0
bathrooms               0
bedrooms                0
beds                    0
price                   0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
Apartment               0
House                   0
Other                   0
Townhouse               0
Condominium             0
Loft                    0
Entire home/apt         0
Private room            0
Shared room             0
Hotel room              0
dtype: int64

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room
0,1,1.0,0.0,1.0,225.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,3,1.0,1.0,4.0,89.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2,1.0,1.0,1.0,200.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,2,1.0,1.0,1.0,79.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,1,1.0,1.0,1.0,79.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## 2. Content Based Filtering 추천 시스템

### 유사도 검사를 위한 새로운 matrix 생성

In [22]:
abm2 = abm1.drop(['review_scores_rating', 'number_of_reviews', 'amenities'], axis = 1, inplace=False)
abm2.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room
0,1,1.0,0.0,1.0,225.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,3,1.0,1.0,4.0,89.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2,1.0,1.0,1.0,200.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,2,1.0,1.0,1.0,79.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,1,1.0,1.0,1.0,79.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [23]:
abm2 = (abm2 - abm2.mean())/abm2.std()
abm2.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room
0,-1.051304,-0.342984,-1.567475,-0.56758,0.332042,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,0.932087,-0.88322,-0.138427,-0.089033
1,-0.082309,-0.342984,-0.295538,1.852492,-0.223321,1.191845,-0.80619,-0.419918,-0.181173,-0.106714,-1.661386,-0.353981,5.099875,-0.234564,-0.181615,-0.186112,0.932087,-0.88322,-0.138427,-0.089033
2,-0.566807,-0.342984,-0.295538,-0.56758,0.229953,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,0.932087,-0.88322,-0.138427,-0.089033
4,-0.566807,-0.342984,-0.295538,-0.56758,-0.264156,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,-1.072838,1.132197,-0.138427,-0.089033
5,-1.051304,-0.342984,-0.295538,-0.56758,-0.264156,-0.839017,1.240376,-0.419918,-0.181173,-0.106714,0.601894,-0.353981,-0.196079,-0.234564,-0.181615,-0.186112,-1.072838,1.132197,-0.138427,-0.089033


### 유사도 계산해보기
코사인 유사도로 계산한다.

In [24]:
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
  return dot(A, B)/(norm(A)*norm(B))

In [25]:
abm2.columns #사용자로부터 입력받아야 하는 값들

Index(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'Brooklyn',
       'Manhattan', 'Queens', 'Bronx', 'Staten Island', 'Apartment', 'House',
       'Other', 'Townhouse', 'Condominium', 'Loft', 'Entire home/apt',
       'Private room', 'Shared room', 'Hotel room'],
      dtype='object')

In [26]:
# 예시 데이터
# 지금은 정적으로 입력됐지만, 사용자가 입력하게끔 해야 한다.
test_data = np.array([2,1,2,4,200,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0])
test_data = test_data.astype(float)

def norm_input(test):
    idx = 0
    for item in abm2.columns:
        test[idx] = (test[idx] - abm1[item].mean())/abm1[item].std()
        idx += 1

In [27]:
sim = []
norm_input(test_data)

for item in abm2.to_numpy():
    sim.append(cos_sim(test_data, item))
    
sim[:10]

[-0.13565189265993036,
 0.29321153481476664,
 -0.04123322629015947,
 -0.5179680881678635,
 -0.4562102212500541,
 0.12553941098724,
 -0.5170304170248761,
 -0.260618062278595,
 -0.4559192017502526,
 -0.013250040579322206]

In [28]:
sim_sorted_byindex = sorted(range(len(sim)), key=lambda k: sim[k])
sim_sorted_byindex = sim_sorted_byindex[-30:]
print(sim_sorted_byindex)

sim.sort()
sim = sim[-30:]
print(sim)
# 상위 30개의 유사도를 가지는 데이터를 추출

# 2,1,2,4,200,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0
abm1.to_numpy()[37465] # 가장 유사도가 높은 호스트. 그러나, 평점이 높은지는 보장 X

[15198, 42430, 10587, 38461, 6782, 25027, 1652, 8732, 26260, 36277, 39220, 17594, 43882, 3245, 22825, 36871, 8441, 17306, 26065, 32721, 39092, 43700, 37021, 9715, 39670, 3578, 23029, 4640, 23682, 37465]
[0.9477340551438512, 0.9477340551438512, 0.9478242030251346, 0.9478242030251346, 0.9497367778805557, 0.9497367778805557, 0.9501573847987658, 0.9501573847987658, 0.9501573847987658, 0.9501573847987658, 0.9501573847987658, 0.9503825123581845, 0.9503825123581845, 0.9515128248257533, 0.9515128248257533, 0.9515128248257533, 0.952220311677238, 0.952220311677238, 0.952220311677238, 0.952220311677238, 0.952220311677238, 0.952220311677238, 0.9552002897261638, 0.9557991740627566, 0.9557991740627566, 0.957911571430615, 0.957911571430615, 0.9581575201702647, 0.9581575201702647, 0.9581575201702647]


array([3, 1.0, 2.0, 3.0, 175.0, 90.0, 173,
       '{TV,"Cable TV",Internet,Wifi,"Air conditioning",Kitchen,"Free street parking",Heating,Washer,"Smoke detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace",Microwave,"Coffee maker",Refrigerator,"Dishes and silverware","Cooking basics"}',
       1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,
       0.0, 0.0], dtype=object)

### 평점이 높다라는 기준
이제 유사도를 가진 Top 30개의 데이터를 산출하였지만,  
이중에서 유의미하게 평점이 높은 데이터들을 선별해야 한다.

In [29]:
abm1['review_scores_rating'].mean()
abm1['review_scores_rating'].quantile(q=0.7, interpolation='nearest')
# 상위 30%의 평점은 97점이다.

94.48223993811376

97.0

### 상위 30개의 데이터에서 평점이 97점 이상인 것만 추려내기

In [30]:
selected = {}

for idx, s in zip(sim_sorted_byindex, sim):
    if abm1.to_numpy()[idx][5] >= 97.0:
        selected[idx] = s
        
selected

{10587: 0.9478242030251346,
 38461: 0.9478242030251346,
 6782: 0.9497367778805557,
 25027: 0.9497367778805557,
 1652: 0.9501573847987658,
 8732: 0.9501573847987658,
 26260: 0.9501573847987658,
 36277: 0.9501573847987658,
 39220: 0.9501573847987658,
 17594: 0.9503825123581845,
 43882: 0.9503825123581845,
 3245: 0.9515128248257533,
 22825: 0.9515128248257533,
 36871: 0.9515128248257533,
 8441: 0.952220311677238,
 26065: 0.952220311677238,
 39092: 0.952220311677238,
 37021: 0.9552002897261638,
 9715: 0.9557991740627566,
 39670: 0.9557991740627566,
 3578: 0.957911571430615,
 23029: 0.957911571430615}

## 3. 최종 추천 가격

In [31]:
def recommend_price(selected):
    child = 0
    parent = 0
    for key in selected.keys():
        child += abm1.to_numpy()[key][4] * selected[key]
        parent += selected[key]
    return child/parent

In [32]:
recommend_price(selected)

156.70574373975955