### Content Based Filtering을 이용한 Airbnb Host Recommender

#### Big Data Group 17. 서현종, 김상현, 김승엽

### 시나리오

> Airbnb 신규 Host인 Ernest는 뉴욕에 위치한 자신의 집을 숙박 업소로 공개하고자 한다.  
그러나, Ernest는 위치, 방의 개수, 침대 개수, 침대의 타입등 자신이 소유한 집의 정보만을 알고 있다.  
이 집을 Airbnb로 공개하였을때 얼마의 가격과 얼마의 보증금, 또한 어떤 amenities(집기구 등)를
놓아야하는지가 Ernest의 최대 관심사이다.  
따라서 Ernest는 CAUSWE Big Data 17조가 만든 Content Based Filtering 추천 시스템을 이용하여
적절한 가격 책정과 집기구 배치에 대해 도움을 얻고자 한다.

### 프로그램 개요

Airbnb Open Data는 Host 숙소에 대한 106가지의 Feature가 정의되어 있는 데이터 셋이다.  
통상적인 추천 시스템이라면 협업 필터링 (Collaborate Filtering)을 사용하나,  
이 데이터 셋은 사용자의 정보가 없기 때문에 (오직 이 숙소에 대한 평균 평점들만 알 수 있음),  
Content의 정보에 기반하여 추천 시스템이 적합해보였다.  
  
Content Based Filtering을 토대로 이 추천시스템은 호스트가 입력한 숙박 업소의 정보를  
어느 데이터와 유사한지 판별하여 (cos sim), Top N을 추려낸다음, 이 Top N 중에서  
평점이 높은 숙박 업소의 정보만을 추려내어 적절한 가격과 amenities를 추천할 것이다.

In [1]:
from IPython.core.interactiveshell import InteractiveShell #python의 대화형 쉘, 인터프리터
InteractiveShell.ast_node_interactivity = "all"#모든 출력값을 연속적으로 출력
import warnings
warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd

pd.options.display.max_rows = 200
pd.options.display.max_columns = 50

In [3]:
abm =  pd.read_csv('../input/airbnb-new-york-city-with-106-features/airbnbmark1.csv')
abm.head(3)
print('abm.shape',abm.shape)
print('abm.size',abm.size)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,...,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,https://www.airbnb.com/rooms/2595,20200212052319,2020-02-12,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...","- Spacious (500+ft²), immaculate and nicely fu...","Beautiful, spacious skylit studio in the heart...",none,Centrally located in the heart of Manhattan ju...,,Apartment is located on 37th Street between 5t...,"Guests have full access to the kitchen, bathro...",I am a Sound Therapy Practitioner and Kundalin...,"Make yourself at home, respect the space and t...",,,https://a0.muscache.com/im/pictures/f0813a11-4...,,2845,https://www.airbnb.com/users/show/2845,Jennifer,2008-09-09,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,...,2020-02-12,48,5,2009-11-21,2019-11-04,94.0,9.0,9.0,10.0,10.0,10.0,9.0,f,,,f,f,strict_14_with_grace_period,t,t,2,2,0,0,0.39
1,3831,https://www.airbnb.com/rooms/3831,20200212052319,2020-02-13,Cozy Entire Floor of Brownstone,Urban retreat: enjoy 500 s.f. floor in 1899 br...,Greetings! We own a double-duplex brownst...,Urban retreat: enjoy 500 s.f. floor in 1899 br...,none,Just the right mix of urban center and local n...,,B52 bus for a 10-minute ride to downtown Brook...,"You will have the private, exclusive use of an...","We'll be around, but since you have the top fl...",Smoking - outside please; pets allowed but ple...,,,https://a0.muscache.com/im/pictures/e49999c2-9...,,4869,https://www.airbnb.com/users/show/4869,LisaRoxanne,2008-12-07,"New York, New York, United States",Laid-back bi-coastal actor/professor/attorney.,...,2020-02-13,307,70,2014-09-30,2020-02-08,90.0,9.0,9.0,10.0,9.0,10.0,9.0,f,,,f,f,moderate,f,f,1,1,0,0,4.69
2,5099,https://www.airbnb.com/rooms/5099,20200212052319,2020-02-12,Large Cozy 1 BR Apartment In Midtown East,My large 1 bedroom apartment has a true New Yo...,I have a large 1 bedroom apartment centrally l...,My large 1 bedroom apartment has a true New Yo...,none,My neighborhood in Midtown East is called Murr...,Read My Full Listing For All Information. New ...,From the apartment is a 10 minute walk to Gran...,I will meet you upon arrival.,I usually check in with guests via text or ema...,• Check-in time is 2PM. • Check-out time is 12...,,,https://a0.muscache.com/im/pictures/24020910/1...,,7322,https://www.airbnb.com/users/show/7322,Chris,2009-02-02,"New York, New York, United States","I'm an artist, writer, traveler, and a native ...",...,2020-02-12,78,8,2009-04-20,2019-10-13,90.0,10.0,9.0,10.0,10.0,10.0,9.0,f,,,f,f,moderate,t,t,1,1,0,0,0.59


abm.shape (153254, 106)
abm.size 16244924


## 쓸모있는 column들만 남겨놓기

쓸모있는 column이라는 근거가 필요하다.

In [4]:
abm1 = abm[['neighbourhood_group_cleansed', 'property_type', 'room_type'
           ,'accommodates', 'bathrooms', 'bedrooms', 'beds'
           ,'bed_type', 'price', 'security_deposit'
            ,'review_scores_rating', 'number_of_reviews', 'amenities']]

In [5]:
abm1.shape
abm1.head(3)

(153254, 13)

Unnamed: 0,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,price,security_deposit,review_scores_rating,number_of_reviews,amenities
0,Manhattan,Apartment,Entire home/apt,1,1.0,0.0,1.0,Real Bed,$225.00,$350.00,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park..."
1,Brooklyn,Guest suite,Entire home/apt,3,1.0,1.0,4.0,Real Bed,$89.00,$500.00,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."
2,Manhattan,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,$200.00,$300.00,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."


In [6]:
abm1 = abm1.drop_duplicates() #중복된 데이터 행 삭제
print('abm1.shape after dropping duplicate rows: ',abm1.shape)
print('abm1.size:  ',abm1.size)
print('DataTypes wise size: \n', abm1.dtypes.value_counts())

abm1.shape after dropping duplicate rows:  (93627, 13)
abm1.size:   1217151
DataTypes wise size: 
 object     7
float64    4
int64      2
dtype: int64


## 결측값 검사, Data Cleaning

In [7]:
abm1.replace((' '),np.nan,inplace=True) #inplace가 T면 새로운 return 값이 아닌 원본을 아예 수정
abm1.isnull().sum()

neighbourhood_group_cleansed        0
property_type                       0
room_type                           0
accommodates                        0
bathrooms                          67
bedrooms                          144
beds                              737
bed_type                            0
price                               0
security_deposit                27798
review_scores_rating            16506
number_of_reviews                   0
amenities                           0
dtype: int64

In [8]:
abm1 = abm1.dropna(subset=['neighbourhood_group_cleansed','bathrooms', 'bedrooms', 'beds', 'review_scores_rating'], how='any', axis=0)
abm1.isnull().sum()

neighbourhood_group_cleansed        0
property_type                       0
room_type                           0
accommodates                        0
bathrooms                           0
bedrooms                            0
beds                                0
bed_type                            0
price                               0
security_deposit                19453
review_scores_rating                0
number_of_reviews                   0
amenities                           0
dtype: int64

In [9]:
abm1 = abm1.fillna(0) #보증금 nan은 0으로 취급
abm1.isnull().sum()

neighbourhood_group_cleansed    0
property_type                   0
room_type                       0
accommodates                    0
bathrooms                       0
bedrooms                        0
beds                            0
bed_type                        0
price                           0
security_deposit                0
review_scores_rating            0
number_of_reviews               0
amenities                       0
dtype: int64

In [10]:
print('abm1.shape after dropping nan rows: ',abm1.shape)

abm1.shape after dropping nan rows:  (76792, 13)


In [11]:
def clean_data(df): #$표시 떼어내기. ,구분표 없애기
    for i in ['price','security_deposit']:
        df[i]=df[i].str.replace('$','').str.replace(',', '').astype(float)
    
    df.replace('', np.nan, inplace=True)
    
    return df.head(2)
clean_data(abm1)

Unnamed: 0,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,price,security_deposit,review_scores_rating,number_of_reviews,amenities
0,Manhattan,Apartment,Entire home/apt,1,1.0,0.0,1.0,Real Bed,225.0,350.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park..."
1,Brooklyn,Guest suite,Entire home/apt,3,1.0,1.0,4.0,Real Bed,89.0,500.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning..."


## 정확도를 위해 리뷰 개수 10개 미만인 호스트는 삭제

In [12]:
indexNames = abm1[abm1['number_of_reviews'] < 10].index
abm1.drop(indexNames , inplace=True)
abm1.shape

(46604, 13)

## Neighbourhood를 onehot encoding 하기

In [13]:
abm1['neighbourhood_group_cleansed'].value_counts()

Brooklyn         19248
Manhattan        18361
Queens            6989
Bronx             1482
Staten Island      524
Name: neighbourhood_group_cleansed, dtype: int64

In [14]:
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Brooklyn'), 'Brooklyn'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Manhattan'), 'Manhattan'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Queens'), 'Queens'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Bronx'), 'Bronx'] = 1
abm1.loc[abm1['neighbourhood_group_cleansed'].str.contains('Staten Island'), 'Staten Island'] = 1

abm1 = abm1.fillna(0)

abm1.drop('neighbourhood_group_cleansed', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

property_type           0
room_type               0
accommodates            0
bathrooms               0
bedrooms                0
beds                    0
bed_type                0
price                   0
security_deposit        0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
dtype: int64

Unnamed: 0,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,price,security_deposit,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island
0,Apartment,Entire home/apt,1,1.0,0.0,1.0,Real Bed,225.0,350.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0
1,Guest suite,Entire home/apt,3,1.0,1.0,4.0,Real Bed,89.0,500.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0
2,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,200.0,300.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0
4,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,79.0,0.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0
5,Apartment,Private room,1,1.0,1.0,1.0,Real Bed,79.0,0.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0


## Property type을 onehot encoding 하기

In [15]:
# property_type에는 무슨 값이 있을지
abm1['property_type'].value_counts()

Apartment             34206
House                  5192
Townhouse              2430
Loft                   1561
Condominium            1489
Guest suite             717
Hotel                   223
Boutique hotel          179
Serviced apartment      146
Hostel                   93
Guesthouse               85
Bed and breakfast        47
Bungalow                 45
Villa                    39
Other                    38
Tiny house               34
Camper/RV                26
Cottage                  18
Resort                   12
Boat                     10
Earth house               4
Aparthotel                3
Barn                      2
Castle                    2
Houseboat                 2
Cabin                     1
Name: property_type, dtype: int64

In [16]:
Mod_prop_type=abm1['property_type'].value_counts()[5:len(abm1['property_type'].value_counts())].index.tolist()

def change_prop_type(label):
    if label in Mod_prop_type:
        label='Other'
    return label

In [17]:
abm1.loc[:,'property_type'] = abm1.loc[:,'property_type'].apply(change_prop_type)

In [18]:
abm1['property_type'].value_counts() # 5순위 이하 주거형태는 Others로 분류

Apartment      34206
House           5192
Townhouse       2430
Other           1726
Loft            1561
Condominium     1489
Name: property_type, dtype: int64

In [19]:
abm1.loc[abm1['property_type'].str.contains('Apartment'), 'Apartment'] = 1
abm1.loc[abm1['property_type'].str.contains('House'), 'House'] = 1
abm1.loc[abm1['property_type'].str.contains('Other'), 'Other'] = 1
abm1.loc[abm1['property_type'].str.contains('Townhouse'), 'Townhouse'] = 1
abm1.loc[abm1['property_type'].str.contains('Condominium'), 'Condominium'] = 1
abm1.loc[abm1['property_type'].str.contains('Loft'), 'Loft'] = 1
abm1 = abm1.fillna(0)

abm1.isnull().sum()
abm1.head(5)

property_type           0
room_type               0
accommodates            0
bathrooms               0
bedrooms                0
beds                    0
bed_type                0
price                   0
security_deposit        0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
Apartment               0
House                   0
Other                   0
Townhouse               0
Condominium             0
Loft                    0
dtype: int64

Unnamed: 0,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,price,security_deposit,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft
0,Apartment,Entire home/apt,1,1.0,0.0,1.0,Real Bed,225.0,350.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Other,Entire home/apt,3,1.0,1.0,4.0,Real Bed,89.0,500.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,200.0,300.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,79.0,0.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,Apartment,Private room,1,1.0,1.0,1.0,Real Bed,79.0,0.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [20]:
abm1.drop('property_type', axis = 1, inplace=True)
abm1.head(3)

Unnamed: 0,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,price,security_deposit,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft
0,Entire home/apt,1,1.0,0.0,1.0,Real Bed,225.0,350.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Entire home/apt,3,1.0,1.0,4.0,Real Bed,89.0,500.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Entire home/apt,2,1.0,1.0,1.0,Real Bed,200.0,300.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## Room_type를 one hot encoding 하기

In [21]:
abm1['room_type'].value_counts()

Entire home/apt    24940
Private room       20420
Shared room          878
Hotel room           366
Name: room_type, dtype: int64

In [22]:
abm1.loc[abm1['room_type'].str.contains('Entire home/apt'), 'Entire home/apt'] = 1
abm1.loc[abm1['room_type'].str.contains('Private room'), 'Private room'] = 1
abm1.loc[abm1['room_type'].str.contains('Shared room'), 'Shared room'] = 1
abm1.loc[abm1['room_type'].str.contains('Hotel room'), 'Hotel room'] = 1
abm1 = abm1.fillna(0)

abm1.drop('room_type', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

accommodates            0
bathrooms               0
bedrooms                0
beds                    0
bed_type                0
price                   0
security_deposit        0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
Apartment               0
House                   0
Other                   0
Townhouse               0
Condominium             0
Loft                    0
Entire home/apt         0
Private room            0
Shared room             0
Hotel room              0
dtype: int64

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,bed_type,price,security_deposit,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room
0,1,1.0,0.0,1.0,Real Bed,225.0,350.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,3,1.0,1.0,4.0,Real Bed,89.0,500.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2,1.0,1.0,1.0,Real Bed,200.0,300.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,2,1.0,1.0,1.0,Real Bed,79.0,0.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,1,1.0,1.0,1.0,Real Bed,79.0,0.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## Bed_type one hot encoding

In [23]:
abm1['bed_type'].value_counts()

Real Bed         45848
Futon              293
Pull-out Sofa      281
Airbed             120
Couch               62
Name: bed_type, dtype: int64

In [24]:
abm1.loc[abm1['bed_type'].str.contains('Real Bed'), 'Real Bed'] = 1
abm1.loc[abm1['bed_type'].str.contains('Futon'), 'Futon'] = 1
abm1.loc[abm1['bed_type'].str.contains('Pull-out Sofa'), 'Pull-out Sofa'] = 1
abm1.loc[abm1['bed_type'].str.contains('Airbed'), 'Airbed'] = 1
abm1.loc[abm1['bed_type'].str.contains('Couch'), 'Couch'] = 1
abm1 = abm1.fillna(0)

abm1.drop('bed_type', axis = 1, inplace=True)
abm1.isnull().sum()
abm1.head(5)

accommodates            0
bathrooms               0
bedrooms                0
beds                    0
price                   0
security_deposit        0
review_scores_rating    0
number_of_reviews       0
amenities               0
Brooklyn                0
Manhattan               0
Queens                  0
Bronx                   0
Staten Island           0
Apartment               0
House                   0
Other                   0
Townhouse               0
Condominium             0
Loft                    0
Entire home/apt         0
Private room            0
Shared room             0
Hotel room              0
Real Bed                0
Futon                   0
Pull-out Sofa           0
Airbed                  0
Couch                   0
dtype: int64

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,security_deposit,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,Real Bed,Futon,Pull-out Sofa,Airbed,Couch
0,1,1.0,0.0,1.0,225.0,350.0,94.0,48,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,3,1.0,1.0,4.0,89.0,500.0,90.0,307,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,2,1.0,1.0,1.0,200.0,300.0,90.0,78,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,2,1.0,1.0,1.0,79.0,0.0,84.0,463,"{TV,Wifi,""Air conditioning"",""Paid parking off ...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,1,1.0,1.0,1.0,79.0,0.0,98.0,118,"{Internet,Wifi,""Air conditioning"",""Paid parkin...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


# Content Based Filtering 추천 시스템

## 유사도 검사를 위한 새로운 matrix 생성

In [25]:
abm2 = abm1.drop(['review_scores_rating', 'number_of_reviews', 'amenities'], axis = 1, inplace=False)
abm2.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,security_deposit,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,Real Bed,Futon,Pull-out Sofa,Airbed,Couch
0,1,1.0,0.0,1.0,225.0,350.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,3,1.0,1.0,4.0,89.0,500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,2,1.0,1.0,1.0,200.0,300.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,2,1.0,1.0,1.0,79.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,1,1.0,1.0,1.0,79.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [26]:
#정규화


## 유사도 계산해보기

In [27]:
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
  return dot(A, B)/(norm(A)*norm(B))

#def normalize_data(test):
#    idx = 0
#    for item in abm2.columns:
#        test[idx] = (test[idx] - abm1[item].min())/(abm1[item].max() - abm1[item].min())
#        idx += 1
#    print("normalize test data: ", test)
#    return test

In [28]:
# 예시 데이터
# 지금은 정적으로 입력됐지만, 사용자가 입력하게끔 해야 한다.
test_data = np.array([2,1,2,4,200,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0])
test_data = test_data.astype(float)

In [29]:
sim = []
#test_data = normalize_data(test_data)

for item in abm2.to_numpy():
    sim.append(cos_sim(test_data, item))
    
sim[:10]

[0.5406708676072626,
 0.17544694294590638,
 0.5546698350997032,
 0.9995796729095607,
 0.9996931262357279,
 0.9998820387929884,
 0.44370659048732747,
 0.40658415392806974,
 0.3912187930399915,
 0.23340151394284064]

In [30]:
sim_sorted_byindex = sorted(range(len(sim)), key=lambda k: sim[k])
sim_sorted_byindex = sim_sorted_byindex[-10:]
print(sim_sorted_byindex)
# 상위 10개의 유사도를 가지는 데이터를 추출

abm2.to_numpy()[37507] # 가장 유사도가 높은 호스트. 그러나, 평점이 높은지는 보장 X

[27252, 39962, 123, 10553, 27486, 3578, 23038, 4640, 23692, 37507]


array([  3.,   1.,   2.,   3., 175.,   0.,   1.,   0.,   0.,   0.,   0.,
         1.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,   1.,
         0.,   0.,   0.,   0.])

In [31]:
def recommend(arr):
    temp = abm1.to_numpy()
    ans = arr[0]
    for idx in arr:
        if temp[idx][6] >= temp[ans][6]:
            ans = idx
    return ans

In [32]:
# [2,1,2,4,200,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0]
find = recommend(sim_sorted_byindex)
print(find)

27486


## 최종 결과, 그래서 추천하는 amenities는?

In [33]:
result = abm1[find:find+1]['amenities']

for item in result.to_numpy().tolist():
    print(item)

{TV,Wifi,"Air conditioning",Kitchen,Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50","Private entrance","Window guards","Hot water","Bed linens","Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"BBQ grill","Garden or backyard","Luggage dropoff allowed","Wide hallways","Host greets you"}


In [34]:
abm1[find:find+1]
# [2,1,2,4,200,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0]

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,security_deposit,review_scores_rating,number_of_reviews,amenities,Brooklyn,Manhattan,Queens,Bronx,Staten Island,Apartment,House,Other,Townhouse,Condominium,Loft,Entire home/apt,Private room,Shared room,Hotel room,Real Bed,Futon,Pull-out Sofa,Airbed,Couch
73136,2,1.0,2.0,3.0,200.0,0.0,99.0,21,"{TV,Wifi,""Air conditioning"",Kitchen,Heating,""F...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
