## Introduction
The goal of this notebook is show how to train a PCA model on AWS Sagemaker. The training data resides on S3 in `s3://skuchkula-sagemaker-airbnb/` location. Objective is to apply Principal Components Analysis (PCA) on airbnb locations located in NYC. 

### STEP 1: Create an S3 bucket which contains sagemaker name in it.
Having the sagemaker name is not a requirement. However, as per Amazon documentation:

> Note: 
Amazon SageMaker needs permission to access these buckets. You grant permission with an IAM role, which you create in the next step when you create an Amazon SageMaker notebook instance. This IAM role automatically gets permissions to access any bucket that has sagemaker in the name. It gets these permissions through the AmazonSageMakerFullAccess policy, which Amazon SageMaker attaches to the role. If you add a policy to the role that grants the SageMaker service principal S3FullAccess permission, the name of the bucket does not need to contain sagemaker.

### STEP 2: Create an Amazon SageMaker Notebook instance
An Amazon SageMaker notebook instance is a fully managed machine learning (ML) Amazon Elastic Compute Cloud (Amazon EC2) compute instance that runs the Jupyter Notebook App. You use the notebook instance to create and manage Jupyter notebooks that you can use to prepare and process data and to train and deploy machine learning models. 

To create an Amazon SageMaker notebook instance

- Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

- Choose Notebook instances, then choose Create notebook instance.

- On the Create notebook instance page, provide the following information (if a field is not mentioned, leave the default values):

- For Notebook instance name, type a name for your notebook instance.

- For Instance type, choose ml.t2.medium. This is the least expensive instance type that notebook instances support, and it suffices for this exercise.

- For IAM role, choose Create a new role, then choose Create role.

- Choose Create notebook instance.

In a few minutes, Amazon SageMaker launches an ML compute instance—in this case, a notebook instance—and attaches an ML storage volume to it. The notebook instance has a preconfigured Jupyter notebook server and a set of Anaconda libraries.

### Step 3: Create notebook instance and start writing code

In [1]:
# data managing and display libs
import pandas as pd
import numpy as np
import os
import io

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline 

# sagemaker libraries
import boto3
import sagemaker

### Step 4:  Get data from S3

In [2]:
# boto3 client to get S3 data
s3_client = boto3.client('s3')
bucket_name='skuchkula-sagemaker-airbnb'

In [3]:
# get a list of objects in the bucket
obj_list=s3_client.list_objects(Bucket=bucket_name)

# print object(s)in S3 bucket
files=[]
for contents in obj_list['Contents']:
    files.append(contents['Key'])
    
print(files)

['detailed_listings.csv', 'summary_listings.csv']


In [4]:
# there is one file --> one key
detailed_listings=files[0]
summary_listings=files[1]

In [5]:
# check the file formats
file_name = detailed_listings

# get an S3 object by passing in the bucket and file name
data_object = s3_client.get_object(Bucket=bucket_name, Key=file_name)

# what info does the object contain?
display(data_object)

{'ResponseMetadata': {'RequestId': 'C6DAE855FAFB68CD',
  'HostId': 'AshhLC/JbPE5I3Gokgj0/kzcz8T89oog0bYTlfMdLNGmVprf3c67vmPyjV9YvXeLq/e/aHIr/r8=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'AshhLC/JbPE5I3Gokgj0/kzcz8T89oog0bYTlfMdLNGmVprf3c67vmPyjV9YvXeLq/e/aHIr/r8=',
   'x-amz-request-id': 'C6DAE855FAFB68CD',
   'date': 'Sat, 05 Oct 2019 00:30:54 GMT',
   'last-modified': 'Thu, 03 Oct 2019 21:24:49 GMT',
   'etag': '"01067810107b6eb6cfc6bf52cf02de2c-22"',
   'accept-ranges': 'bytes',
   'content-type': 'text/csv',
   'content-length': '184372589',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'AcceptRanges': 'bytes',
 'LastModified': datetime.datetime(2019, 10, 3, 21, 24, 49, tzinfo=tzutc()),
 'ContentLength': 184372589,
 'ETag': '"01067810107b6eb6cfc6bf52cf02de2c-22"',
 'ContentType': 'text/csv',
 'Metadata': {},
 'Body': <botocore.response.StreamingBody at 0x7f5634aa90f0>}

In [6]:
# information is in the "Body" of the object
data_body = data_object["Body"].read()
print('Data type: ', type(data_body))

Data type:  <class 'bytes'>


In [7]:
# read in bytes data
data_stream = io.BytesIO(data_body)

# create a dataframe
counties_df = pd.read_csv(data_stream, header=0, delimiter=",") 
counties_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,https://www.airbnb.com/rooms/2595,20190806030549,2019-08-07,Skylit Midtown Castle,"Find your romantic getaway to this beautiful, ...","- Spacious (500+ft²), immaculate and nicely fu...","Find your romantic getaway to this beautiful, ...",none,Centrally located in the heart of Manhattan ju...,...,f,f,strict_14_with_grace_period,t,t,2,1,0,1,0.39
1,3647,https://www.airbnb.com/rooms/3647,20190806030549,2019-08-06,THE VILLAGE OF HARLEM....NEW YORK !,,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T...,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T...,none,,...,f,f,strict_14_with_grace_period,t,t,1,0,1,0,
2,3831,https://www.airbnb.com/rooms/3831,20190806030549,2019-08-06,Cozy Entire Floor of Brownstone,Urban retreat: enjoy 500 s.f. floor in 1899 br...,Greetings! We own a double-duplex brownst...,Urban retreat: enjoy 500 s.f. floor in 1899 br...,none,Just the right mix of urban center and local n...,...,f,f,moderate,f,f,1,1,0,0,4.64
3,5022,https://www.airbnb.com/rooms/5022,20190806030549,2019-08-06,Entire Apt: Spacious Studio/Loft by central park,,Loft apartment with high ceiling and wood floo...,Loft apartment with high ceiling and wood floo...,none,,...,f,f,strict_14_with_grace_period,t,t,1,1,0,0,0.1
4,5099,https://www.airbnb.com/rooms/5099,20190806030549,2019-08-06,Large Cozy 1 BR Apartment In Midtown East,My large 1 bedroom apartment is true New York ...,I have a large 1 bedroom apartment centrally l...,My large 1 bedroom apartment is true New York ...,none,My neighborhood in Midtown East is called Murr...,...,f,f,strict_14_with_grace_period,t,t,1,1,0,0,0.6


In [6]:
def get_data_frame(bucket_name, file_name):
    # get an S3 object by passing in the bucket and file name
    data_object = s3_client.get_object(Bucket=bucket_name, Key=file_name)
    
    # information is in the "Body" of the object
    data_body = data_object["Body"].read()
    
    # read in bytes data
    data_stream = io.BytesIO(data_body)
    
    # create a dataframe
    df = pd.read_csv(data_stream, header=0, delimiter=",", low_memory=False)
    
    return df

In [7]:
df_summary_listings = get_data_frame(bucket_name, summary_listings)
df_detailed_listings = get_data_frame(bucket_name, detailed_listings)

In [8]:
df_summary_listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,46,2019-07-14,0.39,2,288
1,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
2,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,274,2019-07-26,4.64,1,212
3,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
4,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,75,2019-07-21,0.6,1,127


In [9]:
df_detailed_listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,https://www.airbnb.com/rooms/2595,20190806030549,2019-08-07,Skylit Midtown Castle,"Find your romantic getaway to this beautiful, ...","- Spacious (500+ft²), immaculate and nicely fu...","Find your romantic getaway to this beautiful, ...",none,Centrally located in the heart of Manhattan ju...,...,f,f,strict_14_with_grace_period,t,t,2,1,0,1,0.39
1,3647,https://www.airbnb.com/rooms/3647,20190806030549,2019-08-06,THE VILLAGE OF HARLEM....NEW YORK !,,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T...,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T...,none,,...,f,f,strict_14_with_grace_period,t,t,1,0,1,0,
2,3831,https://www.airbnb.com/rooms/3831,20190806030549,2019-08-06,Cozy Entire Floor of Brownstone,Urban retreat: enjoy 500 s.f. floor in 1899 br...,Greetings! We own a double-duplex brownst...,Urban retreat: enjoy 500 s.f. floor in 1899 br...,none,Just the right mix of urban center and local n...,...,f,f,moderate,f,f,1,1,0,0,4.64
3,5022,https://www.airbnb.com/rooms/5022,20190806030549,2019-08-06,Entire Apt: Spacious Studio/Loft by central park,,Loft apartment with high ceiling and wood floo...,Loft apartment with high ceiling and wood floo...,none,,...,f,f,strict_14_with_grace_period,t,t,1,1,0,0,0.1
4,5099,https://www.airbnb.com/rooms/5099,20190806030549,2019-08-06,Large Cozy 1 BR Apartment In Midtown East,My large 1 bedroom apartment is true New York ...,I have a large 1 bedroom apartment centrally l...,My large 1 bedroom apartment is true New York ...,none,My neighborhood in Midtown East is called Murr...,...,f,f,strict_14_with_grace_period,t,t,1,1,0,0,0.6


The Detailed listings contains about 106 different features for each of the Airbnb locations. We need to do some EDA to check if we can trim down some features.

In [10]:
print("Summary Listings row, cols: ", df_summary_listings.shape)
print("Detailed Listings row, cols: ", df_detailed_listings.shape)

Summary Listings row, cols:  (48864, 16)
Detailed Listings row, cols:  (48864, 106)


The summary listings contains a small subset of features contained in the detailed listings. Summary listings is good for visualization purposes, but for machine learning models, it is better to have more features from which the model can learn.

In [11]:
df_summary_listings.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [12]:
# check how the data types are distributed.
# pandas does its best to interpret the datatype while reading in
# however it is our duty to check if the datatype makes sense
df_detailed_listings.dtypes.value_counts()

object     63
float64    22
int64      21
dtype: int64

### Data Cleaning

In [13]:
df_detailed_listings.select_dtypes(include=['object']).columns

Index(['listing_url', 'last_scraped', 'name', 'summary', 'space',
       'description', 'experiences_offered', 'neighborhood_overview', 'notes',
       'transit', 'access', 'interaction', 'house_rules', 'picture_url',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'is_location_exact',
       'property_type', 'room_type', 'bed_type', 'amenities', 'price',
       'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee',
       'extra_people', 'calendar_updated', 'has_availability',
       'calendar_last_scraped', 'first_review', 'last_review',


In [14]:
print(pd.get_option("display.max_columns"))
pd.set_option("display.max_columns", 100)
print(pd.get_option("display.max_columns"))

20
100


In [22]:
drop_object_cols = ['listing_url',
             'last_scraped',
             #'name',
             'picture_url',
             'host_url',
             'host_name',
             'host_since',
             'host_location',
             'host_about',
             'host_thumbnail_url',
             'host_picture_url',
             'host_neighbourhood',
             'street',
             #'neighbourhood',
             #'neighbourhood_cleansed',
             #'neighbourhood_group_cleansed',
             'city',
             'state',
             'zipcode',
             'market',
             'smart_location',
             'country_code',
             'country',
             'calendar_updated',
             'calendar_last_scraped',
             'first_review',
             'last_review' 
            ]

In [18]:
df_detailed_listings.select_dtypes(include=['object']).head()

Unnamed: 0,listing_url,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,picture_url,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,is_location_exact,property_type,room_type,bed_type,amenities,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people,calendar_updated,has_availability,calendar_last_scraped,first_review,last_review,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,https://www.airbnb.com/rooms/2595,2019-08-07,Skylit Midtown Castle,"Find your romantic getaway to this beautiful, ...","- Spacious (500+ft²), immaculate and nicely fu...","Find your romantic getaway to this beautiful, ...",none,Centrally located in the heart of Manhattan ju...,,Apartment is located on 37th Street between 5t...,"Guests have full access to the kitchen, bathro...",I am a Sound Therapy Practitioner and Kundalin...,"Make yourself at home, respect the space and t...",https://a0.muscache.com/im/pictures/f0813a11-4...,https://www.airbnb.com/users/show/2845,Jennifer,2008-09-09,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,within a few hours,90%,f,https://a0.muscache.com/im/users/2845/profile_...,https://a0.muscache.com/im/users/2845/profile_...,Midtown,"['email', 'phone', 'reviews', 'kba', 'work_ema...",t,t,"New York, NY, United States",Midtown,Midtown,Manhattan,New York,NY,10018,New York,"New York, NY",US,United States,f,Apartment,Entire home/apt,Real Bed,"{TV,Wifi,""Air conditioning"",Kitchen,""Paid park...",$225.00,"$1,995.00",,$350.00,$100.00,$0.00,a week ago,t,2019-08-07,2009-11-21,2019-07-14,f,,,f,f,strict_14_with_grace_period,t,t
1,https://www.airbnb.com/rooms/3647,2019-08-06,THE VILLAGE OF HARLEM....NEW YORK !,,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T...,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY T...,none,,,,,,Upon arrival please have a legibile copy of yo...,https://a0.muscache.com/im/pictures/838341/9b3...,https://www.airbnb.com/users/show/4632,Elisabeth,2008-11-25,"New York, New York, United States",Make Up Artist National/ (Website hidden by Ai...,within a day,100%,f,https://a0.muscache.com/im/users/4632/profile_...,https://a0.muscache.com/im/users/4632/profile_...,Harlem,"['email', 'phone', 'google', 'reviews', 'jumio...",t,t,"New York, NY, United States",Harlem,Harlem,Manhattan,New York,NY,10027,New York,"New York, NY",US,United States,t,Apartment,Private room,Pull-out Sofa,"{""Cable TV"",Internet,Wifi,""Air conditioning"",K...",$150.00,,,$200.00,$75.00,$20.00,35 months ago,t,2019-08-06,,,f,,,f,f,strict_14_with_grace_period,t,t
2,https://www.airbnb.com/rooms/3831,2019-08-06,Cozy Entire Floor of Brownstone,Urban retreat: enjoy 500 s.f. floor in 1899 br...,Greetings! We own a double-duplex brownst...,Urban retreat: enjoy 500 s.f. floor in 1899 br...,none,Just the right mix of urban center and local n...,,B52 bus for a 10-minute ride to downtown Brook...,You will have exclusive use of and access to: ...,"We'll be around, but since you have the top fl...",Smoking - outside please; pets allowed but ple...,https://a0.muscache.com/im/pictures/e49999c2-9...,https://www.airbnb.com/users/show/4869,LisaRoxanne,2008-12-07,"New York, New York, United States",Laid-back bi-coastal actor/professor/attorney.,within an hour,90%,f,https://a0.muscache.com/im/users/4869/profile_...,https://a0.muscache.com/im/users/4869/profile_...,Clinton Hill,"['email', 'phone', 'reviews', 'kba']",t,t,"Brooklyn, NY, United States",Brooklyn,Clinton Hill,Brooklyn,Brooklyn,NY,11238,New York,"Brooklyn, NY",US,United States,t,Guest suite,Entire home/apt,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$89.00,$575.00,"$2,100.00",$500.00,,$0.00,today,t,2019-08-06,2014-09-30,2019-07-26,f,,,f,f,moderate,f,f
3,https://www.airbnb.com/rooms/5022,2019-08-06,Entire Apt: Spacious Studio/Loft by central park,,Loft apartment with high ceiling and wood floo...,Loft apartment with high ceiling and wood floo...,none,,,,,,Please be considerate when staying in the apar...,https://a0.muscache.com/im/pictures/feb453bd-f...,https://www.airbnb.com/users/show/7192,Laura,2009-01-29,"Miami, Florida, United States",I have been a NYer for almost 10 years. I came...,,,f,https://a0.muscache.com/im/users/7192/profile_...,https://a0.muscache.com/im/users/7192/profile_...,East Harlem,"['email', 'phone', 'facebook', 'reviews', 'kba']",t,t,"New York, NY, United States",East Harlem,East Harlem,Manhattan,New York,NY,10029,New York,"New York, NY",US,United States,t,Apartment,Entire home/apt,Real Bed,"{Internet,Wifi,""Air conditioning"",Kitchen,Elev...",$80.00,$600.00,"$1,600.00",$100.00,$80.00,$20.00,4 months ago,t,2019-08-06,2012-03-20,2018-11-19,f,,,f,f,strict_14_with_grace_period,t,t
4,https://www.airbnb.com/rooms/5099,2019-08-06,Large Cozy 1 BR Apartment In Midtown East,My large 1 bedroom apartment is true New York ...,I have a large 1 bedroom apartment centrally l...,My large 1 bedroom apartment is true New York ...,none,My neighborhood in Midtown East is called Murr...,Read My Full Listing For All Information. New ...,From the apartment is a 10 minute walk to Gran...,I will meet you upon arrival.,I usually check in with guests via text or ema...,• Check-in time is 2PM. • Check-out time is 12...,https://a0.muscache.com/im/pictures/be2fdcf6-e...,https://www.airbnb.com/users/show/7322,Chris,2009-02-02,"New York, New York, United States","I'm an artist, writer, traveler, and a native ...",within a few hours,90%,f,https://a0.muscache.com/im/pictures/user/26745...,https://a0.muscache.com/im/pictures/user/26745...,Flatiron District,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,f,"New York, NY, United States",Midtown East,Murray Hill,Manhattan,New York,NY,10016,New York,"New York, NY",US,United States,f,Apartment,Entire home/apt,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Buzzer/w...",$200.00,,,$300.00,$125.00,$100.00,4 days ago,t,2019-08-06,2009-04-20,2019-07-21,f,,,f,f,strict_14_with_grace_period,t,t


In [15]:
df_detailed_listings.select_dtypes(include=['float64']).columns

Index(['thumbnail_url', 'medium_url', 'xl_picture_url', 'host_acceptance_rate',
       'host_listings_count', 'host_total_listings_count', 'latitude',
       'longitude', 'bathrooms', 'bedrooms', 'beds', 'square_feet',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month'],
      dtype='object')

In [16]:
drop_float_cols = ['thumbnail_url',
                   'medium_url',
                   'xl_picture_url',
                   #'latitude',
                   #'longitude'
                  ]

In [17]:
df_detailed_listings.select_dtypes(include=['float64']).head()

Unnamed: 0,thumbnail_url,medium_url,xl_picture_url,host_acceptance_rate,host_listings_count,host_total_listings_count,latitude,longitude,bathrooms,bedrooms,beds,square_feet,minimum_nights_avg_ntm,maximum_nights_avg_ntm,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month
0,,,,,5.0,5.0,40.75362,-73.98377,1.0,0.0,1.0,,1.0,1125.0,95.0,10.0,9.0,10.0,10.0,10.0,9.0,0.39
1,,,,,1.0,1.0,40.80902,-73.9419,1.0,1.0,1.0,,3.0,7.0,,,,,,,,
2,,,,,1.0,1.0,40.68514,-73.95976,1.0,1.0,4.0,500.0,1.0,730.0,90.0,10.0,9.0,10.0,9.0,10.0,9.0,4.64
3,,,,,1.0,1.0,40.79851,-73.94399,1.0,,1.0,,10.0,120.0,93.0,10.0,9.0,10.0,10.0,9.0,10.0,0.1
4,,,,,1.0,1.0,40.74767,-73.975,1.0,1.0,1.0,,3.0,21.0,89.0,10.0,9.0,10.0,10.0,9.0,9.0,0.6


In [18]:
df_detailed_listings.select_dtypes(include=['int64']).columns

Index(['id', 'scrape_id', 'host_id', 'accommodates', 'guests_included',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'availability_30', 'availability_60',
       'availability_90', 'availability_365', 'number_of_reviews',
       'number_of_reviews_ltm', 'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms'],
      dtype='object')

In [19]:
drop_int_cols = ['scrape_id',
                 'host_id',
                 'minimum_minimum_nights',
                 'maximum_minimum_nights',
                 'minimum_maximum_nights',
                 'maximum_maximum_nights'
                ]

In [20]:
df_detailed_listings.select_dtypes(include=['int64']).head()

Unnamed: 0,id,scrape_id,host_id,accommodates,guests_included,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,2595,20190806030549,2845,2,2,1,1125,1,1,1125,1125,13,17,31,288,46,12,2,1,0,1
1,3647,20190806030549,4632,2,2,3,7,3,3,7,7,30,60,90,365,0,0,1,0,1,0
2,3831,20190806030549,4869,3,1,1,730,1,1,730,730,0,1,4,212,274,70,1,1,0,0
3,5022,20190806030549,7192,1,1,10,120,10,10,120,120,0,0,0,0,9,2,1,1,0,0
4,5099,20190806030549,7322,2,2,3,21,3,3,21,21,24,33,63,127,75,10,1,1,0,0


In [23]:
drop_cols = drop_object_cols + drop_float_cols + drop_int_cols + ['experiences_offered']
len(drop_cols)

33

In [24]:
drop_cols

['listing_url',
 'last_scraped',
 'picture_url',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'street',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'calendar_updated',
 'calendar_last_scraped',
 'first_review',
 'last_review',
 'thumbnail_url',
 'medium_url',
 'xl_picture_url',
 'scrape_id',
 'host_id',
 'minimum_minimum_nights',
 'maximum_minimum_nights',
 'minimum_maximum_nights',
 'maximum_maximum_nights',
 'experiences_offered']

In [25]:
df_dl_clean_df = df_detailed_listings.drop(columns=drop_cols)

In [26]:
df_dl_clean_df.shape

(48864, 73)

## Missing Value analysis
Calculate missing value statistics

In [27]:
print(pd.get_option("display.max_rows"))
pd.set_option("display.max_rows", 100)
print(pd.get_option("display.max_rows"))

60
100


In [29]:
num_missing = df_dl_clean_df.isnull().sum().to_frame()
num_missing.columns = ['num_missing']
num_missing['pct_missing'] = np.round(100 * (num_missing['num_missing'] / df_dl_clean_df.shape[0]))
num_missing.sort_values(by='num_missing', ascending=False)

Unnamed: 0,num_missing,pct_missing
host_acceptance_rate,48864,100.0
jurisdiction_names,48853,100.0
license,48842,100.0
square_feet,48469,99.0
monthly_price,43715,89.0
weekly_price,42964,88.0
notes,28692,59.0
access,21916,45.0
interaction,19947,41.0
house_rules,18912,39.0


## Analyze amenities

In [30]:
df_dl_clean_df['amenities'] =  df_dl_clean_df['amenities'].apply(lambda x: x[1:-1])

In [31]:
amenities = df_dl_clean_df['amenities']

In [32]:
amenities_idx ={}
idx = 0
corpus = []
for i in range(len(amenities)):
    items = amenities[i]
    items_lower = items.lower()
    tokens = items_lower.split(',')
    corpus.append(tokens)
    for token in tokens:
        if token not in amenities_idx:
            amenities_idx[token] = idx
            idx += 1

In [33]:
len(amenities_idx)

131

In [34]:
len(corpus)

48864

In [35]:
# Get the number of items and tokens 
M = len(amenities)
N = len(amenities_idx)

# Initialize a matrix of zeros
A = np.zeros((M, N))

In [36]:
# Define the amenity_encoder function
def amenity_encoder(tokens):
    x = np.zeros(N)
    for token in tokens:
        # Get the index for each amenity
        idx = amenities_idx[token]
        # Put 1 at the corresponding indices
        x[idx] = 1
    return x

In [37]:
# Make a document-term matrix
i = 0
for tokens in corpus:
    A[i, :] = amenity_encoder(tokens)
    i = i + 1

In [38]:
A.shape

(48864, 131)

In [40]:
from sklearn.manifold import TSNE

In [None]:
# Dimension reduction with t-SNE
model = TSNE(n_components=2, learning_rate=200, random_state=42)
tsne_features = model.fit_transform(A)

In [42]:
tsne_df = pd.DataFrame({'TSNE1': tsne_features[:,0], 
              'TSNE2': tsne_features[:,1]
             })

In [43]:
subset_cols = ['id', 'price', 'amenities', 
               'neighbourhood_cleansed', 
               'neighbourhood_group_cleansed',
               'latitude', 'longitude'
              ]

In [44]:
df_dl_clean_df = df_dl_clean_df[subset_cols]

In [45]:
df_dl_clean_df = pd.concat([df_dl_clean_df, tsne_df], axis='columns')

In [46]:
df_dl_clean_df.head()

Unnamed: 0,id,price,amenities,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,TSNE1,TSNE2
0,2595,$225.00,"TV,Wifi,""Air conditioning"",Kitchen,""Paid parki...",Midtown,Manhattan,40.75362,-73.98377,42.80336,-10.623417
1,3647,$150.00,"""Cable TV"",Internet,Wifi,""Air conditioning"",Ki...",Harlem,Manhattan,40.80902,-73.9419,-34.314285,13.559643
2,3831,$89.00,"TV,""Cable TV"",Internet,Wifi,""Air conditioning""...",Clinton Hill,Brooklyn,40.68514,-73.95976,3.4063,-22.669363
3,5022,$80.00,"Internet,Wifi,""Air conditioning"",Kitchen,Eleva...",East Harlem,Manhattan,40.79851,-73.94399,-11.913917,-8.425117
4,5099,$200.00,"TV,""Cable TV"",Internet,Wifi,Kitchen,""Buzzer/wi...",Murray Hill,Manhattan,40.74767,-73.975,-12.099396,-1.838848


In [None]:
df_dl_clean_df.he

### Visualizing t-sne results

In [88]:
from bokeh.io import show, output_notebook, push_notebook, output_file
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
output_notebook()


In [97]:
# Make a source and a scatter plot  
source = ColumnDataSource(df_dl_clean_df[:100])

In [98]:
plot = figure(x_axis_label = 'T-SNE 1', 
              y_axis_label = 'T-SNE 2', 
              width = 500, height = 400)
plot.circle(x = 'TSNE1', 
    y = 'TSNE2', 
    source = source, 
    size = 10, color = '#FF7373', alpha = .8)

In [99]:
# Create a HoverTool object
hover = HoverTool(tooltips = [('id', '@id'),
                              ('price', '$@price')
                             ])
plot.add_tools(hover)

In [100]:
output_file('amenities.html')
show(plot)

## To do

In [None]:
# more columns that can be removed
# experiences_offered, contains all none's so no use.
more_cols = ['experiences_offered']