 
# Data Cleansing with Airbnb 

We're going to start by doing some exploratory data analysis & cleansing. The data we are working with is Airbnb rentals from SF. You can read more at [Inside Airbnb](http://insideairbnb.com/get-the-data.html)

## In this notebook

 - Find the API Docs for the running version of Pandas
 - Work with messy CSV files
 - Impute missing values
 - Identify & remove outliers
 - Preprocessing Data
 
 


Let's take a look at what data we have here

In [0]:
%ls 

airbnb-listings.csv  [0m[01;34msample_data[0m/



How are we going to load those data? 

**Pandas**

    A python package providing easy to use API with tools for structural data analysis. 
 
**DataFrame**

* The most commonly used object in Pandas.
* It is a 2D indexed data structures wtih columns in different data types.

We are going to use Pandas DataFrame API for importing and analyzing data.

Let's check the Docs first!

* Check [Pandas](https://pandas.pydata.org/) version you installed, and find the API Docs package version 

 


In [0]:
import pandas as pd 

In [0]:
#find pandas version
help(pd)

Help on package pandas:

NAME
    pandas

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    
    **pandas** is a Python package providing fast, flexible, and expressive data
    structures designed to make working with "relational" or "labeled" data both
    easy and intuitive. It aims to be the fundamental high-level building block for
    doing practical, **real world** data analysis in Python. Additionally, it has
    the broader goal of becoming **the most powerful and flexible open source data
    analysis / manipulation tool available in any language**. It is already well on
    its way toward this goal.
    
    Main Features
    -------------
    Here are just a few of the things that pandas does well:
    
      - Easy handling of missing data in floating point as well as non-floating
        point data.
      - Size mutability: columns can be inserted and deleted from DataFrame and
        higher dimensional objects
      - Automatic an

  srch_obj = srch_cls.__getattr__(cls, name)
  fields = getattr(object, '_fields', [])



Let's load the data into a DataFrame  from the CSV file.

In [0]:
filePath = 'airbnb-listings.csv'
 
rawDF=pd.read_csv(filePath)

Let's take a look at the first few records 

In [0]:
 
rawDF.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,...,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,958,https://www.airbnb.com/rooms/958,20181206023014,2018-12-06,"Bright, Modern Garden Unit - 1BR/1B",Our bright garden unit overlooks a grassy back...,"Newly remodeled, modern, and bright garden uni...",Our bright garden unit overlooks a grassy back...,none,*Quiet cul de sac in friendly neighborhood *St...,Due to the fact that we have children and a do...,*Public Transportation is 1/2 block away. *Ce...,*Full access to patio and backyard (shared wit...,A family of 4 lives upstairs with their dog. N...,* No Pets - even visiting guests for a short t...,,,https://a0.muscache.com/im/pictures/b7c2a199-4...,,1169,https://www.airbnb.com/users/show/1169,Holly,2008-07-31,"San Francisco, California, United States",We are a family with 2 boys born in 2009 and 2...,,,,t,https://a0.muscache.com/im/pictures/efdad96a-3...,https://a0.muscache.com/im/pictures/efdad96a-3...,Duboce Triangle,1,1,"['email', 'phone', 'facebook', 'reviews', 'kba']",t,t,"San Francisco, CA, United States",Duboce Triangle,Western Addition,...,2.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Pets liv...",,$170.00,"$1,120.00","$4,200.00",$100.00,$100.00,2,$25.00,1,30,today,t,3,4,5,74,2018-12-06,172,2009-07-23,2018-11-16,97.0,10.0,10.0,10.0,10.0,10.0,10.0,t,STR-0001256,"{""SAN FRANCISCO""}",t,f,moderate,f,f,1,1.51
1,5858,https://www.airbnb.com/rooms/5858,20181206023014,2018-12-06,Creative Sanctuary,,We live in a large Victorian house on a quiet ...,We live in a large Victorian house on a quiet ...,none,I love how our neighborhood feels quiet but is...,All the furniture in the house was handmade so...,The train is two blocks away and you can stop ...,"Our deck, garden, gourmet kitchen and extensiv...",,"Please respect the house, the art work, the fu...",,,https://a0.muscache.com/im/pictures/17714/3a7a...,,8904,https://www.airbnb.com/users/show/8904,Philip And Tania,2009-03-02,"San Francisco, California, United States",Philip: English transplant to the Bay Area and...,within a few hours,70%,,f,https://a0.muscache.com/im/users/8904/profile_...,https://a0.muscache.com/im/users/8904/profile_...,Bernal Heights,2,2,"['email', 'phone', 'reviews', 'kba', 'work_ema...",t,t,"San Francisco, CA, United States",Bernal Heights,Bernal Heights,...,3.0,Real Bed,"{Internet,Wifi,Kitchen,Heating,""Family/kid fri...",,$235.00,"$1,600.00","$5,500.00",,$100.00,2,$0.00,30,60,5 days ago,t,30,60,90,365,2018-12-06,112,2009-05-03,2017-08-06,98.0,10.0,10.0,10.0,10.0,10.0,9.0,t,,"{""SAN FRANCISCO""}",f,f,strict_14_with_grace_period,f,f,1,0.96
2,7918,https://www.airbnb.com/rooms/7918,20181206023014,2018-12-06,A Friendly Room - UCSF/USF - San Francisco,Nice and good public transportation. 7 minute...,Room rental-sunny view room/sink/Wi Fi (inner ...,Nice and good public transportation. 7 minute...,none,"Shopping old town, restaurants, McDonald, Whol...",Please email your picture id with print name (...,N Juda Muni and bus stop. Street parking.,,,"No party, No smoking, not for any kinds of smo...",,,https://a0.muscache.com/im/pictures/26356/8030...,,21994,https://www.airbnb.com/users/show/21994,Aaron,2009-06-17,"San Francisco, California, United States",7 minutes walk to UCSF. 15 minutes walk to US...,,,,f,https://a0.muscache.com/im/users/21994/profile...,https://a0.muscache.com/im/users/21994/profile...,Cole Valley,10,10,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"San Francisco, CA, United States",Cole Valley,Haight Ashbury,...,1.0,Real Bed,"{TV,Internet,Wifi,Kitchen,""Free street parking...",,$65.00,$485.00,"$1,685.00",$200.00,$50.00,1,$12.00,32,60,13 months ago,t,30,60,90,365,2018-12-06,17,2009-08-31,2016-11-21,85.0,8.0,8.0,9.0,9.0,9.0,8.0,t,,"{""SAN FRANCISCO""}",f,f,strict_14_with_grace_period,f,f,9,0.15
3,8142,https://www.airbnb.com/rooms/8142,20181206023014,2018-12-06,Friendly Room Apt. Style -UCSF/USF - San Franc...,Nice and good public transportation. 7 minute...,Room rental Sunny view Rm/Wi-Fi/TV/sink/large ...,Nice and good public transportation. 7 minute...,none,,Please email your picture id with print name (...,"N Juda Muni, Bus and UCSF Shuttle. small shopp...",,,no pet no smoke no party inside the building,,,https://a0.muscache.com/im/pictures/27832/3b1f...,,21994,https://www.airbnb.com/users/show/21994,Aaron,2009-06-17,"San Francisco, California, United States",7 minutes walk to UCSF. 15 minutes walk to US...,,,,f,https://a0.muscache.com/im/users/21994/profile...,https://a0.muscache.com/im/users/21994/profile...,Cole Valley,10,10,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"San Francisco, CA, United States",Cole Valley,Haight Ashbury,...,1.0,Real Bed,"{TV,Internet,Wifi,Kitchen,""Free street parking...",,$65.00,$490.00,"$1,685.00",$200.00,$50.00,1,$12.00,32,90,13 months ago,t,30,60,90,365,2018-12-06,8,2014-09-08,2018-09-12,93.0,9.0,9.0,10.0,10.0,9.0,9.0,t,,"{""SAN FRANCISCO""}",f,f,strict_14_with_grace_period,f,f,9,0.15
4,8339,https://www.airbnb.com/rooms/8339,20181206023014,2018-12-06,Historic Alamo Square Victorian,Pls email before booking. Interior featured i...,Please send us a quick message before booking ...,Pls email before booking. Interior featured i...,none,,tax ID on file tax ID on file,,Guests have access to everything listed and sh...,,House Manual and House Rules will be provided ...,,,https://a0.muscache.com/im/pictures/6f84a7c2-e...,,24215,https://www.airbnb.com/users/show/24215,Rosy,2009-07-02,"San Francisco, California, United States",Always searching for a perfect piece at Europe...,within a few hours,100%,,f,https://a0.muscache.com/im/users/24215/profile...,https://a0.muscache.com/im/users/24215/profile...,Alamo Square,2,2,"['email', 'phone', 'reviews', 'kba']",t,t,"San Francisco, CA, United States",Alamo Square,Western Addition,...,2.0,Real Bed,"{TV,Internet,Wifi,Kitchen,Heating,""Family/kid ...",,$785.00,,,$0.00,$225.00,2,$150.00,7,1125,4 days ago,t,30,60,89,89,2018-12-06,27,2009-09-25,2018-08-11,97.0,10.0,10.0,10.0,10.0,10.0,9.0,t,STR-0000264,"{""SAN FRANCISCO""}",f,f,strict_14_with_grace_period,t,t,2,0.24


How many records in this dataframe?How many columns in this dataframe?

In [0]:
rawDF.shape

(7072, 96)

What are those column names?

In [0]:
rawDF.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'access', 'interaction', 'house_rules',
       'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url',
       'host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms',

What's the data schema of this dataframe, i.e. the data type for each column? 

In [0]:
rawDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7072 entries, 0 to 7071
Data columns (total 96 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                7072 non-null   int64  
 1   listing_url                       7072 non-null   object 
 2   scrape_id                         7072 non-null   int64  
 3   last_scraped                      7072 non-null   object 
 4   name                              7072 non-null   object 
 5   summary                           6868 non-null   object 
 6   space                             5996 non-null   object 
 7   description                       7052 non-null   object 
 8   experiences_offered               7072 non-null   object 
 9   neighborhood_overview             5147 non-null   object 
 10  notes                             4346 non-null   object 
 11  transit                           5129 non-null   object 
 12  access

For the sake of simplicity, only keep certain columns from this dataset.

In [0]:
columnsToKeep = [
  'host_is_superhost',
  'cancellation_policy',
  'instant_bookable',
  'host_total_listings_count',
  'neighbourhood_cleansed',
  'zipcode',
  'latitude',
  'longitude',
  'property_type',
  'room_type',
  'accommodates',
  'bathrooms',
  'bedrooms',
  'beds',
  'bed_type',
  'minimum_nights',
  'number_of_reviews',
  'review_scores_rating',
  'review_scores_accuracy',
  'review_scores_cleanliness',
  'review_scores_checkin',
  'review_scores_communication',
  'review_scores_location',
  'review_scores_value',
  'price']

baseDF = rawDF[columnsToKeep]
 
 

In [0]:
#check the first few of records in baseDF
baseDF.head()

Unnamed: 0,host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
0,t,moderate,t,1,Western Addition,94117.0,37.76931,-122.433856,Apartment,Entire home/apt,3,1.0,1.0,2.0,Real Bed,1,172,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$170.00
1,f,strict_14_with_grace_period,f,2,Bernal Heights,94110.0,37.745112,-122.421018,Apartment,Entire home/apt,5,1.0,2.0,3.0,Real Bed,30,112,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$235.00
2,f,strict_14_with_grace_period,f,10,Haight Ashbury,94117.0,37.76669,-122.452505,Apartment,Private room,2,4.0,1.0,1.0,Real Bed,32,17,85.0,8.0,8.0,9.0,9.0,9.0,8.0,$65.00
3,f,strict_14_with_grace_period,f,10,Haight Ashbury,94117.0,37.764872,-122.451828,Apartment,Private room,2,4.0,1.0,1.0,Real Bed,32,8,93.0,9.0,9.0,10.0,10.0,9.0,9.0,$65.00
4,f,strict_14_with_grace_period,f,2,Western Addition,94117.0,37.775249,-122.436374,House,Entire home/apt,5,1.5,2.0,2.0,Real Bed,7,27,97.0,10.0,10.0,10.0,10.0,10.0,9.0,$785.00


In [0]:
#get schema and more info 
baseDF.dtypes

host_is_superhost               object
cancellation_policy             object
instant_bookable                object
host_total_listings_count        int64
neighbourhood_cleansed          object
zipcode                        float64
latitude                       float64
longitude                      float64
property_type                   object
room_type                       object
accommodates                     int64
bathrooms                      float64
bedrooms                       float64
beds                           float64
bed_type                        object
minimum_nights                   int64
number_of_reviews                int64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_cleanliness      float64
review_scores_checkin          float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
price                           object
dtype: object

## Fixing Data Types

Take a look at the schema above. You'll notice that the `price` field got picked up as object (string). For our task, we need it to be a numeric (float64 type) field.

Let's fix that.

In [0]:
baseDF.price

0         $170.00
1         $235.00
2          $65.00
3          $65.00
4         $785.00
          ...    
7067    $1,800.00
7068       $55.00
7069       $65.00
7070       $50.00
7071      $120.00
Name: price, Length: 7072, dtype: object

In [0]:
#.replace() in DataFrame can be used with regex option , check the usage
help(baseDF.replace)

Help on method replace in module pandas.core.frame:

replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad') method of pandas.core.frame.DataFrame instance
    Replace values given in `to_replace` with `value`.
    
    Values of the DataFrame are replaced with other values dynamically.
    This differs from updating with ``.loc`` or ``.iloc``, which require
    you to specify a location to update with some value.
    
    Parameters
    ----------
    to_replace : str, regex, list, dict, Series, int, float, or None
        How to find the values that will be replaced.
    
        * numeric, str or regex:
    
            - numeric: numeric values equal to `to_replace` will be
              replaced with `value`
            - str: string exactly matching `to_replace` will be replaced
              with `value`
            - regex: regexs matching `to_replace` will be replaced with
              `value`
    
        * list of str, regex, or numeric:
 

In [0]:
fixedPriceDF = baseDF.replace({'price': r'[\$]'}, {'price': ''}, regex=True)

In [0]:
#check if the values in price field has been fixed
fixedPriceDF.price.head() 

0    170.00
1    235.00
2     65.00
3     65.00
4    785.00
Name: price, dtype: object

In [0]:
#check data types of fixedPriceDF
fixedPriceDF.dtypes 

host_is_superhost               object
cancellation_policy             object
instant_bookable                object
host_total_listings_count        int64
neighbourhood_cleansed          object
zipcode                        float64
latitude                       float64
longitude                      float64
property_type                   object
room_type                       object
accommodates                     int64
bathrooms                      float64
bedrooms                       float64
beds                           float64
bed_type                        object
minimum_nights                   int64
number_of_reviews                int64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_cleanliness      float64
review_scores_checkin          float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
price                           object
dtype: object

In [0]:
#fix data type of price using .astype()
fixedPriceDF['price'] = fixedPriceDF['price'].astype('float64') 

ValueError: ignored

Oops! As the error message suggested, there is a value that has a ",", let's fix that. 

In [0]:
#fix price values again
fixedPriceDF = baseDF.replace({'price': r'[\$,]'}, {'price': ''}, regex=True) 

In [0]:
#then try fixing datatype of price  
fixedPriceDF['price'] = fixedPriceDF['price'].astype('float64')  

In [0]:
#check data type
fixedPriceDF.dtypes

host_is_superhost               object
cancellation_policy             object
instant_bookable                object
host_total_listings_count        int64
neighbourhood_cleansed          object
zipcode                        float64
latitude                       float64
longitude                      float64
property_type                   object
room_type                       object
accommodates                     int64
bathrooms                      float64
bedrooms                       float64
beds                           float64
bed_type                        object
minimum_nights                   int64
number_of_reviews                int64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_cleanliness      float64
review_scores_checkin          float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
price                          float64
dtype: object

## Summary statistics

Using .describe() in DataFrame to get summary statistics. 

In [0]:
#checking the usage of describe()

help(fixedPriceDF.describe)

Help on method describe in module pandas.core.generic:

describe(percentiles=None, include=None, exclude=None) -> ~FrameOrSeries method of pandas.core.frame.DataFrame instance
    Generate descriptive statistics.
    
    Descriptive statistics include those that summarize the central
    tendency, dispersion and shape of a
    dataset's distribution, excluding ``NaN`` values.
    
    Analyzes both numeric and object series, as well
    as ``DataFrame`` column sets of mixed data types. The output
    will vary depending on what is provided. Refer to the notes
    below for more detail.
    
    Parameters
    ----------
    percentiles : list-like of numbers, optional
        The percentiles to include in the output. All should
        fall between 0 and 1. The default is
        ``[.25, .5, .75]``, which returns the 25th, 50th, and
        75th percentiles.
    include : 'all', list-like of dtypes or None (default), optional
        A white list of data types to include in the result

In [0]:
#get the stats of some columns, e.g. host_is_superhost','host_total_listings_count','zipcode' using .describe()

fixedPriceDF[['host_is_superhost','host_total_listings_count','zipcode']].describe() 

Unnamed: 0,host_total_listings_count,zipcode
count,7072.0,6879.0
mean,60.621324,94114.884576
std,221.850955,15.652162
min,0.0,94014.0
25%,1.0,94109.0
50%,2.0,94114.0
75%,7.0,94121.0
max,1475.0,94965.0


describe() only reports the summary of numerical columns by default, to get the stats of all columns you need: include='all'  

In [0]:
#get the stats of all columns
 
fixedPriceDF.describe(include='all')

Unnamed: 0,host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
count,7072,7072,7072,7072.0,7072,6879.0,7072.0,7072.0,7072,7072,7072.0,7053.0,7071.0,7067.0,7072,7072.0,7072.0,5713.0,5710.0,5710.0,5709.0,5712.0,5709.0,5708.0,7072.0
unique,2,6,2,,36,,,,24,3,,,,,5,,,,,,,,,,
top,f,strict_14_with_grace_period,f,,Mission,,,,Apartment,Entire home/apt,,,,,Real Bed,,,,,,,,,,
freq,4443,3182,4393,,725,,,,3080,4366,,,,,6999,,,,,,,,,,
mean,,,,60.621324,,94114.884576,37.76595,-122.430704,,,3.184389,1.328584,1.346768,1.741616,,14156.35,43.247031,95.565902,9.761121,9.625044,9.867402,9.83736,9.613593,9.401542,212.99392
std,,,,221.850955,,15.652162,0.022484,0.026684,,,1.902724,0.743306,0.913959,1.1633,,1189129.0,71.176549,6.963257,0.678495,0.765363,0.506203,0.599992,0.744,0.81789,333.335336
min,,,,0.0,,94014.0,37.705088,-122.513065,,,1.0,0.0,0.0,0.0,,1.0,0.0,20.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0
25%,,,,1.0,,94109.0,37.751316,-122.443053,,,2.0,1.0,1.0,1.0,,2.0,1.0,94.0,10.0,9.0,10.0,10.0,9.0,9.0,100.0
50%,,,,2.0,,94114.0,37.767871,-122.425435,,,2.0,1.0,1.0,1.0,,4.0,12.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0
75%,,,,7.0,,94121.0,37.784606,-122.411558,,,4.0,1.5,2.0,2.0,,30.0,54.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,232.25


As you can see we have different count results for the different columns, that means we have null values records for some columns. 

## Nulls

There are a lot of different ways to handle null values.  

Some ways to handle nulls:
* Drop any records that contain nulls
* Numeric:
  * Replace them with mean/median/zero/etc.
* Categorical:
  * Replace them with the mode, i.e. the most frequently observed data value
  * Create a special category for null
 
  

There are a few nulls in the categorical feature `zipcode`. Let's get rid of those rows where any of that column is null, so this is the simplest approach for the time being.
 

In [0]:
#.dropna() can be used, checking usage
help(fixedPriceDF.dropna)

Help on method dropna in module pandas.core.frame:

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If any NA values are present, dro

In [0]:
#drop rows that has null in `zipcode`
noNullsDF=fixedPriceDF.dropna(subset=['zipcode'])

In [0]:
#checking stats after dropna ..
noNullsDF.describe()

Unnamed: 0,host_total_listings_count,zipcode,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
count,6879.0,6879.0,6879.0,6879.0,6879.0,6860.0,6878.0,6874.0,6879.0,6879.0,5637.0,5634.0,5634.0,5633.0,5636.0,5633.0,5632.0,6879.0
mean,48.828754,94114.884576,37.76563,-122.431046,3.182149,1.326968,1.346612,1.747163,14552.74,44.339148,95.557921,9.760561,9.623536,9.866146,9.83978,9.615125,9.406072,213.863934
std,210.523399,15.652162,0.022571,0.0268,1.909663,0.74497,0.911025,1.167019,1205694.0,71.786236,6.950989,0.68043,0.767311,0.508988,0.588008,0.740421,0.808663,337.785933
min,0.0,94014.0,37.705088,-122.513065,1.0,0.0,0.0,0.0,1.0,0.0,20.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0
25%,1.0,94109.0,37.750935,-122.443491,2.0,1.0,1.0,1.0,2.0,2.0,94.0,10.0,9.0,10.0,10.0,9.0,9.0,100.0
50%,2.0,94114.0,37.767096,-122.425715,2.0,1.0,1.0,1.0,3.0,13.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0
75%,6.0,94121.0,37.784318,-122.411701,4.0,1.5,2.0,2.0,30.0,56.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,234.0
max,1475.0,94965.0,37.810306,-122.370427,16.0,10.0,7.0,14.0,100000000.0,649.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,9999.0


Some rows are dropped, but still there are some columns with differnet numbers of records

In [0]:
#checking more info  ..
noNullsDF.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6879 entries, 0 to 7071
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   host_is_superhost            6879 non-null   object 
 1   cancellation_policy          6879 non-null   object 
 2   instant_bookable             6879 non-null   object 
 3   host_total_listings_count    6879 non-null   int64  
 4   neighbourhood_cleansed       6879 non-null   object 
 5   zipcode                      6879 non-null   float64
 6   latitude                     6879 non-null   float64
 7   longitude                    6879 non-null   float64
 8   property_type                6879 non-null   object 
 9   room_type                    6879 non-null   object 
 10  accommodates                 6879 non-null   int64  
 11  bathrooms                    6860 non-null   float64
 12  bedrooms                     6878 non-null   float64
 13  beds              


Now let's try imputation for numerical features. We want to fill the nulls in some numerical features with the median of that column. 


In [0]:
imputeCols = [
  "bedrooms",
  "bathrooms",
  "beds",
  "review_scores_rating",
  "review_scores_accuracy",
  "review_scores_cleanliness",
  "review_scores_checkin",
  "review_scores_communication",
  "review_scores_location",
  "review_scores_value"
]
 

Fill any nulls in those columes with the median value of the colume where a null is located

In [0]:
# .fillna() can be used, checking usage
help(noNullsDF.fillna)

In [0]:
#call .fillna() with median of those selected columns 
imputedDF=noNullsDF.fillna(noNullsDF.median()[imputeCols])

In [0]:
#let's check info on imputedDF
imputedDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6879 entries, 0 to 7071
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   host_is_superhost            6879 non-null   object 
 1   cancellation_policy          6879 non-null   object 
 2   instant_bookable             6879 non-null   object 
 3   host_total_listings_count    6879 non-null   int64  
 4   neighbourhood_cleansed       6879 non-null   object 
 5   zipcode                      6879 non-null   float64
 6   latitude                     6879 non-null   float64
 7   longitude                    6879 non-null   float64
 8   property_type                6879 non-null   object 
 9   room_type                    6879 non-null   object 
 10  accommodates                 6879 non-null   int64  
 11  bathrooms                    6879 non-null   float64
 12  bedrooms                     6879 non-null   float64
 13  beds              

In [0]:
#check the sumary stats on imputedDF
imputedDF.describe(include='all')

## Getting rid of extreme values

Let's take a look at the *min* and *max* values of the `price` column:

In [0]:
imputedDF['price'].describe()

count    6879.000000
mean      213.863934
std       337.785933
min         0.000000
25%       100.000000
50%       150.000000
75%       234.000000
max      9999.000000
Name: price, dtype: float64

There are some super-expensive listings. But that's the Data Scientist's job to decide what to do with them. We can certainly filter the "free" AirBNBs though.

Let's see first how many listings we can find where the *price* is zero.

In [0]:
#.query() can be used to get a dataframe with records that matchs a given condition, checking usage
help(imputedDF.query)

In [0]:
#find number of listings that have price as zero 
imputedDF.query('price==0').price.count() 

1

Now only keep rows with a strictly positive *price*.

In [0]:
#we can still using .query() to get a dataframe that only have positive price
posPricesDF = imputedDF.query('price > 0') 

In [0]:
#check the stats of price in posPricesDF
posPricesDF.describe()

Unnamed: 0,host_total_listings_count,zipcode,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
count,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0,6878.0
mean,48.835272,94114.885286,37.765633,-122.431047,3.182466,1.326112,1.346612,1.746729,14554.85,44.344432,95.998982,9.803867,9.691625,9.890375,9.868712,9.684937,9.513812,213.895028
std,210.538011,15.653189,0.022571,0.026802,1.90962,0.744182,0.911025,1.166819,1205782.0,71.790118,6.362367,0.622679,0.70941,0.463491,0.535825,0.686209,0.76667,337.800646
min,0.0,94014.0,37.705088,-122.513065,1.0,0.0,0.0,0.0,1.0,0.0,20.0,2.0,2.0,2.0,2.0,2.0,2.0,10.0
25%,1.0,94109.0,37.750945,-122.443496,2.0,1.0,1.0,1.0,2.0,2.0,95.0,10.0,10.0,10.0,10.0,10.0,9.0,100.0
50%,2.0,94114.0,37.767099,-122.425717,2.0,1.0,1.0,1.0,3.0,13.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0
75%,6.0,94121.0,37.784319,-122.411701,4.0,1.5,2.0,2.0,30.0,56.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,234.0
max,1475.0,94965.0,37.810306,-122.370427,16.0,10.0,7.0,14.0,100000000.0,649.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,9999.0


Let's take a look at the *min* and *max* values of the *minimum_nights* column:

In [0]:
#checking stats of a single column 'minimum_nights'
posPricesDF[['minimum_nights']].describe() 

Unnamed: 0,minimum_nights
count,6878.0
mean,14554.85
std,1205782.0
min,1.0
25%,2.0
50%,3.0
75%,30.0
max,100000000.0


Let's take a look the distribution of number of records based on `minimum_nights`

In [0]:
#ues .value_counts() on a column
posPricesDF.minimum_nights.value_counts()

30           2560
2            1397
1            1201
3             853
4             280
5             185
31            126
7              78
60             35
32             25
180            25
6              25
90             22
14              7
120             7
45              6
365             5
50              3
10              3
40              2
12              2
35              2
183             2
21              2
80              1
188             1
140             1
55              1
18              1
75              1
100000000       1
179             1
28              1
8               1
200             1
170             1
360             1
1000            1
62              1
9               1
13              1
17              1
25              1
29              1
85              1
185             1
1125            1
58              1
999             1
Name: minimum_nights, dtype: int64

A minimum stay of one year seems to be a reasonable limit here. Let's filter out those records where the *minimum_nights* is greater then 365:

In [0]:
#again, using .query()
cleanDF =  posPricesDF.query('minimum_nights<=365')

In [0]:
#check the stats of minimum_nights in cleanDF
cleanDF[['minimum_nights']].describe()

Unnamed: 0,minimum_nights
count,6874.0
mean,15.298225
std,21.573598
min,1.0
25%,2.0
50%,3.0
75%,30.0
max,365.0


In [0]:
cleanDF.describe(include='all')

Unnamed: 0,host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
count,6874,6874,6874,6874.0,6874,6874.0,6874.0,6874.0,6874,6874,6874.0,6874.0,6874.0,6874.0,6874,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0
unique,2,6,2,,36,,,,24,3,,,,,5,,,,,,,,,,
top,f,strict_14_with_grace_period,f,,Mission,,,,Apartment,Entire home/apt,,,,,Real Bed,,,,,,,,,,
freq,4415,3164,4203,,710,,,,2934,4184,,,,,6801,,,,,,,,,,
mean,,,,48.862671,,94114.888711,37.765632,-122.431055,,,3.182427,1.326157,1.346814,1.747018,,15.298225,44.366308,95.998982,9.803899,9.691592,9.890311,9.868635,9.684754,9.513675,213.506255
std,,,,210.5962,,15.657071,0.022575,0.026807,,,1.909766,0.744323,0.911251,1.16705,,21.573598,71.805219,6.363578,0.622772,0.709538,0.463618,0.535972,0.686366,0.766801,335.965629
min,,,,0.0,,94014.0,37.705088,-122.513065,,,1.0,0.0,0.0,0.0,,1.0,0.0,20.0,2.0,2.0,2.0,2.0,2.0,2.0,10.0
25%,,,,1.0,,94109.0,37.750945,-122.443526,,,2.0,1.0,1.0,1.0,,2.0,2.0,95.0,10.0,10.0,10.0,10.0,10.0,9.0,100.0
50%,,,,2.0,,94114.0,37.767105,-122.425743,,,2.0,1.0,1.0,1.0,,3.0,13.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0
75%,,,,6.0,,94121.0,37.784319,-122.411701,,,4.0,1.5,2.0,2.0,,30.0,56.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,234.0


In [0]:
cleanDF.price.value_counts()

150.0    284
120.0    186
200.0    185
100.0    183
250.0    175
        ... 
324.0      1
752.0      1
376.0      1
459.0      1
332.0      1
Name: price, Length: 461, dtype: int64

Convert price to categorical by binining it 

In [42]:

labels = ["Cheap","Moderate","Expensive"]
cleanDF['binned_price'] = pd.cut(cleanDF['price'], bins=3,  labels=labels)
cleanDF.binned_price.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Cheap        6864
Expensive       6
Moderate        4
Name: binned_price, dtype: int64

In [43]:
cleanDF.query('binned_price=="Expensive"')

Unnamed: 0,host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,binned_price
1364,f,strict_14_with_grace_period,t,1,Nob Hill,94133.0,37.797069,-122.410507,Condominium,Entire home/apt,5,2.0,3.0,4.0,Real Bed,30,2,100.0,10.0,10.0,10.0,10.0,10.0,8.0,9000.0,Expensive
3145,t,moderate,t,1,Parkside,94116.0,37.742552,-122.479172,House,Private room,2,1.0,1.0,1.0,Real Bed,1,87,97.0,10.0,10.0,10.0,10.0,10.0,10.0,8000.0,Expensive
5113,f,super_strict_60,t,419,Pacific Heights,94109.0,37.795744,-122.425657,Apartment,Entire home/apt,6,3.5,3.0,4.0,Real Bed,1,0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,8000.0,Expensive
5515,f,strict_14_with_grace_period,t,5,Western Addition,94115.0,37.78023,-122.440461,Condominium,Entire home/apt,14,3.0,6.0,7.0,Real Bed,2,5,100.0,10.0,10.0,10.0,10.0,10.0,10.0,8000.0,Expensive
6734,f,flexible,t,1,Russian Hill,94109.0,37.798163,-122.419621,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,30,0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,9999.0,Expensive
6735,f,strict_14_with_grace_period,t,33,Diamond Heights,94131.0,37.739226,-122.436527,House,Entire home/apt,4,2.0,3.0,3.0,Real Bed,30,0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,9270.0,Expensive


Convert room_type to numeric using one hot encoding

In [45]:
# List unqiue room types and their count
cleanDF.room_type.value_counts()

Entire home/apt    4184
Private room       2509
Shared room         181
Name: room_type, dtype: int64

In [46]:
# Convert room_type to numeric using one hot encoding
pd.get_dummies(cleanDF.room_type, prefix='room_type')

# use pd.concat to join the new columns with your original dataframe
cleanDF = pd.concat([cleanDF,pd.get_dummies(cleanDF['room_type'], prefix='room_type', drop_first=True)],axis=1)

# now drop the original 'room_type' column (you don't need it anymore)
cleanDF.drop(['room_type'],axis=1, inplace=True)

cleanDF.head()

Unnamed: 0,host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,binned_price,room_type_Private room,room_type_Shared room
0,t,moderate,t,1,Western Addition,94117.0,37.76931,-122.433856,Apartment,3,1.0,1.0,2.0,Real Bed,1,172,97.0,10.0,10.0,10.0,10.0,10.0,10.0,170.0,Cheap,0,0
1,f,strict_14_with_grace_period,f,2,Bernal Heights,94110.0,37.745112,-122.421018,Apartment,5,1.0,2.0,3.0,Real Bed,30,112,98.0,10.0,10.0,10.0,10.0,10.0,9.0,235.0,Cheap,0,0
2,f,strict_14_with_grace_period,f,10,Haight Ashbury,94117.0,37.76669,-122.452505,Apartment,2,4.0,1.0,1.0,Real Bed,32,17,85.0,8.0,8.0,9.0,9.0,9.0,8.0,65.0,Cheap,1,0
3,f,strict_14_with_grace_period,f,10,Haight Ashbury,94117.0,37.764872,-122.451828,Apartment,2,4.0,1.0,1.0,Real Bed,32,8,93.0,9.0,9.0,10.0,10.0,9.0,9.0,65.0,Cheap,1,0
4,f,strict_14_with_grace_period,f,2,Western Addition,94117.0,37.775249,-122.436374,House,5,1.5,2.0,2.0,Real Bed,7,27,97.0,10.0,10.0,10.0,10.0,10.0,9.0,785.0,Cheap,0,0


## Saving Cleaned Data back to Disk

OK, our data is cleansed now. Let's save this DataFrame to a file so that we can start building models with it.

In [0]:
outputPath = "airbnb-cleaned.csv"
cleanDF.to_csv(outputPath)  

In [48]:
%ls

airbnb-cleaned.csv  airbnb-listings.csv  [0m[01;34msample_data[0m/


Summary
--
What we have done in this notebook:

* Fixed data type for the column 'price' 
    * using .replace() with regx to fix values
    * using .astype() to fix data type
* Got rid of null values
    * using .dropna() to dropped rows where 'zipcode' is null
    * using .fillna() to impute columns that have nulls with the median of that column
* Identified & removed some outliers
    * using .qu'price' as 0
    * 'minimum_nights' greater than 365 days

We have also learned:

* Where to find the API docs 
* Checking dataframe attributes
    * using .shape to check dataframe shape
    * using .dtypes to check column data types
    * using .columns to check clumn names
* Using help() to get the usage info of a function/method  
* Using .info() to check the schema of a dataframe  
* Using describe() to get summary statistics of dataframe  
  
Now we have a cleaned dataset ready for trying some machine learning tasks.   

 