In [1]:
import pandas

In [4]:
df = pandas.read_csv('http://bit.ly/airbnbcsv')

Create a data frame that is only the rooms with minimum nights > 7

First, create a series of booleans that we can use to pull the right rows later


In [10]:
minimum_nights_bools = df.minimum_nights > 7

minimum_nights_bools

0        False
1        False
2        False
3        False
4         True
         ...  
48890    False
48891    False
48892     True
48893    False
48894    False
Name: minimum_nights, Length: 48895, dtype: bool

Use the boolean list to create a new data set that isa subset of the original

In [13]:
minimum_nights_df = df[minimum_nights_bools]

How many rows is the new data frame?

In [56]:
len(minimum_nights_df)

7333

OK, now let's narrow it down to only Brooklyn. We'll use the neighborhood_group column for that.

In [61]:
brooklyn_bools = minimum_nights_df.neighbourhood_group == "Brooklyn"

brooklyn_df = minimum_nights_df[brooklyn_bools]

brooklyn_df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
6,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,45,49,2017-10-05,0.40,1,0
36,11452,Clean and Quiet in Brooklyn,7355,Vt,Brooklyn,Bedford-Stuyvesant,40.68876,-73.94312,Private room,35,60,0,,,1,365
45,12627,Entire apartment in central Brooklyn neighborh...,49670,Rana,Brooklyn,Prospect-Lefferts Gardens,40.65944,-73.96238,Entire home/apt,150,29,11,2019-06-05,0.49,1,95
55,14377,Williamsburg 1 bedroom Apartment,56512,Joanna,Brooklyn,Williamsburg,40.70881,-73.95930,Entire home/apt,150,30,105,2019-06-22,0.90,1,30
63,16326,Comfortable 4-bedroom apt in family house.,63588,Dimitri,Brooklyn,Prospect Heights,40.67811,-73.96428,Entire home/apt,200,30,143,2019-01-26,1.33,2,297
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48772,36424129,Bright & Cheerful! Modern w/ Laundry + Fast WiFi,107407045,Brett & Megan,Brooklyn,Bedford-Stuyvesant,40.69217,-73.93191,Private room,35,30,0,,,1,44
48794,36428255,Skyscraper Ultimate Luxury at the Heart of BKLYN.,148289089,Elmar,Brooklyn,Boerum Hill,40.68780,-73.98145,Entire home/apt,235,10,0,,,1,64
48843,36453642,"☆ HUGE, SUNLIT Room - 3 min walk from Train !",53966115,Nora,Brooklyn,Bedford-Stuyvesant,40.69635,-73.93743,Private room,45,29,0,,,2,341
48879,36480292,Gorgeous 1.5 Bdr with a private yard- Williams...,540335,Lee,Brooklyn,Williamsburg,40.71728,-73.94394,Entire home/apt,120,20,0,,,1,22


So how many listings have a minimum nights of more than 7 and are also in Brooklyn?

In [63]:
len(brooklyn_df)

2474

The basic procedure here is to first create the data frame of minimum nights. Then we treat that minimum nights data frame as our main data frame, and repeat the steps again with indexing for the borough. The trick is to not start reverting back to using the original df for the second indexing, or to skip steps along the way, which is hard because  when you have a bunch of variables defined and are taking a lot of steps, you can forget where you are in the process.

# Indexing by string matching

Instead of using the coditiional statement to create the boolean list, we can use special methods that work with strings to create a boolean list of those items that match a string.

First, we get a column as a series. Then we use the .str method to show that we're going to work with a string. Then we can use the .contains() method to check for the presence of a specific string.

Let's look for listing that contain the word "spacious." I got the idea from this article I found on Google:

https://bnbfacts.com/the-most-overused-adjectives-in-airbnb-listing-names/


In [18]:
spacious_bools = df.name.str.contains('spacious')

spacious_bools

0        False
1        False
2        False
3        False
4        False
         ...  
48890    False
48891    False
48892    False
48893    False
48894    False
Name: name, Length: 48895, dtype: object

Now we use the boolean list to create a new data frame and save it to a variable.

There's a problem with our usual method, though. There are some values in the df.name column that are not strings, they contain nothing. So we have to clean up our boolean list before we can use it.

You can see this is the case with the below method, which shows what values are in the series. The "nan" is causing the problem.

In [23]:
spacious_bools.unique()

array([False, True, nan], dtype=object)

The below is a trick to change the nan values to False values. It basically means "make False anything in this list that is not equal to True."

In [22]:
spacious_bools_cleaned = spacious_bools == True

Let's just check that our boolean list doesn't have any nan values now.

In [24]:
spacious_bools_cleaned.unique()

array([False,  True])

With that annoying step out of the way, let's do what we wanted, which is to create a new data set of just the listings that contain the word "spatious."

In [29]:
spatious_df = df[spacious_bools_cleaned]

spatious_df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
239,60794,"Bright and spacious, garden below!",293394,Rachel,Manhattan,Upper West Side,40.80021,-73.96071,Entire home/apt,195,4,4,2017-08-25,0.04,1,0
299,68974,Unique spacious loft on the Bowery,281229,Alicia,Manhattan,Little Italy,40.71943,-73.99627,Entire home/apt,575,2,191,2019-06-20,1.88,1,298
408,135393,"Private, spacious room in Brooklyn",663764,Karen,Brooklyn,East Flatbush,40.65100,-73.94886,Private room,50,2,263,2019-06-24,2.69,2,136
493,173151,spacious studio,826459,Jane,Brooklyn,Greenpoint,40.72901,-73.95812,Private room,91,3,241,2019-06-24,2.49,1,287
548,202273,Cozy and spacious - rare for NYC!,918087,Kestrel,Brooklyn,Bedford-Stuyvesant,40.68812,-73.94934,Private room,67,4,72,2016-12-26,0.76,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48417,36243240,spacious private room #3,35783912,Pi & Leo,Bronx,Fordham,40.86263,-73.89088,Private room,33,2,0,,,8,84
48485,36280357,"Bright and sunny top floor, spacious apartment",49037454,Nicole,Manhattan,Chelsea,40.74218,-73.99813,Entire home/apt,150,5,0,,,1,4
48491,36281984,No-frills master bedroom in spacious Bushwick ...,45909314,Andrew,Brooklyn,Bushwick,40.69807,-73.93466,Private room,55,1,0,,,1,14
48518,36307792,Private room in spacious East Village oasis,41698272,Rachel,Manhattan,East Village,40.72797,-73.98155,Private room,72,3,0,,,1,170


How many listing have the word "spatious"?

In [33]:
len(spatious_df)

582

Get a random sample of the names of the listings. We can pull the "name" column and they will all have the word spatious in it, since our new data frame was created as a subset of the data where the name column contains the word spatious.

In [34]:
spatious_df.name.sample(10)

44122       Clean and spacious room in charming Greenpoint
42875    Bright, spacious Brooklyn Loft with large terrace
48349            PVT spacious room in queens near airports
16398      Sunny and spacious 1 bedroom Brooklyn apartment
13084                           cute spacious bk apartment
42040    Your bright, spacious & central WILLIAMSBURG h...
3093     Cool and spacious room-Approx 150 yard from LT...
8887                       Bright spacious 1BR on the park
5871                      Cozy/spacious 3bd, room for rent
27348    Beautiful room in spacious apartment in Manhattan
Name: name, dtype: object

For fun, let's check if spatious rooms are more expensive on average.

In [35]:
df.price.mean()

152.7206871868289

In [38]:
spatious_df.price.mean()

118.91580756013745

In [39]:
df.price.median()

106.0

In [42]:
spatious_df.price.median()

95.5

Spatious rooms are actually cheaper on averge. This might be for the reason outlined in the article I got the idea from. Basically, the really fancy rooms are named something like "The Coral Reef," not "Spatious Lower East Side Room" or whatever. That's my guess, anyway.

One last thing, which is that the .contain() method was case sensitive. That means there are a number of spatious rooms that didn't get picked up because they used a capital letter. The way to fix this is to make the name column all lower case first before checking.

In [53]:
all_spatious_bools = df.name.str.lower().str.contains('spacious')

all_spatious_bools_cleaned = all_spatious_bools == True

all_spatious_df = df[all_spatious_bools_cleaned]

How many did we miss when we didn't make the series lower case first?

In [55]:
len(all_spatious_df)

3800

So we actually missed most of them the first time. That also tells me that "spacious" is usually the first word in the listing, since most of the uses of the word have a capital letter in them.