## When use SQL in pandas
* Data fits in memory
* Volume of data is not a performance issue
* Don't need a fuly functional SQL Server database
* On complex programs, SQL queries are much easy to read than a pandas code

In [3]:
import pandas as pd
from pandasql import sqldf

Matplotlib is building the font cache; this may take a moment.


In [35]:
airbnb = pd.read_csv('data/AB_NYC_2019.csv')
airbnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


### Basics with SQL queries

In [15]:
query = '''SELECT name, neighbourhood_group 
            FROM airbnb 
            LIMIT 5'''

# Pandas equivalent
# print(airbnb[['name', 'neighbourhood_group']].head(5))

print(sqldf(query))
print(type(sqldf(query)))

                                               name neighbourhood_group
0                Clean & quiet apt home by the park            Brooklyn
1                             Skylit Midtown Castle           Manhattan
2               THE VILLAGE OF HARLEM....NEW YORK !           Manhattan
3                   Cozy Entire Floor of Brownstone            Brooklyn
4  Entire Apt: Spacious Studio/Loft by central park           Manhattan
<class 'pandas.core.frame.DataFrame'>


Above, we extracted information about the name and neighbourhood of the first five items from the airbnb dataframe. Note that running the sqldf function returns a pandas dataframe

In [16]:
query = '''SELECT DISTINCT neighbourhood_group 
            FROM airbnb'''

# Pandas equivalent
# print(airbnb['neighbourhood_group'].unique())

print(sqldf(query))

  neighbourhood_group
0            Brooklyn
1           Manhattan
2              Queens
3       Staten Island
4               Bronx


### Sorting data

In [20]:
query = '''SELECT name, price
            FROM airbnb
            ORDER BY price DESC 
            LIMIT 5'''

# Pandas equivalent
# print(airbnb[['name', 'price']].sort_values(by=['price'], ascending=False, ignore_index=True).head(5))

print(sqldf(query))

                                                name  price
0                Furnished room in Astoria apartment  10000
1    Luxury 1 bedroom apt. -stunning Manhattan views  10000
2                                1-BR Lincoln Center  10000
3  2br - The Heart of NYC: Manhattans Lower East ...   9999
4                Quiet, Clean, Lit @ LES & Chinatown   9999
                                              name  price
0              Furnished room in Astoria apartment  10000
1  Luxury 1 bedroom apt. -stunning Manhattan views  10000
2                              1-BR Lincoln Center  10000
3                               Spanish Harlem Apt   9999
4              Quiet, Clean, Lit @ LES & Chinatown   9999


### Filtering data

In [29]:
query = '''SELECT DISTINCT neighbourhood_group
            FROM airbnb 
            WHERE room_type = 'Private room'
               AND price > 900'''

# Pandas equivalent
# print(airbnb[(airbnb['room_type'] == 'Private room') & (airbnb['price'] > 900)]['neighbourhood_group'].unique())

print(sqldf(query))


  neighbourhood_group
0           Manhattan
1            Brooklyn
2              Queens
3               Bronx


### Grouping and aggregating data 

In [32]:
query = '''SELECT neighbourhood_group, MAX(price)
            FROM airbnb 
            GROUP BY neighbourhood_group'''

# Pandas equivalent
# print(airbnb[['neighbourhood_group', 'price']].groupby('neighbourhood_group', as_index=False).max())

print(sqldf(query))



  neighbourhood_group  MAX(price)
0               Bronx        2500
1            Brooklyn       10000
2           Manhattan       10000
3              Queens       10000
4       Staten Island        5000


### Performing mathematical operations

In [34]:
query = '''SELECT price * minimum_nights AS minimum_payment
            FROM airbnb
            ORDER BY minimum_payment DESC
            LIMIT 5'''

# Pandas equivalent
# airbnb['minimum_payment'] = airbnb['price'] * airbnb['minimum_nights']
# print(airbnb['minimum_payment'].sort_values(ascending=False, ignore_index=True).head())

print(sqldf(query))


0    1170000
1    1000000
2     989901
3     857750
4     730000
Name: minimum_payment, dtype: int64
   minimum_payment
0          1170000
1          1000000
2           989901
3           857750
4           730000
