<a href="https://colab.research.google.com/github/ua-datalab/DataEngineering/blob/main/05_Workshop_Feb_26_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***NoSQL Workshop - MongoDB and PyMongo***

## First install pymongo client on colab

In [43]:
!pip install -q pymongo[srv]==3.10.1
# !pip install -q python-dotenv==0.13.0 ##install dotenv in case you do not want to use open password and load it from an .env file

In [46]:
from google.colab import userdata
password = userdata.get('mongo')

## Once installed, load MongoClient: to connect to the cloud hosted database.
### The sample database has open ip connections allowed so you all can access it but it is recommended to not do this while doing actual development.

In [2]:
from pymongo.mongo_client import MongoClient

## this is a sample url specifically designed for the workshop tutorial

In [47]:
uri = "mongodb+srv://danew52417:"+password+"@datascienceworkshop.aqtuipj.mongodb.net/?retryWrites=true&w=majority&appName=datascienceworkshop"

# Create a new client and connect to the server

In [4]:
client = MongoClient(uri)

# Send a ping to confirm a successful connection

In [6]:
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


# **How does mongoDB stores the data**

![](https://studio3t.com/wp-content/uploads/2022/04/hierachy-768x469.png)

# List all the databases in the cluster:

In [7]:
for db_info in client.list_database_names():
   print(db_info)

sample_airbnb
sample_analytics
sample_geospatial
sample_guides
sample_mflix
sample_restaurants
sample_supplies
sample_training
sample_weatherdata
admin
local


## let's choose the sample_airbnb database

In [8]:
db = client["sample_airbnb"]

## Show Collections
### List all collections in the sample_airbnb database. Collections in MongoDB are similar to tables in relational databases.

In [9]:
collections = db.list_collection_names()
print("Collections:", collections)

Collections: ['listingsAndReviews']


### This code will print the names of all collections within the sample_airbnb database. For the Airbnb sample data, you should see a collection named listingsAndReviews.


## Explore the listingsAndReviews Collection
### Now, let's perform some basic queries on the listingsAndReviews collection to understand the data better.

### **Find the First Document**
### Retrieve and print the first document from the collection to see what the data looks like.

In [10]:
first_listing = db.listingsAndReviews.find_one()
print(first_listing)

{'_id': '10006546', 'listing_url': 'https://www.airbnb.com/rooms/10006546', 'name': 'Ribeira Charming Duplex', 'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.', 'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers the perfect conditions to discover the history and the charm of Porto. Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. Within walking distance of all the most emblematic places of the city of Porto. The apartment is fully equipped to host 8 people, with cooker, oven, washing machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The apartment is located in a very typical area of the city that allows to cross with the most picturesque population of the city, welcoming, genuine and happy people that fills the streets with his o

In [None]:
# print(first_listing['reviews'])

[{'_id': '58663741', 'date': datetime.datetime(2016, 1, 3, 5, 0), 'listing_id': '10006546', 'reviewer_id': '51483096', 'reviewer_name': 'Cátia', 'comments': 'A casa da Ana e do Gonçalo foram o local escolhido para a passagem de ano com um grupo de amigos. Fomos super bem recebidos com uma grande simpatia e predisposição a ajudar com qualquer coisa que fosse necessário.\r\nA casa era ainda melhor do que parecia nas fotos, totalmente equipada, com mantas, aquecedor e tudo o que pudessemos precisar.\r\nA localização não podia ser melhor! Não há melhor do que acordar de manhã e ao virar da esquina estar a ribeira do Porto.'}, {'_id': '62413197', 'date': datetime.datetime(2016, 2, 14, 5, 0), 'listing_id': '10006546', 'reviewer_id': '40031996', 'reviewer_name': 'Théo', 'comments': "We are french's students, we traveled some days in Porto, this space was good and we can cooking easly. It was rainning so we eard every time the water fall to the ground in the street when we sleeping. But It was

### using 'print', doesn't display the output correctly so we would use pprint (pretty print)

In [11]:
import pprint

In [12]:
pprint.pprint(first_listing)
## we will get a nice dictionary like output

{'_id': '10006546',
 'access': 'We are always available to help guests. The house is fully '
           'available to guests. We are always ready to assist guests. when '
           'possible we pick the guests at the airport.  This service transfer '
           'have a cost per person. We will also have service "meal at home" '
           'with a diverse menu and the taste of each. Enjoy the moment!',
 'accommodates': 8,
 'address': {'country': 'Portugal',
             'country_code': 'PT',
             'government_area': 'Cedofeita, Ildefonso, Sé, Miragaia, Nicolau, '
                                'Vitória',
             'location': {'coordinates': [-8.61308, 41.1413],
                          'is_location_exact': False,
                          'type': 'Point'},
             'market': 'Porto',
             'street': 'Porto, Porto, Portugal',
             'suburb': ''},
 'amenities': ['TV',
               'Cable TV',
               'Wifi',
               'Kitchen',
              

### This will give you a sense of the structure of the documents in the collection, including the fields and types of data stored.

### Since it is JSON notation, we can check the available keys for our search

In [13]:
first_listing.keys()

dict_keys(['_id', 'listing_url', 'name', 'summary', 'space', 'description', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'property_type', 'room_type', 'bed_type', 'minimum_nights', 'maximum_nights', 'cancellation_policy', 'last_scraped', 'calendar_last_scraped', 'first_review', 'last_review', 'accommodates', 'bedrooms', 'beds', 'number_of_reviews', 'bathrooms', 'amenities', 'price', 'security_deposit', 'cleaning_fee', 'extra_people', 'guests_included', 'images', 'host', 'address', 'availability', 'review_scores', 'reviews'])

## Count Documents
### Count how many documents (listings) are in the collection.

In [14]:
count = db.listingsAndReviews.count_documents({})
print("Number of listings:", count)

Number of listings: 5555


## Query Documents with Criteria
### Find listings with at least 5 bedrooms and 5 bathrooms.

### Query Object: {"bedrooms": {"$gte": 5}, "bathrooms": {"$gte": 5}}

#### The query is made up of key-value pairs where the keys are the field names in the documents you're querying, and the values specify the conditions those fields must meet for a document to be included in the result set.


## $gte is a MongoDB query operator that stands for "greater than or equal to."


In [15]:
query = {"bedrooms": {"$gte": 5}, "bathrooms": {"$gte": 5}}
results = db.listingsAndReviews.find(query)

## the above query returns a mongo cursor ; similar to an iterator with the results

## How many documents matched our query?

In [16]:
num_documents = db.listingsAndReviews.count_documents(query)
num_documents

23

In [17]:
for listing in results:
    print(listing['name'], ":", listing['bedrooms'], "bedrooms and", listing['bathrooms'], "bathrooms", listing['listing_url'])

Colonial Mansion in Santa Teresa : 6 bedrooms and 5.0 bathrooms https://www.airbnb.com/rooms/1066648
Barra da Tijuca beach house : 9 bedrooms and 7.0 bathrooms https://www.airbnb.com/rooms/12509339
Cosy penthouse close to Barra beach : 5 bedrooms and 8.0 bathrooms https://www.airbnb.com/rooms/12591225
Casa completa p olimpíadas com serviços incluído : 5 bedrooms and 5.0 bathrooms https://www.airbnb.com/rooms/13927230
Whale Beach Sensation : 6 bedrooms and 5.0 bathrooms https://www.airbnb.com/rooms/15958590
Kahala Ali'i : 6 bedrooms and 5.0 bathrooms https://www.airbnb.com/rooms/16215566
Venue Hotel Old City : 20 bedrooms and 16.0 bathrooms https://www.airbnb.com/rooms/20701559
Most Beautiful Villa on Bosphorus Istanbul... : 7 bedrooms and 8.0 bathrooms https://www.airbnb.com/rooms/20955863
Great Complex of the Cellars : 10 bedrooms and 5.0 bathrooms https://www.airbnb.com/rooms/20958766
12 people or we can negotiate. : 5 bedrooms and 6.0 bathrooms https://www.airbnb.com/rooms/2119780
G

## **Introduction to CRUD**
![](https://media.licdn.com/dms/image/C5612AQGtLYxl0tyq3g/article-cover_image-shrink_720_1280/0/1640764212366?e=1714608000&v=beta&t=zzAEq-5O5pRK1tjriI5hTUp7mumjuzUQEjjYhUHRKFo)

- Create: Insert documents into a collection with insertOne(), insertMany().
- Read: Query documents using methods like find() or findOne() to retrieve data.
- Update: Modify existing documents using updateOne(), updateMany(), or replaceOne() commands, specifying criteria and changes.
- Delete: Remove documents from a collection using deleteOne() or deleteMany() based on specified criteria.

## Some Examples

## Find Listings in a Specific Location
### Suppose you want to find all listings located in Portugal. You can specify the "address.country" field in your query.

In [18]:
portugal_listings = db.listingsAndReviews.find({"address.country": "Portugal"})

for listing in portugal_listings:
    print(listing['name'], "in", listing['address']['market'])


Ribeira Charming Duplex in Porto
Be Happy in Porto in Porto
Downtown Oporto Inn (room cleaning) in Porto
A Casa Alegre é um apartamento T1. in Porto
FloresRooms 3T in Porto
BBC OPORTO 4X2 in Porto
Where Castles and Art meet the Sea in Porto
Heroísmo IV in Porto
Spacious and well located apartment in Porto
Ribeira Smart Flat in Porto
PORTO DOWNTOWN FLATS-RIBEIRA STUDIO in Porto
Renovated Classic Design Studio with Sun Room in Porto
Lulapartment II in Porto
O'Porto Studio | Historic Center in Porto
Apartamento São Bento da Vitória 2 - Amazing view in Porto
LightHouse with Sea View in Porto
Fabulous occasion orange in Porto
Casa "Voltaria", o desejo de voltar in Porto
Apartamento Centro histórico in Porto
Porto-amazing room with ensuite in Porto
B. Arts IV in Porto
HOUSE INHISTORICAL AREA GAIA-OPORTO in Porto
Premium Exclusive house & garden Rua Almada in Porto
Central apartment in Almada street in Porto
Shining view in the city heart 5 in Porto
Myboat & Yacht rentals in Porto
Casa da Por

## **Update Documents**
### To update documents, for example, adding a new field to indicate a listing is verified, you can use update_one or update_many.

In [19]:
db.listingsAndReviews.update_many(
    {"address.country": "Brazil"},  # Criteria
    {"$set": {"verified": True}}  # Update
)

<pymongo.results.UpdateResult at 0x7ca56d5e4fc0>

### **Verify the update**

In [20]:
verified_listing = db.listingsAndReviews.find_one({"address.country": "Brazil"})
print(verified_listing['name'], "verified status:", verified_listing.get('verified', False))

Horto flat with small garden verified status: True


## **Delete Documents**
### If you need to delete listings from a specific host, you can use delete_many.

# # Caution: This will delete documents! Use with care.

In [21]:
db.listingsAndReviews.delete_many({"host.host_name": "John Doe"})

# Verify deletion
deleted_count = db.listingsAndReviews.count_documents({"host.host_name": "John Doe"})
print("Listings deleted:", deleted_count)


Listings deleted: 0


## now let's actually delete many entries for host

In [22]:
db.listingsAndReviews.delete_many({"host.host_name": "Ana&Gonçalo"})

# Verify deletion
deleted_count = db.listingsAndReviews.count_documents({"host.host_name": "Ana&Gonçalo"})
# print("Listings deleted:", deleted_count)

Listings deleted: 0


# lets count again

In [24]:
count = db.listingsAndReviews.count_documents({})
print("Number of listings:", count)

Number of listings: 5554


## Find listings that have WiFi and Cable TV:

In [25]:
result = db.listingsAndReviews.find({
    "amenities": {
        "$all": ["Wifi", "Cable TV",'Wifi', 'Kitchen', 'Heating', 'Family/kid friendly', 'Washer']
    }
})

for listing in result:
    print(listing['name'], "in", listing['amenities'])

Charming Flat in Downtown Moda in ['TV', 'Cable TV', 'Internet', 'Wifi', 'Kitchen', 'Free parking on premises', 'Pets allowed', 'Pets live on this property', 'Cat(s)', 'Heating', 'Family/kid friendly', 'Washer', 'Essentials', 'Shampoo', '24-hour check-in', 'Hangers', 'Hair dryer', 'Iron', 'Laptop friendly workspace']
Deluxe Loft Suite in ['TV', 'Cable TV', 'Internet', 'Wifi', 'Air conditioning', 'Kitchen', 'Doorman', 'Gym', 'Elevator', 'Heating', 'Family/kid friendly', 'Washer', 'Dryer', 'Smoke detector', 'Carbon monoxide detector', 'First aid kit', 'Fire extinguisher', 'Essentials', 'Shampoo', '24-hour check-in', 'Hangers', 'Hair dryer', 'Iron']
THE Place to See Sydney's FIREWORKS in ['TV', 'Cable TV', 'Wifi', 'Air conditioning', 'Kitchen', 'Buzzer/wireless intercom', 'Heating', 'Family/kid friendly', 'Washer', 'Dryer', 'Smoke detector', 'Essentials', 'Shampoo', 'Hangers', 'Hair dryer', 'Iron', 'Laptop friendly workspace']
LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4! in ['TV', '

## Find listings within a specific price range (e.g., \$900  to  \$1000 per night):

In [26]:
result = db.listingsAndReviews.find({
    "price": {"$gte": 900, "$lte": 1000}
})

for listing in result:
    print(listing['name'], listing['listing_url'])

GOLF ROYAL RESIDENCE SUİTES(2+1)-2 https://www.airbnb.com/rooms/10116256
Apartamento zona sul do RJ https://www.airbnb.com/rooms/10116578
Big, Bright & Convenient Sheung Wan https://www.airbnb.com/rooms/10141950
Alugo Apart frente mar Barra Tijuca https://www.airbnb.com/rooms/10306879
Best location 1BR Apt in HK - Shops & Sights https://www.airbnb.com/rooms/10343118
Greenwich Fun and Luxury https://www.airbnb.com/rooms/10459480
GREAT Apartment at Barra da Tijuca https://www.airbnb.com/rooms/10635276
Beautiful 1 Bedroom Apartment https://www.airbnb.com/rooms/10659534
NEW FLAT + PRIVATE ROOFTOP IN SAY YING PUN https://www.airbnb.com/rooms/11305966
Luxury 1 Bedroom Central Park Views https://www.airbnb.com/rooms/1146653
Apartamento de Luxo para Olimpíadas https://www.airbnb.com/rooms/11472884
Studio Apartment w/ Private Rooftop https://www.airbnb.com/rooms/11688660
Aluguel para Olimpiadas https://www.airbnb.com/rooms/11695637
Barra da Tijuca: cobertura ESPETACULAR https://www.airbnb.com/r

## Intially there were 5555 entries now we are left with 5554

## Examples of **aggregate**

## Count listings by country:

In [27]:
result = db.listingsAndReviews.aggregate([
    {"$group": {"_id": "$address.country", "count": {"$sum": 1}}}
])

for document in result:
    print(document['_id'], ":", document['count'])


United States : 1222
Canada : 649
Portugal : 554
Turkey : 661
Australia : 610
Brazil : 606
Hong Kong : 600
China : 19
Spain : 633


## Average number of bedrooms and bathrooms by country:

In [29]:
result = db.listingsAndReviews.aggregate([
    {"$group": {
        "_id": "$address.country",
        "averageBedrooms": {"$avg": "$bedrooms"},
        "averageBathrooms": {"$avg": "$bathrooms"}
    }}
])

for document in result:
    print(document['_id'], ":", "Average Bedrooms:", round(document['averageBedrooms'],2), "Average Bedrooms:", round(document['averageBedrooms'],2))

United States : Average Bedrooms: 1.32 Average Bedrooms: 1.32
Canada : Average Bedrooms: 1.42 Average Bedrooms: 1.42
Spain : Average Bedrooms: 1.55 Average Bedrooms: 1.55
Portugal : Average Bedrooms: 1.53 Average Bedrooms: 1.53
China : Average Bedrooms: 1.0 Average Bedrooms: 1.0
Brazil : Average Bedrooms: 1.59 Average Bedrooms: 1.59
Hong Kong : Average Bedrooms: 1.07 Average Bedrooms: 1.07
Turkey : Average Bedrooms: 1.35 Average Bedrooms: 1.35
Australia : Average Bedrooms: 1.56 Average Bedrooms: 1.56


## Sort Listings by Number of Reviews
### To understand which listings are most popular, you might sort them by the number of reviews in descending order.

In [30]:
popular_listings = db.listingsAndReviews.aggregate([
    # Adds a field to each document that represents the size of the reviews list
    {"$addFields": {"reviewsCount": {"$size": "$reviews"}}},
    # Sort the documents by the newly added reviewsCount field in descending order
    {"$sort": {"reviewsCount": -1}},
    # Limit the results to the top 10
    {"$limit": 10}
])

for listing in popular_listings:
    print(listing['name'], "with", listing.get('reviewsCount', 0), "reviews",  listing['listing_url'])

#Private Studio - Waikiki Dream with 533 reviews https://www.airbnb.com/rooms/4069429
Near Airport private room, 2 bedroom granny flat** with 469 reviews https://www.airbnb.com/rooms/12954762
La Sagrada Familia (and metro) 4 blocks! with 463 reviews https://www.airbnb.com/rooms/95560
PRIVATE Room in Spacious, Quiet Apt with 420 reviews https://www.airbnb.com/rooms/476983
traditional and Charming room with 408 reviews https://www.airbnb.com/rooms/5283892
Porto city centre apartment H with 402 reviews https://www.airbnb.com/rooms/2758817
ABEL'S IN DOWNTOWN - THE PLACE TO BE 1 OF 2 with 399 reviews https://www.airbnb.com/rooms/1284759
Beautiful apartment by Pl Catalunya with 397 reviews https://www.airbnb.com/rooms/1482060
B & B  Room Yoga Garden's Place with 391 reviews https://www.airbnb.com/rooms/127208
The Ohana at Volcanoes National Park! with 385 reviews https://www.airbnb.com/rooms/11610598


## Aggregate Average Price by Room Type

In [31]:
from bson.son import SON  # SON ensures order is maintained, important for certain MongoDB operations
from decimal import Decimal
avg_price_by_room = db.listingsAndReviews.aggregate([
    {"$group": {"_id": "$room_type", "averagePrice": {"$avg": "$price"}}},
    {"$sort": SON([("averagePrice", 1)])}  # Sort by average price in ascending order
])

for avg in avg_price_by_room:
    print(avg['_id'], ":", round(Decimal(avg['averagePrice'].to_decimal()),3), "$")

Private room : 212.297 $
Entire home/apt : 314.927 $
Shared room : 349.590 $


## **Update** Operations

## Add a new amenity to a listing

In [32]:
### first get a random listing
first_listing = db.listingsAndReviews.find_one()
print(first_listing['_id'], first_listing['amenities'] )

10009999 ['Wifi', 'Wheelchair accessible', 'Kitchen', 'Free parking on premises', 'Smoking allowed', 'Hot tub', 'Buzzer/wireless intercom', 'Family/kid friendly', 'Washer', 'First aid kit', 'Essentials', 'Hangers', 'Hair dryer', 'Iron', 'Laptop friendly workspace']


#### we will add "smart TV" as amenities here; this should appear at the end.

In [33]:
db.listingsAndReviews.update_one(
    {"_id": "10009999"},
    {"$addToSet": {"amenities": "smart TV"}}
)

<pymongo.results.UpdateResult at 0x7ca56e0bcac0>

## let's verify

In [34]:
query = {"_id": "10009999"}
results = db.listingsAndReviews.find(query)

In [35]:
for listing in results:
    print(listing['_id'], ":", listing['amenities'])

10009999 : ['Wifi', 'Wheelchair accessible', 'Kitchen', 'Free parking on premises', 'Smoking allowed', 'Hot tub', 'Buzzer/wireless intercom', 'Family/kid friendly', 'Washer', 'First aid kit', 'Essentials', 'Hangers', 'Hair dryer', 'Iron', 'Laptop friendly workspace', 'smart TV']


### **update_many** example

## Set a new field verified to true for all listings:

In [36]:
db.listingsAndReviews.update_many(
    {},
    {"$set": {"verified": True}}
)

<pymongo.results.UpdateResult at 0x7ca56e1630c0>

## **DELETE** operations - be careful while using.

## Delete listings without reviews:

In [37]:
db.listingsAndReviews.delete_many({"reviews": {"$size": 0}})

<pymongo.results.DeleteResult at 0x7ca56e163680>

## verification - lets get listings in ascending order


In [41]:
unpopular_listings = db.listingsAndReviews.aggregate([
    # Adds a field to each document that represents the size of the reviews list
    {"$addFields": {"reviewsCount": {"$size": "$reviews"}}},
    # Sort the documents by the newly added reviewsCount field in ascending order
    {"$sort": {"reviewsCount": 1}},
    # Limit the results to the bottom 10
    {"$limit": 10}
])

for listing in unpopular_listings:
    print(listing['name'], "with", listing.get('reviewsCount', 0), "reviews",  listing['listing_url'])

Private Room in Bushwick with 1 reviews https://www.airbnb.com/rooms/10021707
Big, Bright & Convenient Sheung Wan with 1 reviews https://www.airbnb.com/rooms/10141950
LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4! with 1 reviews https://www.airbnb.com/rooms/10227000
Charming Flat in Downtown Moda with 1 reviews https://www.airbnb.com/rooms/10047964
Catete's Colonial Big Hause Room B with 1 reviews https://www.airbnb.com/rooms/10051164
Kailua-Kona, Kona Coast II 2b condo with 1 reviews https://www.airbnb.com/rooms/1022200
Small Room w Bathroom Flamengo Rio de Janeiro with 1 reviews https://www.airbnb.com/rooms/10299095
Quarto inteiro na Tijuca with 1 reviews https://www.airbnb.com/rooms/10228731
Resort-like living in Williamsburg with 1 reviews https://www.airbnb.com/rooms/10166986
Easy 1 Bedroom in Chelsea with 1 reviews https://www.airbnb.com/rooms/10096773


In [42]:
for listing in unpopular_listings:
    print(listing['name'], "with", listing.get('reviewsCount', 0), "reviews",  listing['listing_url'])

### So, Minimum we see is one review

## let's count the number of documents again

In [39]:
count = db.listingsAndReviews.count_documents({})
print("Number of listings:", count)

Number of listings: 3922


# thanks! 🙏