<a href="https://colab.research.google.com/github/fourfeatherz/DS2002F24/blob/main/NoSQL/mongodb_sample_mflix_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MongoDB Atlas with Python: Using the `sample_mflix` Dataset
This notebook introduces the basics of querying, filtering, and performing aggregations with MongoDB using the **sample_mflix** dataset. The dataset contains movie-related information such as movies, comments, theaters, and users. You'll learn how to:
- Connect to MongoDB Atlas
- Perform basic queries and filtering
- Execute advanced operations like aggregation
- Create indexes and update/delete documents

Let's get started!

In [1]:
!pip install --upgrade pymongo certifi

Collecting pymongo
  Downloading pymongo-4.10.1-cp312-cp312-win_amd64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.10.1-cp312-cp312-win_amd64.whl (926 kB)
   ---------------------------------------- 0.0/926.7 kB ? eta -:--:--
   ------ --------------------------------- 153.6/926.7 kB 3.1 MB/s eta 0:00:01
   ---------------------- ----------------- 532.5/926.7 kB 5.6 MB/s eta 0:00:01
   ---------------------------------------  921.6/926.7 kB 7.3 MB/s eta 0:00:01
   ---------------------------------------- 926.7/926.7 kB 6.5 MB/s eta 0:00:00
Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
   ---------------------------------------- 0.0/307.7 kB ? eta -:--:--
   ---------------------------------------- 307.7/307.7 kB 9.3 MB/s eta 0:00:00
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.10.1


## 1. Setup and Connection to MongoDB Atlas


In [1]:

# Install pymongo for MongoDB connection


# Import necessary libraries
from pymongo import MongoClient
import pprint

# Replace with your MongoDB Atlas connection string
connection_string = 'mongodb+srv://teresaduong684:N0thingg00diseverl0st!@cluster0.k0akc.mongodb.net/test?retryWrites=true&w=majority'
# Connect to MongoDB Atlas
client = MongoClient(connection_string)

# Access the sample_mflix database and the movies collection
db = client['sample_mflix']
collection = db['movies']

# sample_mflix is a database
# MongoDB stores documents in collections. Collections are analogous to tables in relational databases

#had to add IP address
#Data Services
#Left menu bar > Security > Network Access
#Database Access is also used to add users/change permissions like read/write

## 2. Basic MongoDB Commands
### Searching for Documents (Basic Query)


In [2]:

# Find one document from the movies collection
document = collection.find_one()
pprint.pprint(document)

# document: a set of key-value pairs (field-value pairs), similar to a JSON object

#documents are our mongo objects. All in JSON key value pairs.
#with find you can do sorts and searches. Similar to SELECT from MySQL

{'_id': ObjectId('573a1390f29313caabcd587d'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['John McCann', 'James A. Marcus', 'Maggie Weston', 'Harry McCoy'],
 'countries': ['USA'],
 'directors': ['Raoul Walsh'],
 'fullplot': 'At 10 years old, Owens becomes a ragged orphan when his sainted '
             'mother dies. The Conways, who are next door neighbors, take Owen '
             'in, but the constant drinking by Jim soon puts Owen on the '
             'street. By 17, Owen learns that might is right. By 25, Owen is '
             'the leader of his own gang who spend most of their time gambling '
             'and drinking. But Marie comes into the gangster area of town and '
             'everything changes for Owen as he falls for Marie. But he cannot '
             'tell her so, so he comes to her settlement to find education and '
             'inspiration. But soon, his old way of life will rise to confront '
             'him again.',
 'genres': ['Bio

### Searching with a Filter (Filtering)


In [7]:

# Find all movies where the genre contains "Action"
action_movies = collection.find({"genres": "Action"}).limit(5)

#filtering for movies where includes a dictionary entry/key with a specified value.

# Print the results
for movie in action_movies:
    pprint.pprint(movie)


{'_id': ObjectId('573a1393f29313caabcdc77e'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['Humphrey Bogart',
          'Mary Astor',
          'Sydney Greenstreet',
          'Charles Halton'],
 'countries': ['USA'],
 'directors': ['John Huston', 'Vincent Sherman'],
 'fullplot': 'Rick Leland makes no secret of the fact he has no loyalty to his '
             'home country after he is court-marshaled out of the army and '
             'boards a Japanese ship for the Orient in late 1941. But has '
             'Leland really been booted out, or is there some other motive for '
             'his getting close to fellow passenger Doctor Lorenz? Any motive '
             'for getting close to attractive traveller Alberta Marlow would '
             'however seem pretty obvious.',
 'genres': ['Action', 'Adventure', 'Drama'],
 'imdb': {'id': 34428, 'rating': 6.9, 'votes': 2892},
 'languages': ['English', 'Japanese'],
 'lastupdated': '2015-08-23 00:05:01.140000000',
 

### Sorting Results


In [9]:

# Find and sort movies by release year in descending order
sorted_movies = collection.find().sort("year", -1).limit(5)
#In MongoDB, the sort() method is used to specify the order in which the query returns matching documents from a collection. 
#It allows users to arrange documents in either ascending (1) or descending (-1) order based on the values of one or more fields.

# Print the sorted results
for movie in sorted_movies:
    pprint.pprint(movie)


{'_id': ObjectId('573a13f1f29313caabddc613'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['Steven Waddington',
          'Siennah Buck',
          'Mike Colter',
          'Christian Contreras'],
 'fullplot': 'In the 28th Century, the prolonged war between humanity and the '
             'fanatical alien alliance the Covenant has ended with a tenuous '
             "treaty. Despite the ceasefire, Earth's outer colonies remain "
             "vulnerable to the Covenant's covert intrusions. The ONI - Office "
             'of Naval Intelligence has been tasked with counterintelligence '
             'to beat the Covenant. In Planet Sedra, Commander Jameson Locke '
             "and his team witness a Covenant's spacecraft and a Zealot Elite "
             'warrior disembarks with a bomb. They unsuccessfully try to stop '
             'the alien that explodes the bomb in a mall. They realize that it '
             'is a biological attack with an element fatal to 

### Searching with Multiple Conditions


In [11]:

# Find movies where the genre is "Action" and the rating is greater than 8
multi_condition_query = {"genres": "Action", "imdb.rating": {"$gt": 8}}

# Execute the query
results = collection.find(multi_condition_query).limit(5)

# Print the results
for result in results:
    pprint.pprint(result)


{'_id': ObjectId('573a1395f29313caabce2498'),
 'awards': {'nominations': 1, 'text': '1 win & 1 nomination.', 'wins': 1},
 'cast': ['Clint Eastwood',
          'Marianne Koch',
          'Gian Maria Volontè',
          'Wolfgang Lukschy'],
 'countries': ['Italy', 'Spain', 'West Germany'],
 'directors': ['Sergio Leone'],
 'fullplot': 'An anonymous, but deadly man rides into a town torn by war '
             "between two factions, the Baxters and the Rojo's. Instead of "
             'fleeing or dying, as most other would do, the man schemes to '
             'play the two sides off each other, getting rich in the bargain.',
 'genres': ['Action', 'Drama', 'Western'],
 'imdb': {'id': 58461, 'rating': 8.1, 'votes': 126585},
 'languages': ['Italian', 'Spanish', 'English'],
 'lastupdated': '2015-09-02 00:17:22.303000000',
 'num_mflix_comments': 0,
 'plot': 'A wandering gunfighter plays two rival families against each other '
         'in a town torn apart by greed, pride, and revenge.',
 'pos

## 3. Advanced MongoDB Operations
### Aggregation Example: Average IMDb Rating by Genre


In [13]:

# Aggregation pipeline to calculate average IMDb rating by genre
aggregation_pipeline = [
    {"$unwind": "$genres"},  # Separate each movie's genres into individual documents
    {"$group": {"_id": "$genres", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

# MongoDB document:
# a set of key-value pairs (field-value pairs), similar to a JSON object

# $unwind:
#Deconstructs an array field from the input documents to output a document for each element. 
#Each output document is the input document with the value of the array field replaced by the element.

# $group
#separates documents into groups according to a "group key". The output is one document for each unique group key.

# Execute the aggregation
aggregated_data = collection.aggregate(aggregation_pipeline)

# Print the results
for data in aggregated_data:
    pprint.pprint(data)


{'_id': 'Film-Noir', 'avg_rating': 7.397402597402598}
{'_id': 'Short', 'avg_rating': 7.377574370709382}
{'_id': 'Documentary', 'avg_rating': 7.365679824561403}
{'_id': 'News', 'avg_rating': 7.252272727272728}
{'_id': 'History', 'avg_rating': 7.1696100917431185}


### Indexing: Creating an Index


In [35]:

# Create an index on the "year" field to improve query performance for year-related searches
collection.create_index([("year", 1)])

#to drop a created index:
#collection.drop_index([("year", 2)])


# Show existing indexes
indexes = collection.index_information()
pprint.pprint(indexes)



{'_id_': {'key': [('_id', 1)], 'v': 2},
 'cast_text_fullplot_text_genres_text_title_text': {'default_language': 'english',
                                                    'key': [('_fts', 'text'),
                                                            ('_ftsx', 1)],
                                                    'language_override': 'language',
                                                    'textIndexVersion': 3,
                                                    'v': 2,
                                                    'weights': SON([('cast', 1), ('fullplot', 1), ('genres', 1), ('title', 1)])},
 'year_1': {'key': [('year', 1)], 'v': 2}}


### Updating Documents


In [25]:

# Update a movie's IMDb rating (change the rating of a specific movie)
collection.update_one({"title": "The Godfather"}, {"$set": {"imdb.rating": 9.3}})


UpdateResult({'n': 1, 'electionId': ObjectId('7fffffff0000000000000121'), 'opTime': {'ts': Timestamp(1728213057, 23), 't': 289}, 'nModified': 0, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1728213057, 23), 'signature': {'hash': b'Y+\xd4\x17\xc8` \xecE\x8bj\xf5\xe8\xac\xd0\xfcJ\xf5\xbc\xb1', 'keyId': 7363326094432272385}}, 'operationTime': Timestamp(1728213057, 23), 'updatedExisting': True}, acknowledged=True)

### Deleting Documents


In [None]:

# Delete a movie based on a condition (delete movies that were released before 1950)
#collection.delete_many({"year": {"$lt": 1950}})


## 4. Exercises for Hands-On Practice
### Exercise 1: Searching and Filtering
**Task**: Find all movies where the genre is 'Comedy' and the IMDb rating is greater than 7.


In [37]:

# Your task: Write a query to find comedies with an IMDb rating greater than 7
comedies = collection.find({"genres": "Comedy", "imdb.rating": {"$gt": 7}}).limit(5)

# Print the first 5 results
for comedy in comedies:
    pprint.pprint(comedy)


{'_id': ObjectId('573a1393f29313caabcdc91c'),
 'awards': {'nominations': 3,
            'text': 'Won 1 Oscar. Another 2 nominations.',
            'wins': 0},
 'cast': ['Bing Crosby', 'Fred Astaire', 'Marjorie Reynolds', 'Virginia Dale'],
 'countries': ['USA'],
 'directors': ['Mark Sandrich'],
 'fullplot': 'Lovely Linda Mason has crooner Jim Hardy head over heels, but '
             'suave stepper Ted Hanover wants her for his new dance partner '
             "after femme fatale Lila Dixon gives him the brush. Jim's supper "
             'club, Holiday Inn, is the setting for the chase by Hanover and '
             "manager Danny Reed. The music's the thing.",
 'genres': ['Comedy', 'Drama', 'Musical'],
 'imdb': {'id': 34862, 'rating': 7.6, 'votes': 7890},
 'languages': ['English'],
 'lastupdated': '2015-08-05 01:05:53.350000000',
 'num_mflix_comments': 0,
 'plot': 'At an inn which is only open on holidays, a crooner and a hoofer vie '
         'for the affections of a beautiful up-and-

### Exercise 2: Aggregation Pipeline
**Task**: Write an aggregation pipeline to find the top 5 directors by the average IMDb rating of their movies.


In [None]:

# Your task: Write an aggregation pipeline to calculate average IMDb rating by director
pipeline = [
    {"$group": {"_id": "$directors", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

# Execute the pipeline and print results
avg_rating_by_director = collection.aggregate(pipeline)
for data in avg_rating_by_director:
    pprint.pprint(data)


{'_id': ['Sara Hirsh Bordo'], 'avg_rating': 9.4}
{'_id': ['Kevin Derek'], 'avg_rating': 9.3}
{'_id': ['Michael Benson'], 'avg_rating': 9.0}
{'_id': ['Slobodan Sijan'], 'avg_rating': 8.95}
{'_id': ['Sundar C.'], 'avg_rating': 8.9}


### Exercise 3: Create an Index and Measure Performance
**Task**: Create an index on the `imdb.rating` field. Measure performance before and after creating the index.


In [None]:

# Task: Create an index on imdb.rating and query before and after indexing

# Query without index
from time import time

start_time = time()
no_index_result = collection.find({"imdb.rating": {"$gt": 8}}).limit(5)
print("Time without index:", time() - start_time)

# Create an index
collection.create_index([("imdb.rating", 1)])

# Query with index
start_time = time()
with_index_result = collection.find({"imdb.rating": {"$gt": 8}}).limit(5)
print("Time with index:", time() - start_time)

# Print the results
for result in with_index_result:
    pprint.pprint(result)

#Time with index is 0.0000011 s faster?

Time without index: 0.0001518726348876953
Time with index: 0.00015044212341308594
{'_id': ObjectId('573a1391f29313caabcd72f0'),
 'awards': {'nominations': 0, 'text': '2 wins.', 'wins': 2},
 'cast': ['Richard Barthelmess',
          'Gladys Hulette',
          'Walter P. Lewis',
          'Ernest Torrence'],
 'countries': ['USA'],
 'directors': ['Henry King'],
 'fullplot': 'When three thuggish men are responsible for the death of his '
             'father and the crippling of his brother, young David must choose '
             'between supporting his family or risking his life and exacting '
             'vengeance.',
 'genres': ['Drama'],
 'imdb': {'id': 12763, 'rating': 8.1, 'votes': 1455},
 'lastupdated': '2015-08-23 01:12:08.943000000',
 'num_mflix_comments': 0,
 'plot': 'When three thuggish men are responsible for the death of his father '
         'and the crippling of his brother, young David must choose between '
         'supporting his family or risking his life and exacting 