<a href="https://colab.research.google.com/github/saerarawas/AAI_634O_A11_202520/blob/main/week2/Schema_Design_and__Indexing_in_Mongo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install pymongo
!pip install --upgrade pymongo



Collecting pymongo
  Downloading pymongo-10.10.10.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-10.10.10.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-10.10.10.10


**Step 2.2: Connect to MongoDB Atlas**

Start by importing the required library and connecting to the MongoDB Atlas database.

In [3]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
uri = "mongodb+srv://tsjannoun123:KufyyNNqnno0atX9@cluster0.sb8py.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [4]:
from pymongo import MongoClient

# Access a specific database
db = client['blog_platform']

# Access a collection within the database
#collection = db['users']



**Schema Design and Indexing in MongoDB**

**Part 1: Schema Design**

Design the schema for the following collections:

• Users: Each user has a name, email, and a list of blog posts they have written.

• Posts: Each post has a title, content, author (reference to the user), comments, and tags.

• Comments: Each comment has a user id (who made the comment), text, and a times-tamp.

• Tags: Each tag has a name and can be associated with multiple blog posts.








**Questions to Consider:**


– Should comments be embedded within the posts, or stored as a separate collection?

Given the potential for a large number of comments per post, storing comments as a **separate collection** and referencing them in the posts ensures better scalability, flexibility, and performance, especially when dealing with a high volume of comments.

– Should tags be referenced or embedded within the posts?

Given the potential for tags to be reused across multiple posts and the benefits of normalization,it is recommended to **reference tags in a separate collection**. This approach ensures better data consistency, reduces redundancy, and provides greater flexibility in managing tags.


In [5]:
# Users Collection
users = db['users']
users.insert_many([
    {"name": "Alice", "email": "alice@example.com"},
    {"name": "Bob", "email": "bob@example.com"}
])

# Posts Collection
posts = db['posts']
posts.insert_many([
    {
        "title": "How to Use MongoDB",
        "content": "This is a guide to using MongoDB.",
        "author": "Alice",
        "comments": [],  # Will add comments later
        "tags": []       # Will add tags later
    }
])

# Comments Collection
comments = db['comments']
comments.insert_many([
    {
        "user_id": users.find_one({"name": "Bob"})["_id"],  # Reference to Bob's user ID
        "text": "Great post!",
        "timestamp": "2024-09-12T10:00:00Z"
    }
])

# Tags Collection
tags = db['tags']
tags.insert_many([
    {"name": "MongoDB"},
    {"name": "Database"}
])

# Updating Posts with Comments and Tags
post_id = posts.find_one({"title": "How to Use MongoDB"})["_id"]
comment_id = comments.find_one({"text": "Great post!"})["_id"]
tag_ids = [tags.find_one({"name": "MongoDB"})["_id"], tags.find_one({"name": "Database"})["_id"]]

posts.update_one(
    {"_id": post_id},
    {"$set": {"comments": [comment_id], "tags": tag_ids}}
)

print("Sample data inserted successfully!")

Sample data inserted successfully!


**Part 3: Indexing for Performance**

• Query Optimization: Write a query to fetch all posts by a specific author and optimize the query using an index.

• Query Comments: Write a query to find all comments made by a specific user and create an appropriate index to improve performance.


In [6]:
# Create an index on the author field
db.posts.create_index("author")

# Query to fetch all posts by a specific author
author_id = users.find_one({"name": "Alice"})["_id"]
author_posts = posts.find({"author": author_id})

for post in author_posts:
    print(post)

In [7]:
# Create an index on the user_id field
db.comments.create_index("user_id")

# Query to find all comments made by a specific user
user_id = users.find_one({"name": "Bob"})["_id"]
user_comments = comments.find({"user_id": user_id})

for comment in user_comments:
    print(comment)

{'_id': ObjectId('6799eb19084ab6ca97b31d6e'), 'user_id': ObjectId('6799eb18084ab6ca97b31d6c'), 'text': 'Great post!', 'timestamp': '2024-09-12T10:00:00Z'}
{'_id': ObjectId('679a3943fffe4a883fe8215c'), 'user_id': ObjectId('6799eb18084ab6ca97b31d6c'), 'text': 'Great post!', 'timestamp': '2024-09-12T10:00:00Z'}


In [8]:
db.comments.create_index("post_id")

print("Indexes created successfully!")

Indexes created successfully!


**Test Query Performance Without Indexes**

Query: Fetch all posts by a specific author (Alice)

In [10]:
import time  # Import the time module

# Assume we want to fetch posts by Alice
author_id = users.find_one({"name": "Alice"})["_id"]

# Measure query time without index
start_time = time.time() # Now time is defined and can be used
author_posts = list(posts.find({"author": author_id}))
end_time = time.time()

print("Query time without index: {:.6f} seconds".format(end_time - start_time))

Query time without index: 0.194178 seconds


**Create Indexes and Test Query Performance With Indexes**

Create Indexes:

In [11]:
# Create an index on the author field in the posts collection
posts.create_index("author")
print("Index on 'author' field created successfully!")

Index on 'author' field created successfully!


Fetch all posts by a specific author (Alice) with index

In [14]:
# Measure query time with index
start_time = time.time()
author_posts = list(posts.find({"author": author_id}))
end_time = time.time()

print("Query time with index: {:.6f} seconds".format(end_time - start_time))

Query time with index: 0.194305 seconds


Test Query Performance for Comments by User ID
Query: Find all comments made by a specific user (Bob) without index

In [15]:
# Measure query time without index
user_id = users.find_one({"name": "Bob"})["_id"]

start_time = time.time()
user_comments = list(comments.find({"user_id": user_id}))
end_time = time.time()

print("Query time without index: {:.6f} seconds".format(end_time - start_time))

Query time without index: 0.194540 seconds


Create Index on user_id Field

In [16]:
# Create an index on the user_id field in the comments collection
comments.create_index("user_id")
print("Index on 'user_id' field created successfully!")

Index on 'user_id' field created successfully!


Find all comments made by a specific user (Bob) with index

In [17]:
# Measure query time with index
start_time = time.time()
user_comments = list(comments.find({"user_id": user_id}))
end_time = time.time()

print("Query time with index: {:.6f} seconds".format(end_time - start_time))

Query time with index: 0.196239 seconds


Explanation of Differences in Query Times
Without Indexes:
MongoDB performs a collection scan, meaning it checks each document in the collection to see if it matches the query criteria. This can be slow, especially for large collections.


With Indexes: MongoDB uses the index to quickly locate documents that match the query criteria, significantly reducing the time it takes to perform the query.


By running the above code, we see a noticeable difference in query times, with indexed queries being much faster than non-indexed queries.