## MongoDB Assignment

## Theoretical Questions

Q1. What are the key differences between SQL and NoSQL databases ?

<img src = 'Ans1.png'>

Q2. What makes MongoDB a good choice for modern applications ?

MongoDB is a great choice for modern applications because of its:

1. Flexible Schema – It stores data in a JSON-like format (BSON), allowing dynamic fields and easy schema evolution without complex migrations.

2. High Scalability – Supports horizontal scaling using sharding, enabling applications to handle large volumes of data and traffic efficiently.

3. Fast Performance – Uses indexing, in-memory storage, and optimized queries to ensure high-speed read and write operations.

4. Cloud-Native & Distributed – Works seamlessly with cloud platforms (MongoDB Atlas), ensuring high availability, automatic backups, and easy scaling.

This makes MongoDB perfect for real-time analytics, IoT, content management, and scalable web applications

Q3. Explain the concept of collections in MongoDB.

Collections in MongoDB

In MongoDB, a collection is a group of documents, similar to a table in relational databases. However, unlike tables, collections in MongoDB:

1. Do Not Have a Fixed Schema – Documents in a collection can have different fields, allowing flexibility.

2. Store Data as BSON Documents – Each document is in JSON-like format, making it easy to store and retrieve data.

3. Are Automatically Created – Collections are created when a document is inserted, without needing predefined structures.

Q4. How does MongoDB ensure high availability using replication ?

MongoDB ensures high availability using replication, specifically through Replica Sets. A Replica Set is a group of MongoDB instances that maintain the same data to provide redundancy and failover capabilities. Here's how it works:

1. Primary-Secondary Architecture
A Replica Set consists of:
Primary Node: Handles all write and read operations (unless secondary reads are enabled).
Secondary Nodes: Maintain copies of the primary’s data via replication.
Arbiter (Optional): Helps in elections but does not store data.

2. Automatic Failover
If the primary node goes down, MongoDB automatically elects a new primary from the secondaries.
The election process takes a few seconds (usually less than 12 seconds).
Once a new primary is elected, applications can resume operations with minimal downtime.


3. Data Redundancy and Synchronization
The primary node continuously streams data changes to secondaries via oplogs (operations logs).
Secondaries apply these operations in the same order to stay synchronized.
If a secondary falls behind, it can perform rollback and resynchronization.

Q5. What are the main benefits of MongoDB Atlas ?

MongoDB Atlas offers several key benefits, including:

1. Fully Managed Service – It automates database deployment, scaling, and maintenance, reducing operational overhead.

2. Scalability – Supports auto-scaling and sharding to handle growing workloads efficiently.

3. High Availability – Provides built-in redundancy and failover with multi-region replication.

4. Security – Features encryption, role-based access control (RBAC), and compliance with industry standards like GDPR and HIPAA.

5. Performance Optimization – Includes automated performance monitoring, indexing suggestions, and workload optimization tools.

6. Multi-Cloud Support – Allows deployment across AWS, Azure, or Google Cloud for flexibility and resilience.



Q6. What is the role of indexes in MongoDB, and how do they improve performance ?

Role of Indexes in MongoDB:

Indexes in MongoDB play a crucial role in improving query performance by enabling efficient data retrieval. Without indexes, MongoDB must scan every document in a collection to fulfill a query, which is slow for large datasets.

How Indexes Improve Performance:

1. Faster Query Execution – Indexes allow MongoDB to locate documents quickly without scanning the entire collection.
   
2. Efficient Sorting – Queries with sort() perform better when indexed fields are used.

3. Optimized Filtering – Queries with indexed fields reduce the number of documents MongoDB needs to inspect.

4. Supports Unique Constraints – Unique indexes ensure data integrity by preventing duplicate values in a field.

5. Improved Aggregation Performance – Indexes speed up aggregation pipelines by reducing the amount of data processed.

Q7. Describe the stages of the MongoDB aggregation pipeline.

The MongoDB aggregation pipeline processes data through multiple stages to transform and analyze documents efficiently. Key stages include:

$match (filters documents)

$group (aggregates data)

$project (reshapes documents)

$sort (orders results)

$unwind (splits arrays into multiple documents)

$lookup (performs joins)

$out (writes results to a collection)

$addFields (adds computed fields)

These stages can be combined for powerful data processing and analytics.

Q8. What is sharding in MongoDB? How does it differ from replication ?

Sharding is a method of distributing data across multiple servers to handle large datasets and high throughput. It ensures horizontal scalability by partitioning data into smaller, more manageable pieces (shards).

Uses a shard key to determine how data is distributed.

Each shard stores only a subset of the data.

A mongos router directs queries to the correct shard(s).

<img src = 'Ans.png'>

Q9. What is PyMongo, and why is it used ?

PyMongo is the official Python driver for MongoDB, allowing Python applications to interact with MongoDB databases. It provides an interface to perform CRUD operations (Create, Read, Update, Delete), run aggregation pipelines, and manage collections.

Why is PyMongo Used?

1. Database Connectivity – Enables Python applications to connect to MongoDB.

2. Easy CRUD Operations – Simplifies inserting, querying, updating, and deleting documents.


3. Aggregation Support – Allows running MongoDB aggregation pipelines.


4. Index Management – Helps create and manage indexes for performance optimization.


5. Scalability & Replication Support – Works with sharded clusters and replica sets.

 Q10. What are the ACID properties in the context of MongoDB transactions ?

In the context of MongoDB transactions, the ACID properties ensure data integrity and consistency. ACID stands for:

Atomicity: A transaction is treated as a single unit—either all operations within the transaction succeed, or none of them are applied. If an error occurs, MongoDB rolls back the transaction.

Consistency: The database remains in a valid state before and after the transaction. Any changes made within a transaction must adhere to the schema and validation rules of the database.

Isolation: Transactions execute independently, ensuring that intermediate changes are not visible to other operations until the transaction is committed. In MongoDB, transactions follow snapshot isolation, meaning they use a consistent view of the data when executing.

Durability: Once a transaction is committed, changes are permanently saved in the database, even if there is a system crash or failure. MongoDB achieves this by writing data to the journal before confirming the commit.

Q11. What is the purpose of MongoDB’s explain() function ?

MongoDB’s explain() function is used to analyze how a query is executed, helping developers and database administrators optimize performance. It provides insights into how MongoDB processes a query by detailing aspects such as:

Query Execution Plan: Shows whether an index was used, if a collection scan occurred, or how many documents were examined.

Execution Time: Helps in understanding query efficiency by displaying the time taken for execution.

Index Usage: Indicates whether an index was used and which one, aiding in performance tuning.

Q12. How does MongoDB handle schema validation ?

Schema Validation in MongoDB

MongoDB is a schema-less database, but it allows schema validation to enforce data integrity within collections. Schema validation is implemented using JSON Schema rules at the collection level.

Key Features of Schema Validation:

Ensures data consistency by defining rules for document structure.

Applies validation rules at the time of insert and update operations.

Uses BSON types (e.g., string, number, object, etc.).


Q13. What is the difference between a primary and a secondary node in a replica set ?

<img src = 'A.png'>

Q14. What security mechanisms does MongoDB provide for data protection ?

MongoDB Security Mechanisms

1. Authentication – Verifies users using SCRAM, X.509, LDAP, or Kerberos.
   
2. Authorization (RBAC) – Controls user permissions with role-based access control.

3. TLS/SSL Encryption – Secures data in transit between clients and servers.

4. Encryption at Rest – Protects stored data using WiredTiger encryption.

5. Firewall & IP Whitelisting – Restricts database access to trusted IPs.

6. Audit Logging – Tracks database events and user actions for security monitoring.

7. Backup & Recovery – Ensures data protection using replication, snapshots, and backups.

Q15. Explain the concept of embedded documents and when they should be used.

In MongoDB, embedded documents (also called nested documents) allow storing related data within a single document instead of using separate collections and references. This follows a denormalized approach, improving performance by reducing the need for joins.

When to Use Embedded Documents?

✅ Use Embedded Documents When:

1. One-to-Few Relationship – Example: A user profile with an embedded address or preferences.

2. Data is Frequently Accessed Together – Example: Order details (products, price, shipping) inside an orders collection.

3. No Need for Separate Queries – If you always fetch data together, embedding reduces the need for joins.

Q16. What is the purpose of MongoDB’s $lookup stage in aggregation ?

Purpose of MongoDB’s $lookup Stage in Aggregation

The $lookup stage in MongoDB’s aggregation pipeline performs a left outer join between two collections. It is used to retrieve related data from another collection and embed it in the query result.

When to Use $lookup?

✅ Use $lookup when:

1. You need to join two collections (similar to SQL joins).

2. You want to avoid data duplication by keeping related data in separate collections.

3. You need to retrieve additional details for a query result.

Q17. What are some common use cases for MongoDB ?

Common Use Cases for MongoDB

1. Real-Time Analytics – Used in financial markets and website traffic monitoring.

2. Content Management – Ideal for blogs, product catalogs, and news platforms.

3. IoT & Sensor Data – Stores smart home and industrial sensor logs.

4. E-Commerce – Manages product inventories, customer orders, and reviews.

5. Gaming – Tracks player profiles, in-game purchases, and leaderboards.

6. Social Media & Chat Apps – Handles messages, notifications, and user interactions.

7. Geospatial Apps – Used in ride-hailing services and location tracking.

8. Healthcare – Stores patient records and medical history securely

Q18. What are the advantages of using MongoDB for horizontal scaling ?

Advantages of Using MongoDB for Horizontal Scaling : 

1. Sharding – Distributes data across multiple servers, enabling scalability.
   
2. Automatic Data Distribution – Handles data balancing automatically across nodes.

3. High Availability – Uses replica sets for data redundancy and automatic failover.

4. Easy to Scale Out – Simply add more servers to scale horizontally with no downtime.

5. Efficient Data Handling – Handles large datasets and high-traffic applications.

6. Flexible Schema – Easily adapts to changing data models as the app grows.

Q19. How do MongoDB transactions differ from SQL transactions ?

MongoDB transactions and SQL transactions differ in several ways, mainly due to their underlying database architectures. Here are the key differences:

1. Data Model
MongoDB: Uses a document-based NoSQL model where data is stored in JSON-like BSON documents.
SQL Databases: Use a relational model with tables, rows, and columns.

2. ACID Compliance
MongoDB: Supports ACID transactions (since version 4.0) but is optimized for document-level operations. Multi-document transactions are available but can have performance overhead.
SQL Databases: Designed with ACID compliance at their core, ensuring strong consistency across multiple tables and rows.

3. Scope of Transactions
MongoDB: Transactions are mainly used for multi-document operations within a single replica set or sharded cluster.
SQL Databases: Transactions can span multiple tables, rows, and even databases.

4. Performance Impact
MongoDB: Since MongoDB is optimized for high-speed document-based operations, transactions introduce additional overhead, potentially reducing performance.
SQL Databases: Transactions are natively integrated and optimized for handling multiple complex operations efficiently.

5. Concurrency Control
MongoDB: Uses an optimistic concurrency control approach, where operations typically retry on failure.
SQL Databases: Use pessimistic locking with explicit row and table locks.

Q20. What are the main differences between capped collections and regular collections ?

<img src = 'A2.png'>

Q21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline ?

Purpose of the $match Stage in MongoDB’s Aggregation Pipeline

The $match stage is used to filter documents based on specified conditions, similar to the find() query but within the aggregation pipeline. It improves performance by reducing the number of documents processed in later stages.

Key Features:
Acts as a filter to include only relevant documents.

Improves efficiency by reducing the dataset early in the pipeline.


Q22. How can you secure access to a MongoDB database ?

To secure a MongoDB database:

1. Enable Authentication – Use username-password authentication (SCRAM-SHA-256).

2. Role-Based Access Control (RBAC) – Assign minimal privileges using roles (read, readWrite).

3. Enable TLS/SSL Encryption – Encrypt data in transit using SSL/TLS.

4. Use IP Whitelisting – Restrict access to specific IPs using bindIp.

5. Disable Remote Access – Run MongoDB on localhost unless external access is needed.

6. Enable Firewall Rules – Use firewalls (UFW, IPTables) to block unauthorized access.

7. Encrypt Data at Rest – Use WiredTiger encryption for stored data.

8. Audit & Monitor – Enable logging and use monitoring tools (MongoDB Atlas, Prometheus).








Q23. What is MongoDB’s WiredTiger storage engine, and why is it important ?

MongoDB’s WiredTiger Storage Engine and Its Importance

WiredTiger is MongoDB’s default storage engine (since version 3.2), designed for high performance, concurrency, and efficient memory use.

Key Features & Importance:

1. Document-Level Concurrency – Supports multiple read/write operations simultaneously, improving performance.

2. Compression – Uses Snappy, Zlib, or Zstd to reduce storage footprint.

3. Journaling – Ensures data durability and crash recovery.

4. Checkpointing – Reduces write amplification and speeds up recovery.

5. Indexing Efficiency – Uses B-Trees for faster lookups.

## Practical Questions

Q1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB.

In [6]:
import pandas as pd
import pymongo

# MongoDB connection details
MONGO_URI = "mongodb://localhost:27017/"  # Change if needed
DB_NAME = "superstore_db"
COLLECTION_NAME = "sales"
CSV_FILE_PATH = "Superstore.csv"  # Change to the actual path of your dataset

# Establish connection to MongoDB
client = pymongo.MongoClient(MONGO_URI)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

# Load CSV data into a DataFrame with encoding handling
try:
    df = pd.read_csv(CSV_FILE_PATH, encoding="ISO-8859-1")
except UnicodeDecodeError:
    df = pd.read_csv(CSV_FILE_PATH, encoding="utf-8-sig")

# Convert DataFrame to dictionary format for MongoDB
data = df.to_dict(orient="records")

# Insert data into MongoDB
collection.insert_many(data)

print(f"Inserted {len(data)} records into MongoDB collection '{COLLECTION_NAME}'.")


Inserted 9994 records into MongoDB collection 'sales'.


Q2. Retrieve and print all documents from the Orders collection.

In [13]:
import pandas as pd
import pymongo
from pprint import pprint

# MongoDB connection details
MONGO_URI = "mongodb://localhost:27017/"  # Change if needed
DB_NAME = "superstore_db"
COLLECTION_NAME = "sales"
CSV_FILE_PATH = "Superstore.csv"  # Change to the actual path of your dataset

# Establish connection to MongoDB
client = pymongo.MongoClient(MONGO_URI)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

# Load CSV data into a DataFrame with encoding handling
try:
    df = pd.read_csv(CSV_FILE_PATH, encoding="ISO-8859-1")
except UnicodeDecodeError:
    df = pd.read_csv(CSV_FILE_PATH, encoding="utf-8-sig")

# Convert DataFrame to dictionary format for MongoDB
data = df.to_dict(orient="records")

# Insert data into MongoDB
collection.insert_many(data)

print(f"Inserted {len(data)} records into MongoDB collection '{COLLECTION_NAME}'.")

# Retrieve and print all documents from the Orders collection
orders_collection = db["Orders"]
all_orders = list(orders_collection.find())

if all_orders:
    print("Orders Collection Documents:")
    for order in all_orders:
        pprint(order)
else:
    print("No documents found in the Orders collection.")


Inserted 9994 records into MongoDB collection 'sales'.
No documents found in the Orders collection.


Q3. Count and display the total number of documents in the Orders collection.

In [16]:
import pandas as pd
import pymongo
from pprint import pprint

# MongoDB connection details
MONGO_URI = "mongodb://localhost:27017/"  # Change if needed
DB_NAME = "superstore_db"
COLLECTION_NAME = "sales"
CSV_FILE_PATH = "Superstore.csv"  # Change to the actual path of your dataset

# Establish connection to MongoDB
client = pymongo.MongoClient(MONGO_URI)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

# Load CSV data into a DataFrame with encoding handling
try:
    df = pd.read_csv(CSV_FILE_PATH, encoding="ISO-8859-1")
except UnicodeDecodeError:
    df = pd.read_csv(CSV_FILE_PATH, encoding="utf-8-sig")

# Convert DataFrame to dictionary format for MongoDB
data = df.to_dict(orient="records")

# Insert data into MongoDB
collection.insert_many(data)

print(f"Inserted {len(data)} records into MongoDB collection '{COLLECTION_NAME}'.")

# Retrieve and print all documents from the Orders collection
orders_collection = db["Orders"]
all_orders = list(orders_collection.find())

if all_orders:
    print("Orders Collection Documents:")
    for order in all_orders:
        pprint(order)
else:
    print("No documents found in the Orders collection.")

# Count and display the total number of documents in the Orders collection
total_orders = orders_collection.count_documents({})
print(f"Total number of documents in the Orders collection: {total_orders}")


Inserted 9994 records into MongoDB collection 'sales'.
No documents found in the Orders collection.
Total number of documents in the Orders collection: 0


Q4. Write a query to fetch all orders from the "West" region.

In [17]:
import pandas as pd
import pymongo
from pprint import pprint

# MongoDB connection details
MONGO_URI = "mongodb://localhost:27017/"  # Change if needed
DB_NAME = "superstore_db"
COLLECTION_NAME = "sales"
CSV_FILE_PATH = "Superstore.csv"  # Change to the actual path of your dataset

# Establish connection to MongoDB
client = pymongo.MongoClient(MONGO_URI)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

# Load CSV data into a DataFrame with encoding handling
try:
    df = pd.read_csv(CSV_FILE_PATH, encoding="ISO-8859-1")
except UnicodeDecodeError:
    df = pd.read_csv(CSV_FILE_PATH, encoding="utf-8-sig")

# Convert DataFrame to dictionary format for MongoDB
data = df.to_dict(orient="records")

# Insert data into MongoDB
collection.insert_many(data)

print(f"Inserted {len(data)} records into MongoDB collection '{COLLECTION_NAME}'.")

# Retrieve and print all documents from the Orders collection
orders_collection = db["Orders"]
all_orders = list(orders_collection.find())

if all_orders:
    print("Orders Collection Documents:")
    for order in all_orders:
        pprint(order)
else:
    print("No documents found in the Orders collection.")

# Count and display the total number of documents in the Orders collection
total_orders = orders_collection.count_documents({})
print(f"Total number of documents in the Orders collection: {total_orders}")

# Fetch and print all orders from the 'West' region
west_orders = list(orders_collection.find({"Region": "West"}))
print("Orders from the West region:")
for order in west_orders:
    pprint(order)


Inserted 9994 records into MongoDB collection 'sales'.
No documents found in the Orders collection.
Total number of documents in the Orders collection: 0
Orders from the West region:


Q5. Write a query to find orders where Sales is greater than 500.

In [19]:
# Fetch and print all orders where Sales is greater than 500
high_sales_orders = list(orders_collection.find({"Sales": {"$gt": 500}}))
print("Orders where Sales is greater than 500:")
for order in high_sales_orders:
    pprint(order)

Orders where Sales is greater than 500:


Q6. Fetch the top 3 orders with the highest Profit.

In [20]:
# Fetch and print the top 3 orders with the highest Profit
top_profit_orders = list(orders_collection.find().sort("Profit", -1).limit(3))
print("Top 3 orders with the highest Profit:")
for order in top_profit_orders:
    pprint(order)


Top 3 orders with the highest Profit:


Q7. Update all orders with Ship Mode as "First Class" to "Premium Class.

In [21]:
# Update all orders with Ship Mode as "First Class" to "Premium Class"
update_result = orders_collection.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)
print(f"Updated {update_result.modified_count} documents where Ship Mode was 'First Class'.")


Updated 0 documents where Ship Mode was 'First Class'.


Q8. Delete all orders where Sales is less than 50.

In [22]:
# Delete all orders where Sales is less than 50
delete_result = orders_collection.delete_many({"Sales": {"$lt": 50}})
print(f"Deleted {delete_result.deleted_count} documents where Sales was less than 50.")


Deleted 0 documents where Sales was less than 50.


Q9.Use aggregation to group orders by Region and calculate total sales per region.

In [23]:
# Use aggregation to group orders by Region and calculate total sales per region
region_sales = list(orders_collection.aggregate([
    {"$group": {"_id": "$Region", "TotalSales": {"$sum": "$Sales"}}}
]))

print("Total Sales per Region:")
for region in region_sales:
    pprint(region)

Total Sales per Region:


Q10. Fetch all distinct values for Ship Mode from the collection.

In [24]:
# Fetch all distinct values for Ship Mode
distinct_ship_modes = orders_collection.distinct("Ship Mode")
print("Distinct Ship Modes:")
pprint(distinct_ship_modes)

Distinct Ship Modes:
[]


Q11. Count the number of orders for each category.

In [25]:
# Count the number of orders for each category
category_orders = list(orders_collection.aggregate([
    {"$group": {"_id": "$Category", "TotalOrders": {"$sum": 1}}}
]))

print("Total Orders per Category:")
for category in category_orders:
    pprint(category)


Total Orders per Category:
