# MongoDB Assignment- Theoretical

1. **What are the key differences between SQL and NoSQL databases?**

    SQL databases are relational and use structured schemas with tables, rows, and columns to store data. They rely on Structured Query Language (SQL) for defining and manipulating data and are ideal for applications requiring complex queries, strict consistency, and well-defined relationships—such as banking or enterprise systems.

    NoSQL databases are non-relational and store data in flexible formats like key-value pairs, documents, or graphs. They are designed for scalability, speed, and handling unstructured or semi-structured data, making them suitable for real-time analytics, content management, and applications with rapidly changing data like social media or IoT platforms.

2. **What makes MongoDB a good choice for modern applications?**

    MongoDB is a great choice for modern applications due to its flexible document-based model, which allows easy handling of dynamic and unstructured data. It supports high scalability, real-time performance, and built-in features like replication and sharding, making it ideal for fast-changing, large-scale applications such as mobile apps, IoT, and real-time analytics.

3. **Explain the concept of collections in MongoDB?**

    In MongoDB, a collection is a group of documents, similar to a table in relational databases. Each document within a collection is a JSON-like object with dynamic fields, meaning documents in the same collection can have different structures. Collections do not require a fixed schema, making them flexible and ideal for handling varied or evolving data formats.


4. **How does MongoDB ensure high availability using replication?**

    MongoDB ensures high availability through a feature called replication, where data is copied across multiple servers using a replica set. A replica set includes a primary node that handles all writes and multiple secondary nodes that maintain copies of the data. If the primary node fails, one of the secondaries is automatically elected as the new primary, ensuring the database remains available without manual intervention.

5. **What are the main benefits of MongoDB Atlas?**

    MongoDB Atlas is a fully managed cloud database that offers auto-scaling, high availability, built-in security, and real-time performance monitoring. It supports multi-cloud deployment and reduces the burden of managing infrastructure, making it ideal for building scalable and secure modern applications.

6. **What is the role of indexes in MongoDB, and how do they improve performance?**

    In MongoDB, indexes play a key role in improving query performance by allowing the database to find and retrieve data faster. Without indexes, MongoDB must scan every document in a collection to find matches, which is slow for large datasets. Indexes help by creating a structured path to the data, reducing the amount of data scanned, and speeding up operations like search, sort, and filter.

7. **Describe the stages of the MongoDB aggregation pipeline?**

  The MongoDB aggregation pipeline processes data through a sequence of stages, each transforming the documents as they pass through. Key stages include:

   $match – Filters documents based on specified conditions (like SQL WHERE).

   $group – Groups documents by a field and performs operations like sum, average, or count.

   $project – Reshapes documents by including, excluding, or computing new fields.

   $sort – Sorts documents based on one or more fields.

   $limit – Limits the number of documents passed to the next stage.

  $skip – Skips a specified number of documents, useful for pagination.

8. **What is sharding in MongoDB? How does it differ from replication?**

    Sharding in MongoDB is used for scaling by distributing data across multiple servers, allowing the database to handle large volumes and high traffic. It improves performance by splitting the data set. In contrast, replication ensures high availability by copying the same data across multiple servers, providing backup and failover support.

9. **What is PyMongo, and why is it used?**

    PyMongo is the official Python driver for MongoDB. It allows Python applications to connect to, query, and manipulate MongoDB databases. PyMongo is used because it provides an easy and efficient way to interact with MongoDB from Python code, supporting operations like CRUD (Create, Read, Update, Delete), aggregation, indexing, and more.

10. **What are the ACID properties in the context of MongoDB transactions?**

    In MongoDB, ACID properties ensure that transactions are reliable and consistent. Here's what each property means:

    Atomicity: All operations in a transaction are completed or none at all.

    Consistency: The database moves from one valid state to another, maintaining data integrity.

    Isolation: Transactions are isolated from each other until completed, preventing conflicts.

    Durability: Once a transaction is committed, the changes are permanent, even after a crash.

    MongoDB supports ACID transactions across multiple documents and collections starting from version 4.0.

11. **What is the purpose of MongoDB’s explain() function?**

    The purpose of MongoDB’s explain() function is to analyze and understand how a query is executed. It provides detailed information about the query plan, including whether indexes are used, how many documents are scanned, and the execution time. This helps developers optimize queries and improve performance by identifying inefficiencies.

12. **How does MongoDB handle schema validation?**

    MongoDB handles schema validation using JSON Schema rules defined at the collection level. When creating or updating a collection, you can specify validation rules that enforce the structure, data types, and required fields of documents. This ensures that only documents matching the defined schema are allowed, helping maintain data consistency and integrity while still allowing flexibility when needed.

13. **What is the difference between a primary and a secondary node in a replica set?**

    In a MongoDB replica set, the primary node is the main server that handles all write and read operations by default, while secondary nodes maintain copies of the primary’s data by continuously replicating it. If the primary fails, one of the secondaries is automatically elected as the new primary, ensuring high availability. Secondary nodes can also be used for read operations if configured for read scaling.

14. **What security mechanisms does MongoDB provide for data protection?**

    MongoDB provides several security mechanisms to protect data:

    Authentication – Ensures only authorized users can access the database.

    Authorization – Uses role-based access control (RBAC) to limit user actions.

    Encryption – Supports encryption at rest and in transit (TLS/SSL) to secure data.

    Auditing – Tracks database activity for compliance and monitoring.

    IP Whitelisting & Network Isolation – Controls which clients can connect to the database.

    These features work together to ensure data confidentiality, integrity, and access control.

15. **Explain the concept of embedded documents and when they should be used?**

    Embedded documents in MongoDB are documents stored within other documents, creating a nested structure. They allow related data to be kept together in a single document, which can improve read performance and make data more self-contained.

    They should be used when the related data has a one-to-one or one-to-few relationship and is frequently accessed together. For example, storing a user's address or order items inside the user or order document is a good use of embedding. However, for large or growing related data, referencing (using separate documents) is more efficient.

16. **What is the purpose of MongoDB’s $lookup stage in aggregation?**

    The purpose of MongoDB’s $lookup stage in aggregation is to perform a left outer join between documents from two collections. It allows you to combine related data, similar to SQL joins, by matching fields in one collection with fields in another. This is useful for retrieving nested or related data in a single query without denormalizing your schema.

17. **What are some common use cases for MongoDB?**

    Some common use cases for MongoDB include:

    Content Management Systems (CMS) – Flexible schema is ideal for managing varied content types.

    Real-Time Analytics – Fast reads/writes and aggregation make it great for dashboards and data analysis.

    Mobile and Web Apps – Easily handles rapidly changing data and high user loads.

    Internet of Things (IoT) – Manages large volumes of semi-structured sensor data efficiently.

    Catalogs and Inventory – Perfect for e-commerce platforms with diverse product attributes.

    Personalization Engines – Stores user profiles and activity data to serve dynamic content.

18. **What are the advantages of using MongoDB for horizontal scaling?**

    The advantages of using MongoDB for horizontal scaling include:

    Sharding Support – MongoDB natively supports sharding, allowing data to be distributed across multiple servers for better performance.

    Increased Capacity – Easily handles large volumes of data and high traffic by adding more nodes.

    Cost-Effective Scaling – You can scale out using commodity hardware instead of expensive high-end servers.

    High Availability – Works well with replication for fault tolerance and uptime.

    Improved Performance – Distributes query and write load, reducing bottlenecks and improving response times.

19. **How do MongoDB transactions differ from SQL transactions?**

    MongoDB transactions and SQL transactions both support ACID properties, but they differ in design and usage:

    SQL transactions are built-in and widely used, often involving multiple rows and tables in complex queries, with full ACID support by default.

    MongoDB transactions were introduced later (from version 4.0) and are typically used for multi-document operations. MongoDB is designed for high performance with single-document atomicity, so multi-document transactions are used only when necessary.

20. **What are the main differences between capped collections and regular collections?**

    Capped collections in MongoDB have a fixed size and automatically overwrite the oldest data when full, preserving insertion order. They don’t allow deletions or large updates. In contrast, regular collections have no size limit and support full CRUD operations. Capped collections are ideal for logs or real-time data; regular collections suit general-purpose storage.

21. **What is the purpose of the $match stage in MongoDB’s aggregation pipeline?**

    The purpose of the $match stage in MongoDB’s aggregation pipeline is to filter documents based on specified conditions, similar to a WHERE clause in SQL. It narrows down the documents passed to the next stages, improving performance by reducing the amount of data processed in the pipeline.

22. **How can you secure access to a MongoDB database?**

    You can secure access to a MongoDB database using the following methods:

    Authentication – Require users to log in with valid credentials.

    Authorization – Use role-based access control (RBAC) to limit user actions.

    Encryption – Enable encryption for data at rest and in transit (TLS/SSL).

    IP Whitelisting – Allow access only from trusted IP addresses.

    Firewalls & Network Isolation – Use VPCs and firewalls to restrict unauthorized network access.

    Audit Logging – Track and monitor database activity for compliance and security audits.

23. **What is MongoDB’s WiredTiger storage engine, and why is it important?**

    WiredTiger is MongoDB’s default storage engine, responsible for managing how data is stored, indexed, and retrieved on disk. It is important because it offers high performance, compression, and concurrency control, allowing multiple read and write operations to happen efficiently.

    WiredTiger uses document-level locking and supports data compression, which reduces storage space and improves speed, making it ideal for modern, high-throughput applications.

# Practical Questions

In [None]:
# 1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB
pip install pymongo pandas

import pandas as pd
from pymongo import MongoClient

csv_file = 'superstore.csv'
df = pd.read_csv(csv_file)

data = df.to_dict(orient='records')

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

collection.insert_many(data)

print(f"Inserted {len(data)} records into MongoDB collection 'Orders' in 'SuperstoreDB'")

In [None]:
# 2. Retrieve and print all documents from the Orders collection
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

documents = collection.find()

for doc in documents:
    print(doc)

In [None]:
# 3.Count and display the total number of documents in the Orders collection.
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

total_docs = collection.count_documents({})

print(f"Total number of documents in 'Orders' collection: {total_docs}")

In [None]:
# 4. Write a query to fetch all orders from the "West" region
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

west_orders = collection.find({"Region": "West"})

for order in west_orders:
    print(order)

In [None]:
# 5. Write a query to find orders where Sales is greater than 500
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

high_sales_orders = collection.find({"Sales": {"$gt": 500}})

for order in high_sales_orders:
    print(order)

In [None]:
# 6. Fetch the top 3 orders with the highest Profit.
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

top_profit_orders = collection.find().sort("Profit", -1).limit(3)

for order in top_profit_orders:
    print(order)

In [None]:
# 7.  Update all orders with Ship Mode as "First Class" to "Premium Class.

from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

result = collection.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)

print(f"Modified {result.modified_count} documents.")


In [None]:
# 8. Delete all orders where Sales is less than 50.
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

result = collection.delete_many({"Sales": {"$lt": 50}})

print(f"Deleted {result.deleted_count} documents.")


In [None]:
# 9. Use aggregation to group orders by Region and calculate total sales per region.

from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

pipeline = [
    {
        "$group": {
            "_id": "$Region",              # Group by Region
            "total_sales": {"$sum": "$Sales"}  # Calculate total Sales per region
        }
    }
]

results = collection.aggregate(pipeline)

for region in results:
    print(f"Region: {region['_id']}, Total Sales: {region['total_sales']}")


In [None]:
# 10. Fetch all distinct values for Ship Mode from the collection.

from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

distinct_ship_modes = collection.distinct("Ship Mode")

print("Distinct Ship Modes:")
for mode in distinct_ship_modes:
    print("-", mode)


In [None]:
# 11. Count the number of orders for each category.
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db = client["SuperstoreDB"]
collection = db["Orders"]

pipeline = [
    {
        "$group": {
            "_id": "$Category",
            "order_count": {"$sum": 1}
        }
    }
]

results = collection.aggregate(pipeline)

print("Order Count by Category:")
for category in results:
    print(f"{category['_id']}: {category['order_count']}")
