#### What is Hadoop?
Hadoop is an open-source framework developed by the Apache Software Foundation for storing and processing large-scale data across a distributed network of computers. It is designed to handle big data — data that is too large or complex for traditional data-processing software to manage efficiently.

Hadoop allows applications to run on clusters of commodity hardware (inexpensive, commonly available machines) and can scale from a single server to thousands of machines. Its architecture is fault-tolerant, scalable, and efficient, making it ideal for handling massive volumes of structured and unstructured data.
![image.png](attachment:9eb9c6b6-b4ef-4901-9a2e-dd7db86bb8b3.png)
#### Key features of Hadoop include:
    
    Distributed storage and processing
    
    Scalability (horizontal scaling)
    
    High fault tolerance
    
    Cost-effective (runs on low-cost hardware)

#### Role of Distributed Computing Environment in Hadoop
    
    A Distributed Computing Environment (DCE) refers to a system in which the processing and storage of data are distributed across multiple interconnected computers (nodes) that work together as a unified system. Hadoop makes full use of this model.

#### In the context of Hadoop:
    
    Data is divided into small chunks or blocks and stored across multiple nodes using HDFS (Hadoop Distributed File System).
    
    Processing of data is done in parallel using MapReduce, which executes tasks close to where the data resides (data locality).
    
    If any node fails, the task is automatically redirected to another node because of Hadoop’s replication and fault-tolerance mechanism.

#### Importance of DCE in Hadoop:

    Parallel processing: Tasks are distributed, allowing for faster data processing.
    
    Fault tolerance: Copies of data are stored in different nodes to ensure data is not lost if a node fails.
    
    High availability: Even if one part of the system fails, the rest continues to function.
    
    Efficient resource utilization: Workload is shared across all nodes, making efficient use of hardware.

#### Two Components of the Hadoop Ecosystem

HDFS (Hadoop Distributed File System):

    HDFS is the storage component of Hadoop.
    
    It splits large files into smaller blocks (default size is 128MB or 256MB) and stores them across multiple nodes.
    
    Each block is replicated (usually 3 times) to ensure reliability and fault tolerance.
    
    The system consists of a NameNode (manages metadata and file system namespace) and DataNodes (store actual data blocks).

MapReduce:
    
    MapReduce is the processing engine of Hadoop.
    
    It follows the divide-and-conquer approach where data is processed in two phases:
    
        Map phase: Data is broken down into key-value pairs and filtered.
        
        Reduce phase: The key-value pairs are aggregated and summarized to produce the final output.
    
    It enables parallel processing, improves speed, and minimizes data transfer by processing data close to where it is stored.
![image.png](attachment:c959caf1-fad0-409f-b49e-3a507128757c.png)

Conclusion:

Hadoop is a powerful tool for managing big data in a distributed computing environment. Its architecture, based on HDFS and MapReduce, allows it to store and process data reliably and efficiently across multiple machines. The distributed nature of Hadoop ensures high performance, fault tolerance, and scalability — essential for modern data-driven applications.



#### Hive Architecture

Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows users to query and analyze large datasets stored in Hadoop using a language similar to SQL called HiveQL. Hive converts these queries into MapReduce jobs that run on the Hadoop cluster.

#### The architecture of Hive consists of the following major components:

User Interface:
    It allows users to submit queries to the system. It can be a command-line interface (CLI), a web UI, or applications using JDBC/ODBC.

Driver:
    The driver acts as the controller. It receives the HiveQL statements from the user, initiates their execution, and maintains the lifecycle of the query.

Compiler:
    It parses the HiveQL, checks for syntax and semantics, and then converts the query into a logical execution plan. This plan is further optimized and translated into a DAG (Directed Acyclic Graph) of MapReduce jobs.

Metastore:
    It is a central repository that stores metadata about Hive tables, such as schema, location, partitions, and data types. It is crucial for query planning and optimization.

Execution Engine:
    This component executes the execution plan created by the compiler. It communicates with the Hadoop framework (usually MapReduce or Tez or Spark) to carry out the actual query execution.

#### Roles of Key Components

1. Metastore
    Stores metadata about tables, columns, partitions, and data types.
    
    Maintains information about data location in HDFS.
    
    Provides metadata to the compiler for query parsing and planning.
    
    Usually backed by a traditional RDBMS like MySQL or Derby.

2. Driver
    Acts as a controller to manage the complete query lifecycle.
    
    Receives HiveQL statements from the user.
    
    Sends queries to the compiler and coordinates the execution.
    
    Maintains session handles and query execution status.
    
    Manages the results returned to the user.

3. Compiler
    Parses and analyzes the HiveQL query.
    
    Performs syntax and semantic checks.
    
    Creates a logical plan and optimizes it.
    
    Converts the logical plan into physical execution plan (MapReduce/Tez jobs).

4. Execution Engine
    Executes the physical plan created by the compiler.
    
    Interacts with Hadoop (MapReduce/Tez/Spark) to execute jobs.
    
    Monitors execution and handles job failures.
    
    Returns the result back to the driver after execution.

Conclusion
Hive simplifies querying and analyzing large datasets stored in Hadoop by providing a SQL-like interface. Its architecture efficiently translates HiveQL into distributed execution plans using components like the Metastore, Driver, Compiler, and Execution Engine, each playing a critical role in query execution.
![image.png](attachment:b4aa167d-27c0-4680-abfb-92105f3f6e96.png)

#### What is the CAP Theorem?

The CAP Theorem, also known as Brewer's Theorem, is a principle in distributed computing that states that it is impossible for a distributed data system to simultaneously provide all three of the following guarantees:

    Consistency (C):
        Every read receives the most recent write. All nodes return the same data at the same time.
    
    Availability (A):
        Every request (read or write) receives a response, even if some nodes are down. The system remains operational 100% of the time.
    
    Partition Tolerance (P):
        The system continues to operate even if there is a network partition (loss of communication between some parts of the system).

CAP theorem states: A distributed system can satisfy only two of these three guarantees at any given time, but not all three.

#### CAP Theorem in the Context of NoSQL Databases
NoSQL databases are non-relational databases designed to scale horizontally and handle large volumes of unstructured data. In a distributed environment, they must often choose between Consistency and Availability in the presence of a Partition (P), because Partition Tolerance is generally assumed in distributed systems.

#### Different NoSQL databases make different trade-offs:

    CP (Consistency + Partition Tolerance):
    Prioritizes consistency but might sacrifice availability during network failures.
    Example: HBase, MongoDB (in some configurations)
    
    AP (Availability + Partition Tolerance):
    Prioritizes availability, allowing for eventual consistency.
    Example: Cassandra, CouchDB
    
    CA (Consistency + Availability):
    These systems work only when there is no network partition, which is not practical in real-world distributed systems, hence rarely used.

Four Main Types of NoSQL Databases
Document-based Databases
    
    Store data as documents (usually JSON, BSON, or XML).
    
    Flexible schema.
    
    Example: MongoDB, CouchDB
    
Key-Value Stores
    
    Store data as key-value pairs.
    
    Very fast and scalable.
    
    Example: Redis, Riak, DynamoDB

Column-family Stores
    
    Store data in columns rather than rows.
    
    Good for large-scale, sparse data.
    
    Example: Apache Cassandra, HBase

Graph Databases

    Designed for data with complex relationships.
    
    Uses nodes, edges, and properties.
    
    Example: Neo4j, Amazon Neptune

Conclusion
The CAP Theorem helps explain the design choices behind various NoSQL databases. Depending on the use case, different NoSQL databases prioritize Consistency, Availability, or Partition Tolerance. Understanding CAP is essential when selecting the right NoSQL database for a distributed application.

#### Apache Spark as a Unified Analytics Engine
Apache Spark is often described as a unified analytics engine because it provides a comprehensive platform for handling a wide range of big data processing tasks within a single framework. Spark supports:

* Batch processing

* Real-time streaming

* Interactive querying

* Machine learning

* Graph processing

This versatility allows organizations to use one engine for all their analytics needs instead of integrating multiple separate tools.

Spark achieves this through different built-in libraries:
    
    Spark SQL: for structured data processing
    
    Spark Streaming: for real-time data streams
    
    MLlib: for machine learning algorithms
    
    GraphX: for graph analytics

By integrating these components into one framework, Spark eliminates the need for complex workflows that span multiple systems.

#### In-Memory Computation Model of Apache Spark
One of the key features of Spark is its in-memory computation. This means Spark:

    Loads data into memory (RAM)
    
    Performs transformations and actions directly in memory
    
    Avoids writing intermediate results to disk

This results in faster processing, especially for iterative algorithms and machine learning tasks, where the same data is reused multiple times.

   ![image.png](attachment:54756083-c630-4eb2-841e-aec57944a030.png)
   
Conclusion
Apache Spark is a unified analytics engine because it supports multiple analytics tasks in a single platform and offers high performance through its in-memory computation model. In contrast, Hadoop MapReduce relies on disk-based processing, which makes it slower and less efficient for complex and iterative tasks.

## Section B - 1 

1. Create a directory named InputDir under the path /user/hadoop/

In [None]:
hdfs dfs -mkdir /user/hadoop/InputDir

2. List the contents of SampleDir with file sizes and modification times

In [None]:
hdfs dfs -ls -h /user/hadoop/SampleDir
#-h gives human-readable file sizes, and -ls includes modification times by default.

3. Upload all .txt files from local directory XYZ to HDFS directory /user/hadoop/InputDir/

In [None]:
 hdfs dfs -put /home/hadoop/XYZ/*.txt /user/hadoop/InputDir/

4. Read file.txt from HDFS and browse its content using a pager like less

In [None]:
hdfs dfs -cat /user/hadoop/InputDir/file.txt | less

5. Move the directory OutputDir to ArchiveDir within HDFS

In [None]:
hdfs dfs -mv /user/hadoop/InputDir /user/hadoop/ArchiveDir/

## Section B - 2 

1. Create an RDD from the list

In [None]:
# Assuming SparkContext is available as sc
data = ["apple", "banana", "orange", "grape"]
rdd = sc.parallelize(data)

2. Count how many elements are in the RDD

In [None]:
count = rdd.count()
print(f"Total elements: {count}")

3. Filter only even numbers from an RDD

In [None]:
# Example numeric RDD
numbers_rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Filter even numbers
even_numbers = numbers_rdd.filter(lambda x: x % 2 == 0)
print(even_numbers.collect())

4. Read a text file and count how many lines contain the word "error"

In [None]:
# Read text file (replace 'path/to/file.txt' with the actual path)
text_rdd = sc.textFile("path/to/file.txt")

# Count lines containing the word "error"
error_count = text_rdd.filter(lambda line: "error" in line.lower()).count()
print(f"Lines containing 'error': {error_count}")

## Section B - 3

1. Create an external table products in the sales_db database

In [None]:
CREATE EXTERNAL TABLE IF NOT EXISTS sales_db.products (
    product_id INT,
    product_name STRING,
    category STRING,
    price FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/products';

2. Perform a left outer join between employees and departments on dept_id

In [None]:
SELECT 
    e.*, 
    d.*
FROM 
    employees e
LEFT OUTER JOIN 
    departments d
ON 
    e.dept_id = d.dept_id;

3. Select the top 5 highest-paid employees from employee_details

In [None]:
SELECT * 
FROM employee_details
ORDER BY salary DESC
LIMIT 5;

## Section B - 4 

1. Create a MongoDB collection named customer_data

In [None]:
db.createCollection("customer_data")

2. Insert 4 documents into customer_data with fields: name, email, and city

In [None]:
db.customer_data.insertMany([
  { name: "John", email: "john@example.com", city: "Delhi" },
  { name: "Alice", email: "alice@example.com", city: "Bombay" },
  { name: "Bob", email: "bob@example.com", city: "Chennai" },
  { name: "Mary", email: "mary@example.com", city: "Delhi" }
])

3. Query to find all customers from the city "Delhi"

In [None]:
db.customer_data.find({ city: "Delhi" })

4. Update all records where the city is "Bombay" and change it to "Mumbai", then display modified documents

In [None]:
db.customer_data.updateMany(
  { city: "Bombay" },
  { $set: { city: "Mumbai" } }
)

// Display updated documents
db.customer_data.find({ city: "Mumbai" })

5.Delete documents where the name is "John"

In [None]:
db.customer_data.deleteMany({ name: "John" })