# 13th_June_BigData_Assignment.ipynb
**Data Science Masters – Big Data Assignment**

This notebook contains answers to the Big Data assignment questions including theoretical explanations and practical implementations using Python and Apache Spark.


## Q1: Core Components of the Hadoop Ecosystem

Hadoop consists of the following core components:

1. **HDFS (Hadoop Distributed File System)** – Stores data across multiple machines ensuring fault tolerance.
2. **MapReduce** – A programming model for processing large data sets with a distributed algorithm.
3. **YARN (Yet Another Resource Negotiator)** – Manages cluster resources and job scheduling.

- **HDFS** provides high-throughput access to application data.
- **MapReduce** processes the data in two phases: Map and Reduce.
- **YARN** separates resource management from data processing components.


## Q2: Hadoop Distributed File System (HDFS)

### Key Components:
- **NameNode**: Manages metadata and directory structure of HDFS.
- **DataNode**: Stores actual data blocks.
- **Blocks**: Data is split into blocks (default 128MB), replicated across DataNodes.

### Reliability and Fault Tolerance:
- Data replication ensures that even if a DataNode fails, data can be retrieved from replicas.
- The NameNode monitors block health and schedules replication if needed.


## Q3: MapReduce Framework

### Step-by-Step:
1. **Map Phase**: Input data is split and processed into key-value pairs.
2. **Shuffle and Sort**: System groups values by key.
3. **Reduce Phase**: Aggregates values for each key to produce the final result.

### Example: Word Count
- Map: (word, 1)
- Reduce: Sum counts for each word

### Advantages:
- Scalable, fault-tolerant, batch processing

### Limitations:
- Slow for iterative tasks, high latency, complex for simple operations


## Q4: Role of YARN

YARN manages resources and schedules tasks in a Hadoop cluster.

### Components:
- **ResourceManager**: Allocates resources.
- **NodeManager**: Manages node-level execution.
- **ApplicationMaster**: Coordinates individual jobs.

### Comparison with Hadoop 1.x:
- Hadoop 1.x had a monolithic JobTracker.
- YARN enables multiple applications (not just MapReduce).


## Q5: Hadoop Ecosystem Components

- **HBase**: NoSQL database on HDFS.
- **Hive**: SQL-like interface for querying HDFS data.
- **Pig**: High-level scripting for data analysis.
- **Spark**: Fast, in-memory big data processing.

### Example: Using Hive with Hadoop
Hive enables SQL-like queries over large datasets stored in HDFS, improving accessibility for users familiar with SQL.


## Q6: Apache Spark vs. Hadoop MapReduce

| Feature          | Spark                            | MapReduce                        |
|------------------|----------------------------------|----------------------------------|
| Speed            | In-memory, faster                | Disk-based, slower               |
| Ease of Use      | APIs in Python, Scala, Java      | Java-based, verbose              |
| Iterative Jobs   | Efficient                        | Inefficient                      |
| Machine Learning | MLlib                            | Not suitable                     |


In [None]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()
text_file = sc.textFile("sample.txt")

word_counts = (text_file
               .flatMap(lambda line: line.split())
               .map(lambda word: (word.lower(), 1))
               .reduceByKey(lambda a, b: a + b)
               .sortBy(lambda x: -x[1]))

word_counts.take(10)


In [None]:
# Assuming we have a dataset as RDD
data = sc.parallelize([
    ("John", 25), ("Jane", 30), ("Joe", 45), ("Jill", 28)
])

# a. Filter
filtered = data.filter(lambda x: x[1] > 30)

# b. Map
mapped = data.map(lambda x: (x[0], x[1] + 5))

# c. Reduce (Average Age)
total = data.map(lambda x: x[1]).reduce(lambda a, b: a + b)
count = data.count()
average = total / count
filtered.collect(), mapped.collect(), average


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.read.csv("people.csv", header=True, inferSchema=True)

# a. Select
df.select("Name", "Age").show()

# b. Filter
df.filter(df.Age > 30).show()

# c. Group and Aggregate
df.groupBy("Department").avg("Salary").show()

# d. Join
df2 = spark.read.csv("departments.csv", header=True, inferSchema=True)
df.join(df2, on="DepartmentID").show()


In [None]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 9999)

counts = (lines.flatMap(lambda line: line.split())
               .map(lambda word: (word, 1))
               .reduceByKey(lambda a, b: a + b))

counts.pprint()
ssc.start()
ssc.awaitTermination()


## Q11: Apache Kafka Concepts

Kafka is a distributed streaming platform that enables:
- Publish-subscribe messaging
- Fault-tolerant storage
- Real-time processing

Kafka addresses the need for real-time, durable, high-throughput data pipelines in big data systems.


## Q12: Kafka Architecture

- **Producers**: Send messages to topics.
- **Topics**: Logical channel for messages.
- **Brokers**: Kafka servers storing messages.
- **Consumers**: Read messages from topics.
- **ZooKeeper**: Coordinates cluster state and leader election.


In [None]:
# Producer Example (Python)
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
producer.send('test-topic', b'Hello, Kafka!')
producer.flush()

# Consumer Example
from kafka import KafkaConsumer
consumer = KafkaConsumer('test-topic', bootstrap_servers='localhost:9092')
for message in consumer:
    print(message.value)


## Q14: Kafka Data Retention and Partitioning

- **Data Retention**: Configurable duration messages stay in a topic (default 7 days).
- **Partitioning**: Splits data for parallelism. Each partition is an ordered log.

Retention and partitioning improve scalability and manageability in Kafka.


## Q15: Real-World Use Cases

- **LinkedIn**: Activity tracking
- **Uber**: Real-time analytics
- **Netflix**: Event sourcing
- **Financial Services**: Fraud detection

Kafka offers scalability, durability, and real-time capabilities crucial for modern data systems.
