DAY-6(Apache Spark)

1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.


In [None]:
##1a.
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName('LocalDataRDDExample') \
    .getOrCreate()

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Perform operations on the RDD
squared_rdd = rdd.map(lambda x: x * x)
sum_rdd = squared_rdd.reduce(lambda x, y: x + y)

# Print the results
print(f'Squared RDD: {squared_rdd.collect()}')
print(f'Sum of squared RDD: {sum_rdd}')

# Stop the SparkSession
spark.stop()


In [None]:
##1b.
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName('RDDProcessingExample') \
    .getOrCreate()

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Transformations
squared_rdd = rdd.map(lambda x: x * x)
filtered_rdd = squared_rdd.filter(lambda x: x > 10)

# Actions
sum_rdd = squared_rdd.reduce(lambda x, y: x + y)
count_rdd = filtered_rdd.count()
first_element = filtered_rdd.first()
collected_rdd = filtered_rdd.collect()

# Print the results
print(f'Squared RDD: {squared_rdd.collect()}')
print(f'Filtered RDD: {filtered_rdd.collect()}')
print(f'Sum of squared RDD: {sum_rdd}')
print(f'Count of filtered RDD: {count_rdd}')
print(f'First element of filtered RDD: {first_element}')
print(f'Collected RDD: {collected_rdd}')

# Stop the SparkSession
spark.stop()


In [None]:
##1c.
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName('RDDDataManipulationExample') \
    .getOrCreate()

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Perform RDD operations
squared_rdd = rdd.map(lambda x: x * x)
filtered_rdd = squared_rdd.filter(lambda x: x > 10)
sum_rdd = squared_rdd.reduce(lambda x, y: x + y)
product_rdd = filtered_rdd.reduce(lambda x, y: x * y)
mean_rdd = squared_rdd.mean()

# Print the results
print(f'Squared RDD: {squared_rdd.collect()}')
print(f'Filtered RDD: {filtered_rdd.collect()}')
print(f'Sum of squared RDD: {sum_rdd}')
print(f'Product of filtered RDD: {product_rdd}')
print(f'Mean of squared RDD: {mean_rdd}')

# Stop the SparkSession
spark.stop()


2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.


In [None]:
##2a.
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName('CSVLoadingExample') \
    .getOrCreate()

# Load CSV file into DataFrame
csv_file_path = 'path/to/your/csv_file.csv'
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

# Perform operations on the DataFrame
# For example, display the DataFrame schema and the first few rows
df.printSchema()
df.show(5)

# Stop the SparkSession
spark.stop()


In [None]:
##2b.


3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.


In [None]:
##3a.
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

# Create a SparkSession
spark = SparkSession.builder \
    .appName('SparkStreamingExample') \
    .getOrCreate()

# Create a StreamingContext with a batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 1)

# Set the log level to only display error messages
spark.sparkContext.setLogLevel('ERROR')

# Create a DStream by consuming data from a TCP socket
lines = ssc.socketTextStream('localhost', 9999)

# Perform transformations and actions on the DStream
word_counts = lines.flatMap(lambda line: line.split(' ')) \
                   .map(lambda word: (word, 1)) \
                   .reduceByKey(lambda x, y: x + y)

# Print the word counts
word_counts.pprint()

# Start the streaming context
ssc.start()

# Wait for the streaming to finish
ssc.awaitTermination()


In [None]:
##3b.
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create a SparkSession
spark = SparkSession.builder \
    .appName('SparkStreamingKafkaExample') \
    .getOrCreate()

# Create a StreamingContext with a batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 1)

# Set the log level to only display error messages
spark.sparkContext.setLogLevel('ERROR')

# Kafka broker details
bootstrap_servers = 'localhost:9092'
topic = 'my_topic'

# Create a Kafka direct stream
kafka_stream = KafkaUtils.createDirectStream(ssc, [topic], {'metadata.broker.list': bootstrap_servers})

# Get the messages from the Kafka stream
messages = kafka_stream.map(lambda x: x[1])

# Perform transformations and actions on the messages
word_counts = messages.flatMap(lambda line: line.split(' ')) \
                     .map(lambda word: (word, 1)) \
                     .reduceByKey(lambda x, y: x + y)

# Print the word counts
word_counts.pprint()

# Start the streaming context
ssc.start()

# Wait for the streaming to finish
ssc.awaitTermination()


In [None]:
##3c.
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create a SparkSession
spark = SparkSession.builder \
    .appName('StreamingDataProcessingExample') \
    .getOrCreate()

# Create a StreamingContext with a batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 1)

# Set the log level to only display error messages
spark.sparkContext.setLogLevel('ERROR')

# Kafka broker details
bootstrap_servers = 'localhost:9092'
topic = 'my_topic'

# Create a Kafka direct stream
kafka_stream = KafkaUtils.createDirectStream(ssc, [topic], {'metadata.broker.list': bootstrap_servers})

# Get the messages from the Kafka stream
messages = kafka_stream.map(lambda x: x[1])

# Perform streaming transformations and actions
word_counts = messages.flatMap(lambda line: line.split(' ')) \
                     .map(lambda word: (word, 1)) \
                     .reduceByKey(lambda x, y: x + y)

# Print the word counts for each batch interval
word_counts.pprint()

# Perform windowed word count for the last 10 seconds of data
windowed_word_counts = messages.window(10, 5) \
                              .flatMap(lambda line: line.split(' ')) \
                              .map(lambda word: (word, 1)) \
                              .reduceByKey(lambda x, y: x + y)

# Print the windowed word counts
windowed_word_counts.pprint()

# Perform a stateful word count
def update_func(new_values, last_sum):
    return sum(new_values) + (last_sum or 0)

stateful_word_counts = messages.flatMap(lambda line: line.split(' ')) \
                              .map(lambda word: (word, 1)) \
                              .updateStateByKey(update_func)

# Print the stateful word counts
stateful_word_counts.pprint()

# Start the streaming context
ssc.start()

# Wait for the streaming to finish
ssc.awaitTermination()


4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.
