In [None]:
1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.



In [None]:
from pyspark import SparkContext

# Create SparkContext
sc = SparkContext(appName="RDD Example")

# Create RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Display the RDD
rdd.collect()

# Close SparkContext
sc.stop()
# to transform each element in the RDD using the map() transformation:
squared_rdd = rdd.map(lambda x: x**2)
squared_rdd.collect()
#To filter the RDD based on a condition using the filter() transformation:
filtered_rdd = rdd.filter(lambda x: x > 3)
filtered_rdd.collect()
# to calculate the sum of all elements in the RDD using reduce():
sum_result = rdd.reduce(lambda x, y: x + y)
#To calculate the average of the elements in the RDD using reduce() and count():
sum_result = rdd.reduce(lambda x, y: x + y)
count = rdd.count()
average = sum_result / count



In [None]:

2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.



In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("Spark DataFrame Example") \
    .getOrCreate()

# Load CSV file into DataFrame
csv_path = "/path/to/file.csv"
df = spark.read.format("csv").option("header", "true").load(csv_path)

# Display the DataFrame
df.show()

# Close SparkSession
spark.stop()


In [None]:
3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.


In [None]:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create StreamingContext with a batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 1)

# Configure Kafka parameters
kafka_params = {
    "bootstrap.servers": "localhost:9092",
    "group.id": "my-streaming-app",
    "auto.offset.reset": "latest"
}

# Create a DStream by consuming data from Kafka topic
kafka_stream = KafkaUtils.createDirectStream(ssc, ["my-topic"], kafka_params)

# Process and analyze the incoming data stream
processed_stream = kafka_stream.map(lambda record: record.value)  # Extract value from Kafka record
processed_stream.pprint()  # Print the processed stream to console

# Start the streaming context
ssc.start()
ssc.awaitTermination()


In [None]:
4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.




In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("Spark MySQL Example") \
    .config("spark.driver.extraClassPath", "/path/to/mysql-connector-java.jar") \
    .getOrCreate()

# MySQL configuration
jdbc_url = "jdbc:mysql://localhost:3306/mydatabase"
connection_properties = {
    "user": "username",
    "password": "password",
    "driver": "com.mysql.jdbc.Driver"
}

# Read data from MySQL table
df = spark.read.jdbc(jdbc_url, "table_name", properties=connection_properties)

# Perform Spark SQL operations on the data
df.createOrReplaceTempView("my_table")
result = spark.sql("SELECT * FROM my_table WHERE ...")

# Display the result
result.show()

# Close SparkSession
spark.stop()
