1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.

2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.

3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.

4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.


Working with RDDs:
a) Here's an example of creating an RDD from a local data source:

from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "RDD Example")

# Create an RDD from a local data source (list)
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Print the RDD elements
print(rdd.collect())

# Close the SparkContext
sc.stop()


b) Implementing transformations and actions on the RDD:

# Transformations
rdd = rdd.map(lambda x: x * 2)  # Multiply each element by 2

# Actions
total = rdd.reduce(lambda x, y: x + y)  # Calculate the sum of all elements

# Print the transformed RDD and the result of the action
print(rdd.collect())
print(total)


 Analyzing and manipulating data using RDD operations:

# Filter RDD elements
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)  # Keep only even numbers

# Aggregate RDD elements
product = rdd.aggregate(1, lambda acc, x: acc * x, lambda acc1, acc2: acc1 * acc2)  # Calculate the product of all elements

# Print the filtered RDD and the result of the aggregation
print(filtered_rdd.collect())
print(product)


Spark DataFrame Operations:
a) Loading a CSV file into a Spark DataFrame:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrame Example").getOrCreate()

# Load a CSV file into a DataFrame
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Show the DataFrame schema and some sample data
df.printSchema()
df.show()

# Stop the SparkSession
spark.stop()


b) Performing common DataFrame operations:

# Filter DataFrame rows
filtered_df = df.filter(df["column"] > 10)  # Keep only rows where "column" is greater than 10

# Group DataFrame by a column and calculate aggregations
grouped_df = df.groupBy("column").agg({"other_column": "sum"})  # Group by "column" and calculate the sum of "other_column"

# Join two DataFrames
joined_df = df1.join(df2, on="common_column", how="inner")  # Inner join df1 and df2 on the common_column

# Show the results of the operations
filtered_df.show()
grouped_df.show()
joined_df.show()


c) Applying Spark SQL queries on the DataFrame:

# Register the DataFrame as a temporary table
df.createOrReplaceTempView("table_name")

# Run SQL queries on the DataFrame
result = spark.sql("SELECT * FROM table_name WHERE column > 10")

# Show the result of the query
result.show()


Spark Streaming:
a) Creating a Spark Streaming application:

from pyspark.streaming import StreamingContext

# Create a StreamingContext
sc = SparkContext("local[2]", "Streaming Example")
ssc = StreamingContext(sc, batchDuration=1)

# Create a DStream from a streaming source (e.g., socket)
dstream = ssc.socketTextStream("localhost", 9999)

# Perform transformations and actions on the DStream
processed_stream = dstream.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Print the processed stream
processed_stream.pprint()

# Start the streaming context
ssc.start()
ssc.awaitTermination()


c) Implementing streaming transformations and actions:

The flatMap, map, and reduceByKey operations in the above example are examples of transformations and actions applied to the DStream. You can use various other transformations and actions provided by Spark Streaming to process and analyze the incoming data stream.

Spark SQL and Data Source Integration:
a) Connecting Spark with a relational database:

c) Implementing streaming transformations and actions:

The flatMap, map, and reduceByKey operations in the above example are examples of transformations and actions applied to the DStream. You can use various other transformations and actions provided by Spark Streaming to process and analyze the incoming data stream.

Spark SQL and Data Source Integration:
a) Connecting Spark with a relational database:

b) Performing SQL operations on the data stored in the database:

# Register the DataFrame as a temporary table
df.createOrReplaceTempView("table_name")

# Run SQL queries on the DataFrame
result = spark.sql("SELECT * FROM table_name WHERE column > 10")

# Show the result of the query
result.show()


c) Exploring integration capabilities with other data sources:

Spark provides built-in support for various data sources, including HDFS and Amazon S3. You can load data from and save data to these sources using the appropriate methods provided by Spark:

# Load data from HDFS into a DataFrame
df_hdfs = spark.read.csv("hdfs://localhost:9000/path/to/file.csv", header=True, inferSchema=True)

# Save DataFrame to HDFS
df_hdfs.write.csv("hdfs://localhost:9000/path/to/save.csv")

# Load data from Amazon S3 into a DataFrame
df_s3 = spark.read.csv("s3a://bucket_name/path/to/file.csv", header=True, inferSchema=True)

# Save DataFrame to Amazon S3
df_s3.write.csv("s3a://bucket_name/path/to/save.csv")
