1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.



In [None]:
from pyspark import SparkContext

sc = SparkContext("local", "RDD Example")

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

rdd.collect()

rdd_mapped = rdd.map(lambda x: x * 2)

rdd_filtered = rdd.filter(lambda x: x % 2 == 0)

sum_of_elements = rdd.reduce(lambda x, y: x + y)

sum_count = rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))

print(rdd_mapped.collect())
print(rdd_filtered.collect())
print(sum_of_elements)
print(sum_count)

squared_rdd = rdd.map(lambda x: x ** 2)

filtered_rdd = rdd.filter(lambda x: x > 3)

max_number = rdd.reduce(lambda x, y: x if x > y else y)

sum_count = rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
average = sum_count[0] / sum_count[1]

print(squared_rdd.collect())
print(filtered_rdd.collect())
print(max_number)
print(average)


2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.






In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrame Example").getOrCreate()

df = spark.read.csv("file_path.csv", header=True, inferSchema=True)

df.show()

filtered_df = df.filter(df["age"] > 30)

grouped_df = df.groupBy("gender").agg({"age": "mean", "salary": "sum"})

joined_df = df1.join(df2, df1["id"] == df2["id"], "inner")

filtered_df.show()
grouped_df.show()
joined_df.show()

df.createOrReplaceTempView("people")

result = spark.sql("SELECT * FROM people WHERE age > 30")

result.show()


3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.



In [None]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sparkContext, 1)

dstream = ssc.socketTextStream("localhost", 9999)

processed_dstream = dstream.map(lambda line: line.split(" ")).filter(lambda words: "error" in words)

processed_dstream.pprint()

ssc.start()
ssc.awaitTermination()


4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark SQL Example").getOrCreate()

jdbc_url = "jdbc:mysql://localhost:3306/database_name"
db_properties = {
  "user": "username",
  "password": "password"
}

df = spark.read.jdbc(url=jdbc_url, table="table_name", properties=db_properties)

df.show()

df.createOrReplaceTempView("table")

result = spark.sql("SELECT * FROM table WHERE column > 10")

result.show()

df.createOrReplaceTempView("table")

result = spark.sql("SELECT * FROM table WHERE column > 10")

result.show()
