### 1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.


In [1]:
from pyspark import SparkContext

# Create SparkContext
sc = SparkContext(appName="RDDExample")

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Perform transformations and actions on the RDD
squared_rdd = rdd.map(lambda x: x**2)
filtered_rdd = squared_rdd.filter(lambda x: x > 10)
sum_of_values = filtered_rdd.reduce(lambda x, y: x + y)
count = rdd.count()

# Print the results
print("Original RDD: ", rdd.collect())
print("Squared RDD: ", squared_rdd.collect())
print("Filtered RDD: ", filtered_rdd.collect())
print("Sum of values: ", sum_of_values)
print("Count: ", count)

# Stop SparkContext
sc.stop()


### 2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.


In [2]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Load a CSV file into a Spark DataFrame
df = spark.read.csv("file:///path/to/file.csv", header=True, inferSchema=True)

# Perform common DataFrame operations
filtered_df = df.filter(df["age"] > 30)
grouped_df = df.groupBy("gender").agg({"salary": "mean"})
joined_df = df.join(grouped_df, on="gender", how="left")

# Apply Spark SQL queries on the DataFrame
df.createOrReplaceTempView("employees")
sql_query = "SELECT * FROM employees WHERE age > 30"
result_df = spark.sql(sql_query)

# Show the results
df.show()
filtered_df.show()
grouped_df.show()
joined_df.show()
result_df.show()

# Stop SparkSession
spark.stop()


### 3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
  
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.


In [None]:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create a Spark StreamingContext
spark_context = SparkContext(appName="SparkStreamingExample")
streaming_context = StreamingContext(spark_context, 1)

# Configure the application to consume data from a Kafka topic
kafka_params = {
    "bootstrap.servers": "localhost:9092",
    "group.id": "my-streaming-group"
}
kafka_topic = "my-topic"
kafka_stream = KafkaUtils.createDirectStream(streaming_context, [kafka_topic], kafka_params)

# Implement streaming transformations and actions
stream_data = kafka_stream.map(lambda x: x[1])  # Extract values from Kafka stream
word_counts = stream_data.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Print the word counts
word_counts.pprint()

# Start the streaming context
streaming_context.start()
streaming_context.awaitTermination()


### 4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.


In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("SparkSQLExample") \
    .config("spark.driver.extraClassPath", "/path/to/database-connector.jar") \
    .getOrCreate()

# Connect Spark with a relational database
database_url = "jdbc:mysql://localhost:3306/mydatabase"
database_properties = {
    "user": "myuser",
    "password": "mypassword",
    "driver": "com.mysql.jdbc.Driver"
}

# Read data from the database table using Spark SQL
df = spark.read \
    .format("jdbc") \
    .option("url", database_url) \
    .option("dbtable", "mytable") \
    .option("properties", database_properties) \
    .load()

# Perform SQL operations on the data
df.createOrReplaceTempView("mytable_view")
result_df = spark.sql("SELECT * FROM mytable_view WHERE age > 30")

# Explore integration capabilities with other data sources
hdfs_path = "hdfs://localhost:9000/path/to/data"
s3_path = "s3a://bucket-name/path/to/data"

# Read data from HDFS
hdfs_df = spark.read.text(hdfs_path)

# Read data from Amazon S3
s3_df = spark.read.text(s3_path)

# Show the results
df.show()
result_df.show()
hdfs_df.show()
s3_df.show()

# Stop SparkSession
spark.stop()
