1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.


In [None]:
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext('local', 'RDD Example')

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Perform transformations and actions on the RDD
# Example 1: Map operation - square each element
squared_rdd = rdd.map(lambda x: x ** 2)

# Example 2: Filter operation - select only even numbers
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)

# Example 3: Reduce operation - sum all elements
sum_of_elements = rdd.reduce(lambda x, y: x + y)

# Example 4: Aggregate operation - calculate sum and count
sum_and_count = rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value, acc[1] + 1),
                              lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))

# Print the results
print("Original RDD:", rdd.collect())
print("Squared RDD:", squared_rdd.collect())
print("Filtered RDD (Even numbers):", filtered_rdd.collect())
print("Sum of elements:", sum_of_elements)
print("Sum and count:", sum_and_count)

# Stop the SparkContext
sc.stop()



2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrame Example").getOrCreate()

# Load a CSV file into a DataFrame
df = spark.read.csv("path_to_file.csv", header=True, inferSchema=True)

# Perform common DataFrame operations
# Example 1: Filtering
filtered_df = df.filter(df["age"] > 30)

# Example 2: Grouping
grouped_df = df.groupBy("city").count()

# Example 3: Joining
other_df = spark.read.csv("path_to_another_file.csv", header=True, inferSchema=True)
joined_df = df.join(other_df, df["id"] == other_df["id"], "inner")

# Apply Spark SQL queries on the DataFrame
df.createOrReplaceTempView("people")
result = spark.sql("SELECT name, age FROM people WHERE age > 30")

# Show the results
filtered_df.show()
grouped_df.show()
joined_df.show()
result.show()

# Stop the SparkSession
spark.stop()



3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.


In [None]:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

# Create a SparkSession
spark = SparkSession.builder.appName("Streaming Example").getOrCreate()

# Create a StreamingContext
ssc = StreamingContext(spark.sparkContext, batchDuration=1)

# Configure the streaming source
stream = ssc.socketTextStream("localhost", 9999)  # Replace with your streaming source (e.g., Kafka)

# Implement streaming transformations and actions
# Example 1: Word count
word_counts = stream.flatMap(lambda line: line.split(" ")) \
                   .map(lambda word: (word, 1)) \
                   .reduceByKey(lambda x, y: x + y)

# Example 2: Filter
filtered_stream = stream.filter(lambda line: "error" in line.lower())

# Print the results
word_counts.pprint()
filtered_stream.pprint()

# Start the streaming context
ssc.start()

# Wait for the streaming to finish
ssc.awaitTermination()

# Stop the SparkSession
spark.stop()



4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Spark SQL Example").getOrCreate()

# Configure the MySQL database connection
mysql_host = 'localhost'  # Replace with the MySQL host address
mysql_port = '3306'  # Replace with the MySQL port number
mysql_database = 'your_database_name'  # Replace with the MySQL database name
mysql_username = 'your_username'  # Replace with your MySQL username
mysql_password = 'your_password'  # Replace with your MySQL password

# Connect Spark with the MySQL database
jdbc_url = f"jdbc:mysql://{mysql_host}:{mysql_port}/{mysql_database}"
connection_properties = {
    'user': mysql_username,
    'password': mysql_password,
    'driver': 'com.mysql.jdbc.Driver'
}

# Load data from the MySQL database into a DataFrame
df = spark.read.jdbc(url=jdbc_url, table='your_table_name', properties=connection_properties)

# Perform SQL operations on the DataFrame
df.createOrReplaceTempView("data")

# Example 1: Select all rows from the table
all_rows = spark.sql("SELECT * FROM data")

# Example 2: Filter rows based on a condition
filtered_rows = spark.sql("SELECT * FROM data WHERE column1 = 'value'")

# Example 3: Aggregation - Calculate the average of a column
average_value = spark.sql("SELECT AVG(column2) FROM data")

# Show the results
all_rows.show()
filtered_rows.show()
average_value.show()

# Explore integration capabilities with other data sources
# Example: Read data from HDFS
hdfs_path = 'hdfs://localhost:9000/path_to_file.csv'  # Replace with the HDFS file path
hdfs_data = spark.read.csv(hdfs_path, header=True, inferSchema=True)
hdfs_data.show()

# Stop the SparkSession
spark.stop()
