# Assignment 4 - (Apache Spark)


1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.


In [None]:
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "RDD Example")

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Perform transformations and actions on the RDD
squared_rdd = rdd.map(lambda x: x**2)  # Square each element
filtered_rdd = squared_rdd.filter(lambda x: x > 10)  # Filter elements greater than 10
sum_result = filtered_rdd.reduce(lambda x, y: x + y)  # Calculate the sum of remaining elements

# Analyze and manipulate data using RDD operations
count = rdd.count()  # Count the number of elements in the RDD
first_element = rdd.first()  # Get the first element of the RDD
collected_data = rdd.collect()  # Collect all elements of the RDD into a local list

# Print the results
print("Original RDD: ", rdd.collect())
print("Squared RDD: ", squared_rdd.collect())
print("Filtered RDD: ", filtered_rdd.collect())
print("Sum Result: ", sum_result)
print("Count: ", count)
print("First Element: ", first_element)
print("Collected Data: ", collected_data)

# Stop the SparkContext
sc.stop()



2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrame Example").getOrCreate()

# Load a CSV file into a Spark DataFrame
data_path = "path/to/your/file.csv"
df = spark.read.csv(data_path, header=True, inferSchema=True)

# Perform common DataFrame operations
filtered_df = df.filter(df["age"] > 30)  # Filter rows where age is greater than 30
grouped_df = df.groupBy("gender").count()  # Group by gender and count occurrences
joined_df = df.join(grouped_df, "gender")  # Join with grouped DataFrame on gender

# Apply Spark SQL queries on the DataFrame
df.createOrReplaceTempView("people")  # Create a temporary view for the DataFrame
sql_result = spark.sql("SELECT * FROM people WHERE age > 30")  # Run SQL query on the DataFrame

# Show the results
df.show()
filtered_df.show()
grouped_df.show()
joined_df.show()
sql_result.show()

# Stop the SparkSession
spark.stop()


3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.


In [None]:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

# Create a SparkSession
spark = SparkSession.builder.appName("Spark Streaming Example").getOrCreate()

# Create a StreamingContext with a batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 1)

# Configure the application to consume data from a socket
host = "localhost"
port = 9999
lines = ssc.socketTextStream(host, port)

# Implement streaming transformations and actions
word_counts = lines.flatMap(lambda line: line.split(" ")) \
                   .map(lambda word: (word, 1)) \
                   .reduceByKey(lambda x, y: x + y)

# Print the word counts
word_counts.pprint()

# Start the streaming context
ssc.start()

# Wait for the streaming to finish
ssc.awaitTermination()


#To run this program, you need to start a socket server that streams data
#nc -lk 9999

4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.



In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Spark SQL Example").getOrCreate()

# Connect Spark with a relational database (MySQL)
jdbc_url = "jdbc:mysql://localhost:3306/mydatabase"  # Replace with your MySQL database URL
table_name = "mytable"  # Replace with your table name
connection_properties = {
    "user": "username",  # Replace with your MySQL username
    "password": "password"  # Replace with your MySQL password
}

# Read data from the database table
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Perform SQL operations on the data stored in the database using Spark SQL
df.createOrReplaceTempView("my_temp_view")  # Create a temporary view for the DataFrame
result = spark.sql("SELECT * FROM my_temp_view WHERE age > 30")  # Run SQL query on the DataFrame

# Explore integration capabilities of Spark with other data sources (e.g., HDFS or Amazon S3)
hdfs_path = "hdfs://localhost:9000/mydata/file.csv"  # Replace with your HDFS file path
s3_path = "s3a://my-bucket/mydata/file.csv"  # Replace with your S3 file path

# Read data from HDFS
df_hdfs = spark.read.csv(hdfs_path, header=True, inferSchema=True)

# Read data from Amazon S3
df_s3 = spark.read.csv(s3_path, header=True, inferSchema=True)

# Show the results
df.show()
result.show()
df_hdfs.show()
df_s3.show()

# Stop the SparkSession
spark.stop()
