In [1]:
from pyspark import SparkConf, SparkContext

# Initialize SparkContext
conf = SparkConf().setAppName("MaxSnowfall").setMaster("local")
sc = SparkContext(conf=conf)

# Sample input data (simulating random weather data)
weather_data = [
    ("001", 2023, 1, 15, 5),
    ("002", 2023, 1, 15, 3),
    ("001", 2023, 2, 5, 7),
    ("002", 2023, 2, 5, 1),
    ("001", 2023, 12, 25, 10),
    ("003", 2023, 12, 25, 6),
    ("001", 2023, 11, 30, 2),
    ("003", 2023, 11, 30, 8)
]

# Parallelize the data (simulating reading from a file)
rdd = sc.parallelize(weather_data)

# Mapper: Emit (year, station_id, day) as key and snowfall as value
def map_function(record):
    station_id, year, month, day, snowfall = record
    return ((year, station_id, day), snowfall)

# Applying the map function
mapped_rdd = rdd.map(map_function)

# Reducer: Find maximum snowfall per (year, station_id, day)
def reduce_function(a, b):
    return max(a, b)

# Apply the reduce function to get the maximum snowfall
reduced_rdd = mapped_rdd.reduceByKey(reduce_function)

# Collect results and find the (year, station_id, day) with the maximum snowfall
max_snowfall_record = reduced_rdd.collect()

# Output the results
max_snowfall = max(max_snowfall_record, key=lambda x: x[1])
year, station_id, day = max_snowfall[0]
snowfall = max_snowfall[1]
print(f"Max Snowfall in 2023: Station {station_id}, Day {day}, Snowfall {snowfall} inches")

# Stop SparkContext
sc.stop()


Max Snowfall in 2023: Station 001, Day 25, Snowfall 10 inches


Explanation of the Program
1. Input Data:

    The program simulates a weather database using a list of tuples in the format (station_id, year, month, day, snowfall).

    In a real scenario, this data would be read from HDFS or a file.

2. Map Function:

def map_function(record):
    station_id, year, month, day, snowfall = record
    return ((year, station_id, day), snowfall)

    Key: (year, station_id, day) — this will uniquely identify a record based on the year, station, and day.

    Value: snowfall — the snowfall amount for that day.

3. Reduce Function:

def reduce_function(a, b):
    return max(a, b)

    The reducer compares the snowfall values for each (year, station_id, day) and returns the maximum snowfall.

4. Final Output:

max_snowfall = max(max_snowfall_record, key=lambda x: x[1])

    The program finds the record with the maximum snowfall from the output of the reducer.

5. Result:

    The output prints the station, day, and snowfall corresponding to the maximum snowfall in the specified year (2023).

Example Output:

Max Snowfall in 2023: Station 001, Day 25, Snowfall 10 inches

📌 How This Relates to Big Data and Hadoop:

    MapReduce: We used the Map phase to distribute the work of parsing and creating key-value pairs, and the Reduce phase to aggregate the values (snowfall) to compute the maximum.

    Scalability: This approach can scale horizontally to handle large datasets across multiple nodes in a Hadoop cluster.

    Fault Tolerance: Spark (and Hadoop) provides fault tolerance by replicating data and computations across different nodes.

    Distributed Computing: The work is distributed across the cluster, and each node processes part of the data in parallel.

🚀 Running on Hadoop

To run this on a Hadoop cluster:

    Store the input file on HDFS.

    Submit the script using spark-submit, specifying the HDFS path for input and output.

    Monitor the results through the Hadoop UI or fetch the output from the specified HDFS directory.