In [1]:
# Import SparkContext to initialize Spark application
from pyspark import SparkContext

# Function to parse each line from the CSV file
def parse_line(line):
    """
    Parses each line of ratings.csv to extract (movieId, rating).
    Skips the header line.
    """
    if line.startswith("userId,movieId,rating,timestamp"):
        return None  # Skip the header
    parts = line.split(",")
    return (int(parts[1]), float(parts[2]))  # Return (movieId, rating)

# Main function
def main():
    # Initialize SparkContext (local mode for standalone execution)
    sc = SparkContext("local", "MovieRatings")

    # Provide the local path to the ratings.csv file
    dataset_file = "ratings.csv"  # Change path if file is in another folder

    # Read the file into an RDD (line-by-line)
    input_rdd = sc.textFile(dataset_file)

    # Filter out header and parse remaining lines into (movieId, rating)
    mapped_rdd = input_rdd.filter(
        lambda line: not line.startswith("userId,movieId,rating,timestamp")
    ).map(parse_line)

    # Group ratings by movieId and compute average rating for each movie
    reduced_rdd = mapped_rdd.groupByKey().mapValues(
        lambda ratings: sum(ratings) / len(ratings)
    )

    # Collect and print the results
    results = reduced_rdd.collect()
    for movie_id, avg_rating in results:
        print(f"Movie {movie_id} has an average rating of {avg_rating:.2f}")

    # Stop SparkContext after processing is done
    sc.stop()

# Run the main function when script is executed
if __name__ == "__main__":
    main()


Movie 1 has an average rating of 3.92
Movie 3 has an average rating of 3.26
Movie 6 has an average rating of 3.95
Movie 47 has an average rating of 3.98
Movie 50 has an average rating of 4.24
Movie 70 has an average rating of 3.51
Movie 101 has an average rating of 3.78
Movie 110 has an average rating of 4.03
Movie 151 has an average rating of 3.55
Movie 157 has an average rating of 2.86
Movie 163 has an average rating of 3.56
Movie 216 has an average rating of 3.33
Movie 223 has an average rating of 3.86
Movie 231 has an average rating of 3.06
Movie 235 has an average rating of 3.68
Movie 260 has an average rating of 4.23
Movie 296 has an average rating of 4.20
Movie 316 has an average rating of 3.38
Movie 333 has an average rating of 3.78
Movie 349 has an average rating of 3.60
Movie 356 has an average rating of 4.16
Movie 362 has an average rating of 3.53
Movie 367 has an average rating of 3.18
Movie 423 has an average rating of 2.85
Movie 441 has an average rating of 3.93
Movie 457

In [None]:
"""
📚 Dataset
File: ratings.csv

Attributes:

userId: Unique ID of the user.

movieId: Unique ID of the movie.

rating: Rating given by the user to the movie (float).

timestamp: Time at which the rating was recorded.

🧠 Concepts Covered
1. Apache Spark & PySpark
Apache Spark is a distributed computing engine for big data.

PySpark is its Python interface.

Spark uses RDD (Resilient Distributed Dataset) to perform operations in parallel across a cluster or local system.

2. SparkContext
The main entry point for Spark functionality.

sc.textFile() loads a file as an RDD, enabling distributed data processing.

3. RDD Operations
Transformations: Lazy operations like map(), filter(), groupByKey().

Actions: Trigger execution, like collect().

4. Map & Filter
filter() removes the header or unwanted rows.

map() transforms each line to a tuple (movieId, rating).

5. GroupByKey & MapValues
groupByKey() groups all ratings by movieId.

mapValues() applies a function to each group to compute the average rating.

6. Data Output
The result is a list of tuples (movieId, average_rating).

Can be printed or exported to a file or database.

🔹 PySpark Overview
PySpark is the Python API for Apache Spark, a distributed data processing engine used for big data analytics. It allows scalable processing of large datasets using simple Python code while leveraging Spark's power.

🔹 RDD (Resilient Distributed Dataset)
Fundamental data structure in Spark.

Immutable and distributed collection of objects.

Supports fault tolerance and parallel processing.

Operations:

Transformations: e.g., map(), filter(), groupByKey()

Actions: e.g., collect(), count()

🔹 Data Parsing & Transformation
Data from ratings.csv includes columns like userId, movieId, rating, and timestamp.
Using map() and filter(), the dataset is cleaned and transformed into key-value pairs like (movieId, rating).

🔹 Data Aggregation with groupByKey()
Ratings are grouped by movieId.

groupByKey() gathers all ratings for each movie ID.

🔹 Computing Average Ratings
After grouping:

Average rating for each movie is calculated using:
average = sum(ratings) / count(ratings)

This gives insights into how movies are rated across users.

🔹 Benefits of Using Spark
Efficient processing of large-scale data.

Parallel execution improves performance.

Simple and expressive transformations.

🔹 Use Cases in Real Life
Recommendation systems

Customer feedback analysis

Movie or product review summarization

"""