# üß† PySpark collect() Function:
üîπ Purpose:

The collect() function in PySpark is used to retrieve all rows of a DataFrame (or RDD) from the executors (worker nodes) back to the driver program as a list of Row objects.

It‚Äôs typically used for small datasets or for debugging, not for large-scale distributed data because it brings everything into driver memory.

In [0]:
#Step 1: Create a Sample DataFrame
data = [
    (1, "Alice", 3500),
    (2, "Bob", 4000),
    (3, "Charlie", 4500),
    (4, "David", 5000)
]

columns = ["emp_id", "name", "salary"]

df = spark.createDataFrame(data, columns)

df.display()

emp_id,name,salary
1,Alice,3500
2,Bob,4000
3,Charlie,4500
4,David,5000


In [0]:
data_collect = df.collect()

# Print the entire data collected to driver
for i in data_collect:
    print(i)


Row(emp_id=1, name='Alice', salary=3500)
Row(emp_id=2, name='Bob', salary=4000)
Row(emp_id=3, name='Charlie', salary=4500)
Row(emp_id=4, name='David', salary=5000)


In [0]:
# Access individual elements
for row in data_collect:
    print(row.name, "earns", row.salary)


Alice earns 3500
Bob earns 4000
Charlie earns 4500
David earns 5000


### ‚ö†Ô∏è Important Notes

collect() brings all data to the driver memory, so avoid using it for large datasets.

If your DataFrame is large, use alternatives like:

.take(n) ‚Üí retrieves first n rows safely

.limit(n).toPandas() ‚Üí converts small subset to Pandas DataFrame

.show() ‚Üí prints preview without moving all data

### ‚úÖ Summary
| Function    | Purpose            | Returns      | Safe for Large Data |
| ----------- | ------------------ | ------------ | ------------------- |
| `collect()` | Fetches all rows   | List of Rows | ‚ùå No                |
| `take(n)`   | Fetches top n rows | List of Rows | ‚úÖ Yes               |
| `show(n)`   | Prints top n rows  | None         | ‚úÖ Yes               |

