## Unique & Sorted records In PySpark

### Unique records in PySpark
To extract unique records in PySpark, you can use the distinct() function on a DataFrame. It removes duplicate rows, giving you only the unique records.

**Example**

In [0]:
# Sample data
data=[(10 ,'Rohish',50000, 18),
(11 ,'Vikas',75000,  16),
(12 ,'Nisha',40000,  18),
(13 ,'Nidhi',60000,  17),
(14 ,'Priya',80000,  18),
(15 ,'Mohit',45000,  18),
(16 ,'Rajesh',90000, 10),
(17 ,'Raman',55000, 16),
(18 ,'Sam',65000,   17),
(15 ,'Mohit',45000,  18),
(13 ,'Nidhi',60000,  17),      
(14 ,'Priya',90000,  18),  
(18 ,'Sam',65000,   17)]

columns = ["id", "name", "salary", "age"]

emp_df = spark.createDataFrame(data, columns)

emp_df.show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 10|Rohish| 50000| 18|
| 11| Vikas| 75000| 16|
| 12| Nisha| 40000| 18|
| 13| Nidhi| 60000| 17|
| 14| Priya| 80000| 18|
| 15| Mohit| 45000| 18|
| 16|Rajesh| 90000| 10|
| 17| Raman| 55000| 16|
| 18|   Sam| 65000| 17|
| 15| Mohit| 45000| 18|
| 13| Nidhi| 60000| 17|
| 14| Priya| 90000| 18|
| 18|   Sam| 65000| 17|
+---+------+------+---+



In [0]:
# without distinct()
emp_df.count()

Out[20]: 13

In [0]:
# without distinct()
# emp_df.distinct().show()
emp_df.select("*").distinct().count()

Out[21]: 10

In [0]:
emp_df.distinct().show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 10|Rohish| 50000| 18|
| 12| Nisha| 40000| 18|
| 11| Vikas| 75000| 16|
| 13| Nidhi| 60000| 17|
| 15| Mohit| 45000| 18|
| 14| Priya| 80000| 18|
| 16|Rajesh| 90000| 10|
| 17| Raman| 55000| 16|
| 18|   Sam| 65000| 17|
| 14| Priya| 90000| 18|
+---+------+------+---+



**Key Notes:**
- `distinct()` works on the entire row, so even if one column differs, it considers the row as unique.Key Notes:
- For selecting unique rows based on specific columns, you can use `dropDuplicates()`

**dropDuplicates():**

In [0]:
# Get unique records based on specific columns
emp_df.dropDuplicates(["id", "name"]).show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 10|Rohish| 50000| 18|
| 11| Vikas| 75000| 16|
| 12| Nisha| 40000| 18|
| 13| Nidhi| 60000| 17|
| 15| Mohit| 45000| 18|
| 14| Priya| 80000| 18|
| 17| Raman| 55000| 16|
| 16|Rajesh| 90000| 10|
| 18|   Sam| 65000| 17|
+---+------+------+---+



### Sorting In PySpark
- In PySpark, you can sort a DataFrame using the `sort()` or `orderBy()` methods. 
- Both methods work similarly, allowing you to sort the data either in ascending or descending order, and by one or multiple columns.

**Syntax for Sorting:**
- `Ascending Order (default): `df.sort("column_name")
- `Descending Order:` df.sort(col("column_name").desc())
- `Multiple Columns: `df.sort("col1", "col2")

In [0]:
# sort
from pyspark.sql.functions import col

emp_df.sort(col("id")).show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 10|Rohish| 50000| 18|
| 11| Vikas| 75000| 16|
| 12| Nisha| 40000| 18|
| 13| Nidhi| 60000| 17|
| 13| Nidhi| 60000| 17|
| 14| Priya| 90000| 18|
| 14| Priya| 80000| 18|
| 15| Mohit| 45000| 18|
| 15| Mohit| 45000| 18|
| 16|Rajesh| 90000| 10|
| 17| Raman| 55000| 16|
| 18|   Sam| 65000| 17|
| 18|   Sam| 65000| 17|
+---+------+------+---+



In [0]:
# orderBy
emp_df.orderBy(col("id")).show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 10|Rohish| 50000| 18|
| 11| Vikas| 75000| 16|
| 12| Nisha| 40000| 18|
| 13| Nidhi| 60000| 17|
| 13| Nidhi| 60000| 17|
| 14| Priya| 90000| 18|
| 14| Priya| 80000| 18|
| 15| Mohit| 45000| 18|
| 15| Mohit| 45000| 18|
| 16|Rajesh| 90000| 10|
| 17| Raman| 55000| 16|
| 18|   Sam| 65000| 17|
| 18|   Sam| 65000| 17|
+---+------+------+---+



**orderBy() is an alias for sort(), and the two can be used interchangeably and their performance is the same.**

**Examples:**


**Sorting in Ascending Order**

In [0]:
emp_df.sort(col("name").asc()).show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 15| Mohit| 45000| 18|
| 15| Mohit| 45000| 18|
| 13| Nidhi| 60000| 17|
| 13| Nidhi| 60000| 17|
| 12| Nisha| 40000| 18|
| 14| Priya| 90000| 18|
| 14| Priya| 80000| 18|
| 16|Rajesh| 90000| 10|
| 17| Raman| 55000| 16|
| 10|Rohish| 50000| 18|
| 18|   Sam| 65000| 17|
| 18|   Sam| 65000| 17|
| 11| Vikas| 75000| 16|
+---+------+------+---+



**Sorting in Descending  Order**

In [0]:
emp_df.sort(col("name").desc()).show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 11| Vikas| 75000| 16|
| 18|   Sam| 65000| 17|
| 18|   Sam| 65000| 17|
| 10|Rohish| 50000| 18|
| 17| Raman| 55000| 16|
| 16|Rajesh| 90000| 10|
| 14| Priya| 80000| 18|
| 14| Priya| 90000| 18|
| 12| Nisha| 40000| 18|
| 13| Nidhi| 60000| 17|
| 13| Nidhi| 60000| 17|
| 15| Mohit| 45000| 18|
| 15| Mohit| 45000| 18|
+---+------+------+---+



**Sorting by Multiple Columns**

In [0]:
# First by name (Ascending), then by salary (Descending):
emp_df.sort(col("name").asc(), col("salary").desc()).show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
| 15| Mohit| 45000| 18|
| 15| Mohit| 45000| 18|
| 13| Nidhi| 60000| 17|
| 13| Nidhi| 60000| 17|
| 12| Nisha| 40000| 18|
| 14| Priya| 90000| 18|
| 14| Priya| 80000| 18|
| 16|Rajesh| 90000| 10|
| 17| Raman| 55000| 16|
| 10|Rohish| 50000| 18|
| 18|   Sam| 65000| 17|
| 18|   Sam| 65000| 17|
| 11| Vikas| 75000| 16|
+---+------+------+---+

