# **Number of Calls Between Two Persons**

## **Problem Statement**
You are given a table **Calls** with the following structure:

### **Table: Calls**
| Column Name | Type |
|-------------|------|
| `from_id`   | int  |
| `to_id`     | int  |
| `duration`  | int  |

- This table does **not have a primary key** and **may contain duplicates**.
- It contains the duration of a phone call between `from_id` and `to_id`.
- `from_id` is **not equal** to `to_id`.

### **Objective**
Write a query to **report the number of calls** and **the total call duration** between each **pair of distinct persons** `(person1, person2)` where `person1 < person2`.

---

## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create a DataFrame for `Calls` Table**
3. **Normalize the Data** (Ensure `person1 < person2`)
4. **Group by (person1, person2) and Aggregate**
5. **Select and Display the Result**

### **Code**

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, least, greatest

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("NumberOfCalls").getOrCreate()

# Step 2: Create DataFrame for Calls Table
calls_data = [
    (1, 2, 59),
    (2, 1, 11),
    (1, 3, 20),
    (3, 4, 100),
    (3, 4, 200),
    (3, 4, 200),
    (4, 3, 499),
]
calls_columns = ["from_id", "to_id", "duration"]

calls_df = spark.createDataFrame(calls_data, calls_columns)

# Step 3: Normalize Data (Ensure person1 < person2)
normalized_df = calls_df.withColumn("person1", least(col("from_id"), col("to_id"))) \
                        .withColumn("person2", greatest(col("from_id"), col("to_id")))

# Step 4: Group by (person1, person2) and Aggregate
result_df = normalized_df.groupBy("person1", "person2") \
                         .agg(count("*").alias("call_count"), sum("duration").alias("total_duration"))

# Step 5: Display Result
result_df.show()

StatementMeta(, 6c26ad54-4998-4260-9902-28cdbc86f37d, 5, Finished, Available, Finished)

+-------+-------+----------+--------------+
|person1|person2|call_count|total_duration|
+-------+-------+----------+--------------+
|      1|      2|         2|            70|
|      1|      3|         1|            20|
|      3|      4|         4|           999|
+-------+-------+----------+--------------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create Spark Session**
2. **Create DataFrame for `Calls` Table**
3. **Register it as a SQL View**
4. **Write and Execute SQL Query**
5. **Display the Output**


### **Code**


In [4]:
# Step 1: Register DataFrame as a SQL View
calls_df.createOrReplaceTempView("Calls")

# Step 2: Run SQL Query
sql_query = """
SELECT 
    LEAST(from_id, to_id) AS person1, 
    GREATEST(from_id, to_id) AS person2, 
    COUNT(*) AS call_count, 
    SUM(duration) AS total_duration
FROM Calls
GROUP BY person1, person2;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, 6c26ad54-4998-4260-9902-28cdbc86f37d, 6, Finished, Available, Finished)

+-------+-------+----------+--------------+
|person1|person2|call_count|total_duration|
+-------+-------+----------+--------------+
|      1|      2|         2|            70|
|      1|      3|         1|            20|
|      3|      4|         4|           999|
+-------+-------+----------+--------------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `.withColumn()`, `.groupBy()`, `.agg()` |
| **Approach 2** | SQL Query in PySpark     | Uses `LEAST()`, `GREATEST()`, `GROUP BY`, `SUM()` |
