# **Patients With a Condition**

## **Problem Statement**
We have a table named **Patients**.

### **Table: Patients**
This table contains hospital patient information.

| Column Name  | Type    |
|-------------|---------|
| patient_id  | int     |
| patient_name | varchar |
| conditions  | varchar |

- `patient_id` is the **primary key**.
- `conditions` contains **0 or more medical codes separated by spaces**.
- **Type I Diabetes** always starts with the **prefix "DIAB1"**.

---

## **Objective**
Find the `patient_id`, `patient_name`, and `conditions` for patients who have **Type I Diabetes** (i.e., their `conditions` column contains a code that starts with `"DIAB1"`).

### **Example**
#### **Input: Patients Table**
| patient_id | patient_name | conditions   |
|------------|--------------|--------------|
| 1          | Daniel       | YFEV COUGH   |
| 2          | Alice        |              |
| 3          | Bob          | DIAB100 MYOP |
| 4          | George       | ACNE DIAB100 |
| 5          | Alain        | DIAB201      |

#### **Expected Output**
| patient_id | patient_name | conditions   |
|------------|--------------|--------------|
| 3          | Bob          | DIAB100 MYOP |
| 4          | George       | ACNE DIAB100 |

- **Bob** has `DIAB100`, which starts with `"DIAB1"`, so he is included.
- **George** has `DIAB100`, which starts with `"DIAB1"`, so he is included.
- **Daniel, Alice, and Alain** do not meet the condition, so they are excluded.

---


## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create DataFrame for Patients**
3. **Filter Rows Where `conditions` Contains `"DIAB1"` Using `rlike()`**
4. **Select Required Columns**
5. **Display the Result**

### **Code**

In [12]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("PatientsWithCondition").getOrCreate()

# Step 2: Create DataFrame
patients_data = [
    (1, "Daniel", "YFEV COUGH"),
    (2, "Alice", ""),
    (3, "Bob", "DIAB100 MYOP"),
    (4, "George", "ACNE DIAB100"),
    (5, "Alain", "DIAB201"),
]
patients_columns = ["patient_id", "patient_name", "conditions"]

patients_df = spark.createDataFrame(patients_data, patients_columns)

# Step 3: Filter Rows Where conditions contain "DIAB1"
filtered_df = patients_df.filter(col("conditions").rlike(r"\bDIAB1\d*\b"))

# Step 4: Select Required Columns
result_df = filtered_df.select("patient_id", "patient_name", "conditions")

# Step 5: Display Output
result_df.show()


StatementMeta(, d3d33543-511b-4cf8-a029-d4ae3905d6fd, 15, Finished, Available, Finished)

+----------+------------+------------+
|patient_id|patient_name|  conditions|
+----------+------------+------------+
|         3|         Bob|DIAB100 MYOP|
|         4|      George|ACNE DIAB100|
+----------+------------+------------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create DataFrame for Patients**
2. **Register as SQL View**
3. **Run SQL Query Using `LIKE` or `REGEXP`**
4. **Return the Result**


### **Code**

In [13]:
# Step 1: Register DataFrame as SQL View
patients_df.createOrReplaceTempView("Patients")

# Step 2: Run SQL Query
sql_query = """
SELECT patient_id, patient_name, conditions
FROM Patients
WHERE conditions LIKE '%DIAB1%';
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, d3d33543-511b-4cf8-a029-d4ae3905d6fd, 16, Finished, Available, Finished)

+----------+------------+------------+
|patient_id|patient_name|  conditions|
+----------+------------+------------+
|         3|         Bob|DIAB100 MYOP|
|         4|      George|ACNE DIAB100|
+----------+------------+------------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses **`rlike()`** to filter rows where `conditions` contain `"DIAB1"` |
| **Approach 2** | SQL Query in PySpark     | Uses **`REGEXP`** to find `"DIAB1"` pattern in the `conditions` column |
