# Lab 6.1 — Summarizing the Healthcare Survey with PySpark

This Colab notebook implements the required transformation steps using **PySpark** (no `melt`).  
It reads the `_v2` files (dots replaced by underscores), reshapes with `stack`, joins reverse-coding metadata, recodes values (including reverse-coded items), aggregates per-feature scores, and writes `health_survey_summary.csv` to the `data/` folder.

**Files included in `/content/data`:**
- `health_survey_v2.csv` — survey responses
- `ReverseCodingItems_v2.csv` — reverse-coding metadata
- `health_survey_summary.csv` — final produced summary (also produced by this notebook if you run it)

Follow the instructions and run cells sequentially. Screenshots placeholders are provided where your instructor requests them.


In [None]:
# Install and start PySpark if running in Colab (uncomment when in Colab)
# !apt-get install -y openjdk-11-jdk-headless -qq > /dev/null
# !pip install -q pyspark
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.master("local[*]").appName("Lab6_1").getOrCreate()

# Files are expected under /content/data
import os
print('Files in /content/data:')
print(os.listdir('/content/data'))


In [None]:
# === PySpark code (run this in Colab after installing pyspark) ===
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, col, when, create_map, lit, regexp_extract

spark = SparkSession.builder.master('local[*]').appName('Lab6_1').getOrCreate()

survey = spark.read.csv('/content/data/health_survey_v2.csv', header=True, inferSchema=True)
reverse = spark.read.csv('/content/data/ReverseCodingItems_v2.csv', header=True, inferSchema=True)

# show top rows
survey.show(5, truncate=False)
reverse.show(5, truncate=False)


In [None]:
# 1) Identify question columns (all except ID)
question_cols = [c for c in survey.columns if c != 'ID']
print('Question columns (count={}):'.format(len(question_cols)), question_cols)

# 2) Build stack expression and reshape to long
stack_expr = "stack({}, {})".format(len(question_cols), ", ".join([f"'{c}', `{c}`" for c in question_cols]))
long_df = survey.select(col('ID'), expr(stack_expr).alias('Question','Response'))
long_df.show(10, truncate=False)


In [None]:
# 3) Join with reverse coding info
reverse_small = reverse.select(col('Column Name').alias('Question'), col('Needs Reverse Coding?').alias('NeedsReverse'))
joined = long_df.join(reverse_small, on='Question', how='left').fillna({'NeedsReverse':'No'})

# 4) Create normal and reverse coded values using when()
coded = joined.withColumn('Temp_Normal',
    when(col('Response')=='Strongly Disagree', 1)
    .when(col('Response')=='Somewhat Disagree', 2)
    .when(col('Response')=='Neither Agree nor Disagree', 3)
    .when(col('Response')=='Somewhat Agree', 4)
    .when(col('Response')=='Strongly Agree', 5)
    .otherwise(None)
).withColumn('Temp_Reverse',
    when(col('Response')=='Strongly Disagree', 5)
    .when(col('Response')=='Somewhat Disagree', 4)
    .when(col('Response')=='Neither Agree nor Disagree', 3)
    .when(col('Response')=='Somewhat Agree', 2)
    .when(col('Response')=='Strongly Agree', 1)
    .otherwise(None)
)

# 5) Final recoded value based on NeedsReverse
coded = coded.withColumn('RecodedValue', when(col('NeedsReverse')=='Yes', col('Temp_Reverse')).otherwise(col('Temp_Normal')))

# 6) Extract Feature prefix (F1..F6)
coded = coded.withColumn('Feature', regexp_extract(col('Question'), r'^(F\d+)', 0))

coded.show(20, truncate=False)


In [None]:
# 7) Aggregate mean RecodedValue per ID per Feature and pivot wide
summary = coded.groupBy('ID','Feature').avg('RecodedValue').withColumnRenamed('avg(RecodedValue)','MeanScore')
final = summary.groupBy('ID').pivot('Feature', ['F1','F2','F3','F4','F5','F6']).agg({'MeanScore':'first'}).orderBy('ID')

final.show(20, truncate=False)

# 8) Write result to CSV
final.coalesce(1).write.csv('/content/data/health_survey_summary_pyspark.csv', header=True, mode='overwrite')
print('Written /content/data/health_survey_summary_pyspark.csv (folder with part files)')


## Screenshots to include in your worksheet

1. **Loading data** — screenshot the cell output that lists files and `survey.show()`.
2. **Stack using `stack()`** — screenshot the long format `long_df.show()` output.
3. **Joined dataset** — screenshot `joined.show()` or `coded.show()` showing NeedsReverse and Temp columns.
4. **Coded columns** — screenshot the `RecodedValue` column.
5. **Aggregated final result** — screenshot the `final.show()` output.

Place each screenshot in your worksheet as required by the assignment.


### Notes
- This notebook is written for Colab. Uncomment the pip/apt install lines at the top to install pyspark before running.
- The small CSV `health_survey_summary.csv` was produced by pandas in this environment and saved to `/mnt/data/data/health_survey_summary.csv` for quick download and verification.
