<a href="https://colab.research.google.com/github/tyri0n11/distributed-system/blob/main/lab6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise I

The input is a textual csv file containing the daily value of PM10 for a set of sensors, and in each line of the files has the following format:
```sensorId,date,PM10 value (Î¼g/m3)\n```

Here is the example of data:
```
s1,2016-01-01,20.5
s2,2016-01-01,30.1
s1,2016-01-02,60.2
s2,2016-01-02,20.4
s1,2016-01-03,55.5
s2,2016-01-03,52.5
```

You're required to use pyspark to load the file, filter the values and use map/reduce code idea to give the output. The output is a line for each sensor on the standard output.
Each line contains a `sensorId` and the list of `dates` with a PM10 values greater than 50 for that sensor. The example output:
```
(s1, [2016-01-02, 2016-01-03])
(s2, [2016-01-03])
```



In [17]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PM10_Filter").getOrCreate()
sc = spark.sparkContext

# Load CSV as RDD (map/reduce style instead of DataFrame SQL)
rdd = sc.textFile("sensor.csv")

# Split lines by comma
# Format: (sensorId, date, pm10)
parsed_rdd = rdd.map(lambda line: line.split(","))

# Keep only rows with PM10 > 50
filtered_rdd = parsed_rdd.filter(lambda x: float(x[2]) > 50)

# Map to (sensorId, date)
mapped_rdd = filtered_rdd.map(lambda x: (x[0], x[1]))

# Reduce: collect list of dates for each sensor
result_rdd = mapped_rdd.groupByKey().map(lambda x: (x[0], list(x[1])))

# Print result
for row in result_rdd.collect():
    print(row)


('s1', ['2016-01-02', '2016-01-03', '2016-01-04', '2016-01-07'])
('s3', ['2016-01-05', '2016-01-06', '2016-01-08', '2016-01-09'])
('s2', ['2016-01-03', '2016-01-07', '2016-01-09'])
('s4', ['2016-01-05', '2016-01-06', '2016-01-10'])


## Exercise II

Using the same data of the Exercise I, you're required to get the output: sensors ordered by the number of critical days. Each line of the output contains the number of days with a PM10 values greater than 50 for a sensor `s` and the `sensorId` of sensor `s`.

The example of the output:
```
2, s1
1, s2
```



In [18]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Exercise_2").getOrCreate()
sc = spark.sparkContext

# Load CSV as RDD (map/reduce style instead of DataFrame SQL)
rdd = sc.textFile("sensor.csv")

# Split lines by comma
# Format: (sensorId, date, pm10)
parsed_rdd = rdd.map(lambda line: line.split(","))

# Keep only rows with PM10 > 50
filtered_rdd = parsed_rdd.filter(lambda x: float(x[2]) > 50)

# Map to key-value: (sensorId, 1)
sensor_count_rdd = filtered_rdd.map(lambda x: (x[0], 1))

# Reduce by key to count how many times each sensorId appears
result_rdd = sensor_count_rdd.reduceByKey(lambda a, b: a + b)

# Print in desired format: count, sensorId
for count, sensorId in result_rdd.map(lambda x: (x[1], x[0])).collect():
    print(f"{count}, {sensorId}")


4, s1
4, s3
3, s2
3, s4


## Exercise III

In this exercise, you're given an input: A CSV file containing a list of profiles

- Header: `name,age,gender`
- Each line of the file contains the information about one user

The example of input data
```
name,surname,age
Paolo,Garza,42
Luca,Boccia,41
Maura,Bianchi,16
```

You're required to use pyspark to load and analyze the data to achieve the output: A CSV file containing one line for each profile. The original age attribute is substituted with a new attributed called rangeage of type String.
```
rangeage = "[" + (age/10)*10 + "-" + (age/10)*10+9 + "]"
```





In [30]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,concat,lit,floor

spark = SparkSession.builder.appName("Exercise_3").getOrCreate()
sc = spark.sparkContext

df = spark.read.csv("person.csv", header=True)

df = df.withColumnRenamed('_c0','name')\
        .withColumnRenamed('_c1','surname')\
        .withColumnRenamed('_c2','age')

# Convert age to integer
df = df.withColumn("age", col("age").cast("int"))

# Compute rangeage
df = df.withColumn(
    "rangeage",
    concat(
        lit("["),
        floor(col("age") / 10) * 10,
        lit("-"),
        (floor(col("age") / 10) * 10) + 9,
        lit("]")
    )
)
df.show()

+-----+--------+---+--------+
| name| surname|age|rangeage|
+-----+--------+---+--------+
|Paolo|   Garza| 42| [40-49]|
| Luca|  Boccia| 41| [40-49]|
|Maura| Bianchi| 16| [10-19]|
|Alice|   Cochi| 17| [10-19]|
|Laura|  Latini| 28| [20-29]|
|Paula| Zachini| 19| [10-19]|
|Carta|  Cianci| 29| [20-29]|
| Rita|Lisatini| 31| [30-39]|
+-----+--------+---+--------+

