# Ex - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType

spark = (
    SparkSession.builder
    .appName("MyApp")
    .getOrCreate()
    )

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv).

In [2]:
!wget -O drinks.csv https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv

--2025-12-02 11:55:36--  https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4973 (4.9K) [text/plain]
Saving to: ‘drinks.csv’


2025-12-02 11:55:36 (28.8 MB/s) - ‘drinks.csv’ saved [4973/4973]



### Step 3. Assign it to a variable called drinks.

In [6]:
drinks = spark.read.csv('drinks.csv',header=True, inferSchema=True)

In [7]:
drinks.printSchema()

root
 |-- country: string (nullable = true)
 |-- beer_servings: integer (nullable = true)
 |-- spirit_servings: integer (nullable = true)
 |-- wine_servings: integer (nullable = true)
 |-- total_litres_of_pure_alcohol: double (nullable = true)
 |-- continent: string (nullable = true)



### Step 4. Which continent drinks more beer on average?

In [12]:
beer_stats = (
    drinks
    .groupBy('continent')
    .agg(
        F.count("*").alias('count'),
        F.sum('beer_servings').alias('beer_servings_sum')
    ).select(
        F.col('continent'),
        (F.col('beer_servings_sum') / F.col('count')).alias('average')
    )
)

beer_stats.show()

+---------+------------------+
|continent|           average|
+---------+------------------+
|       NA|145.43478260869566|
|       SA|175.08333333333334|
|       AS| 37.04545454545455|
|       OC|           89.6875|
|       EU|193.77777777777777|
|       AF|61.471698113207545|
+---------+------------------+



### Step 5. For each continent print the statistics for wine consumption.

In [15]:
wine_stats = (
    drinks.groupBy('continent')
    .agg(
        F.count("*").alias('count'),
        F.mean('wine_servings').alias('average'),
        F.stddev('wine_servings').alias('std_wine'),
        F.min('wine_servings').alias('min_wine'),
        F.expr('percentile(wine_servings, 0.5)').alias('median_wine'),
        F.max('wine_servings').alias('max_wine')
    )
)

wine_stats.show()

+---------+-----+------------------+------------------+--------+-----------+--------+
|continent|count|           average|          std_wine|min_wine|median_wine|max_wine|
+---------+-----+------------------+------------------+--------+-----------+--------+
|       NA|   23| 24.52173913043478|28.266378301658847|       1|       11.0|     100|
|       SA|   12|62.416666666666664| 88.62018888937148|       1|       12.0|     221|
|       AS|   44| 9.068181818181818|21.667033931944484|       0|        1.0|     123|
|       OC|   16|            35.625| 64.55578982554547|       0|        8.5|     212|
|       EU|   45|142.22222222222223| 97.42173756146497|       0|      128.0|     370|
|       AF|   53|16.264150943396228| 38.84641897335842|       0|        2.0|     233|
+---------+-----+------------------+------------------+--------+-----------+--------+



### Step 6. Print the mean alcohol consumption per continent for every column

In [18]:
drinks_means = (
    drinks.groupBy('continent')
    .agg(*[F.mean(c).alias(f"avg_{c}")
    for c in drinks.columns
    if c not in ['continent','country']])
)

drinks_means.show()

+---------+------------------+-------------------+------------------+--------------------------------+
|continent| avg_beer_servings|avg_spirit_servings| avg_wine_servings|avg_total_litres_of_pure_alcohol|
+---------+------------------+-------------------+------------------+--------------------------------+
|       NA|145.43478260869566|  165.7391304347826| 24.52173913043478|               5.995652173913044|
|       SA|175.08333333333334|             114.75|62.416666666666664|               6.308333333333334|
|       AS| 37.04545454545455|  60.84090909090909| 9.068181818181818|              2.1704545454545454|
|       OC|           89.6875|            58.4375|            35.625|              3.3812500000000005|
|       EU|193.77777777777777| 132.55555555555554|142.22222222222223|               8.617777777777777|
|       AF|61.471698113207545| 16.339622641509433|16.264150943396228|                3.00754716981132|
+---------+------------------+-------------------+------------------+----

### Step 7. Print the median alcohol consumption per continent for every column

In [19]:
drinks_means = (
    drinks.groupBy('continent')
    .agg(*[F.mode(c).alias(f"avg_{c}")
    for c in drinks.columns
    if c not in ['continent','country']])
)

drinks_means.show()

+---------+-----------------+-------------------+-----------------+--------------------------------+
|continent|avg_beer_servings|avg_spirit_servings|avg_wine_servings|avg_total_litres_of_pure_alcohol|
+---------+-----------------+-------------------+-----------------+--------------------------------+
|       NA|               52|                 69|                2|                             6.3|
|       SA|              159|                100|                3|                             4.2|
|       AS|                0|                  0|                0|                             0.0|
|       OC|               21|                  0|                1|                             1.0|
|       EU|              224|                100|               56|                            11.4|
|       AF|               25|                  0|                1|                             1.8|
+---------+-----------------+-------------------+-----------------+------------------------

### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [26]:
res = (
    drinks.select(F.mean('spirit_servings').alias('mean_spirit'),
                  F.max('spirit_servings').alias('max_spirit'),
                  F.min('spirit_servings').alias('min_spirit'))
)

res.show()


+-----------------+----------+----------+
|      mean_spirit|max_spirit|min_spirit|
+-----------------+----------+----------+
|80.99481865284974|       438|         0|
+-----------------+----------+----------+

