<a href="https://colab.research.google.com/github/saurater/ciencia_de_dados_pyspark/blob/main/PySpark_Tutorial_Part_5_Dataset_Grouping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark - Tutorial - Part 5 - Dataset Aggregations and Grouping
Notebook by Sam Faraday
June 2022

1.Sum
2.Min
3.Max
4.Count





Sources:

Free Code Camp: PySpark Tutorial at https://www.youtube.com/watch?v=_C8kWso4ne4

Apache Spark API Refernce at https://spark.apache.org/docs/latest/api/python/reference/index.html

# 1. Installing PySpark

In [18]:
pip install pyspark # run it every time you connect to Google Colab Notebook

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# 2. Importing the required libraries

In [19]:
from pyspark.sql.functions import col,isnan,when,count

In [20]:
import pandas as pd

In [21]:
import numpy as np

# 3. Creating the Test5 Dataset

In [22]:
data = {'Index':[1,2,3,4,5,6,7], 'Name':['Tom', 'Nick', 'Krish', 'Paul','Jack',  'John','Sam'], 'Department':['IOT','Big Data', 'IOT', 'Big Data', 'Big Data', 'Data Science', 'Data Science'],'Age':[20, 21, 20, 19, 18,19, 25], 'Salary':[2000, 3000, 3500, 4000, 3000, 3500, 2800] }
# Create DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Index,Name,Department,Age,Salary
0,1,Tom,IOT,20,2000
1,2,Nick,Big Data,21,3000
2,3,Krish,IOT,20,3500
3,4,Paul,Big Data,19,4000
4,5,Jack,Big Data,18,3000
5,6,John,Data Science,19,3500
6,7,Sam,Data Science,25,2800


# 4. Saving the Dataset

In [23]:
df.to_csv('test5.csv', index=False)

# 5. Initializing PySpark


In [24]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Agg").getOrCreate()

spark

# 6. Reading the Dataset

In [25]:
df_spark = spark.read.csv("test5.csv", header =True, inferSchema =True)
df_spark.show()

+-----+-----+------------+---+------+
|Index| Name|  Department|Age|Salary|
+-----+-----+------------+---+------+
|    1|  Tom|         IOT| 20|  2000|
|    2| Nick|    Big Data| 21|  3000|
|    3|Krish|         IOT| 20|  3500|
|    4| Paul|    Big Data| 19|  4000|
|    5| Jack|    Big Data| 18|  3000|
|    6| John|Data Science| 19|  3500|
|    7|  Sam|Data Science| 25|  2800|
+-----+-----+------------+---+------+



# 7. Checking the Schema

In [26]:
df_spark.printSchema()

root
 |-- Index: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [27]:
df_spark.summary().show()

+-------+-----------------+----+----------+------------------+-----------------+
|summary|            Index|Name|Department|               Age|           Salary|
+-------+-----------------+----+----------+------------------+-----------------+
|  count|                7|   7|         7|                 7|                7|
|   mean|              4.0|null|      null|20.285714285714285|3114.285714285714|
| stddev|2.160246899469287|null|      null|2.2886885410853175|638.8233230676836|
|    min|                1|Jack|  Big Data|                18|             2000|
|    25%|                2|null|      null|                19|             2800|
|    50%|                4|null|      null|                20|             3000|
|    75%|                6|null|      null|                21|             3500|
|    max|                7| Tom|       IOT|                25|             4000|
+-------+-----------------+----+----------+------------------+-----------------+



# 8. Sum Salary Grouped by Department


In [43]:
df_spark.groupBy('Department').sum('Salary').show()

+------------+-----------+
|  Department|sum(Salary)|
+------------+-----------+
|         IOT|       5500|
|    Big Data|      10000|
|Data Science|       6300|
+------------+-----------+



# 9. Mean, Min, Max, Count Grouped by Department

In [47]:
df_spark.groupBy('Department').mean('Salary').show()

+------------+------------------+
|  Department|       avg(Salary)|
+------------+------------------+
|         IOT|            2750.0|
|    Big Data|3333.3333333333335|
|Data Science|            3150.0|
+------------+------------------+



In [49]:
df_spark.groupBy('Department').min('Salary').show()

+------------+-----------+
|  Department|min(Salary)|
+------------+-----------+
|         IOT|       2000|
|    Big Data|       3000|
|Data Science|       2800|
+------------+-----------+



In [48]:
df_spark.groupBy('Department').max('Salary').show()

+------------+-----------+
|  Department|max(Salary)|
+------------+-----------+
|         IOT|       3500|
|    Big Data|       4000|
|Data Science|       3500|
+------------+-----------+



In [50]:
df_spark.groupBy('Department').count().show()

+------------+-----+
|  Department|count|
+------------+-----+
|         IOT|    2|
|    Big Data|    3|
|Data Science|    2|
+------------+-----+



# 11. Ungprouped Sum, Mean, Min, Max, Count Salary

In [53]:
df_spark.agg({'Salary': 'sum'}).show()

+-----------+
|sum(Salary)|
+-----------+
|      21800|
+-----------+



In [55]:
df_spark.agg({'Salary': 'mean'}).show()

+-----------------+
|      avg(Salary)|
+-----------------+
|3114.285714285714|
+-----------------+



In [56]:
df_spark.agg({'Salary': 'min'}).show()

+-----------+
|min(Salary)|
+-----------+
|       2000|
+-----------+



In [57]:
df_spark.agg({'Salary': 'min'}).show()

+-----------+
|min(Salary)|
+-----------+
|       2000|
+-----------+



In [60]:
df_spark.agg({'Salary': 'max'}).show()

+-----------+
|max(Salary)|
+-----------+
|       4000|
+-----------+



In [59]:
df_spark.agg({'Salary': 'count'}).show()

+-------------+
|count(Salary)|
+-------------+
|            7|
+-------------+

