![spark logo](https://datascientest.com/en/files/2024/02/pyspark.webp)

# <b><center>PySpark Tutorial</center></b>

### what we will cover?
* Pyspark Dataframe
* Reading Datasets
* Check Schemas Datatypes
* Selecting & Indexing Columns
* Pyspark Describe Alternative
* Adding Columns
* Dropping Columns

## Part 1: Pysprak Dataframes  - Reading & Schema Datatype 

Install Notebook Dependencies

In [1]:
!pip install pyspark python-dotenv pandas



Import libaries

In [2]:
import pyspark
import dotenv
import pandas as pd

dotenv.load_dotenv()


True

First, let's import the csv as a traditional Pandas Dataframe

In [3]:
df = pd.read_csv('data/gym_members_exercise_tracking.csv')
df.head()


Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.2
1,46,Female,74.9,1.53,179,151,66,1.3,883.0,HIIT,33.9,2.1,4,2,32.0
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71
3,25,Male,53.2,1.7,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39


In [4]:
from pyspark.sql import SparkSession

In [5]:
# spark = SparkSession.builder.appName('Practice').getOrCreate()
spark = SparkSession.builder \
    .config('spark.sql.debug.maxToStringFields', 2000) \
    .getOrCreate()

24/10/26 10:16:15 WARN Utils: Your hostname, Roys-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.0.0.11 instead (on interface en0)
24/10/26 10:16:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/26 10:16:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/10/26 10:16:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [6]:
spark

In [7]:
# read the dataset
df_pyspark = spark.read.csv('data/gym_members_exercise_tracking.csv')

In [8]:
# set the first rows as headers
df_pyspark = spark.read.option('header', 'true').csv(
    'data/gym_members_exercise_tracking.csv')

In [9]:
# dataframe dtype
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [10]:
# schema datatypes
df_pyspark.printSchema()

root
 |-- Age: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Weight (kg): string (nullable = true)
 |-- Height (m): string (nullable = true)
 |-- Max_BPM: string (nullable = true)
 |-- Avg_BPM: string (nullable = true)
 |-- Resting_BPM: string (nullable = true)
 |-- Session_Duration (hours): string (nullable = true)
 |-- Calories_Burned: string (nullable = true)
 |-- Workout_Type: string (nullable = true)
 |-- Fat_Percentage: string (nullable = true)
 |-- Water_Intake (liters): string (nullable = true)
 |-- Workout_Frequency (days/week): string (nullable = true)
 |-- Experience_Level: string (nullable = true)
 |-- BMI: string (nullable = true)



### Why did all columns been set to string?
that's because if not explicity set, Pyspark set all values to string values. we'll have to aet inferSchema Parameter to True

In [11]:
df_pyspark = spark.read.option('header', 'true').csv(
    'data/gym_members_exercise_tracking.csv', inferSchema=True)

df_pyspark.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Weight (kg): double (nullable = true)
 |-- Height (m): double (nullable = true)
 |-- Max_BPM: integer (nullable = true)
 |-- Avg_BPM: integer (nullable = true)
 |-- Resting_BPM: integer (nullable = true)
 |-- Session_Duration (hours): double (nullable = true)
 |-- Calories_Burned: double (nullable = true)
 |-- Workout_Type: string (nullable = true)
 |-- Fat_Percentage: double (nullable = true)
 |-- Water_Intake (liters): double (nullable = true)
 |-- Workout_Frequency (days/week): integer (nullable = true)
 |-- Experience_Level: integer (nullable = true)
 |-- BMI: double (nullable = true)



## Part 2: Selecting & Indexing Columns

select a single column

In [12]:
df_pyspark.select('Gender').show(3)

+------+
|Gender|
+------+
|  Male|
|Female|
|Female|
+------+
only showing top 3 rows



select multiply columns

In [13]:
df_pyspark.select(['Gender','Age']).show(3)

+------+---+
|Gender|Age|
+------+---+
|  Male| 56|
|Female| 46|
|Female| 32|
+------+---+
only showing top 3 rows



## Part 3: Describe Dataframe

In [19]:
df_pyspark.describe().show(vertical=True)

-RECORD 0--------------------------------------------
 summary                       | count               
 Age                           | 973                 
 Gender                        | 973                 
 Weight (kg)                   | 973                 
 Height (m)                    | 973                 
 Max_BPM                       | 973                 
 Avg_BPM                       | 973                 
 Resting_BPM                   | 973                 
 Session_Duration (hours)      | 973                 
 Calories_Burned               | 973                 
 Workout_Type                  | 973                 
 Fat_Percentage                | 973                 
 Water_Intake (liters)         | 973                 
 Workout_Frequency (days/week) | 973                 
 Experience_Level              | 973                 
 BMI                           | 973                 
-RECORD 1--------------------------------------------
 summary                    

## Part 4: Adding & Removing Columns


In [18]:
## adding a binary column for is gender is a male 
df_pyspark.withColumn('Is_Male',df_pyspark['Gender']=='Male').show(vertical=True)

-RECORD 0---------------------------------
 Age                           | 56       
 Gender                        | Male     
 Weight (kg)                   | 88.3     
 Height (m)                    | 1.71     
 Max_BPM                       | 180      
 Avg_BPM                       | 157      
 Resting_BPM                   | 60       
 Session_Duration (hours)      | 1.69     
 Calories_Burned               | 1313.0   
 Workout_Type                  | Yoga     
 Fat_Percentage                | 12.6     
 Water_Intake (liters)         | 3.5      
 Workout_Frequency (days/week) | 4        
 Experience_Level              | 3        
 BMI                           | 30.2     
 Is_Male                       | true     
-RECORD 1---------------------------------
 Age                           | 46       
 Gender                        | Female   
 Weight (kg)                   | 74.9     
 Height (m)                    | 1.53     
 Max_BPM                       | 179      
 Avg_BPM   