Create a DataFrame in PySpark by reading data from a CSV file and explore its structure and contents.

In [1]:
import os
import findspark

# Point to Java & Spark
os.environ["JAVA_HOME"] = "C:/Progra~1/Java/jdk1.8"
os.environ["SPARK_HOME"] = "C:/spark/spark-3.5.7-bin-hadoop3-scala2.13"
os.environ["HADOOP_HOME"] = "C:/hadoop"   # if you installed winutils here
os.environ["PATH"] += ";C:/spark/spark-3.5.7-bin-hadoop3-scala2.13/bin;C:/hadoop/bin"

# Initialize findspark
findspark.init(os.environ["SPARK_HOME"])

from pyspark.sql import SparkSession

# Now build SparkSession
spark = SparkSession.builder \
    .appName("CSV_RDD_Example") \
    .master("local[*]") \
    .getOrCreate()

sc = spark.sparkContext

In [2]:
sc

📊 Dataset Overview

    --Total Records: 50 students

    --Columns: 7 → id, name, age, gender, math, science, english

    --No missing values

👥 Demographics

    --Age: 18 – 25 years (average ≈ 21.5)

    --Gender: 29 Female, 21 Male

📚 Academic Performance

**Math:

    --Range: 40 – 100

    --Mean: 68.9

    --Std. Dev.: 17.6 (high variation)

**Science:

    --Range: 44 – 99

    --Mean: 70.2

    --Std. Dev.: 14.6 (moderate variation)

**English:

    --Range: 42 – 100

    --Mean: 69.4

    --Std. Dev.: 18.7 (highest variation)

✨ Key Insights

    --Science is the strongest subject on average.

    --English has the most variation in performance.

    --Students strengths vary — performance is not uniform across subjects.

In [3]:
from pyspark.sql import SparkSession
 # Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("BasicDataFrameOps").getOrCreate()

In [4]:
 # Step 2: Read CSV file into DataFrame
 df = spark.read.csv("students.csv", header=True, inferSchema=True)

In [5]:
 # === Basic Operations ===
 # 1. View first 5 rows
 print("=== First 5 rows ===")
 df.show(5)

=== First 5 rows ===
+---+-------+---+------+----+-------+-------+
| id|   name|age|gender|math|science|english|
+---+-------+---+------+----+-------+-------+
|  1|  Alice| 20|     F|  66|     92|     44|
|  2|    Bob| 20|     M|  82|     52|     77|
|  3|Charlie| 22|     F|  43|     57|     76|
|  4|  David| 19|     M|  95|     69|     46|
|  5|    Eva| 19|     F|  62|     44|     96|
+---+-------+---+------+----+-------+-------+
only showing top 5 rows



In [6]:
 # 2. Print schema (structure of DataFrame)
 print("=== Schema ===")
 df.printSchema()

=== Schema ===
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- math: integer (nullable = true)
 |-- science: integer (nullable = true)
 |-- english: integer (nullable = true)



In [8]:
 # 3. Select specific columns: name and math
 print("=== Select name and math columns ===")
df.select("name", "math").show(5)

=== Select name and math columns ===
+-------+----+
|   name|math|
+-------+----+
|  Alice|  66|
|    Bob|  82|
|Charlie|  43|
|  David|  95|
|    Eva|  62|
+-------+----+
only showing top 5 rows



In [9]:
 # 4. Filter students with math >= 80
 print("=== Students with math >= 80 ===")
 df.filter(df.math >= 80).show(5)

=== Students with math >= 80 ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
|  2|   Bob| 20|     M|  82|     52|     77|
|  4| David| 19|     M|  95|     69|     46|
| 11| Kathy| 25|     M|  85|     71|     89|
| 12|   Leo| 24|     M|  97|     84|     83|
| 15|Olivia| 18|     M|  87|     90|     87|
+---+------+---+------+----+-------+-------+
only showing top 5 rows



In [10]:
 # 5. Sort students by science marks (descending)
 print("=== Sorted by science (desc) ===")
 df.orderBy(df.science.desc()).show(5)

=== Sorted by science (desc) ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
| 27| Aaron| 25|     F|  81|     99|     44|
| 32| Fiona| 22|     F|  48|     96|     48|
| 33|George| 22|     M|  66|     95|     84|
| 29|  Carl| 22|     F|  53|     92|     52|
|  1| Alice| 20|     F|  66|     92|     44|
+---+------+---+------+----+-------+-------+
only showing top 5 rows



In [13]:
#6. Count total rows
print("Total Number of Rows:",df.count())

Total Number of Rows: 50


In [14]:
 # 7. Show column names
 print("Columns:", df.columns)

Columns: ['id', 'name', 'age', 'gender', 'math', 'science', 'english']


Summary

PySpark Operations

  **Here we performed the following key operations on the DataFrame:

    --Data Exploration: Displayed the first 10 rows, printed the schema and data types, and showed summary statistics.

    --Column Selection: Selected specific columns like name, age, and math.

    --Filtering: Filtered the data to find students with an age of 21 or older and a math score of 70 or higher.

    --Data Manipulation: Added a new column named average by calculating the average of the math, science, and english scores.

    --Sorting and Filtering: Filtered for students with an average score of 75 or higher and sorted them in descending order.

    --Aggregation: Grouped the data by gender to calculate the average scores for each subject.

In [15]:
 # Stop Spark session
 # spark.stop()