
#### Telecom Domain Read & Write Ops Assignment ‚Äì Building Datalake & Lakehouse

###### First Import all required libraries & Create spark session object

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

###### create Sample Telecom Data

In [0]:
telecom_data = [
    (1, "Airtel", "Prepaid", 249.0, "Active"),
    (2, "Jio", "Postpaid", 399.0, "Active"),
    (3, "Vi", "Prepaid", 199.0, "Inactive"),
    (4, "BSNL", "Prepaid", 149.0, "Active")
]

schema = ["customer_id", "operator", "plan_type", "monthly_charge", "status"]

df = spark.createDataFrame(telecom_data, schema)
df.show()

###### Create Catalog, Schema and Volume

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS telecom_catalog_assign;
CREATE SCHEMA IF NOT EXISTS telecom_catalog_assign.landing_zone;
CREATE VOLUME IF NOT EXISTS telecom_catalog_assign.landing_zone.landing_vol;

###### Using dbutils.fs.mkdirs, create folders

In [0]:
# Create Customer folder
dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/"
)

# Create usage folder
dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/"
)

# Create tower folder
dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/"
)

Difference between Volume vs DBFS/ Filestore

DBFS/FileStore ‚Üí Dev, temporary, non-governed storage

Volumes ‚Üí Secure, governed, production-ready storage for regulated data

##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

In [0]:
# step 1: Define Raw data files (as given)

customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''


##2. Filesystem operations
1. Write dbutils.fs code to copy the above datasets into your created Volume folders:
Customer ‚Üí /Volumes/.../customer/
Usage ‚Üí /Volumes/.../usage/
Tower (region-based) ‚Üí /Volumes/.../tower/region1/ and /Volumes/.../tower/region2/

2. Write a command to validate whether files were successfully copied

In [0]:
# Step 2: Write raw files into landing volume folders
# Customer CSV --> landing/customer
dbutils.fs.put(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",
    customer_csv,
    overwrite=True
)

# Usage TSV --> landing/usage
dbutils.fs.put(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",
    usage_tsv,
    overwrite=True
)

# Tower logs --> landing/tower
dbutils.fs.put(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region1.csv",
    tower_logs_region1,
    overwrite=True
)

In [0]:
# Step 3: Read customer CSV file
customer_df = spark.read.csv(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv"
)
customer_df.show()

In [0]:
# Step 4: Read customer csv with correct options

customer_df = spark.read \
    .option("header", "false") \
    .option("inferSchema", "true") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv") \
    .toDF("customer_id", "name", "age", "city", "plan_type")

customer_df.show()


In [0]:
# Step 5: Read usage tsv file
usage_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("sep", "\t") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv") 
    #.toDF("customer_id", "voice_mins", "data_mb", "sms_count")
usage_df.show()

In [0]:
# Step 6: Read tower logs pipe delimited
tower_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("sep", "|") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region1.csv") 
    #.toDF("event_id", "customer_id", "tower_id", "signal_strength", "timestamp")
tower_df.show()

##3. Spark Directory Read Use Cases
1. Read all tower logs using:
Path glob filter (example: *.csv)
Multiple paths input
Recursive lookup

2. Demonstrate these 3 reads separately:
Using pathGlobFilter
Using list of paths in spark.read.csv([path1, path2])
Using .option("recursiveFileLookup","true")

3. Compare the outputs and understand when each should be used.

Use case 1: Read using pathGlobalFilter

Description <br>
Reads files matching a specific filename pattern within a single directory


In [0]:
"""Use case 1: Read using pathGlobalFilter

Description <br>
Reads files matching a specific filename pattern within a single directory"""

tower_glob_df = spark.read \
    .option("header", "true") \
    .option("sep", "|") \
    .option("pathGlobalFilter", "*.log") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")

tower_glob_df.show()

Use case 1: When to use

- Files are in one directory
- Naming pattern is consistent
- No subfolders

In [0]:
"""Use Case 2: Read using Multiple Paths Input
Description <br>

Explicitly specify exact file paths to read."""

paths = [
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region1.csv",
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region2.csv"
]

tower_multi_df = spark.read \
    .option("header", "true") \
    .option("sep", "|") \
    .csv(paths)

tower_multi_df.show()


Use case 2: When to use

- You know exact file names
- Reading specific files only
- Reprocessing selected data

In [0]:
"""Use Case 3: Read using Recursive File Lookup
Description <br>

Reads files recursively from all subdirectories."""

tower_recursive_df = spark.read \
    .option("header", "true") \
    .option("sep", "|") \
    .option("recursiveFileLookup", "true") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")

tower_recursive_df.show()

Use Case 3: When to use

- Deep folder hierarchy
- Partitioned data
- Unknown or dynamic folder structure

Final One-Line Summary (Exam Ready)

- pathGlobFilter ‚Üí Pattern-based read in a single directory
- Multiple paths ‚Üí Full control over files read
- recursiveFileLookup ‚Üí Auto-discovery across subfolders

##4. Schema Inference, Header, and Separator
1. Try the Customer, Usage files with the option and options using read.csv and format function:<br>
header=false, inferSchema=false<br>
or<br>
header=true, inferSchema=true<br>
2. Write a note on What changed when we use header or inferSchema  with true/false?<br>
3. How schema inference handled ‚Äúabc‚Äù in age?<br>

In [0]:
#Case 1: customer file --> header = false, inferschema = false
cust_no_header_no_schema_df = spark.read \
    .option("header", "false") \
    .option("inferSchema", "false") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

cust_no_header_no_schema_df.show()
cust_no_header_no_schema_df.printSchema()

In [0]:
#Case 1: usage file --> header = false, inferschema = false
usage_no_header_no_schema_df = spark.read \
    .option("header", "false") \
    .option("inferSchema", "false") \
    .option("sep", "\t") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")

usage_no_header_no_schema_df.show()
usage_no_header_no_schema_df.printSchema()


In [0]:
# Case 2: Header = True and inferschema = True
# customer file
customer_header_schema_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv") \
    .toDF("customer_id", "name", "age", "city", "plan_type")

customer_header_schema_df.show()
customer_header_schema_df.printSchema()

In [0]:
# Case 2: Header = True and inferschema = True
# usage file
usage_header_schema_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("sep", "\t") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv") \
    .toDF("customer_id", "voice_mins", "data_mb", "sms_count")

usage_header_schema_df.show()
usage_header_schema_df.printSchema()

###### 2. Write a note on What changed when we use header or inferSchema with true/false?
üîπ header = false
Spark treats first row as data
Column names default to _c0, _c1, etc.
Manual renaming required

üîπ header = true
First row becomes column names
Data is more readable and usable
Prevents accidental ingestion of header as data

üîπ inferSchema = false
All columns are read as string
No type validation
Faster read but unsafe for analytics

üîπ inferSchema = true
Spark scans data to determine data types
Enables numeric operations and aggregations
Slight performance overhead

###### 3.How schema inference handled ‚Äúabc‚Äù in age?

In [0]:
customer_header_schema_df.select("customer_id", "age").show()

##5. Column Renaming Usecases

##### 1. Apply column names using string using toDF function for customer data

###### Use Case:
###### File has no header and all columns are read as strings.
###### We only want to rename columns not enforce data types yet.

In [0]:
customer_df = spark.read \
    .option("header", "false") \
    .option("inferSchema", "true") \
    .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv") \
    .toDF("customer_id", "name", "age", "city", "plan_type")

customer_df.show()
customer_df.printSchema()

######2. Apply column names and datatype using the schema function for usage data
Use-case:
File has a header, but we want full control of schema.