### Create Catalog,schema,volume


In [0]:

%sql
create catalog if not exists catalog1_dropme; 
create database if not exists catalog1_dropme.schema1_dropme; 
create volume if not exists catalog1_dropme.schema1_dropme.volume1_dropme;

 ## Typical Spark Program
### from pyspark.sql.session import SparkSession


| Component      | Name                | Type       |
| -------------- | ------------------- | ---------- |
| `pyspark`      | pyspark             | Package    |
| `sql`          | pyspark.sql         | Subpackage |
| `session`      | pyspark.sql.session | Module     |
| `SparkSession` | SparkSession        | Class      |
In Python, a package is a directory of modules, a subpackage is a package inside another package, a module is a single .py file, and a class is a blueprint defined inside a module

In [0]:
from pyspark.sql.session import SparkSession
print(spark)#already instantiated by databricks
spark1= SparkSession.builder.getOrCreate()
print(spark1)#we instantiated
# SparkSession.builder.getOrCreate() 
#either returns an existing SparkSession or creates one if none exists. In Databricks, it always returns the pre-created session

###  Read/Extract the data from the filesytem and load it into the distributed memory for further processing/load.

### (default) If we don't use any options in spark.read.csv():
1. By default it uses comma ( , ) as delimiter (sep=",")
2. By default header=False, so first row is treated as data and column names are auto-generated as _c0, _c1, ... _cn
3. By default inferSchema=False, so all columns are read as StringType
4. Default read mode is PERMISSIVE

In [0]:
%python
# File path inside the volume
file_path = "dbfs:///Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv"

# Sample CSV content
csv_data = """cust_id,cust_name,age,city
1,John,30,Hyderabad
2,Jane,28,Bangalore
3,Robert,35,Pune
4,Emily,32,Chennai
"""

#Method 1 dbutils.fs.put Part of dbutils.fs utility
dbutils.fs.put(
    file_path,
    csv_data,
    overwrite=True
)

# Verify file creation
dbutils.fs.ls("dbfs:///Volumes/catalog1_dropme/schema1_dropme/volume1_dropme")


In [0]:
# dbfs:/// is the URI(Uniform Resource Identifier) prefix for accessing DBFS paths
#dbfs:// → scheme(URI scheme (tells Spark/Databricks to use DBFS)) / protocol, like http:// or file:// 
# / → root directory of the DBFS filesystem

csv_df1=spark.read.csv("dbfs:////Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv")
#csv_df1.show(2)#display with produce output in a dataframe format
csv_df1.printSchema()
display(csv_df1)#display with produce output in a beautified table format, specific to databricks


In [0]:
#1. Header Concepts(2 ways).
#By default it will use _c0,_c1..._cn it will apply as column headers, but we are asking spark to take the first row as header and not as a data?
#WAY1
csv_df1=spark.read.csv("dbfs:////Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv",header=True)
print(csv_df1.printSchema())
csv_df1.show(2)
#csv_df1.write.csv("/Volumes/workspace/wd36schema2/volume1/folder1/outputdata")
#By default it will use _c0,_c1..._cn it will apply as column headers, if we use toDF(colnames) we can define our own headers..
#way2
csv_df2=spark.read.csv("dbfs:////Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv").toDF("my_id","name","age","city")
#csv_df1.show(2)#display with produce output in a dataframe format
print(csv_df2.printSchema())
csv_df2.show(2)

In [0]:
#2. Printing Schema (equivalent to describe table)
csv_df1.printSchema()
csv_df2.printSchema()

In [0]:
#3. Inferring Schema 
# (Performance Consideration: Use this function causiously because it scans the entire data by immediately evaluating and executing
# hence, not good for large data or not good to use on the predefined schema dataset)


### Reading CSV without inferSchema

In [0]:
csv_df1 = spark.read.csv(
    "/Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv",
    header=True        # Use first row as column names
)
csv_df1.printSchema()
csv_df1.show(5)

'''
Observation without inrferschema
1)All columns considers as strings (stringType)
 |-- cust_id: string (nullable = true)
 |-- cust_name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)
2)Even numeric columns like cust_id and age are strings
3)You cannot perform numeric operations (sum, avg, comparison) on age without casting '''

### Reading CSV with inferSchema=True
How inferSchema works internally -->
Spark scans a sample of the CSV file (default is first 1000 rows)

If type mismatch occurs in the sample, Spark may default to StringType for safety

In [0]:
csv_df2 = spark.read.csv(
    "/Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv",
    header=True,       # First row is column names
    inferSchema=True   # Spark will detect correct data types
)
csv_df2.printSchema()
csv_df2.show(5)
'''
Observation
1)Spark automatically detected cust_id and age as integers
2)cust_name and city remain strings
3)Numeric operations are now possible:
|-- cust_id: integer (nullable = true)
 |-- cust_name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)
'''

### Generic way of read and load data into dataframe using fundamental options from built in sources (csv/orc/parquet/xml/json/table) (inferschema, header, sep)

In [0]:
csv_df1=spark.read.csv("dbfs:///Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv",inferSchema=True,header=True,sep=',')  # 1~john~30~Hyderabad then use sep='~'....default is " , 
csv_df1.show(2)

# Provide schema with SQL String or programatically (very very important)
 Important part - Using structure type to define custom complex schema.
import the types library based classes..
 define_structure=StructType([StructField("colname",DataType(),True),StructField("colname",DataType(),True)...])


In [0]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
custom_schema=StructType([StructField("id",IntegerType(),False),StructField("fname",StringType(),True),StructField("lname",StringType(),True),StructField("age",IntegerType(),True),StructField("prof",StringType())])
csv_df1=spark.read.schema(custom_schema).csv("dbfs:///Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv")
print(csv_df1.printSchema())
csv_df1.show(2)

### ✅ What is Manual Schema?(most used)
### 
### Manual schema means you explicitly define column names, data types, and nullability, instead of letting Spark guess.


StructField("cust_id", IntegerType(), **_True_**)
Indicates whether this column can contain NULL values
**_True_** → NULL allowed
**_False_** → NULL NOT allowed (strict)

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# 1️⃣ Define schema explicitly
manual_schema = StructType([
    StructField("cust_id", IntegerType(), True),
    StructField("cust_name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])
# 2️⃣ Read file using the schema
v_df1 = spark.read.csv(
    "dbfs:///Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv",
    schema=manual_schema,  # Manual schema applied
    header=True,           # First row is header
    sep=','                # default , delimiter(if not given no Issue)
)
# 3️⃣ Validate data
v_df1.show(2)
v_df1.printSchema()
display(v_df1)

getting data from different source systems of different regions (NY, TX, CA) into different landing pad (locations), how to access this data?

In [0]:
df_multiple_sources = spark.read.csv(
    [
        "dbfs:///Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv",
        "/Volumes/workspace/default/usage_metrics/mobile_os_usage.csv"
    ],
    inferSchema=True,
    header=True,
    sep=','
)

df_multiple_sources.show(4)
print(df_multiple_sources.count())


In [0]:
df_multiple_sources=spark.read.csv(path=["dbfs:///Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/customers.csv","/Volumes/workspace/default/usage_metrics/mobile_os_usage.csv"],inferSchema=True,header=True,sep=',',pathGlobFilter="customers",recursiveFileLookup=True)
#.toDF("cid","fn","ln","a","p")
print(df_multiple_sources.count())
df_multiple_sources.show(4)

### DataFrame.write (write.csv / write.parquet)
Part of Spark DataFrame API
Writes structured data from a DataFrame to CSV, Parquet, JSON, Delta
Handles large datasets in parallel (distributed)