## PySpark ###

In [None]:
# What is file reading in PySpark?
# PySpark allows you to read data from different file formats (CSV, JSON, Parquet, Avro, ORC, text, etc.) into a DataFrame.
# You can then perform transformations and actions on this DataFrame.
# It’s similar to pandas.read_csv() but distributed and optimized for big data.

CSV READ

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("CSV") \
    .getOrCreate()
df_csv = spark.read.format('csv').option('inferSchema',True)\
    .option('header',True)\
    .load('C:/Git files/My git files/PySpark/files/sales_data.csv')

JSON READ

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("JSON") \
    .getOrCreate()
df_json = spark.read.format('json').option('inferSchema', True)\
    .option('header', True)\
    .option('multiline', True)\
    .load('C:/Git files/My git files/PySpark/files/sales_data.json')

for i in df_json.columns:
    print(i)

## Reading Parquet files

In [None]:
df_parquet = spark.read.parquet("path/to/file.parquet")
df_parquet.show()

### Reading Text files

In [None]:
df_text = spark.read.text("path/to/file.txt")
df_text.show()

### Options and Schema
#### You can specify schema manually for faster reading

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("country", StringType(), True)
])

df = spark.read.csv("path/to/file.csv", header=True, schema=schema)
df.show()

#### 7️⃣ Reading multiple files

In [None]:
df_multi = spark.read.csv("path/to/folder/*.csv", header=True)

##### 8️⃣ Reading from cloud storage

In [None]:
# S3 example
df_s3 = spark.read.csv("s3a://bucket-name/path/file.csv", header=True)

| File Type | Function             | Notes                             |
| --------- | -------------------- | --------------------------------- |
| CSV       | `spark.read.csv`     | `header=True`, `inferSchema=True` |
| JSON      | `spark.read.json`    | Nested JSON supported             |
| Parquet   | `spark.read.parquet` | Columnar, fast                    |
| Text      | `spark.read.text`    | Each line → row                   |