## How to Read data in PySpark
- The `DataFrameReader` API in PySpark is used to load data from various data sources into a DataFrame.
- It provides multiple methods for reading data, allowing you to specify the file format, schema, options, and more.

### Core Structure

```python
DataFrameReader.format (...) \
    .option ("key", "value") \
    .schema (...) \
    .load (...)
```

**format():** 
- data file format you are reading. (`CSV`, `JSON`, `parquet`, `ORC`, `JBDC/ODBC table`, etc...)

**option():** (this is optional) 
- allows you to specify additional parameters to customize how the data is read.
- Common options: we can use multiple option() methods to specify each parameter
  - `header`: Whether the file contains a header row. (True or False)
  - `delimiter`: Specifies the delimiter used in CSV files, e.g., , or \t.
  - `inferSchema`: Whether Spark should automatically infer the schema from the data (True or False).
  - `mode`:

**schema():** (this is optional)
- manual schema you can pass

**load():**
- path where our data is residing

### Example:

In [0]:
# reading the csv file using format method
flight_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("mode", "failfast") \
    .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/2010_summary.csv")

# show method to display the dataframe data
flight_df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [0]:
# to see the schema of DF
flight_df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



### mode option in the DataFrameReader
- The `mode` option in the DataFrameReader API specifies the behavior when Spark encounters `invalid` or `corrupt` data during the data loading process. 
- It controls how to handle issues such as` missing values`, `malformed records`, or other errors.

#### Supported Modes in DataFrameReader:
- **permissive (default):**
  - This is the default mode. It tries to correct corrupted records and place them in a separate column called _corrupt_record when reading files.
  - set null value to corrupted records
- **dropMalformed:**
  - Drops any rows containing `malformed` or `corrupted` records.
  - If a row has an issue (e.g., `incorrect schema`), it will be excluded from the resulting DataFrame.
- **failfast:**
  - fails execution if malformed `malformed` or `corrupted` records founds in dataset
  - Use this mode when you want the job to fail immediately on encountering bad data.

## Creating Manual Schema in Spark

**Possible interview questions:**

- how to create schema in Pyspark?
- what are other ways to creating it?
- what is structfield and structtype in schema?
- what if I have header in my data.


**In PySpark, you can manually create a schema for a DataFrame in three main ways:**

##### 1. Using StructType and StructField:
- In PySpark, `StructType` and `StructField` are essential components used to define the schema of a DataFrame.
- They are part of PySpark's `pyspark.sql.types` module and are typically used to specify column names, data types, and whether columns can contain null values.
- This is the most flexible and commonly used method for defining complex schemas.
- **StructField:** represents a single field (or column) within a StructType.
- **StructType:** is a list of `StructField` objects, representing the schema of a DataFrame.

**Example:** lets create a dataframe for `2010_summary.csv` using manual schema using StructType and StructField

In [0]:
# first we need to import the StructType and StructField
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [0]:
my_schema = StructType(
    [StructField("DEST_COUNTRY_NAME", StringType(), True),
     StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
     StructField("count", IntegerType(), True)]
)

In [0]:
# inferSchema as false
flight_df2 = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "false") \
    .option("mode", "failfast") \
    .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/2010_summary.csv")

# When we set inferSchema as false, it took everything as string
flight_df2.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: string (nullable = true)



In [0]:
# Let give the manual schema now
flight_df3 = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "false") \
    .schema(my_schema) \
    .option("mode", "failfast") \
    .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/2010_summary.csv")

flight_df3.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



##### 2. Using a DDL (Data Definition Language):
- You can define the schema as a single string with column names and data types, which is useful for simpler schemas.

**Example:**

In [0]:
my_schema_ddl = "DEST_COUNTRY_NAME string, ORIGIN_COUNTRY_NAME string, count int"

In [0]:
# Let give the manual schema created by DDl
flight_df4 = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "false") \
    .schema(my_schema_ddl) \
    .option("mode", "failfast") \
    .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/2010_summary.csv")

flight_df4.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



#### 3. Using a Python Dictionary (Converted to StructType)
- While not directly a schema definition, you can use a dictionary to map column names and types, then convert it to StructType.

**Example:**

In [0]:
my_dict = {"DEST_COUNTRY_NAME": StringType(), "ORIGIN_COUNTRY_NAME": StringType(), "count": IntegerType()}
my_schema_dict = StructType([StructField(key, value, True) for key, value in my_dict.items()])

In [0]:
# Let give the manual schema created by DDl
flight_df5 = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "false") \
    .schema(my_schema_dict) \
    .option("mode", "failfast") \
    .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/2010_summary.csv")

flight_df5.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)

