## Writing Dataframe to disk in PySpark
- Writing a DataFrame to disk in PySpark is a common operation for saving data after performing transformations. 
- PySpark provides the **DataFrame.write** API, which allows saving data in multiple formats (e.g., `CSV`, `Parquet`, `JSON`, etc.) with various options for file organization and data management.

### Writing a DataFrame
The write function is used to save a DataFrame to storage. It supports a variety of file formats and configurations.

**Dataframe writer API general structure:**

  ```python
  dataframewriter.format() \
                 .option() \
                 .partitionBy() \
                 .bucketBy() \
                 .save()
  ```

**format():**
- Specifies the file format to use when saving the DataFrame.
- If not explicitly specified, the default format is Parquet.

**option(key, value):**
- Allows you to pass additional configurations specific to the file format.
- Common options include:
  | Key          | Value                       | Description                                           |
  |--------------|-----------------------------|-------------------------------------------------------|
  | `header`     | `true` or `false`           | Writes column names as the first row (for CSV).       |
  | `delimiter`  | Any single character (e.g., `,`, `|`) | Sets a custom delimiter for CSV files.              |
  | `compression`| `snappy`, `gzip`, `none`, etc. | Specifies the compression codec to use.             |
  | `inferSchema`| `true` or `false`           | Infers the schema automatically (for CSV or JSON).    |
  | `quote`      | Character (e.g., `"`)       | Specifies the character to wrap around quoted fields (CSV). |
  | `escape`     | Character (e.g., `\`)       | Defines escape characters for special cases (CSV).    |



**repartition() and coalesce():**
- To control how many files are written, you can change the number of partitions using repartition() or coalesce().
- **`Using repartition():`**
  - Redistributes the data into a specified number of partitions, increasing or decreasing the number of output files.
  - Causes data shuffling across the cluster
  - Example: # Redistributing data into 4 partitions (will create 4 output files)
    - `df.repartition(4).write.csv("/path/to/output")`
- **`Using coalesce()`**
  - Reduces the number of partitions without shuffling the data.
  - More efficient than repartition() for reducing the number of partitions.
  - Example: # Reducing to 2 partitions (will create 2 output files)
    - `df.coalesce(2).write.csv("/path/to/output")`



**partitionBy(\*columns):**
- Splits the output data into subdirectories based on the values of one or more columns.
- Improves query performance by pruning irrelevant partitions during data reads.
- Example: 
  - `df.write.partitionBy("Age").parquet("/path/to/output")`
  - Example Directory Structure:
    ```bash
    /path/to/output/Age=25/
    /path/to/output/Age=30/
    /path/to/output/Age=35/
    ```

**bucketBy(numBuckets, column): (For Spark Tables)**
- Groups data into a specific number of buckets for distributed storage.
- Often used with saveAsTable.
- Example: `df.write.bucketBy(5, "Age").saveAsTable("bucketed_table")`
- Bucketing is typically used for optimizing large joins or aggregations.

**mode():**
- Controls the behavior when saving data to an existing location.
  | Mode                 | Description                                                          |
  |----------------------|----------------------------------------------------------------------|
  | `overwrite`          | Overwrites any existing data at the output location.                |
  | `append`             | Appends data to the existing data.                                  |
  | `ignore`             | Skips the operation if the target path already exists.              |
  | `error` or `errorifexists` | Fails the operation if the target path already exists (default). |


**save(path):**
- Specifies the output path where the DataFrame will be saved.
- The format is determined by the `format()` method or defaults to Parquet.
- Use **saveAsTable()** for saving as a table instead of a file

**saveAsTable(tableName):**
- Saves the DataFrame as a managed or external table in a Hive-compatible metastore.
- The `tableName` can include a database prefix, e.g., `database_name.table_name.`
- Example: `df.write.saveAsTable("my_table")`

**insertInto(tableName):**
- Inserts data into an existing table without overwriting it.
- The table must exist, and the schema should match.
- Example: `df.write.mode("append").insertInto("existing_table")`

**compression (via option() or implicitly):**
- Specifies the compression codec to use for data storage.
- Supported codecs depend on the format:
  - `For Parquet:` snappy (default), gzip, none
  - `For CSV/JSON:` gzip, bzip2, none
- Example: `df.write.option("compression", "gzip").csv("/path/to/output")`


**Full Example with Multiple Configurations**

In [0]:
# reading the csv file using format method
flight_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("mode", "failfast") \
    .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/2010_summary.csv")

# show method to display the dataframe data
flight_df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [0]:
# Writing DataFrame with configurations
flight_df.write.format("csv") \
    .option("header", "true") \
    .option("delimiter", "|") \
    .option("compression", "gzip") \
    .mode("overwrite") \
    .partitionBy("ORIGIN_COUNTRY_NAME") \
    .save("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/")

We can see we the data is got ingested

In [0]:
%fs
ls dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/

path,name,size,modificationTime
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Afghanistan/,ORIGIN_COUNTRY_NAME=Afghanistan/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Algeria/,ORIGIN_COUNTRY_NAME=Algeria/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Angola/,ORIGIN_COUNTRY_NAME=Angola/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Anguilla/,ORIGIN_COUNTRY_NAME=Anguilla/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Antigua and Barbuda/,ORIGIN_COUNTRY_NAME=Antigua and Barbuda/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Argentina/,ORIGIN_COUNTRY_NAME=Argentina/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Aruba/,ORIGIN_COUNTRY_NAME=Aruba/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Australia/,ORIGIN_COUNTRY_NAME=Australia/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Austria/,ORIGIN_COUNTRY_NAME=Austria/,0,0
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=Azerbaijan/,ORIGIN_COUNTRY_NAME=Azerbaijan/,0,0


In [0]:
%fs
ls dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=India/

path,name,size,modificationTime
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=India/_SUCCESS,_SUCCESS,0,1733468417000
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=India/_committed_8171927095212985685,_committed_8171927095212985685,115,1733468414000
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=India/_started_8171927095212985685,_started_8171927095212985685,0,1733468406000
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=India/part-00000-tid-8171927095212985685-7004e94a-5416-4c0f-9c18-f427923198c3-9-64.c000.csv.gz,part-00000-tid-8171927095212985685-7004e94a-5416-4c0f-9c18-f427923198c3-9-64.c000.csv.gz,61,1733468406000


In [0]:
%sql
select * from csv.`dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/`

_c0,ORIGIN_COUNTRY_NAME
DEST_COUNTRY_NAME|count,United States
Egypt|24,United States
Equatorial Guinea|1,United States
Costa Rica|477,United States
Senegal|29,United States
Guyana|17,United States
Malta|1,United States
Bolivia|46,United States
Anguilla|21,United States
Turks and Caicos Islands|136,United States


In [0]:
%sql
select * from csv.`dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/out/ORIGIN_COUNTRY_NAME=India/part-00000-tid-8171927095212985685-7004e94a-5416-4c0f-9c18-f427923198c3-9-64.c000.csv.gz`

_c0
DEST_COUNTRY_NAME|count
United States|69
