##SparkSession Methods
1.  **`sql()`**:
    *   This method lets you execute SQL queries against your Spark data.
    *   It returns a DataFrame containing the results.
    *   You need to register tables or views before querying. (you can do it by reading a table into a DataFrame with `spark.read` and then calling `df.createOrReplaceTempView("my_table")`.)
    *   This is useful for complex transformations or using SQL's strengths.

2.  **`table()`**:
    *   This method retrieves a DataFrame that represents an existing table in the metastore.
    *   You need to have a table that's been created (either by `CREATE TABLE` SQL, Spark jobs etc.)
    *   This is a quick way to load tables for further processing.

3.  **`read()`**:
    *   This returns a `DataFrameReader` object, which is used to read different types of data.
    *   You use methods on the `DataFrameReader` (like `.csv()`, `.parquet()`, `.json()`, etc.) to specify the data source and any options (header, schema).
    *   This is the entry point for loading data from external files or databases.

4. **`range()`**:
   *  This method creates a DataFrame with a column of numbers following a specific range and step.
   *   It is useful for creating a test dataset and for adding a number index to the DataFrame.
   *   The arguments are `start`, `end` (exclusive), `step` (optional, defaults to 1), and number of `partitions` (optional, defaults to 1)

**Important Notes:**

*   **SparkSession:**  You must create a `SparkSession` to interact with Spark. This is created for you in Databricks.
*   **Paths:**  Make sure to adjust file paths to where your data is located.
*   **Schema Inference:**  `inferSchema=True` in `read.csv()` attempts to infer the data types of columns. Use carefully, it can be slow with big files, is better to provide a schema.
*   **Show:** The `.show()` method displays a few rows of the DataFrame for visualization.



In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import os

# 1. Create a DataFrame with 100 rows
# ----------------------------------
num_rows = 100

# Create a schema with two columns
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("value", StringType(), True),
])
# Create the data as a list
data = [(i, f"row_{i}") for i in range(num_rows)]

# Create the DataFrame
df = spark.createDataFrame(data, schema)

# Register the DataFrames as Temporary View
# ----------------------------------------------------
df.createOrReplaceTempView("my_table")

# 2. Define output paths
# -----------------------
output_path_csv = "/tmp/my_data.csv"
output_path_json = "/tmp/my_data.json"
output_path_parquet = "/tmp/my_data.parquet"

# 3. Save the DataFrame to CSV, JSON, and Parquet
# ---------------------------------------------
df.write.csv(output_path_csv, header=True, mode="overwrite")
df.write.json(output_path_json, mode="overwrite")
df.write.parquet(output_path_parquet, mode="overwrite")

# 4. Load the data back into DataFrames
# ------------------------------------
df_csv_loaded = spark.read.csv(output_path_csv, header=True, inferSchema=True)
df_json_loaded = spark.read.json(output_path_json)
df_parquet_loaded = spark.read.parquet(output_path_parquet)

# Example: Run a SQL query and get the results as a DataFrame
query_result_df = spark.sql("SELECT * FROM my_table WHERE id > 97")
query_result_df.show() # Show sample rows from the DataFrame

# Example: Load data from a table
table_df = spark.table("my_table")  # The table needs to exist in the current database
table_df.show()

# Example: Create a DataFrame with numbers from 0 to 9
range_df_1 = spark.range(0, 10)  # defaults to step 1 and 1 partition
range_df_1.show()

# Example:  Create a DataFrame with numbers from 10 to 100, step of 2 and 2 partitions.
range_df_2 = spark.range(10, 100, 2, 2) # start, end, step, number of partitions
range_df_2.show()

# Example: Create a DataFrame with a range and custom column name
range_df_3 = spark.range(10, 20).withColumnRenamed("id","my_custom_range_column")
range_df_3.show()

# Clean up the output path (if needed)
# ------------------------------------
def delete_files(path):
    if os.path.exists(path):
        if os.path.isdir(path):
            for file in os.listdir(path):
                os.remove(os.path.join(path, file))
            os.rmdir(path)
        else:
            os.remove(path)

delete_files(output_path_csv)
delete_files(output_path_json)
delete_files(output_path_parquet)

## Spark Function and Method Summary

| Category                 | Method/Function               | Description                                                                                                                                                            |
|--------------------------|-------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **SparkSession Methods** | `sql`                          | Returns a DataFrame representing the result of the given query.                                                                                                       |
|                          | `table`                        | Returns the specified table as a DataFrame.                                                                                                                            |
|                          | `read`                         | Returns a DataFrameReader that can be used to read data in as a DataFrame.                                                                                           |
|                          | `range`                        | Creates a DataFrame with a column containing elements in a range from start to end (exclusive) with a step value and number of partitions.                              |
| **DataFrame Transformation Methods**| `select`                    | Returns a new DataFrame by computing given expressions for each element.                                                                                         |
|                          | `drop`                        | Returns a new DataFrame with a column dropped.                                                                                                                         |
|                          | `withColumnRenamed`           | Returns a new DataFrame with a column renamed.                                                                                                                         |
|                          | `withColumn`                  | Returns a new DataFrame by adding a column or replacing the existing column that has the same name.                                                                 |
|                          | `filter`, `where`             | Filters rows using the given condition.                                                                                                                                  |
|                          | `sort`, `orderBy`             | Returns a new DataFrame sorted by the given expressions.                                                                                                                   |
|                          | `dropDuplicates`, `distinct` | Returns a new DataFrame with duplicate rows removed.                                                                                                                |
|                          | `limit`                       | Returns a new DataFrame by taking the first n rows.                                                                                                                       |
|                          | `groupBy`                     | Groups the DataFrame using the specified columns, so aggregations can be run on them.                                                                                     |
| **DataFrame Action Methods**| `show`                        | Displays the top n rows of a DataFrame in a tabular form.                                                                                                               |
|                          | `count`                       | Returns the number of rows in the DataFrame.                                                                                                                             |
|                          | `describe`, `summary`          | Computes basic statistics for numeric and string columns.                                                                                                                |
|                          | `first`                       | Returns the first row.                                                                                                                                                  |
|                          | `head`                        | Returns the first n rows.                                                                                                                                                 |
|                          | `collect`                     | Returns an array that contains all rows in this DataFrame.                                                                                                               |
|                          | `take`                        | Returns an array of the first n rows in the DataFrame.                                                                                                                    |
| **DataFrameNaFunctions** | `drop`                        | Returns a new DataFrame omitting rows with any, all, or a specified number of null values, considering an optional subset of columns.                                    |
|                          | `fill`                        | Replaces null values with the specified value for an optional subset of columns.                                                                                         |
|                          | `replace`                     | Returns a new DataFrame replacing a value with another value, considering an optional subset of columns.                                                                |
| **Built-in Functions**  |  **Math functions** | |
|                          | `ceil`                         | Computes the ceiling of the given column.                                                                                                                                  |
|                          | `log`                          | Computes the natural logarithm of the given value.                                                                                                                        |
|                          | `round`                        | Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode.                                                                                    |
|                          | `sqrt`                         | Computes the square root of the specified float value.                                                                                                                     |
|                          | **Collection functions**       |                                                                                                                                                                      |
|                          | `array_contains`               | Returns null if the array is null, true if the array contains value, and false otherwise.                                                                             |
|                          | `explode`                      | Creates a new row for each element in the given array or map column.                                                                                                      |
|                          | `slice`                        | Returns an array containing all the elements in x from an index start (or from the end if start is negative) with the specified length.                              |
|                         | **Date time functions**  | |
|                          | `date_format`                  | Converts a date/timestamp/string to a value of a string in the format specified by the date format given by the second argument.                                           |
|                          | `add_months`                   | Returns the date that is numMonths after startDate.                                                                                                                     |
|                          | `dayofweek`                    | Extracts the day of the month as an integer from a given date/timestamp/string.                                                                                           |
|                          | `from_unixtime`                | Converts the number of seconds from the unix epoch to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.|
|                          | `minute`                       | Extracts the minutes as an integer from a given date/timestamp/string.                                                                                                   |
|                          | `unix_timestamp`              | Converts a time string with a given pattern to a Unix timestamp (in seconds).                                                                                            |
|                         | **String functions**  | |
|                          | `translate`                  | Translate any character in the src by a character in replaceString                                                                                                     |
|                          | `regexp_replace`             | Replace all substrings of the specified string value that match regexp with rep                                                                                        |
|                          | `regexp_extract`             | Extract a specific group matched by a Java regex, from the specified string column                                                                                     |
|                          | `ltrim`                      | Extract a specific group matched by a Java regex, from the specified string column                                                                                     |
|                          | `lower`                     | Converts a string column to lowercase                                                                                     |
|                          | `split`                     | Splits str around matches of the given pattern                                                                                     |
| **Row Methods (Python)**    | `index`                       | Returns the first index of value.                                                                                                                                        |
|                          | `count`                       | Returns the number of occurrences of value.                                                                                                                                |
|                          | `asDict`                      | Returns a row as a dictionary.                                                                                                                                           |
|                          | `row.key`                       | Access fields like attributes.                                                                                                                                            |
|                          | `row["key"]`                    | Access fields like dictionary values.                                                                                                                                   |
|                          | `key in row`                   | Search through row keys.                                                                                                                                                |
| **Grouped Data Object** |  `agg`    | Compute aggregates by specifying a series of aggregate columns|
|                         |  `avg`    | Compute the mean value for each numeric columns for each group|
|                         |  `count`  | Count the number of rows for each group|
|                         |  `max`    | Compute the maximum value for each numeric column for each group|
|                         |  `mean`   | Compute the average value for each numeric column for each group|
|                         |  `min`    | Compute the minimum value for each numeric column for each group|
|                         |  `pivot`  | Pivots a column of the current DataFrame and performs the specified aggregation|
|                         |  `sum`    | Compute the sum for each numeric columns for each group|


In [0]:
from pyspark.sql import functions as F  # Importing common Spark functions with alias F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType, DateType  # Importing data types for schema definition
from datetime import datetime # Importing datetime for date generation

# 1. Create DataFrame with 1000 rows
# ----------------------------------
num_rows = 1000  # Defining the number of rows for the DataFrame

# Defining the schema of the DataFrame using StructType and StructField
schema = StructType([
    StructField("id", IntegerType(), True),  # Integer column 'id'
    StructField("category", StringType(), True),  # String column 'category'
    StructField("value", IntegerType(), True),  # Integer column 'value'
    StructField("items", ArrayType(StringType()), True),  # Array of Strings column 'items'
    StructField("date", DateType(), True)  # Date column 'date'
])

# Creating sample data for the DataFrame
data = [
    (i, "A" if i % 3 == 0 else "B" if i % 2 == 0 else "C", i * 2, [f"item_{i % 5}", f"item_{i % 7}"], datetime.now().date())
    for i in range(num_rows)
]
# Creating the DataFrame using the data and schema
df = spark.createDataFrame(data, schema)


# 2. SparkSession Methods
# -----------------------
# The SparkSession methods are used to create DataFrames, or to run SQL Queries on the DataFrames.

# sql: creates a temporary view so that we can use the sql method to query the data.
df.createOrReplaceTempView("my_table")
# Using spark.sql to run a SQL query against the temporary view.
sql_df = spark.sql("SELECT * FROM my_table WHERE value > 500")
print("\nDataFrame after spark.sql():")
sql_df.show(5)  # Showing the first 5 rows of the resulting DataFrame


# 3. DataFrame Transformation Methods
# -------------------------------------
# These are methods that modify the DataFrame and return a new transformed DataFrame without modifying the source DataFrame.
# They perform lazy operations, which are performed when an action like show() is performed.

# --- select() ---
# selects columns from the DataFrame, and can create new ones based on expressions.
df_selected = df.select("id", "category", (F.col("value") * 2).alias("double_value"),
                        F.col("items").alias("my_items"))
print("\nDataFrame after select():")
df_selected.show(5)

# --- drop() ---
# Drops the specified column from the DataFrame.
df_dropped = df.drop("items")
print("\nDataFrame after drop():")
df_dropped.show(5)

# --- withColumnRenamed() ---
# Renames the specified column with a new name
df_renamed = df.withColumnRenamed("category", "group")
print("\nDataFrame after withColumnRenamed():")
df_renamed.show(5)

# --- withColumn() ---
# adds a new column to the DataFrame, or replaces an existing column with a new value.
df_with_column = df.withColumn("value_plus_10", F.col("value") + 10)
print("\nDataFrame after withColumn (adding a column):")
df_with_column.show(5)

df_with_column_2 = df.withColumn("value", F.lit(100))  # Replacing the value column with literal 100
print("\nDataFrame after withColumn (replacing a column):")
df_with_column_2.show(5)

# --- filter(), where() ---
# Filters rows of the DataFrame based on the given expression (the where() is the alias for filter())
df_filtered = df.filter(F.col("value") > 100)
print("\nDataFrame after filter()/where():")
df_filtered.show(5)

# --- sort(), orderBy() ---
# Sorts rows of the DataFrame based on the given column (orderBy() is the alias for sort())
df_sorted = df.sort(F.col("value"), ascending=False)
print("\nDataFrame after sort()/orderBy():")
df_sorted.show(5)


# --- dropDuplicates(), distinct() ---
# Removes the duplicated rows of the DataFrame (distinct is the alias for dropDuplicates())
df_with_dupes = df.union(df.limit(5))  # Creating a DataFrame with duplicate rows
df_no_dupes = df_with_dupes.dropDuplicates()
print("\nDataFrame after dropDuplicates()/distinct():")
df_no_dupes.show(5)


# --- limit() ---
# limits the DataFrame to the given number of rows
df_limited = df.limit(10)
print("\nDataFrame after limit():")
df_limited.show(5)


# --- groupBy() ---
# Groups the DataFrame by the specified columns and can run aggregations.
df_grouped = df.groupBy("category").count()
print("\nDataFrame after groupBy():")
df_grouped.show(5)


# 4. DataFrame Action Methods
# ----------------------------
# These methods trigger the computation and return results (non-lazy operations)

# --- show() ---
# Displays a given number of rows of the DataFrame in a tabular format
print("\nDataFrame show():")
df.show(3)  # Showing the first 3 rows of the DataFrame


# --- count() ---
# Returns the number of rows in the DataFrame
print(f"\nDataFrame count(): {df.count()}")

# --- describe(), summary() ---
# Computes basic descriptive statistics for each column of the DataFrame
print("\nDataFrame describe():")
df.describe().show() # computes statistics for numeric columns.
print("\nDataFrame summary():")
df.summary().show() # computes statistics for numeric and string columns.


# --- first() ---
# Returns the first row of the DataFrame
print(f"\nDataFrame first(): {df.first()}")

# --- head() ---
# Returns the first 'n' rows of the DataFrame
print("\nDataFrame head():")
print(df.head(3)) # returns as python list of Row objects

# --- collect() ---
# Collects all the rows of the DataFrame into a python list
print(f"\nDataFrame collect(): (first 3 rows) {df.collect()[:3]}")

# --- take() ---
# returns the first n rows of the DataFrame as a python list.
print(f"\nDataFrame take(): (first 3 rows) {df.take(3)}")


# 5. DataFrameNaFunctions
# -----------------------
# These methods allow to handle null values on the DataFrame.

# Creating a DataFrame with null values for demonstration purposes
df_with_nulls = df.withColumn("value", F.when(F.col("id") % 10 == 0, F.lit(None)).otherwise(F.col("value")))

# --- drop() ---
# Drops rows that have null values, can be configured for specific columns.
df_na_dropped = df_with_nulls.na.drop()
print("\nDataFrame after na.drop():")
df_na_dropped.show(5)

# --- fill() ---
# Replaces all the null values of the given column with the value provided, can be configured for specific columns.
df_na_filled = df_with_nulls.na.fill({"value": -1}) #replaces null values in value column with -1
print("\nDataFrame after na.fill():")
df_na_filled.show(5)

# --- replace() ---
# Replaces a value with another specified value in the given column (can use other types as well).
df_na_replaced = df_with_nulls.na.replace(100, 1000, subset=["value"])
print("\nDataFrame after na.replace():")
df_na_replaced.show(5)


# 6. Built-in Functions
# -----------------------
# These are functions that can be used in expressions for column manipulation.

# Math Functions:
# Methods that perform math related functions.
df_math_functions = df.withColumn("ceil_value", F.ceil("value")) \
    .withColumn("log_value", F.log("value")) \
    .withColumn("round_value", F.round("value", 0)) \
    .withColumn("sqrt_value", F.sqrt("value"))
print("\nDataFrame with Math Functions:")
df_math_functions.select("value", "ceil_value", "log_value", "round_value", "sqrt_value").show(5)

# Collection Functions:
# These are functions that can be used for working with arrays or maps.
df_collection_functions = df.withColumn("array_contains", F.array_contains("items", "item_1")) \
                            .withColumn("exploded_items", F.explode("items"))
print("\nDataFrame with Collection Functions:")
df_collection_functions.select("items", "array_contains", "exploded_items").show(5)


df_collection_functions_2 = df.withColumn("sliced_items", F.slice("items", 1, 1))
print("\nDataFrame with Collection Functions and slice")
df_collection_functions_2.select("items", "sliced_items").show(5)


# Date time Functions:
# These functions manipulate date related columns.
df_date_functions = df.withColumn("formatted_date", F.date_format("date", "yyyy-MM-dd")) \
    .withColumn("add_months_date", F.add_months("date", 1)) \
    .withColumn("day_of_week", F.dayofweek("date")) \
    .withColumn("minute", F.minute("date")) \
    .withColumn("unix_timestamp", F.unix_timestamp("date")) \
    .withColumn("from_unixtime", F.from_unixtime(F.unix_timestamp("date")))
print("\nDataFrame with Date Time Functions:")
df_date_functions.select("date", "formatted_date", "add_months_date", "day_of_week", "minute", "unix_timestamp", "from_unixtime").show(5)


# String functions
# These functions manipulate String columns.
df_string_functions = df.withColumn("translated_cat", F.translate("category", "A", "Z")) \
    .withColumn("replaced_category", F.regexp_replace("category", "A", "AAA")) \
    .withColumn("extracted_category", F.regexp_extract("category", "(A|B)", 1)) \
    .withColumn("trimmed_category", F.ltrim("   " + F.col("category") + "  ")) \
    .withColumn("lowered_category", F.lower("category")) \
    .withColumn("split_category", F.split("category", ""))
print("\nDataFrame with String Functions:")
df_string_functions.select("category", "translated_cat", "replaced_category", "extracted_category",
                           "trimmed_category", "lowered_category", "split_category").show(5)


# 7. Row Methods (Example with first() row object)
# -----------------------------------------------
# methods that can be called when using Row objects.
first_row = df.first()  # Getting the first row of the DataFrame as Row object
print(f"\nFirst row as is: {first_row}")

print(f"\nFirst row index of value 'id': {first_row.index(0)}")
print(f"\nFirst row count of value 'A': {first_row.count('A')}")
print(f"\nFirst row as a dictionary: {first_row.asDict()}")
print(f"\nFirst row attribute access: {first_row.id}")
print(f"\nFirst row dictionary access: {first_row['category']}")
print(f"\n'id' in first row: {'id' in first_row}")


# 8. Grouped Data Object Methods
# ------------------------------
# these methods are called on the object returned after calling the groupby() method on the DataFrame
grouped_df = df.groupBy("category")

# Using agg method to call multiple aggregations.
df_aggregated = grouped_df.agg(
    F.avg("value").alias("avg_value"),
    F.count("*").alias("count"),
    F.max("value").alias("max_value"),
    F.min("value").alias("min_value"),
    F.sum("value").alias("sum_value")
)
print("\nDataFrame after groupBy().agg():")
df_aggregated.show(5)


# Using avg, count, max, min and sum individually to show their usage.
df_aggregated_2 = grouped_df.avg("value").withColumnRenamed("avg(value)", "avg_value")
df_aggregated_2 = df_aggregated_2.join(grouped_df.count().withColumnRenamed("count", "count_rows"), "category")
df_aggregated_2 = df_aggregated_2.join(grouped_df.max("value").withColumnRenamed("max(value)", "max_value"), "category")
df_aggregated_2 = df_aggregated_2.join(grouped_df.min("value").withColumnRenamed("min(value)", "min_value"), "category")
df_aggregated_2 = df_aggregated_2.join(grouped_df.sum("value").withColumnRenamed("sum(value)", "sum_value"), "category")
print("\nDataFrame after groupBy().avg,count, max, min, sum():")
df_aggregated_2.show(5)


#  pivot
# pivots a table by selecting the values from the category column as new columns.
df_pivoted = df.groupBy("id").pivot("category").sum("value")
print("\nDataFrame after groupBy().pivot():")
df_pivoted.show(5)


# 9. Column Operators & Methods
# -------------------------------
# methods and operators to manipulate columns.

df_column_ops = df.withColumn("combined_value", F.col("value") + 10) \
    .withColumn("is_even", (F.col("value") % 2 == 0)) \
    .withColumn("double_value_cast", F.col("value").cast("double")) \
    .withColumn("is_value_null", F.col("value").isNull())
print("\nDataFrame with Column Operations:")
df_column_ops.select("value", "combined_value", "is_even", "double_value_cast", "is_value_null").show(5)
