# What are DataFrames in PySpark:

  - DataFrames are a distributed collection of data organized into named columns, similar to a table in a relational database.
  - They are part of the higher-level API provided by PySpark's `pyspark.sql` module.
  - DataFrames are immutable and follow lazy evaluation, meaning transformations are not executed until an action is called.
  - DataFrames support a wide range of data sources, including CSV, Parquet, JSON, ORC, Avro, Hive, JDBC, and more.

![DataFrames.png](attachment:ec499e0d-d4bc-4e3a-81a1-2a405c4b5de7.png)
### **Creating DataFrames:**
  - DataFrames can be created from various data sources, such as RDDs, Python lists, Pandas DataFrames, and external files (CSV, JSON, etc.).
  - To create a DataFrame, you typically use the `spark.createDataFrame()` method.

### **DataFrames Operations:**
  - DataFrame operations can be broadly categorized into transformations and actions.
  - Transformations (e.g., `select`, `filter`, `groupBy`) create a new DataFrame from an existing one without executing immediately.
  - Actions (e.g., `show`, `count`, `collect`) trigger the execution and return results or display data.

### **Schema:**
  - DataFrames have a defined schema, which specifies the names and data types of the columns.
  - Schema inference automatically determines the schema when reading from files, but you can also define it explicitly.

### **Data Manipulation:**
  - DataFrame provides various methods for data manipulation, such as `select`, `filter`, `groupBy`, `orderBy`, `withColumn`, `drop`, and more.
  - Operations can be chained together to create complex data processing pipelines.

### **Aggregation and Grouping:**
  - DataFrame supports aggregation operations like `avg`, `sum`, `min`, `max`, `count`, etc., which can be applied after grouping using `groupBy`.

### **SQL-Like Queries:**
  - DataFrames can be queried using SQL-like syntax using SparkSQL, enabling users familiar with SQL to perform data analysis.

### **Broadcast and Join Optimization:**
  - DataFrames automatically optimize joins and broadcasts small DataFrames to reduce shuffle overhead in join operations.

### **Built-in Functions:**
  - PySpark provides a rich set of built-in functions (`pyspark.sql.functions`) for data manipulation, aggregation, string operations, date/time handling, and more.

### **Integration with MLlib and GraphX:**
  - DataFrames can be seamlessly integrated with Spark's MLlib (Machine Learning) and GraphX (Graph Processing) libraries for advanced analytics.

### **Performance Optimization:**
  - PySpark provides various options to optimize DataFrame performance, like tuning memory configurations, adjusting parallelism, and using data partitioning.

### **Data Sources and Formats:**
  - DataFrames can read and write data from/to various sources like HDFS, cloud storage, and databases, supporting various formats like Parquet, ORC, Avro, etc.

### **Dynamic Partition Pruning:**
  - DataFrames can dynamically prune partitions when reading data, which improves performance in certain scenarios.

DataFrames in PySpark offer a higher-level API for working with structured data, allowing users to focus on data manipulation and analysis without dealing with low-level distributed computing complexities.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [2]:
spark

In [4]:
empDF

NameError: name 'empDF' is not defined

In [5]:
print(type(empDF))

NameError: name 'empDF' is not defined

In [3]:
empDF = spark.read.csv("data/emp.csv")

In [4]:
print(type(empDF))

<class 'pyspark.sql.dataframe.DataFrame'>


In [8]:
empDF.show()

+------+----------+-----------------+----------+--------------+-------------+-------------+
|   _c0|       _c1|              _c2|       _c3|           _c4|          _c5|          _c6|
+------+----------+-----------------+----------+--------------+-------------+-------------+
|emp_id|  emp_name|  emp_designation|emp_salary|emp_department|emp_join_date| emp_location|
|   101|John Smith|Software Engineer|     75000|            IT|   15-01-2022|     New York|
|   102|  Jane Doe|     Data Analyst|     60000|     Analytics|   20-08-2021|San Francisco|
|   103|Mike Brown|  Product Manager|     90000|       Product|   10-05-2023|       London|
|   104|Lisa Green|       HR Manager|     85000|            HR|   05-11-2020|         NULL|
+------+----------+-----------------+----------+--------------+-------------+-------------+

