# Introduction to PySpark DataFrames
This notebook contains examples of PySpark DataFrame operations. PySpark is a powerful tool for big data processing and analysis, and DataFrames are a key component of PySpark.

In this notebook, we will explore various DataFrame operations such as creating DataFrames, reading from external sources, filtering, aggregating, and more.


In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext("local")
spark = SparkSession.builder.getOrCreate()

# Create Data Frame

**createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)**

The `createDataFrame()` function is a fundamental method to create a DataFrame in PySpark and allows for seamless integration with various data sources and data structures.

**Parameters:**
- `data`: An RDD, a list, or a pandas DataFrame containing the data to create the DataFrame.
- `schema`: Optional. The schema of the DataFrame. It can be a `pyspark.sql.types.DataType`, a datatype string, or a list of column names. If `schema` is None, the schema will be inferred from the data. The schema determines the column names and types of the DataFrame.
- `samplingRatio`: Optional. The sample ratio of rows used for inferring the schema. It is only used when the schema needs to be inferred.
- `verifySchema`: Optional. Specifies whether to verify the data types of every row against the schema. If set to `True` (default), the data types will be verified.

**Returns:**
A DataFrame created from the provided data and schema.

**Note:**
- The `createDataFrame()` function is a versatile method to create a DataFrame in PySpark from different data sources.
- If the schema is provided, it must match the actual data. If the schema is not a `pyspark.sql.types.StructType`, it will be wrapped into a `pyspark.sql.types.StructType` with a single field named "value". Each record will also be wrapped into a tuple that can be converted to a row later.
- If schema inference is needed, the `samplingRatio` parameter determines the ratio of rows used for inferring the schema. By default, the first row will be used.
- The `verifySchema` parameter specifies whether to verify the data types of every row against the schema. It is enabled by default.
- This function provides flexibility in creating a DataFrame with specified or inferred schemas, enabling seamless integration with different data representations and types.

In [9]:
# From list
a = [('Chris', 'Berliner', 5), ('Peter', 'Bud Light', 9), ('John', 'Corona Extra', 6)]
df = spark.createDataFrame(a, ['drinker', 'beer', 'score'])  
df.show()

+-------+------------+-----+
|drinker|        beer|score|
+-------+------------+-----+
|  Chris|    Berliner|    5|
|  Peter|   Bud Light|    9|
|   John|Corona Extra|    6|
+-------+------------+-----+



In [10]:
df

DataFrame[drinker: string, beer: string, score: bigint]

In [11]:
# Create a DataFrame from RDD
rdd = sc.parallelize(a)
df = spark.createDataFrame(rdd)
df.show()

+-----+------------+---+
|   _1|          _2| _3|
+-----+------------+---+
|Chris|    Berliner|  5|
|Peter|   Bud Light|  9|
| John|Corona Extra|  6|
+-----+------------+---+



In [12]:
# From RDD and column names from list
df = spark.createDataFrame(rdd, ['drinker', 'beer', 'score'])
df.show()

+-------+------------+-----+
|drinker|        beer|score|
+-------+------------+-----+
|  Chris|    Berliner|    5|
|  Peter|   Bud Light|    9|
|   John|Corona Extra|    6|
+-------+------------+-----+



In [5]:
# From RDD and add schema

from pyspark.sql.types import *
schema = StructType([
    StructField("drinker", StringType(), True),
    StructField("beer", StringType(), True),
    StructField("score", ByteType(), True)])
df3 = spark.createDataFrame(rdd, schema)
df3.show()

+-------+------------+-----+
|drinker|        beer|score|
+-------+------------+-----+
|  Chris|    Berliner|    5|
|  Peter|   Bud Light|    9|
|   John|Corona Extra|    6|
+-------+------------+-----+



If there is type mismatch in the data, it will throw an error.

In [14]:
from pyspark.sql.types import *
a = [('Chris', 'Berliner', 5), ('Peter', 'Bud Light', 9), ('John', 'Corona Extra', "t")]
rdd = sc.parallelize(a)
print(rdd.collect())   

schema = StructType([
    StructField("drinker", StringType(), True),
    StructField("beer", StringType(), True),
    StructField("score", ByteType(), True)])
df3 = spark.createDataFrame(rdd, schema)
# df3 = spark.createDataFrame(rdd) This will not throw an error
df3.show()

[('Chris', 'Berliner', 5), ('Peter', 'Bud Light', 9), ('John', 'Corona Extra', 't')]


Py4JJavaError: An error occurred while calling o393.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20.0 (TID 20) (DT-Inspiron.lan executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 619, in main
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 611, in process
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\session.py", line 683, in prepare
    return obj
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1410, in verify
    if not verify_nullability(obj):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1391, in verify_struct
    for v, (_, verifier) in zip(obj, verifiers):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1410, in verify
    if not verify_nullability(obj):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1314, in verify_byte
    if obj < -128 or obj > 127:
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1292, in verify_acceptable_types
    if type(obj) not in _acceptable_types[_type]:
TypeError: field score: ByteType can not accept object 't' in type <class 'str'>

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:556)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:762)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:744)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:509)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 619, in main
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 611, in process
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\session.py", line 683, in prepare
    return obj
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1410, in verify
    if not verify_nullability(obj):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1391, in verify_struct
    for v, (_, verifier) in zip(obj, verifiers):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1410, in verify
    if not verify_nullability(obj):
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1314, in verify_byte
    if obj < -128 or obj > 127:
  File "C:\Spark\spark-3.2.2-bin-hadoop3.2\python\pyspark\sql\types.py", line 1292, in verify_acceptable_types
    if type(obj) not in _acceptable_types[_type]:
TypeError: field score: ByteType can not accept object 't' in type <class 'str'>

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:556)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:762)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:744)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:509)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


# Load a DataFrame from external storage
The `DataFrameReader` class is an interface used to load a DataFrame from external storage systems such as file systems, key-value stores, and more. It is accessed through the `read` attribute of the `SparkSession` object.

Here are some of the methods available in the `DataFrameReader` class:

- `csv(path[, schema, sep, encoding, quote, ...])`: Loads a CSV file and returns the result as a DataFrame.
- `format(source)`: Specifies the input data source format.
- `jdbc(url, table[, column, lowerBound, ...])`: Constructs a DataFrame representing a database table accessible via a JDBC URL and connection properties.
- `json(path[, schema, primitivesAsString, ...])`: Loads JSON files and returns the results as a DataFrame.
- `load([path, format, schema])`: Loads data from a data source and returns it as a DataFrame.
- `option(key, value)`: Adds an input option for the underlying data source.
- `options(**options)`: Adds input options for the underlying data source.
- `orc(path[, mergeSchema, pathGlobFilter, ...])`: Loads ORC files and returns the result as a DataFrame.
- `parquet(*paths, **options)`: Loads Parquet files and returns the result as a DataFrame.
- `schema(schema)`: Specifies the input schema.
- `table(tableName)`: Returns the specified table as a DataFrame.
- `text(paths[, wholetext, lineSep, ...])`: Loads text files and returns a DataFrame.

These methods provide various ways to load data from different sources and formats, allowing you to create DataFrames for further processing and analysis.

The `DataFrameReader` class was introduced in version 1.4.0 of PySpark and has undergone changes in version 3.4.0 to support Spark Connect. It provides a flexible and convenient interface to interact with external data sources and integrate them with the PySpark ecosystem.

In [15]:
# Read from CSV file
dfa = spark.read.format('csv')\
    .options(header='true', inferSchema='true',	sep=",")\
    .load("data/advertising.csv")
dfa.show()

+-----+-----+---------+-----+
|   TV|Radio|Newspaper|Sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
| 44.5| 39.3|     45.1| 10.4|
| 17.2| 45.9|     69.3| 12.0|
|151.5| 41.3|     58.5| 16.5|
|180.8| 10.8|     58.4| 17.9|
|  8.7| 48.9|     75.0|  7.2|
| 57.5| 32.8|     23.5| 11.8|
|120.2| 19.6|     11.6| 13.2|
|  8.6|  2.1|      1.0|  4.8|
|199.8|  2.6|     21.2| 15.6|
| 66.1|  5.8|     24.2| 12.6|
|214.7| 24.0|      4.0| 17.4|
| 23.8| 35.1|     65.9|  9.2|
| 97.5|  7.6|      7.2| 13.7|
|204.1| 32.9|     46.0| 19.0|
|195.4| 47.7|     52.9| 22.4|
| 67.8| 36.6|    114.0| 12.5|
|281.4| 39.6|     55.8| 24.4|
| 69.2| 20.5|     18.3| 11.3|
|147.3| 23.9|     19.1| 14.6|
+-----+-----+---------+-----+
only showing top 20 rows



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF">
<img align=left src="images/pyspark-pictures-dataframes-page58.svg" width=360 height=203 />
</a>

**toDF(\*cols)**

The `toDF()` returns a new DataFrame with new column names

**Parameters:**
- `cols`: a tuple of string with new column name. The length of the list needs to be the same as the number of columns in the initial DataFrame.

**Returns:**
A new DataFrame with the specified column names.

**Note:**
- The order of the column names in the `cols` determines the order of the columns in the resulting DataFrame.

In [None]:
# toDF
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.toDF("seller","buyer","amt")
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+------+-----+---+
|seller|buyer|amt|
+------+-----+---+
| Alice|  Bob|0.1|
|   Bob|Carol|0.2|
| Carol| Dave|0.3|
+------+-----+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toPandas">
<img align=left src="images/pyspark-pictures-dataframes-page60.svg" width=360 height=203 />
</a>

**toPandas()**

The `toPandas()` method is used to retrieve the contents of a DataFrame as a Pandas DataFrame. This method is available only if the Pandas library is installed and accessible.

**Returns:**
A Pandas DataFrame containing the contents of the Spark DataFrame.

**Note:**
- It is important to use the `toPandas()` method with caution, as it brings all the data from the distributed Spark DataFrame into the driver's memory. This can lead to memory-related issues if the resulting Pandas DataFrame is expected to be large.
- The `toPandas()` method is particularly useful when you want to leverage the rich visualization and analysis capabilities provided by the Pandas library, especially for cases where columns have very long content that may not be easily displayed in a tabular format in Spark.
- If the resulting Pandas DataFrame is expected to be large, it is recommended to explore alternative methods for processing and analyzing the data directly within the Spark ecosystem, such as using Spark SQL or the DataFrame API.

In [17]:
# toPandas
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.toPandas()
x.show()
print(type(y))
y

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,from,to,amt
0,Alice,Bob,0.1
1,Bob,Carol,0.2
2,Carol,Dave,0.3


# DataFrame to RDD

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.rdd">
<img align=left src="images/pyspark-pictures-dataframes-page42.svg" width=360 height=203 />
</a>

**property rdd**

The `rdd` property returns the content of a DataFrame as an RDD of `Row` objects. Each `Row` represents a row of data in the DataFrame.

**Returns:**
An RDD of `Row` objects representing the content of the DataFrame.

**Note:**
- The `rdd` property allows you to access the underlying RDD representation of the DataFrame. This can be useful when you want to apply RDD-specific transformations and actions that are not available directly on DataFrames.
- If you want to convert the RDD of `Row` objects to an RDD of tuples, you can use the `map()` transformation along with the `tuple()` function. For example, you can use `rdd.map(tuple)` to convert the RDD of `Row` objects to an RDD of tuples. This can be helpful when you need to work with a tuple-based representation of the data in subsequent RDD operations.

In [17]:
# rdd
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.rdd
x.show()
print(y.collect())

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

[Row(from='Alice', to='Bob', amt=0.1), Row(from='Bob', to='Carol', amt=0.2), Row(from='Carol', to='Dave', amt=0.3)]


In [19]:
# rdd
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.rdd.map(tuple)
x.show()
print(y.collect())

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

[('Alice', 'Bob', 0.1), ('Bob', 'Carol', 0.2), ('Carol', 'Dave', 0.3)]


# Showing dataframe and metadata

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show">
<img align=left src="images/pyspark-pictures-dataframes-page52.svg" width=360 height=203 />
</a>

**show(n=20, truncate=True, vertical=False)**

The `show()` method is used to print the first `n` rows of a DataFrame to the console in a tabular format.

**Parameters:**
- `n`: The number of rows to display. By default, it is set to 20.
- `truncate`: If set to `True`, strings longer than 20 characters will be truncated by default. If set to a number greater than 1, strings will be truncated to the specified length and aligned to the right.
- `vertical`: If set to `True`, the output rows will be printed vertically, with each column value on a separate line.

**Note:**
- The `show()` method is mainly used for quick data inspection and debugging purposes, especially when dealing with smaller datasets. It is not intended for displaying the entire content of large DataFrames.
- By default, the `truncate` parameter is set to `True`, which means that long strings will be truncated to 20 characters. This helps to keep the output concise. You can adjust the truncation length or disable truncation by setting `truncate` to a different value.
- The `vertical` parameter is useful when you want to view the values in each column separately, which can be helpful when dealing with wide tables or when inspecting specific columns more closely.

In [20]:
# show
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3), ("Big name big name big name big name", "Big", 0.2)], ['from','to','amt'])
x.show()
x.show(truncate=False)
x.show(vertical=True)
x.toPandas()

+--------------------+-----+---+
|                from|   to|amt|
+--------------------+-----+---+
|               Alice|  Bob|0.1|
|                 Bob|Carol|0.2|
|               Carol| Dave|0.3|
|Big name big name...|  Big|0.2|
+--------------------+-----+---+

+-----------------------------------+-----+---+
|from                               |to   |amt|
+-----------------------------------+-----+---+
|Alice                              |Bob  |0.1|
|Bob                                |Carol|0.2|
|Carol                              |Dave |0.3|
|Big name big name big name big name|Big  |0.2|
+-----------------------------------+-----+---+

-RECORD 0--------------------
 from | Alice                
 to   | Bob                  
 amt  | 0.1                  
-RECORD 1--------------------
 from | Bob                  
 to   | Carol                
 amt  | 0.2                  
-RECORD 2--------------------
 from | Carol                
 to   | Dave                 
 amt  | 0.3         

Unnamed: 0,from,to,amt
0,Alice,Bob,0.1
1,Bob,Carol,0.2
2,Carol,Dave,0.3
3,Big name big name big name big name,Big,0.2


<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.printSchema">
<img align=left src="images/pyspark-pictures-dataframes-page40.svg" width=360 height=203 />
</a>

**printSchema()**

The `printSchema()` method is used to print the schema of a DataFrame in a tree format.

**Note:**
- The `printSchema()` method provides a concise way to view the structure and data types of columns in a DataFrame.
- The output is displayed in a tree format, where the root represents the DataFrame, and each subsequent line represents a column name along with its corresponding data type.
- This method is particularly useful when working with complex schemas or when you need to quickly understand the structure of a DataFrame.

In [12]:
# printSchema
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
x.show()
x.printSchema()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

root
 |-- from: string (nullable = true)
 |-- to: string (nullable = true)
 |-- amt: double (nullable = true)



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.schema">
<img align=left src="images/pyspark-pictures-dataframes-page49.svg" width=360 height=203 />
</a>

**property schema**

The `schema` property is used to retrieve the schema of a DataFrame as a `pyspark.sql.types.StructType` object.

**Returns:**
A `pyspark.sql.types.StructType` object representing the schema of the DataFrame.

**Note:**
- The schema of a DataFrame defines the structure and data types of its columns.
- The `schema` property provides access to the schema information, allowing you to inspect and work with the column names and data types programmatically.
- The `pyspark.sql.types.StructType` object returned by the `schema` property contains a list of `pyspark.sql.types.StructField` objects, which represent individual columns with their respective name and data type information.
- This property is useful when you need to perform advanced operations or transformations based on the DataFrame schema, such as dynamically generating SQL statements or working with nested structures.

In [13]:
# schema
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.schema
x.show()
print(y)

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

StructType(List(StructField(from,StringType,true),StructField(to,StringType,true),StructField(amt,DoubleType,true)))


<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.columns">
<img align=left src="images/pyspark-pictures-dataframes-page8.svg" width=360 height=203 />
</a>

**property columns**

The `columns` property is used to retrieve all column names of a DataFrame as a list.

**Returns:**
A list containing the names of all columns in the DataFrame.

**Note:**
- The `columns` property is a convenient way to access and work with the column names of a DataFrame.
- The order of the column names in the list corresponds to the order of the columns in the DataFrame.
- This property is often used when you need to reference or manipulate specific columns in a DataFrame, such as selecting columns, renaming columns, or performing aggregations.
- You can use the `columns` property along with other DataFrame methods and operations to perform various transformations and computations on the data.

In [14]:
# columns
x = spark.createDataFrame([("Alice","Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.columns #creates list of column names on driver
x.show()
print(y)

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

['from', 'to', 'amt']


<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.describe">
<img align=left src="images/pyspark-pictures-dataframes-page14.svg" width=360 height=203 />
</a>

**describe(\*cols)**

The `describe()` method computes basic statistics for numeric and string columns in a DataFrame.

**Parameters:**
- `*cols`: Optional. Columns for which statistics need to be computed. If no columns are provided, statistics will be computed for all numerical or string columns.

**Returns:**
A DataFrame containing the computed statistics for the specified columns.

**Note:**
- The `describe()` method provides summary statistics for the specified columns, including count, mean, standard deviation, minimum, and maximum values.
- If no columns are provided as arguments, the method will automatically compute statistics for all numerical or string columns in the DataFrame.
- The output DataFrame will have the following columns: "summary" (which specifies the type of statistic), the specified columns, and their corresponding statistics.
- This method is useful for gaining insights into the distribution and summary of data in the specified columns, helping to understand the range, central tendency, and dispersion of values.

In [21]:
# describe
x = spark.createDataFrame([("Alice","Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
x.show()
x.describe().show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-------+-----+----+-------------------+
|summary| from|  to|                amt|
+-------+-----+----+-------------------+
|  count|    3|   3|                  3|
|   mean| null|null|0.20000000000000004|
| stddev| null|null|0.09999999999999998|
|    min|Alice| Bob|                0.1|
|    max|Carol|Dave|                0.3|
+-------+-----+----+-------------------+



# Selecting columns

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select">
<img align=left src="images/pyspark-pictures-dataframes-page50.svg" width=360 height=203 />
</a>

**select(\*cols)**

The `select()` method is used to project a set of columns or expressions from a DataFrame and returns a new DataFrame.

**Parameters:**
- `*cols`: A list of column names (as strings) or expressions (as `Column` objects) to be selected. If one of the column names is `'*'`, it expands to include all columns in the current DataFrame.

**Returns:**
A new DataFrame that includes only the selected columns or expressions.

**Note:**
- The `select()` method is commonly used to extract specific columns from a DataFrame for further processing or analysis.
- You can specify column names as strings in the `cols` list to include those columns in the resulting DataFrame. For example, `select("column1", "column2")` selects only the "column1" and "column2" columns.
- Alternatively, you can pass `Column` expressions in the `cols` list to perform transformations or computations on the columns. For example, `select(col("column1") + col("column2")).alias("sum")` computes the sum of "column1" and "column2" and assigns it an alias "sum" in the resulting DataFrame.
- By using `'*'` as one of the column names in the `cols` list, you can include all columns from the current DataFrame in the selection.
- The order of the columns in the resulting DataFrame will match the order of the columns specified in the `cols` list.

In [22]:
# 1 - Pandas like select using a list of columns
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.select(['from','amt'])
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+---+
| from|amt|
+-----+---+
|Alice|0.1|
|  Bob|0.2|
|Carol|0.3|
+-----+---+



In [17]:
# 2 - Columns as a parameters
x.select("from","amt").show()

+-----+---+
| from|amt|
+-----+---+
|Alice|0.1|
|  Bob|0.2|
|Carol|0.3|
+-----+---+



In [23]:
# 3 - Using F.col
from pyspark.sql import functions as F
x.select(F.col("from"),F.col("amt")).show()


+--------------------+---+
|                from|amt|
+--------------------+---+
|               Alice|0.1|
|                 Bob|0.2|
|               Carol|0.3|
|Big name big name...|0.2|
+--------------------+---+



In [24]:
# 4 - Using list of F.col and unpack it
x.select(*[F.col("from"),F.col("amt")]).show()

+--------------------+---+
|                from|amt|
+--------------------+---+
|               Alice|0.1|
|                 Bob|0.2|
|               Carol|0.3|
|Big name big name...|0.2|
+--------------------+---+



**Apply functions to the columns**
PySpark supports a wide range of built-in functions that can be applied to columns in a DataFrame. These functions are available in the `pyspark.sql.functions` module.
There are a large number of functions available in PySpark.

The full list is available on this link https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html

In [25]:
# apply function to the columns
x.select(F.upper(F.col("from")),F.col("amt")*2).show()

+--------------------+---------+
|         upper(from)|(amt * 2)|
+--------------------+---------+
|               ALICE|      0.2|
|                 BOB|      0.4|
|               CAROL|      0.6|
|BIG NAME BIG NAME...|      0.4|
+--------------------+---------+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.selectExpr">
<img align=left src="images/pyspark-pictures-dataframes-page51.svg" width=360 height=203 />
</a>

**selectExpr(\*expr)**

The `selectExpr()` method is a variant of `select()` that allows you to project a set of SQL expressions and returns a new DataFrame.

**Parameters:**
- `*expr`: SQL expressions to be selected. Each expression can be a column name, a SQL expression, or an alias expression.

**Returns:**
A new DataFrame that includes the selected SQL expressions.

**Note:**
- The `selectExpr()` method is useful when you want to perform complex transformations or computations on columns using SQL expressions.
- You can specify SQL expressions as strings in the `*expr` parameter. For example, `selectExpr("column1 + column2", "column3 * 2 AS doubled_column")` performs addition and multiplication operations on columns and assigns an alias to the computed column.
- The SQL expressions can include column names, SQL functions, arithmetic operations, or any valid SQL expression that can be evaluated.
- By using `*expr` as the parameter, you can pass multiple SQL expressions to be selected.
- The resulting DataFrame will include the selected SQL expressions as columns, with the column names derived from the expressions or aliases assigned.

In [None]:
# selectExpr
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.selectExpr(['substr(from,1,1)','amt+10'])
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+------------------+----------+
|substr(from, 1, 1)|(amt + 10)|
+------------------+----------+
|                 A|      10.1|
|                 B|      10.2|
|                 C|      10.3|
+------------------+----------+



# Projection operations

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.where">
<img align=left src="images/pyspark-pictures-dataframes-page63.svg" width=360 height=203 />
</a>

**where(condition)** / **filter(condition)**

The `where()` and `filter()` methods are used to filter rows in a DataFrame based on the given condition.

**Parameters:**
- `condition`: A `Column` of type `BooleanType` or a string of SQL expression representing the filter condition.

**Returns:**
A new DataFrame that includes only the rows satisfying the given condition.

**Note:**
- Both `where()` and `filter()` are used interchangeably to filter rows in a DataFrame based on a condition.
- The `condition` parameter can be a `Column` object representing a boolean condition, or it can be a string of SQL expression that evaluates to a boolean value.
- The resulting DataFrame will contain only the rows that satisfy the provided condition.
- Multiple filtering conditions can be combined using logical operators such as `&` (AND) and `|` (OR).
- The filter condition can involve comparisons, arithmetic operations, SQL functions, or any valid SQL expression that evaluates to a boolean value.
- The `where()` and `filter()` methods are lazy operations, meaning that the filtering is not applied immediately. It is executed when an action is performed on the resulting DataFrame.4

In [3]:
# where (filter) - SQL like syntax
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.where("amt > 0.1")
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+



In [27]:
# filter: function syntax
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.filter(F.col("amt") > 0.2)
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+----+---+
| from|  to|amt|
+-----+----+---+
|Carol|Dave|0.3|
+-----+----+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct">
<img align=left src="images/pyspark-pictures-dataframes-page15.svg" width=360 height=203 />
</a>

**distinct()**

The `distinct()` method is used to return a new DataFrame that contains only the distinct (unique) rows from the original DataFrame.

**Returns:**
A new DataFrame containing the distinct rows.

**Note:**
- The `distinct()` operation eliminates duplicate rows from the DataFrame, resulting in a DataFrame that contains only unique rows.
- The distinctness of rows is determined by comparing all columns in the DataFrame.
- The order of rows in the resulting DataFrame may not be the same as the original DataFrame.
- The `distinct()` operation is a transformation and is lazily evaluated. The actual computation occurs when an action is performed on the resulting DataFrame.
- This method is commonly used to identify unique values or to remove duplicates from a DataFrame.

In [56]:
# distinct
x = spark.createDataFrame([("Alice","Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3),("Bob","Carol",0.2)], ['from','to','amt'])
y = x.distinct()
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
|  Bob|Carol|0.2|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|  Bob|Carol|0.2|
|Alice|  Bob|0.1|
|Carol| Dave|0.3|
+-----+-----+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropna">
<img align=left src="images/pyspark-pictures-dataframes-page18.svg" width=360 height=203 />
<

**dropna(how='any', thresh=None, subset=None)**

The `dropna()` method is used to return a new DataFrame omitting rows with null values.

**Parameters:**
- `how`: Specifies how to drop rows with null values. If set to `'any'`, a row is dropped if it contains any nulls. If set to `'all'`, a row is dropped only if all its values are null. By default, it is set to `'any'`.
- `thresh`: If specified, drops rows that have less than `thresh` non-null values. This parameter overrides the `how` parameter.
- `subset`: An optional list of column names to consider. If specified, only the specified columns are checked for null values.

**Returns:**
A new DataFrame with the rows containing null values dropped.

**Note:**
- The `dropna()` method is used to remove rows from a DataFrame that have null values, allowing you to clean the data and ensure data quality.
- By default, the method drops rows if they have any null values (`how='any'`). If you want to drop rows only if all their values are null, you can set `how='all'`.
- The `thresh` parameter allows you to specify a threshold for the number of non-null values a row must have to be retained. Rows with fewer than `thresh` non-null values will be dropped.
- The `subset` parameter allows you to specify a subset of columns to consider for null value checking. Only the specified columns will be checked, and rows with null values in other columns will not be dropped.
- The `dropna()` operation is a transformation and is lazily evaluated. The actual computation occurs when an action is performed on the resulting DataFrame.
- This method is useful for handling missing or null values in a DataFrame and ensuring data integrity.

In [4]:
# dropna
x = spark.createDataFrame([(None,"Bob",0.1),("Bob","Carol",None),("Carol",None,0.3),("Bob","Carol",0.2)], ['from','to','amt'])
y = x.dropna(how='any',subset=['from','to'])
x.show()
y.show()

+-----+-----+----+
| from|   to| amt|
+-----+-----+----+
| null|  Bob| 0.1|
|  Bob|Carol|null|
|Carol| null| 0.3|
|  Bob|Carol| 0.2|
+-----+-----+----+

+----+-----+----+
|from|   to| amt|
+----+-----+----+
| Bob|Carol|null|
| Bob|Carol| 0.2|
+----+-----+----+



# Add columns to Data Frame

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn">
<img align=left src="images/pyspark-pictures-dataframes-page64.svg" width=360 height=203 />
</a>

**withColumn(colName, col)**

The `withColumn()` method is used to return a new DataFrame by adding a column or replacing an existing column with the same name.

**Parameters:**
- `colName`: A string representing the name of the new column.
- `col`: A `Column` expression for the new column. The expression must be based on the current DataFrame; attempting to add a column from another DataFrame will raise an error.

**Returns:**
A new DataFrame with the specified column added or replaced.

**Note:**
- The `withColumn()` method allows you to add a new column to a DataFrame or replace an existing column with a new one.
- The `colName` parameter specifies the name of the new column.
- The `col` parameter is a `Column` expression that defines the values of the new column. The expression should be based on the current DataFrame, as attempting to add a column from another DataFrame will result in an error.
- If a column with the same name as `colName` already exists in the DataFrame, it will be replaced with the new column.
- It's important to note that calling `withColumn()` multiple times, especially within loops, to add multiple columns can result in inefficient plans and potential performance issues. To avoid this, it is recommended to use `select()` with multiple columns at once, rather than calling `withColumn()` multiple times. This helps optimize the execution plan and improves performance.

In [5]:

# withColumn
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",None),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.withColumn('conf',x.amt.isNotNull()) # pandas equivalent df['conf'] = df['amt'].isNotNull()
x.show()
y.show()

+-----+-----+----+
| from|   to| amt|
+-----+-----+----+
|Alice|  Bob| 0.1|
|  Bob|Carol|null|
|Carol| Dave| 0.3|
+-----+-----+----+

+-----+-----+----+-----+
| from|   to| amt| conf|
+-----+-----+----+-----+
|Alice|  Bob| 0.1| true|
|  Bob|Carol|null|false|
|Carol| Dave| 0.3| true|
+-----+-----+----+-----+



In [6]:
# lit(val) Creates a Column of a literal value.
from pyspark.sql.functions import lit
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",None),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.withColumn('constant',lit(1))
x.show()
y.show()

+-----+-----+----+
| from|   to| amt|
+-----+-----+----+
|Alice|  Bob| 0.1|
|  Bob|Carol|null|
|Carol| Dave| 0.3|
+-----+-----+----+

+-----+-----+----+--------+
| from|   to| amt|constant|
+-----+-----+----+--------+
|Alice|  Bob| 0.1|       1|
|  Bob|Carol|null|       1|
|Carol| Dave| 0.3|       1|
+-----+-----+----+--------+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumnRenamed">
<img align=left src="images/pyspark-pictures-dataframes-page65.svg" width=360 height=203 />
</a>

**withColumnRenamed(existing, new)**

The `withColumnRenamed()` method is used to return a new DataFrame by renaming an existing column. If the schema does not contain the given column name, this operation has no effect.

**Parameters:**
- `existing`: A string representing the name of the existing column to rename.
- `new`: A string representing the new name of the column.

**Returns:**
A new DataFrame with the specified column renamed.

**Note:**
- If the schema of the DataFrame does not contain the given `existing` column name, this operation has no effect. The resulting DataFrame will be the same as the original DataFrame.
- Renaming a column does not modify the original DataFrame. Instead, it creates a new DataFrame with the renamed column.
- This method is useful when you want to update or clarify the column names in a DataFrame for improved readability or compatibility with downstream operations.

In [31]:
# withColumnRenamed
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.withColumnRenamed('amt','amount')
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+------+
| from|   to|amount|
+-----+-----+------+
|Alice|  Bob|   0.1|
|  Bob|Carol|   0.2|
|Carol| Dave|   0.3|
+-----+-----+------+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop">
<img align=left src="images/pyspark-pictures-dataframes-page16.svg" width=360 height=203 />
</a>

**drop(cols)**

The `drop()` method is used to return a new DataFrame that drops the specified column(s). If the schema does not contain the given column name(s), this operation has no effect.

**Parameters:**
- `cols`: Can be one of the following:
  - A string representing the name of the column to drop.
  - A `Column` object representing the column to drop.
  - A list of string names of the columns to drop.

**Returns:**
A new DataFrame with the specified column(s) dropped.

**Note:**
- If the schema of the DataFrame does not contain any of the given column name(s), this operation has no effect. The resulting DataFrame will be the same as the original DataFrame.
- Dropping a column does not modify the original DataFrame. Instead, it creates a new DataFrame without the dropped column(s).
- This method is useful when you want to exclude specific columns from a DataFrame for further analysis or processing.

In [31]:
# drop
x = spark.createDataFrame([("Alice","Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.drop('amt')
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+
| from|   to|
+-----+-----+
|Alice|  Bob|
|  Bob|Carol|
|Carol| Dave|
+-----+-----+



# Aggregates

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy">
<img align=left src="images/pyspark-pictures-dataframes-page28.svg" width=360 height=203 />
</a>

**groupBy(\*cols)**

The `groupBy()` method is used to group the DataFrame using the specified columns, allowing for subsequent aggregation operations. The `groupBy()` method returns a GroupedData object that provides access to various aggregate functions for performing calculations on the grouped data.

**Parameters:**
- `cols`: A list of columns to group by. Each element can be a column name (string) or a Column expression.

**Returns:**
A GroupedData object representing the grouped DataFrame.

**Note:**
- The resulting GroupedData object provides access to various aggregate functions such as `avg()`, `max()`, `min()`, `sum()`, `count()`, and more. These functions can be used to calculate summary statistics and perform aggregations on the grouped data.
- Additionally, user-defined aggregate functions (UDFs) created with `pandas_udf()` can also be used as aggregate functions with `groupBy()`.
- The `groupBy()` operation involves a full shuffle of the data, and all the data for each group will be loaded into memory. Therefore, it's important to consider the memory usage and potential out-of-memory risks, especially when dealing with skewed data or groups that are too large to fit in memory.
- The `groupBy()` method is commonly used in combination with aggregate functions to perform group-level calculations and summarize data based on specific grouping criteria.


In [33]:
# groupBy
x = spark.createDataFrame([('Alice',"Bob",1),("Alice","Carol",2),("Carol","Dave",3),('Carol',"Bob",4)], ['from','to','amt'])
y = x.groupBy('from')
x.show()
print(y)

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|  1|
|Alice|Carol|  2|
|Carol| Dave|  3|
|Carol|  Bob|  4|
+-----+-----+---+

<pyspark.sql.group.GroupedData object at 0x000001B4D090DF00>


<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy">
<img align=left src="images/pyspark-pictures-dataframes-page29.svg" width=360 height=203 />
</a>

The resulting GroupedData object provides access to various aggregate functions such as `avg()`, `max()`, `min()`, `sum()`, `count()`, and more. These functions can be used to calculate summary statistics and perform aggregations on the grouped data.

In [7]:
# groupBy(col1).avg(col2)
y = x.groupBy('from').avg('amt')
x.show()
y.show()

+-----+-----+----+
| from|   to| amt|
+-----+-----+----+
|Alice|  Bob| 0.1|
|  Bob|Carol|null|
|Carol| Dave| 0.3|
+-----+-----+----+

+-----+--------+
| from|avg(amt)|
+-----+--------+
|Carol|     0.3|
|  Bob|    null|
|Alice|     0.1|
+-----+--------+



**agg(\*exprs)**

Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).

**Parameters:**
- `exprs` Columns or expressions to aggregate DataFrame by

In [9]:
from pyspark.sql import functions as F
print(x.agg({"amt": "max"}).collect())
print(x.agg(F.min(x.amt)).collect())


[Row(max(amt)=0.3)]
[Row(min(amt)=0.1)]


# Top K queries

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy">
<img align=left src="images/pyspark-pictures-dataframes-page38.svg" width=360 height=203 />
</a>

**orderBy(\*cols, \*kwargs)**

The `orderBy()` method is used to return a new DataFrame that is sorted by the specified column(s).

**Parameters:**
- `cols`: A list of Column objects or column names to sort the DataFrame by.
- `ascending`: Optional. A boolean or a list of boolean values specifying the sort order for each column. If set to `True`, the corresponding column is sorted in ascending order.

**Returns:**
A new DataFrame that is sorted based on the specified column(s).

**Note:**
- The `cols` parameter specifies the columns by which the DataFrame should be sorted. Each element in `cols` can be a Column object or a column name.
- By default, the sorting is done in ascending order. You can control the sort order using the `ascending` parameter. If set to `True`, the corresponding column is sorted in ascending order, and if set to `False`, the column is sorted in descending order. If a list of boolean values is provided, it must have the same length as `cols` and specify the sort order for each column individually.
- The resulting DataFrame will have the rows sorted based on the specified column(s). If multiple columns are specified, the DataFrame is sorted first by the first column, then by the second column, and so on.
- The `orderBy()` operation triggers a global sort, which may involve shuffling the data across the cluster. This can be an expensive operation, especially for large datasets.

In [10]:
from pyspark.sql import functions as F
# orderBy
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
x.show()
y = x.orderBy(['to'],ascending=[False])
y.show()
# The same functionality 
y = x.orderBy(F.desc('to'))
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Carol| Dave|0.3|
|  Bob|Carol|0.2|
|Alice|  Bob|0.1|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Carol| Dave|0.3|
|  Bob|Carol|0.2|
|Alice|  Bob|0.1|
+-----+-----+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.limit">
<img align=left src="images/pyspark-pictures-dataframes-page34.svg" width=360 height=203 />
</a>

**limit(num)**

The `limit()` method is used to limit the number of results in a DataFrame to the specified number.

**Parameters:**
- `num`: The number of results to limit the DataFrame to.

**Returns:**
A new DataFrame containing a limited number of results.

**Note:**
- The `limit()` method is commonly used to restrict the number of rows in a DataFrame and obtain a smaller subset of the data.
- The `num` parameter specifies the maximum number of results to be included in the resulting DataFrame.
- The resulting DataFrame will contain at most `num` rows. If the original DataFrame has fewer than `num` rows, all rows from the original DataFrame will be included in the result.
- This method is useful when you want to sample a subset of data for exploration or testing purposes or when you want to reduce the size of the DataFrame for improved performance.

In [11]:
# limit
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.limit(2)
z = x.orderBy(['amt'],ascending=False).limit(2)
x.show()
y.show()
z.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Carol| Dave|0.3|
|  Bob|Carol|0.2|
+-----+-----+---+



# Join

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join">
<img align=left src="images/pyspark-pictures-dataframes-page33.svg" width=360 height=203 />
</a>

**join(other, on=None, how=None)**

The `join()` method is used to join a DataFrame with another DataFrame, using the given join expression.

**Parameters:**
- `other`: The right side DataFrame to join with.
- `on`: Optional. Specifies the join column name(s) as a string, a list of column names, a join expression (Column), or a list of Columns. If `on` is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
- `how`: Optional. Specifies the type of join to perform. It must be one of the following:
  - `inner`: Performs an inner join, keeping only the rows that have matching keys in both DataFrames.
  - `cross`: Performs a cross join, producing a Cartesian product of the rows in both DataFrames.
  - `outer`, `full`, `fullouter`, `full_outer`: Performs a full outer join, keeping all rows from both DataFrames and filling in missing values with null.
  - `left`, `leftouter`, `left_outer`: Performs a left outer join, keeping all rows from the left DataFrame and filling in missing values from the right DataFrame with null.
  - `right`, `rightouter`, `right_outer`: Performs a right outer join, keeping all rows from the right DataFrame and filling in missing values from the left DataFrame with null.
  - `semi`, `leftsemi`, `left_semi`: Performs a semi join, keeping only the rows from the left DataFrame that have matching keys in the right DataFrame.
  - `anti`, `leftanti`, `left_anti`: Performs an anti join, keeping only the rows from the left DataFrame that do not have matching keys in the right DataFrame.

**Returns:**
A new DataFrame resulting from the join operation.

**Note:**
- The resulting DataFrame will contain the combined rows from both DataFrames, based on the join condition and type specified.
- Joins can be used to merge data from multiple sources, perform data matching, or combine data for further analysis.

In [18]:
# join
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = spark.createDataFrame([('Alice',20),("Bob",40),("Dave",80)], ['name','age'])
z = x.join(y,x.to == y.name,'inner').select('from','to','amt','age')
x.show()
y.show()
z.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+---+
| name|age|
+-----+---+
|Alice| 20|
|  Bob| 40|
| Dave| 80|
+-----+---+

+-----+----+---+---+
| from|  to|amt|age|
+-----+----+---+---+
|Alice| Bob|0.1| 40|
|Carol|Dave|0.3| 80|
+-----+----+---+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.intersect">
<img align=left src="images/pyspark-pictures-dataframes-page31.svg" width=360 height=203 />
</a>

**intersect(other)**

The `intersect()` method is used to return a new DataFrame containing rows that exist in both the current DataFrame and another DataFrame. This operation is equivalent to the `INTERSECT` operation in SQL.

**Parameters:**
- `other`: The other DataFrame to intersect with.

**Returns:**
A new DataFrame containing the intersected rows.

**Note:**
- The `intersect()` method is used to find the common rows between two DataFrames, creating a new DataFrame that includes only the rows present in both DataFrames.
- The resulting DataFrame will contain only the rows that exist in both the current DataFrame and the other DataFrame.
- The `intersect()` operation compares the rows based on their values and does not consider the order of the rows.
- This method can be useful for finding common records, performing data deduplication, or identifying overlapping data between different datasets.

In [6]:
# intersect
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Alice",0.2),("Carol","Dave",0.1)], ['from','to','amt'])
z = x.intersect(y)
x.show()
y.show()
z.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Alice|0.2|
|Carol| Dave|0.1|
+-----+-----+---+

+-----+---+---+
| from| to|amt|
+-----+---+---+
|Alice|Bob|0.1|
+-----+---+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.subtract">
<img align=left src="images/pyspark-pictures-dataframes-page56.svg" width=360 height=203 />
</a>

**subtract(other)**

The `subtract()` method is used to return a new DataFrame containing rows that exist in the current DataFrame but not in another DataFrame. This operation is equivalent to the `EXCEPT DISTINCT` operation in SQL.

**Parameters:**
- `other`: The other DataFrame to subtract from the current DataFrame.

**Returns:**
A new DataFrame containing the subtracted rows.

**Note:**
- The resulting DataFrame will contain only the rows that exist in the current DataFrame and do not exist in the other DataFrame.
- The `subtract()` operation compares the rows based on their values and does not consider the order of the rows.
- This method can be useful for finding unique records, removing duplicate data, or identifying the differences between two datasets.

In [39]:
# subtract
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.1)], ['from','to','amt'])
z = x.subtract(y)
x.show()
y.show()
z.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.1|
+-----+-----+---+

+-----+----+---+
| from|  to|amt|
+-----+----+---+
|Carol|Dave|0.3|
+-----+----+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.unionAll">
<img align=left src="images/pyspark-pictures-dataframes-page61.svg" width=360 height=203 />
</a>

**unionAll(other)**

The `unionAll()` method is used to return a new DataFrame containing the union of rows from the current DataFrame and other DataFrame. This operation is equivalent to the `UNION ALL` operation in SQL. 

**Parameters:**
- `other`: The other DataFrame to union with the current DataFrame.

**Returns:**
A new DataFrame containing the union of rows.

**Note:**
- The resulting DataFrame will contain all the rows from the current DataFrame and all the rows from the other DataFrame, with duplicates included if they exist.
- The `unionAll()` operation does not perform deduplication of elements. If you want to remove duplicates and obtain a distinct set of rows, you can follow this function with the `distinct()` function.
- The column resolution in `unionAll()` is done by position (not by name), meaning that columns with the same position in both DataFrames will be combined together.
- This method is useful for combining multiple DataFrames vertically, appending new rows, or merging datasets with compatible schemas.

In [7]:
# unionAll
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2)], ['from','to','amt'])
y = spark.createDataFrame([("Bob","Carol",0.2),("Carol","Dave",0.1)], ['from','to','amt'])
z = x.unionAll(y)
x.show()
y.show()
z.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|  Bob|Carol|0.2|
|Carol| Dave|0.1|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|  Bob|Carol|0.2|
|Carol| Dave|0.1|
+-----+-----+---+



# SQL like Operations on DataFrames

**sql(sqlQuery)**

The `sql()` method is used to execute a SQL query on a DataFrame and return a new DataFrame representing the result of the query.

**Parameters:**
- `sqlQuery`: A string containing the SQL query to be executed.

**Returns:**
A new DataFrame representing the result of the SQL query.

**Note:**
- The `sql()` method allows you to perform SQL queries on a DataFrame, leveraging SQL syntax and capabilities for data manipulation and analysis.
- The resulting DataFrame represents the result of the SQL query, with the schema and data determined by the query.
- The SQL query can include various operations such as selecting columns, filtering rows, aggregating data, joining tables, and more.
- This method is particularly useful when you have complex data manipulation or analysis requirements that can be expressed more easily and concisely using SQL syntax.
- The `sql()` method internally uses the Spark SQL engine to execute the SQL query on the DataFrame.
- It's important to ensure that the DataFrame you are executing the SQL query on has been registered as a temporary table or view using the `createOrReplaceTempView()` method. Otherwise, the SQL query will not be able to reference the DataFrame.
- The SQL query can reference column names and perform operations on the columns of the DataFrame. The columns are resolved based on the schema of the DataFrame.

**createOrReplaceTempView(name)**

The `createOrReplaceTempView()` method is used to register the DataFrame as a temporary table with the given name. This temporary table can be used to perform SQL queries and operations.

**Parameters:**
- `name`: A string representing the name to assign to the temporary table.

**Note:**
- The `createOrReplaceTempView()` method allows you to register the DataFrame as a temporary table, making it available for SQL operations and queries.
- The `name` parameter specifies the name to assign to the temporary table.
- The temporary table is tied to the lifetime of the SparkSession that was used to create the DataFrame. Once the SparkSession is terminated, the temporary table will no longer be accessible.
- Once the DataFrame is registered as a temporary table, you can refer to it by the specified name in SQL queries using the `sql()` method or any other Spark SQL operations.
- Registering a DataFrame as a temporary table enables you to leverage SQL capabilities and syntax for data manipulation and analysis, including complex queries, joins, aggregations, and more.
- Temporary tables are particularly useful when you need to perform multiple SQL operations on the same DataFrame or when you want to separate the SQL logic from the DataFrame transformations.
- Temporary tables can be accessed and queried from different parts of the application or from different Spark jobs as long as they use the same SparkSession.

In [12]:
# registerTempTable
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
x.createOrReplaceTempView(name="TRANSACTIONS")
y = spark.sql('SELECT * FROM TRANSACTIONS WHERE amt > 0.1')
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+



# Statistics

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.count">
<img align=left src="images/pyspark-pictures-dataframes-page10.svg" width=360 height=203 />
</a>

**count()**

The `count()` method is used to return the number of rows in the DataFrame.

**Returns:**
The number of rows in the DataFrame.

**Note:**
- This method is useful for understanding the size of the DataFrame, checking data completeness, or performing basic data profiling tasks.
- The `count()` operation is an action, meaning that it triggers the evaluation of the DataFrame and retrieves the actual row count from the underlying data source.
- It is important to note that the count operation may require reading and processing the entire DataFrame, which can be time-consuming for large datasets.

In [9]:
# count
x = spark.createDataFrame([("Alice","Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
x.show()
print(x.count())

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

3


<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.corr">
<img align=left src="images/pyspark-pictures-dataframes-page9.svg" width=360 height=203 />
</a>

**corr(col1, col2, method=None)**

The `corr()` method calculates the correlation between two columns of a DataFrame as a double value. Currently, it only supports the Pearson Correlation Coefficient.

**Parameters:**
- `col1`: The name of the first column.
- `col2`: The name of the second column.
- `method`: Optional. The correlation method to use. Currently, only the "pearson" method is supported.

**Returns:**
The correlation between the two columns as a double value.

**Note:**
- The correlation measures the strength and direction of the linear relationship between the two columns. The Pearson Correlation Coefficient is a commonly used correlation measure.
- The resulting correlation value is returned as a double value, ranging from -1.0 to 1.0. A value of 1.0 indicates a perfect positive correlation, -1.0 indicates a perfect negative correlation, and 0.0 indicates no correlation.
- Currently, only the "pearson" method is supported for calculating the correlation.
- The `corr()` method can be useful for understanding the relationship between different variables in the DataFrame and can be used in exploratory data analysis, feature selection, or statistical modeling.

In [10]:
# corr
x = spark.createDataFrame([("Alice","Bob",0.1,0.001),("Bob","Carol",0.2,0.02),("Carol","Dave",0.3,0.02)], ['from','to','amt','fee'])
y = x.corr(col1="amt",col2="fee")
x.show()
print(y)

+-----+-----+---+-----+
| from|   to|amt|  fee|
+-----+-----+---+-----+
|Alice|  Bob|0.1|0.001|
|  Bob|Carol|0.2| 0.02|
|Carol| Dave|0.3| 0.02|
+-----+-----+---+-----+

0.8660254037844389


<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.cov">
<img align=left src="images/pyspark-pictures-dataframes-page11.svg" width=360 height=203 />
</a>

**cov(col1, col2)**

The `cov()` method calculates the sample covariance between two columns in the DataFrame, specified by their names.

**Parameters:**
- `col1`: The name of the first column.
- `col2`: The name of the second column.

**Returns:**
The sample covariance between the two columns, represented as a double value.

**Note:**
- The sample covariance measures the linear relationship between the two columns and provides insight into how they vary together.
- The resulting covariance value is returned as a double value.
- This method can be useful for understanding the relationship between different variables in the DataFrame and can be used in exploratory data analysis, feature engineering, or statistical modeling.

In [45]:
# cov
x = spark.createDataFrame([("Alice","Bob",0.1,0.001),("Bob","Carol",0.2,0.02),("Carol","Dave",0.3,0.02)], ['from','to','amt','fee'])
y = x.cov(col1="amt",col2="fee")
x.show()
print(y)

+-----+-----+---+-----+
| from|   to|amt|  fee|
+-----+-----+---+-----+
|Alice|  Bob|0.1|0.001|
|  Bob|Carol|0.2| 0.02|
|Carol| Dave|0.3| 0.02|
+-----+-----+---+-----+

0.0009500000000000001


# Other

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.persist">
<img align=left src="images/pyspark-pictures-dataframes-page39.svg" width=360 height=203 />
</a>

**persist(storageLevel=StorageLevel(True, True, False, False, 1))**

The `persist()` method is used to set the storage level for persisting the contents of the DataFrame across operations after the first computation. This allows for faster access to the DataFrame in subsequent operations. By default, the storage level is set to `MEMORY_AND_DISK` if no storage level is specified.

**Parameters:**
- `storageLevel`: Optional. The storage level to assign to the DataFrame. It determines how the DataFrame is stored in memory or on disk. If not specified, the default storage level is `MEMORY_AND_DISK` which provides a good balance between performance and storage usage..

**Note:**
- The `storageLevel` parameter specifies the storage level to assign to the DataFrame. It determines where and how the DataFrame is stored. The storage level can be customized based on the available memory and disk resources.
- The storage level is an instance of the `StorageLevel` class, which represents various options for storing data, such as storing it in memory, on disk, or both.
- The storage level set using `persist()` is persistent across operations, meaning that the DataFrame remains stored in memory or on disk until explicitly unpersisted.
- The `persist()` operation is lazy and does not trigger immediate storage. The actual storage occurs when an action is performed on the DataFrame.
- The `persist()` method is useful when you have intermediate results or frequently accessed DataFrames that you want to cache in memory or on disk for faster access in subsequent operations.

In [46]:
# persist
from pyspark.storagelevel import StorageLevel
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
x.persist(storageLevel=StorageLevel(True,True,False,True,1)) # StorageLevel(useDisk,useMemory,useOffHeap,deserialized,replication=1)
x.show()
x.is_cached

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+



True

<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.unpersist">
<img align=left src="images/pyspark-pictures-dataframes-page62.svg" width=360 height=203 />
</a>

**unpersist(blocking=False)**

The `unpersist()` method is used to mark the DataFrame as non-persistent and remove all blocks associated with it from memory and disk. This frees up the resources used by the DataFrame.

**Parameters:**
- `blocking`: Optional. If set to `True`, the method will block until the DataFrame is unpersisted from all nodes. If set to `False`, the method will return immediately without waiting for the unpersistence to complete. By default, `blocking` is set to `False`.

**Note:**
- By marking the DataFrame as non-persistent, all the blocks associated with it are removed from memory and disk, allowing the resources to be reclaimed.
- When a DataFrame is unpersisted, subsequent access to the DataFrame will trigger recomputation of the DataFrame's contents.
- It's important to note that unpersisting a DataFrame does not remove the DataFrame itself from memory. It only removes the persistence status and the associated blocks from memory and disk.
- The `unpersist()` method is useful when you want to free up memory and disk resources after you no longer need the persisted DataFrame. This can help optimize resource usage and improve overall performance.

In [47]:
# unpersist
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
x.cache()
x.count()
x.show()
print(x.is_cached)
x.unpersist()
print(x.is_cached)

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

True
False


<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace">
<img align=left src="images/pyspark-pictures-dataframes-page45.svg" width=360 height=203 />
</a>

**replace(to_replace, value=<no value>, subset=None)**

The `replace()` method is used to return a new DataFrame with specified values replaced by other values. This operation can be used to perform value replacement in a DataFrame. 

**Parameters:**
- `to_replace`: The value(s) to be replaced. It can be a boolean, integer, float, string, list, or dictionary. If it is a dictionary, the `value` parameter is ignored, and `to_replace` must be a mapping between a value and its replacement.
- `value`: The replacement value(s) to be used. It must be of the same type as `to_replace` and can be a boolean, integer, float, string, list, or None. If `value` is a list, it should have the same length and type as `to_replace`. If `value` is a scalar and `to_replace` is a sequence, the scalar value is used as a replacement for each item in `to_replace`.
- `subset`: Optional. A list of column names to consider for replacement. Only the columns specified in `subset` will be processed. Columns with a different data type than the replacement value(s) will be ignored.

**Returns:**
A new DataFrame with specified values replaced by other values.

**Note:**
- The `replace()` method allows you to replace specific values in a DataFrame with other values, creating a new DataFrame with the replacements applied.
- The `to_replace` parameter specifies the value(s) to be replaced. It can be a boolean, integer, float, string, list, or dictionary. If it is a dictionary, the `value` parameter is ignored, and `to_replace` must be a mapping between a value and its replacement.
- The `value` parameter specifies the replacement value(s) to be used. It must be of the same type as `to_replace` and can be a boolean, integer, float, string, list, or None.
- The `subset` parameter is optional and allows you to specify a subset of columns to consider for replacement. Only the columns specified in `subset` will be processed. Columns with a different data type than the replacement value(s) will be ignored.
- The resulting DataFrame will have the specified values replaced by other values according to the specified replacements.
- When performing numeric replacements, it is important to note that all values to be replaced should have unique floating point representations to avoid conflicts. In cases of conflicts, an arbitrary replacement will be used.
- The replacement values will be cast to the type of the existing column in the DataFrame.

In [11]:
# replace
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.replace('Dave','David',['from','to'])
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol|David|0.3|
+-----+-----+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample">
<img align=left src="images/pyspark-pictures-dataframes-page47.svg" width=360 height=203 />
</a>

**sample(withReplacement=None, fraction=None, seed=None)**

The `sample()` method is used to return a sampled subset of the DataFrame.

**Parameters:**
- `withReplacement`: Optional. Specifies whether to sample with replacement or not. If set to `True`, sampling will be done with replacement (default is `False`).
- `fraction`: Optional. Specifies the fraction of rows to include in the sample. It should be a value between 0.0 and 1.0.
- `seed`: Optional. Specifies the seed for sampling. If provided, the same seed will generate the same sample in subsequent runs (default is a random seed).

**Returns:**
A sampled subset of the DataFrame.

**Note:**
- The `fraction` parameter specifies the fraction of rows to include in the sample. It should be a value between 0.0 and 1.0. For example, a value of 0.5 indicates that the sample should contain 50% of the rows in the DataFrame.
- The `seed` parameter allows you to specify a seed for the random number generator used in sampling. Providing a specific seed ensures reproducibility of the sample across multiple runs. If no seed is provided, a random seed will be used.
- The resulting DataFrame will contain a random subset of the rows from the original DataFrame, based on the specified sampling parameters.
- The `sample()` method is useful when you want to explore or analyze a smaller subset of data, or when you need to perform testing or validation on a fraction of the data.

In [49]:
# sample
x = spark.createDataFrame([('Alice',"Bob",0.1),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.sample(False,0.5)
x.show()
y.show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+----+---+
| from|  to|amt|
+-----+----+---+
|Carol|Dave|0.3|
+-----+----+---+



<a href="http://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit">
<img align=left src="images/pyspark-pictures-dataframes-page41.svg" width=360 height=203 />
</a>

**randomSplit(weights, seed=None)**

The `randomSplit()` method is used to randomly split the DataFrame into multiple subsets with the provided weights.

**Parameters:**
- `weights`: A list of doubles representing the weights with which to split the DataFrame. The weights will be normalized if they don't sum up to 1.0.
- `seed`: Optional. The seed for sampling. If provided, the same seed will generate the same split in subsequent runs.

**Returns:**
A list of DataFrames resulting from the random split.

**Note:**
- Providing a specific seed ensures reproducibility of the split across multiple runs. If no seed is provided, a random seed will be used.
- The resulting list will contain multiple DataFrames, each representing a subset of the original DataFrame. The sizes of the subsets will be proportional to the specified weights.
- The subsets are randomly generated, meaning that each row from the original DataFrame has an equal chance of being included in any of the resulting subsets.
- The `randomSplit()` method is useful when you need to split the data into training and test sets, or when you want to create multiple subsets for cross-validation or data exploration purposes.

In [50]:
# randomSplit
x = spark.createDataFrame([('Alice',"Bob",0.1),('Alice',"Bob",0.4),('Alice',"Bob",0.5),('Alice',"Bob",0.6),("Bob","Carol",0.2),("Carol","Dave",0.3)], ['from','to','amt'])
y = x.randomSplit([0.5,0.5],10)
x.show()
y[0].show()
y[1].show()

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|Alice|  Bob|0.4|
|Alice|  Bob|0.5|
|Alice|  Bob|0.6|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+-----+---+
| from|   to|amt|
+-----+-----+---+
|Alice|  Bob|0.1|
|  Bob|Carol|0.2|
|Carol| Dave|0.3|
+-----+-----+---+

+-----+---+---+
| from| to|amt|
+-----+---+---+
|Alice|Bob|0.4|
|Alice|Bob|0.5|
|Alice|Bob|0.6|
+-----+---+---+

