## Partitioning a DataFrame
PySpark DataFrame can be split into the specifed number of partitions using **repartition(~)** method. This method allows to partition by column values.

### Parameters
1. numPartitions | int
    - Number of patitions to break down the DataFrame.

2. cols or str or Column
    - Columns by which to partition the DataFrame.

### Return Value
  - A new PySpark DataFrame.

[API Reference](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.repartition.html)

### Example

## Partitioning DataFrame

In [0]:
df = spark.createDataFrame(
  [('Shiva', 'Appu'), ('Teja', 'Sweety'), ('Bhavishya', 'Bhavi'), ('Shiva', 'Chinu')],
  ['name','nick_name']
)
df.show(truncate=False)

+---------+---------+
|name     |nick_name|
+---------+---------+
|Teja     |Sweety   |
|Bhavishya|Bhavi    |
|Shiva    |Appu     |
|Shiva    |Chinu    |
+---------+---------+



In [0]:
# Get the number of partitions
df.rdd.getNumPartitions()

Out[24]: 8

DataFrame is split into 8 partitions by default.Since, the number of partitions depends on the parallelism level of your Spark configurations.

In [0]:
df.rdd.glom().collect()

Out[25]: [[],
 [Row(name='Teja', nick_name='Sweety')],
 [],
 [Row(name='Bhavishya', nick_name='Bhavi')],
 [],
 [Row(name='Shiva', nick_name='Appu')],
 [],
 [Row(name='Shiva', nick_name='Chinu')]]

Here, we can see that we have indeed 8 partitions, but only 4 of the partitions have a Row in them.

Now, let's repartition our DataFrame such that the Rows are divided into only 2 partitions using 'repartition' method:

In [0]:
df1 = df.repartition(2)
df1.rdd.getNumPartitions()

Out[26]: 2

Now the distribution of the rows are in repartitioned DataFrame(d1).

In [0]:
df1.rdd.glom().collect()

Out[27]: [[Row(name='Teja', nick_name='Sweety'),
  Row(name='Bhavishya', nick_name='Bhavi'),
  Row(name='Shiva', nick_name='Appu'),
  Row(name='Shiva', nick_name='Chinu')],
 []]

> **Note:** *No guarantee that the rows will be evenly distributed in the partitions*

## Partitioning DataFrame by columns

In [0]:
df.show(truncate=False)

+---------+---------+
|name     |nick_name|
+---------+---------+
|Teja     |Sweety   |
|Bhavishya|Bhavi    |
|Shiva    |Appu     |
|Shiva    |Chinu    |
+---------+---------+



In [0]:
# repartition DataFrame by the column 'name' 

df3 = df.repartition("name")
print(f'Number of Partitions = {df3.rdd.getNumPartitions()}')
df3.rdd.glom().collect()

Number of Partitions = 1
Out[31]: [[Row(name='Teja', nick_name='Sweety'),
  Row(name='Bhavishya', nick_name='Bhavi'),
  Row(name='Shiva', nick_name='Appu'),
  Row(name='Shiva', nick_name='Chinu')]]

In [0]:
# repartition DataFrame by the column 'name' into 2 partitions
df4 = df.repartition(2, "name")
print(f'Number of Partitions = {df4.rdd.getNumPartitions()}')
df4.rdd.glom().collect()

Number of Partitions = 2
Out[32]: [[],
 [Row(name='Teja', nick_name='Sweety'),
  Row(name='Bhavishya', nick_name='Bhavi'),
  Row(name='Shiva', nick_name='Appu'),
  Row(name='Shiva', nick_name='Chinu')]]

In [0]:
# repartition DataFrame by the multiple columns
df5 = df.repartition(2, "name", "nick_name")
print(f'Number of Partitions = {df5.rdd.getNumPartitions()}')
df5.rdd.glom().collect()

Number of Partitions = 2
Out[34]: [[Row(name='Teja', nick_name='Sweety'), Row(name='Shiva', nick_name='Appu')],
 [Row(name='Bhavishya', nick_name='Bhavi'),
  Row(name='Shiva', nick_name='Chinu')]]