## Spark Dataframe
We can manually create a PySpark DataFrame using toDF() and createDataFrame() methods, both these function takes different signatures in order to create DataFrame from existing RDD, list, and DataFrame.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("testApp") \
.getOrCreate()

#### Create DataFrame from RDD

|S.No.| Item     |  Input           | Command         | Output| Comment|
|:----|:-------------|:------------------:|:-------------:|:-----:|:------|
|  1  | Create DataFrame from RDD         |RDD| `rdd.toDF() `     | DF | |
|  2  | Create DataFrame from RDD with columns         |RDD| `rdd.toDF(col_list)`     | DF | |
|  3  | Create DataFrame from RDD using createDataFrame         |RDD| `spark.createDataFrame(rdd)`     | DF | |
|  4  | Create DataFrame from RDD using createDataFrame with columns         |RDD| `spark.createDataFrame(rdd, schema=cols)`     | DF | |

##### Using toDF() 

In [2]:
data = ((1, 2, 3),(4, 5, 6),(7, 8, 9))
rdd = spark.sparkContext.parallelize(data)
type(rdd)

pyspark.rdd.RDD

In [3]:
df = rdd.toDF()
print(type(df))

<class 'pyspark.sql.dataframe.DataFrame'>


In [4]:
df.show()

+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+



In [5]:
df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: long (nullable = true)
 |-- _3: long (nullable = true)



##### With columns also

In [6]:
cols = ["COL1","COL2","COL3"]
df = rdd.toDF(cols)
df.show()

+----+----+----+
|COL1|COL2|COL3|
+----+----+----+
|   1|   2|   3|
|   4|   5|   6|
|   7|   8|   9|
+----+----+----+



In [7]:
df.printSchema()

root
 |-- COL1: long (nullable = true)
 |-- COL2: long (nullable = true)
 |-- COL3: long (nullable = true)



##### Using createDataFrame()
spark.createDataFrame(rdd).toDF(*columns)

In [8]:
df = spark.createDataFrame(rdd)

In [9]:
df.show()

+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+



In [10]:
df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: long (nullable = true)
 |-- _3: long (nullable = true)



##### with columns

In [11]:
df = spark.createDataFrame(rdd, schema=cols)

In [12]:
df.show()

+----+----+----+
|COL1|COL2|COL3|
+----+----+----+
|   1|   2|   3|
|   4|   5|   6|
|   7|   8|   9|
+----+----+----+



In [13]:
df.printSchema()

root
 |-- COL1: long (nullable = true)
 |-- COL2: long (nullable = true)
 |-- COL3: long (nullable = true)



#### Create DataFrame from List/Tuple

In [14]:
data = [(1,"John","25"), (2,"Sam","26"), (3,"Saul", "30"),(4,"Jorah", "30")]

In [15]:
df = spark.createDataFrame(data)

In [16]:
df.show()

+---+-----+---+
| _1|   _2| _3|
+---+-----+---+
|  1| John| 25|
|  2|  Sam| 26|
|  3| Saul| 30|
|  4|Jorah| 30|
+---+-----+---+



In [17]:
columns = ["id","name","age"]
df = spark.createDataFrame(data,schema=columns)

In [18]:
df.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1| John| 25|
|  2|  Sam| 26|
|  3| Saul| 30|
|  4|Jorah| 30|
+---+-----+---+



#### Create DataFrame with the Row type

In [19]:
from pyspark.sql import Row

In [20]:
data =[
    Row(ID=1,NAME="John",AGE=20),
    Row(ID=2,NAME="Sam",AGE=25)
]

In [21]:
data

[Row(AGE=20, ID=1, NAME='John'), Row(AGE=25, ID=2, NAME='Sam')]

In [22]:
df = spark.createDataFrame(data)

In [23]:
df.show()

+---+---+----+
|AGE| ID|NAME|
+---+---+----+
| 20|  1|John|
| 25|  2| Sam|
+---+---+----+



In [24]:
df.printSchema()

root
 |-- AGE: long (nullable = true)
 |-- ID: long (nullable = true)
 |-- NAME: string (nullable = true)



#### Create DataFrame using Namedtuple

In [25]:
from collections import namedtuple

In [26]:
cust = namedtuple("CUSTOMER",["CUSTOMER_ID","CUSTOMER_NAME","CUSTOMER_ADDR","CUSTOMER_EMAIL","CUSTOMER_PHONE"])

In [27]:
data = [cust(1,"James","639 Main St","james@comapny.com","504-845-1427"),
        cust(2,"John","#45 Main St","john@comapny.com","804-895-1427"),
        cust(3,"Sam","34 Center St","sam@comapny.com","704-895-1427"),
        cust(4,"John","322 New Horizon","john@comapny.com","604-895-1427")
       ]

In [28]:
df = spark.createDataFrame(data)

In [29]:
df.show()

+-----------+-------------+---------------+-----------------+--------------+
|CUSTOMER_ID|CUSTOMER_NAME|  CUSTOMER_ADDR|   CUSTOMER_EMAIL|CUSTOMER_PHONE|
+-----------+-------------+---------------+-----------------+--------------+
|          1|        James|    639 Main St|james@comapny.com|  504-845-1427|
|          2|         John|    #45 Main St| john@comapny.com|  804-895-1427|
|          3|          Sam|   34 Center St|  sam@comapny.com|  704-895-1427|
|          4|         John|322 New Horizon| john@comapny.com|  604-895-1427|
+-----------+-------------+---------------+-----------------+--------------+



In [30]:
df.printSchema()

root
 |-- CUSTOMER_ID: long (nullable = true)
 |-- CUSTOMER_NAME: string (nullable = true)
 |-- CUSTOMER_ADDR: string (nullable = true)
 |-- CUSTOMER_EMAIL: string (nullable = true)
 |-- CUSTOMER_PHONE: string (nullable = true)



#### Create DataFrame with StructType Schema

In [31]:
from pyspark.sql.types import StructType,StructField,IntegerType,StringType,FloatType

schema =StructType([
            StructField("id", IntegerType(), True),
            StructField("name", StringType(), True),
        ])

df = spark.createDataFrame([(1, "john"),(2, "sam"),],schema)

In [32]:
df.show()

+---+----+
| id|name|
+---+----+
|  1|john|
|  2| sam|
+---+----+



In [33]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)



##### Creating Empty Dataframe

In [34]:
field = [
    StructField("COL1", FloatType(), True),
    StructField("COL2", StringType(), True),
]
schema = StructType(field)
empty_df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

In [35]:
empty_df.show()

+----+----+
|COL1|COL2|
+----+----+
+----+----+



In [36]:
empty_df.printSchema()

root
 |-- COL1: float (nullable = true)
 |-- COL2: string (nullable = true)



#### Create a sample single-column Spark DataFrame in Python

In [37]:
df = spark.createDataFrame([1,2,3,4,5], "integer").toDF("numbers")

In [38]:
df.show()

+-------+
|numbers|
+-------+
|      1|
|      2|
|      3|
|      4|
|      5|
+-------+



In [39]:
df = spark.createDataFrame([1,2,3,4,5], IntegerType()).toDF("numbers")

In [40]:
df.show()

+-------+
|numbers|
+-------+
|      1|
|      2|
|      3|
|      4|
|      5|
+-------+



In [41]:
df = spark.createDataFrame(["x","y","z"], StringType()).toDF("char_data")

In [42]:
df.show()

+---------+
|char_data|
+---------+
|        x|
|        y|
|        z|
+---------+



With name elements should be tuples and schema as sequence:

In [43]:
df = spark.createDataFrame([(1, ), (2, ), (2,  )], ["num"])

In [44]:
df.show()

+---+
|num|
+---+
|  1|
|  2|
|  2|
+---+



Coverting RDD into tuple form

In [45]:
myRdd = spark.sparkContext.parallelize([1.0,2.0,3.0,4.0])

In [46]:
df = myRdd.map(lambda x: (x, )).toDF(["COL1"])

In [47]:
df.show()

+----+
|COL1|
+----+
| 1.0|
| 2.0|
| 3.0|
| 4.0|
+----+



Otherway

In [48]:
from pyspark.sql import Row

row = Row("val") # Or some other column name
df_other = myRdd.map(row).toDF()

In [49]:
df_other.show()

+---+
|val|
+---+
|1.0|
|2.0|
|3.0|
|4.0|
+---+

