There are many ways to select or access value of columns on Data frames:
1. df.select('col_name1', 'col_name2')
2. df

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
spark = SparkSession.builder.appName('Tip 2 select column').getOrCreate()

In [3]:
schema_df1 = "order_id long, order_date string, cust_id long, status string"

In [4]:
df1 = spark.read.format('csv').schema(schema_df1)\
        .load('data/orders.csv')

In [5]:
df1.show(3)

+--------+----------+-------+---------------+
|order_id|order_date|cust_id|         status|
+--------+----------+-------+---------------+
|       1|2013-07-25|  11599|         CLOSED|
|       2|2013-07-25|    256|PENDING_PAYMENT|
|       3|2013-07-25|  12111|       COMPLETE|
+--------+----------+-------+---------------+
only showing top 3 rows



1. df.select("col1, col2,...")
- Apply only show some columns or whole columns.
- Create a new data frame based on existed data frame
- Change the order of column name (FIFO)

In [6]:
output1 = df1.select("cust_id","order_id","status")

In [7]:
output1.show(3)

+-------+--------+---------------+
|cust_id|order_id|         status|
+-------+--------+---------------+
|  11599|       1|         CLOSED|
|    256|       2|PENDING_PAYMENT|
|  12111|       3|       COMPLETE|
+-------+--------+---------------+
only showing top 3 rows



2. df.select(df1.col1, df2.col1, df1.col3,...)
- Change the order of column names
- Apply in case of joining other dataframes/tables
- Can be able to create new columns based on existed columns


In [20]:
output2 = df1.select(df1.order_id, (df1.status), (df1.cust_id+1).alias('customer'))

In [21]:
output2.show(3)

+--------+--------------+--------+
|order_id|(status + _st)|customer|
+--------+--------------+--------+
|       1|          NULL|   11600|
|       2|          NULL|     257|
|       3|          NULL|   12112|
+--------+--------------+--------+
only showing top 3 rows



3. df.select(df1['col1'], df1['col2'],...)
- Same as the way 2

In [22]:
output3 = df1.select(df1['cust_id'], df1['status'],
                     (df1['cust_id']+1).alias('new_customer'))

In [23]:
output3.show(3)

+-------+---------------+------------+
|cust_id|         status|new_customer|
+-------+---------------+------------+
|  11599|         CLOSED|       11600|
|    256|PENDING_PAYMENT|         257|
|  12111|       COMPLETE|       12112|
+-------+---------------+------------+
only showing top 3 rows



4. Using pyspark.sql.funtions.col
- col(), where(), sum(), ....
- Provide a lot of default attribute after col() such as like, between, isNull, contain...

In [35]:
output4 = df1.select(col("cust_id"), col('order_date'), col('order_id').alias("order"), col('status')) \
    .where(col('status').like('COMPLETE') & col('cust_id').between(600, 1300))
output4.show(7)

+-------+----------+-----+--------+
|cust_id|order_date|order|  status|
+-------+----------+-----+--------+
|    656|2013-07-25|   28|COMPLETE|
|   1148|2013-07-25|   63|COMPLETE|
|   1265|2013-07-25|   83|COMPLETE|
|    610|2013-07-26|  126|COMPLETE|
|   1104|2013-07-26|  186|COMPLETE|
|   1137|2013-07-26|  258|COMPLETE|
|    815|2013-07-26|  271|COMPLETE|
+-------+----------+-----+--------+
only showing top 7 rows



5. Using expr() or selectExpr()
- select(col(), expr("expr1").alias,...)
- selectExpr("col1", "express1 as name1", "express2",...)

In [41]:
output6 = df1.select(col('status'), expr("order_id * 2").alias('new_order'), expr("cust_id / 10").alias('divide_cust'))

In [42]:
output6.show(5)

+---------------+---------+-----------+
|         status|new_order|divide_cust|
+---------------+---------+-----------+
|         CLOSED|        2|     1159.9|
|PENDING_PAYMENT|        4|       25.6|
|       COMPLETE|        6|     1211.1|
|         CLOSED|        8|      882.7|
|       COMPLETE|       10|     1131.8|
+---------------+---------+-----------+
only showing top 5 rows



In [51]:
output7 = df1.selectExpr('status', "order_id * 2 as new_order", "cust_id/10 as divide_cust")

In [52]:
output7.show(5)

+---------------+---------+-----------+
|         status|new_order|divide_cust|
+---------------+---------+-----------+
|         CLOSED|        2|     1159.9|
|PENDING_PAYMENT|        4|       25.6|
|       COMPLETE|        6|     1211.1|
|         CLOSED|        8|      882.7|
|       COMPLETE|       10|     1131.8|
+---------------+---------+-----------+
only showing top 5 rows

