## Import Functions

In [0]:
from pyspark.sql.functions import col, column, concat, lit

## Creating a dataframe

In [0]:
df = spark.read.format("csv").option("header", True).load("/FileStore/tables/Auto-mpg/auto_mpg.csv")

## Function documentation

In [0]:
help(df.withColumn)

Help on method withColumn in module pyspark.sql.dataframe:

withColumn(colName, col) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` by adding a column or replacing the
    existing column that has the same name.
    
    The column expression must be an expression over this :class:`DataFrame`; attempting to add
    a column from some other :class:`DataFrame` will raise an error.
    
    .. versionadded:: 1.3.0
    
    Parameters
    ----------
    colName : str
        string, name of the new column.
    col : :class:`Column`
        a :class:`Column` expression for the new column.
    
    Notes
    -----
    This method introduces a projection internally. Therefore, calling it multiple
    times, for instance, via loops in order to add multiple columns can generate big
    plans which can cause performance issues and even `StackOverflowException`.
    To avoid this, use :func:`select` with the multiple columns at once.
    
    Examples
    

In [0]:
help(df.withColumnRenamed)

Help on method withColumnRenamed in module pyspark.sql.dataframe:

withColumnRenamed(existing, new) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` by renaming an existing column.
    This is a no-op if schema doesn't contain the given column name.
    
    .. versionadded:: 1.3.0
    
    Parameters
    ----------
    existing : str
        string, name of the existing column to rename.
    new : str
        string, new name of the column.
    
    Examples
    --------
    >>> df.withColumnRenamed('age', 'age2').collect()
    [Row(age2=2, name='Alice'), Row(age2=5, name='Bob')]



In [0]:
help(df.toDF)

Help on method toDF in module pyspark.sql.dataframe:

toDF(*cols) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` that with new specified column names
    
    Parameters
    ----------
    cols : str
        new column names
    
    Examples
    --------
    >>> df.toDF('f1', 'f2').collect()
    [Row(f1=2, f2='Alice'), Row(f1=5, f2='Bob')]



## Renaming columns

## Select

In [0]:
#Using alias with select
df.select(df["`car name`"].alias("name")).show()

+--------------------+
|                name|
+--------------------+
|chevrolet chevell...|
|   buick skylark 320|
|  plymouth satellite|
|       amc rebel sst|
|         ford torino|
|    ford galaxie 500|
|    chevrolet impala|
|   plymouth fury iii|
|    pontiac catalina|
|  amc ambassador dpl|
| dodge challenger se|
|  plymouth 'cuda 340|
|chevrolet monte c...|
|buick estate wago...|
|toyota corona mar...|
|     plymouth duster|
|          amc hornet|
|       ford maverick|
|        datsun pl510|
|volkswagen 1131 d...|
+--------------------+
only showing top 20 rows



In [0]:
#Using alias with select
df.select(col("`car name`").alias("name")).show()

+--------------------+
|                name|
+--------------------+
|chevrolet chevell...|
|   buick skylark 320|
|  plymouth satellite|
|       amc rebel sst|
|         ford torino|
|    ford galaxie 500|
|    chevrolet impala|
|   plymouth fury iii|
|    pontiac catalina|
|  amc ambassador dpl|
| dodge challenger se|
|  plymouth 'cuda 340|
|chevrolet monte c...|
|buick estate wago...|
|toyota corona mar...|
|     plymouth duster|
|          amc hornet|
|       ford maverick|
|        datsun pl510|
|volkswagen 1131 d...|
+--------------------+
only showing top 20 rows



In [0]:
df.select(concat(col("car name"),lit("-"),col("model year")).alias("car_details")).show() #Using alias with select

+--------------------+
|         car_details|
+--------------------+
|chevrolet chevell...|
|buick skylark 320-70|
|plymouth satellit...|
|    amc rebel sst-70|
|      ford torino-70|
| ford galaxie 500-70|
| chevrolet impala-70|
|plymouth fury iii-70|
| pontiac catalina-70|
|amc ambassador dp...|
|dodge challenger ...|
|plymouth 'cuda 34...|
|chevrolet monte c...|
|buick estate wago...|
|toyota corona mar...|
|  plymouth duster-70|
|       amc hornet-70|
|    ford maverick-70|
|     datsun pl510-70|
|volkswagen 1131 d...|
+--------------------+
only showing top 20 rows



In [0]:
# Changing multiple column names
df.select(col("mpg").alias("mpg1"),col("car name").alias("name")).show()

+----+--------------------+
|mpg1|                name|
+----+--------------------+
|  18|chevrolet chevell...|
|  15|   buick skylark 320|
|  18|  plymouth satellite|
|  16|       amc rebel sst|
|  17|         ford torino|
|  15|    ford galaxie 500|
|  14|    chevrolet impala|
|  14|   plymouth fury iii|
|  14|    pontiac catalina|
|  15|  amc ambassador dpl|
|  15| dodge challenger se|
|  14|  plymouth 'cuda 340|
|  15|chevrolet monte c...|
|  14|buick estate wago...|
|  24|toyota corona mar...|
|  22|     plymouth duster|
|  18|          amc hornet|
|  21|       ford maverick|
|  27|        datsun pl510|
|  26|volkswagen 1131 d...|
+----+--------------------+
only showing top 20 rows



In [0]:
df.select(df["mpg"].alias("mpg1"),df["car name"].alias("name")).show()

+----+--------------------+
|mpg1|                name|
+----+--------------------+
|  18|chevrolet chevell...|
|  15|   buick skylark 320|
|  18|  plymouth satellite|
|  16|       amc rebel sst|
|  17|         ford torino|
|  15|    ford galaxie 500|
|  14|    chevrolet impala|
|  14|   plymouth fury iii|
|  14|    pontiac catalina|
|  15|  amc ambassador dpl|
|  15| dodge challenger se|
|  14|  plymouth 'cuda 340|
|  15|chevrolet monte c...|
|  14|buick estate wago...|
|  24|toyota corona mar...|
|  22|     plymouth duster|
|  18|          amc hornet|
|  21|       ford maverick|
|  27|        datsun pl510|
|  26|volkswagen 1131 d...|
+----+--------------------+
only showing top 20 rows



## withColumn
This does not maintain column order

In [0]:
#withColumn usage. Using select to see the result else entire df will be displayed.
df.withColumn("car_details",concat(col("car name"),lit("-"),col("model year"))).select("car_details").show() 

+--------------------+
|         car_details|
+--------------------+
|chevrolet chevell...|
|buick skylark 320-70|
|plymouth satellit...|
|    amc rebel sst-70|
|      ford torino-70|
| ford galaxie 500-70|
| chevrolet impala-70|
|plymouth fury iii-70|
| pontiac catalina-70|
|amc ambassador dp...|
|dodge challenger ...|
|plymouth 'cuda 34...|
|chevrolet monte c...|
|buick estate wago...|
|toyota corona mar...|
|  plymouth duster-70|
|       amc hornet-70|
|    ford maverick-70|
|     datsun pl510-70|
|volkswagen 1131 d...|
+--------------------+
only showing top 20 rows



In [0]:
# Using withColumn
df.withColumn("year",col("model year")).select("year","car name").show()

+----+--------------------+
|year|            car name|
+----+--------------------+
|  70|chevrolet chevell...|
|  70|   buick skylark 320|
|  70|  plymouth satellite|
|  70|       amc rebel sst|
|  70|         ford torino|
|  70|    ford galaxie 500|
|  70|    chevrolet impala|
|  70|   plymouth fury iii|
|  70|    pontiac catalina|
|  70|  amc ambassador dpl|
|  70| dodge challenger se|
|  70|  plymouth 'cuda 340|
|  70|chevrolet monte c...|
|  70|buick estate wago...|
|  70|toyota corona mar...|
|  70|     plymouth duster|
|  70|          amc hornet|
|  70|       ford maverick|
|  70|        datsun pl510|
|  70|volkswagen 1131 d...|
+----+--------------------+
only showing top 20 rows



In [0]:
#Expected Error as String is not a valid argument.
df.withColumn("year", "model year") 

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
[0;32m<command-2111398855956371>[0m in [0;36m<module>[0;34m[0m
[1;32m      1[0m [0;31m#Expected Error as String is not a valid argument.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 2[0;31m [0mdf[0m[0;34m.[0m[0mwithColumn[0m[0;34m([0m[0;34m"year"[0m[0;34m,[0m [0;34m"model year"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/sql/dataframe.py[0m in [0;36mwithColumn[0;34m(self, colName, col)[0m
[1;32m   2651[0m         """
[1;32m   2652[0m         [0;32mif[0m [0;32mnot[0m [0misinstance[0m[0;34m([0m[0mcol[0m[0;34m,[0m [0mColumn[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m-> 2653[0;31m             [0;32mraise[0m [0mTypeError[0m[0;34m([0m[0;34m"col should be Column"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0

In [0]:
df.withColumn("year", df["model year"]).show()

+---+---------+------------+----------+------+------------+----------+------+--------------------+----+
|mpg|cylinders|displacement|horsepower|weight|acceleration|model year|origin|            car name|year|
+---+---------+------------+----------+------+------------+----------+------+--------------------+----+
| 18|        8|         307|       130|  3504|          12|        70|     1|chevrolet chevell...|  70|
| 15|        8|         350|       165|  3693|        11.5|        70|     1|   buick skylark 320|  70|
| 18|        8|         318|       150|  3436|          11|        70|     1|  plymouth satellite|  70|
| 16|        8|         304|       150|  3433|          12|        70|     1|       amc rebel sst|  70|
| 17|        8|         302|       140|  3449|        10.5|        70|     1|         ford torino|  70|
| 15|        8|         429|       198|  4341|          10|        70|     1|    ford galaxie 500|  70|
| 14|        8|         454|       220|  4354|           9|     

In [0]:
#withColumn to perform a transformation on mpg column.
df.withColumn("mpg", col("mpg").cast("int")).withColumn("mpg+2", col("mpg") + 2).select("mpg","mpg+2").show() 

+---+-----+
|mpg|mpg+2|
+---+-----+
| 18|   20|
| 15|   17|
| 18|   20|
| 16|   18|
| 17|   19|
| 15|   17|
| 14|   16|
| 14|   16|
| 14|   16|
| 15|   17|
| 15|   17|
| 14|   16|
| 15|   17|
| 14|   16|
| 24|   26|
| 22|   24|
| 18|   20|
| 21|   23|
| 27|   29|
| 26|   28|
+---+-----+
only showing top 20 rows



## withColumnRenamed
This maintains column orders

In [0]:
df.columns

Out[54]: ['mpg',
 'cylinders',
 'displacement',
 'horsepower',
 'weight',
 'acceleration',
 'model year',
 'origin',
 'car name']

In [0]:
df.withColumnRenamed("mpg", "mpg1").show()

+----+---------+------------+----------+------+------------+----------+------+--------------------+
|mpg1|cylinders|displacement|horsepower|weight|acceleration|model year|origin|            car name|
+----+---------+------------+----------+------+------------+----------+------+--------------------+
|  18|        8|         307|       130|  3504|          12|        70|     1|chevrolet chevell...|
|  15|        8|         350|       165|  3693|        11.5|        70|     1|   buick skylark 320|
|  18|        8|         318|       150|  3436|          11|        70|     1|  plymouth satellite|
|  16|        8|         304|       150|  3433|          12|        70|     1|       amc rebel sst|
|  17|        8|         302|       140|  3449|        10.5|        70|     1|         ford torino|
|  15|        8|         429|       198|  4341|          10|        70|     1|    ford galaxie 500|
|  14|        8|         454|       220|  4354|           9|        70|     1|    chevrolet impala|


In [0]:
#Changing multiple column names using chaining withColumnRenamed
df.withColumnRenamed("mpg", "mpg1").withColumnRenamed("car name", "name").show()

+----+---------+------------+----------+------+------------+----------+------+--------------------+
|mpg1|cylinders|displacement|horsepower|weight|acceleration|model year|origin|                name|
+----+---------+------------+----------+------+------------+----------+------+--------------------+
|  18|        8|         307|       130|  3504|          12|        70|     1|chevrolet chevell...|
|  15|        8|         350|       165|  3693|        11.5|        70|     1|   buick skylark 320|
|  18|        8|         318|       150|  3436|          11|        70|     1|  plymouth satellite|
|  16|        8|         304|       150|  3433|          12|        70|     1|       amc rebel sst|
|  17|        8|         302|       140|  3449|        10.5|        70|     1|         ford torino|
|  15|        8|         429|       198|  4341|          10|        70|     1|    ford galaxie 500|
|  14|        8|         454|       220|  4354|           9|        70|     1|    chevrolet impala|


## Changing multiple column names - toDF

In [0]:
df.select(df.columns).toDF(*["car_mpg","car_cylinders","car_displacement","car_horsepower","car_weight","car_acceleration","car_model year","car_origin","car_name"]).show()

+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+--------------------+
|car_mpg|car_cylinders|car_displacement|car_horsepower|car_weight|car_acceleration|car_model year|car_origin|            car_name|
+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+--------------------+
|     18|            8|             307|           130|      3504|              12|            70|         1|chevrolet chevell...|
|     15|            8|             350|           165|      3693|            11.5|            70|         1|   buick skylark 320|
|     18|            8|             318|           150|      3436|              11|            70|         1|  plymouth satellite|
|     16|            8|             304|           150|      3433|              12|            70|         1|       amc rebel sst|
|     17|            8|             302|           140|      3449|            10.5|

In [0]:
# Changing a subset of column names
column_name = ["mpg","cylinders","displacement"]
desired_column_names= ["car_mpg","car_cylinders","car_displacement"]

In [0]:
df.select(column_name).toDF(*desired_column_names).show()

+-------+-------------+----------------+
|car_mpg|car_cylinders|car_displacement|
+-------+-------------+----------------+
|     18|            8|             307|
|     15|            8|             350|
|     18|            8|             318|
|     16|            8|             304|
|     17|            8|             302|
|     15|            8|             429|
|     14|            8|             454|
|     14|            8|             440|
|     14|            8|             455|
|     15|            8|             390|
|     15|            8|             383|
|     14|            8|             340|
|     15|            8|             400|
|     14|            8|             455|
|     24|            4|             113|
|     22|            6|             198|
|     18|            6|             199|
|     21|            6|             200|
|     27|            4|              97|
|     26|            4|              97|
+-------+-------------+----------------+
only showing top