##Column Functions##

**orderBy(..) & sort(..)**
> Returns a new Dataset sorted by the given expressions.  
>>both performs same operation
*two variants of these two methods:* 
  * `orderBy(Column)`
  * `orderBy(String)`
  * `sort(Column)`
  * `sort(String)`

##The Column Class##

The `Column` class is an object that encompasses more than just the name of the column, but also column-level-transformations, such as sorting in a descending order.

In [0]:
countryDF=spark.read.table("country_lookup_csv")

In [0]:
countryDF.show()

In [0]:
countryDF.orderBy("continent").show()

**To sort in decending order**  
>the below code will throw error: `'DataFrame' object has no attribute 'desc'`

In [0]:
countryDF.orderBy("continent").desc().show() #Error: 'DataFrame' object has no attribute 'desc'

In [0]:
countryDF.col("continent").desc().show() #Error: DataFrame' object has no attribute 'col'

create a column object and use it with dataframe in the nested structure

In [0]:
# If we import ...sql.functions, we get a couple of more options:
from pyspark.sql.functions import *
continentCol=col("continent").desc()

In [0]:
type(continentCol) #pyspark.sql.column.Column

In [0]:
sortedCountryDF=countryDF.orderBy(continentCol).show() #countryDF.orderBy(col("continent").desc()).show()

### Column Class - Additional Info

The `Column` objects provide us a programmatic way to build up SQL-ish expressions.

Besides the `Column.desc()` operation we used above, we have a number of other operations that can be performed on a `Column` object.

Here is a preview of the various functions - we will cover many of these as we progress through the class:

**Column Functions**
* Various mathematical functions such as add, subtract, multiply & divide
* Various bitwise operators such as AND, OR & XOR
* Various null tests such as `isNull()`, `isNotNull()` & `isNaN()`.
* `as(..)`, `alias(..)` & `name(..)` - Returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode).
* `between(..)` - A boolean expression that is evaluated to true if the value of this expression is between the given columns.
* `cast(..)` & `astype(..)` - Convert the column into type dataType.
* `asc(..)` - Returns a sort expression based on the ascending order of the given column name.
* `desc(..)` - Returns a sort expression based on the descending order of the given column name.
* `startswith(..)` - String starts with.
* `endswith(..)` - String ends with another string literal.
* `isin(..)` - A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
* `like(..)` - SQL like expression
* `rlike(..)` - SQL RLIKE expression (LIKE with Regex).
* `substr(..)` - An expression that returns a substring.
* `when(..)` & `otherwise(..)` - Evaluates a list of conditions and returns one of multiple possible result expressions.

The complete list of functions differs from language to language.

## `filter(..)` and `where(..)`
> Filters rows using the given condition.
> Both return a new dataset containing only those records for which the specified condition is true.

* Like `distinct()` and `dropDuplicates()`, `filter(..)` and `where(..)` are aliases for each other.
  * `filter(..)` appealing to functional programmers.
  * `where(..)` appealing to developers with an SQL background.
* Like `orderBy(..)` there are two variants of these two methods:
  * `filter(Column)`
  * `filter(String)`
  * `where(Column)`
  * `where(String)`
* Unlike `orderBy(String)` which requires a column name, `filter(String)` and `where(String)` both expect an SQL expression.

In [0]:
countryDF.where("country == 'India'").show() # using SQL expression

In [0]:
countryDF.filter("country == 'India'").show() #using SQL expression

In [0]:
countryDF.where(col("country") == 'India').show() # using column 

In [0]:
countryDF.filter(col("country") == 'India').show() # using column 

In [0]:
countryDF.filter(col("continent") == 'Asia').orderBy(col("country").desc()).show() # using column

## first() & head()

> Returns the first row.

Just like `distinct()` & `dropDuplicates()` are aliases for each other, so are `first(..)` and `head(..)`.

However, unlike `distinct()` & `dropDuplicates()` which are **transformations** `first(..)` and `head(..)` are **actions**.

Once all processing is done, these methods return the object backing the first record.

In the case of `DataFrames` (both Scala and Python) that object is a `Row`.

In the case of `Datasets` (the strongly typed version of `DataFrames` in Scala and Java), the object may be a `Row`, a `String`, a `Customer`, a `PendingApplication` or any number of custom objects.

In [0]:
countryDF.filter(col("continent") == 'Asia').orderBy(col("country").desc()).head() #returns rowobject - Row(country='Yeman, ...')

In [0]:
countryDF.first() #returns the first row. Not first N rows (doesn't take any arguments)

In [0]:
populationDF=countryDF.orderBy(col("population").desc()).head()

In [0]:
type(populationDF) #pyspark.sql.types.Row

In [0]:
populationDF #displays the row

In [0]:
populationDF["country"] #accessing elements in the row object.

## collect()

> Returns an array that contains all of Rows in this Dataset.

`collect()` returns a collection of the specific type backing each record of the `DataFrame`.
* In the case of Python, this is always the `Row` object.
* In the case of Scala, this is also a `Row` object.
* If the `DataFrame` was converted to a `Dataset` the backing object would be the user-specified object.

In [0]:
continents = (countryDF          # collecting all continents and number of countries
  .groupBy("continent")
  .count()
   .collect()           # The action returning all records in the DataFrame
)



In [0]:
type(continents) #list

In [0]:
continents #continents holds the list of row objects

In [0]:
# rows is an Array. Now in the driver, 
# we can just loop over the array and print 'em out.

listItems = ""
for row in continents:
  continent = row['continent']
  countries_count = row['count']
  listItems += "    <li><b>{}</b> {:0,d} countries</li>\n".format(continent, countries_count)
  
  html = """
<body>
  <h1>Continents and Number of countries</h1>
  <ol>
    %s
  </ol>
</body>
""" % (listItems.strip())

#print(html)

# UNCOMMENT FOR A PRETTIER PRESENTATION
displayHTML(html)

## take(n)

> Returns the first n rows in the Dataset.

`take(n)` returns a collection of the first N records of the specific type backing each record of the `DataFrame`.
* In the case of Python, this is always the `Row` object.
* In the case of Scala, this is also a `Row` object.
* If the `DataFrame` was converted to a `Dataset` the backing object would be the user-specified object.

In short, it's the same basic function as `collect()` except you specify as the first parameter the number of records to return.

In [0]:
countryDF.take(10) #returns the list of row objects

Use **collect()** and **take()** to return records from a DataFrame to the driver of the cluster