# DataFrame Column Class

**Data Source **
* One hour of Pagecounts from the English Wikimedia projects captured August 5, 2016, at 12:00 PM UTC.
* Size on Disk: ~23 MB
* Type: Compressed Parquet File
* More Info: <a href="https://dumps.wikimedia.org/other/pagecounts-raw" target="_blank">Page view statistics for Wikimedia projects</a>

**Technical Accomplishments:**
* Continue exploring the `DataFrame` set of APIs.
* Introduce the `Column` class

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

In [None]:
from pyspark.sql import SparkSession

In [None]:
# Initialize Spark Session
spark = (SparkSession.builder
         .appName("Read CSV Data")
         .getOrCreate())

In [None]:
print(spark)
spark

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) **The Data Source**

We will be using the same data source as our previous notebook.

As such, we can go ahead and start by creating our initial `DataFrame`.

In [None]:
parquetFile = "../dataset/pagecounts/staging_parquet_en_only_clean/"

In [None]:
pagecountsEnAllDF = (spark  # Our SparkSession & Entry Point
  .read                     # Our DataFrameReader
  .parquet(parquetFile)     # Returns an instance of DataFrame
  .cache()                  # cache the data
)
print(pagecountsEnAllDF)

Let's take another look at the number of records in our `DataFrame`

In [None]:
total = pagecountsEnAllDF.count()

print("Record Count: {0:,}".format( total ))

Now let's take another peek at our data...

In [None]:
pagecountsEnAllDF.show()

As we view the data, we can see that there is no real rhyme or reason as to how the data is sorted.
* We cannot even tell if the column **project** is sorted - we are seeing only the first 1,000 of some 2.3 million records.
* The column **article** is not sorted as evident by the article **A_Little_Boy_Lost** appearing between a bunch of articles starting with numbers and symbols.
* The column **requests** is clearly not sorted.
* And our **bytes_served** contains nothing but zeros.

So let's start by sorting our data. In doing this, we can answer the following question:

What are the top 10 most requested articles?

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) orderBy(..) & sort(..)

If you look at the API docs, `orderBy(..)` is described like this:
> Returns a new Dataset sorted by the given expressions.

Both `orderBy(..)` and `sort(..)` arrange all the records in the `DataFrame` as specified.
* Like `distinct()` and `dropDuplicates()`, `sort(..)` and `orderBy(..)` are aliases for each other.
  * `sort(..)` appealing to functional programmers.
  * `orderBy(..)` appealing to developers with an SQL background.
* Like `orderBy(..)` there are two variants of these two methods:
  * `orderBy(Column)`
  * `orderBy(String)`
  * `sort(Column)`
  * `sort(String)`

All we need to do now is sort our previous `DataFrame`.

In [None]:
sortedDF = (pagecountsEnAllDF
  .orderBy("requests")
)
sortedDF.show(10, False)

As you can see, we are not sorting correctly.

We need to reverse the sort.

One might conclude that we could make a call like this:

`pagecountsEnAllDF.orderBy("requests desc")`

Try it in the cell below:

In [None]:
# Uncomment and try this:
# pagecountsEnAllDF.orderBy("requests desc")


Why does this not work?
* The `DataFrames` API is built upon an SQL engine.
* There is a lot of familiarity with this API and SQL syntax in general.
* The problem is that `orderBy(..)` expects the name of the column.
* What we specified was an SQL expression in the form of **requests desc**.
* What we need is a way to programmatically express such an expression.
* This leads us to the second variant, `orderBy(Column)` and more specifically, the class `Column`.

** *Note:* ** *Some of the calls in the `DataFrames` API actually accept SQL expressions.*<br/>
*While these functions will appear in the docs as `someFunc(String)` it's very*<br>
*important to thoroughly read and understand what the parameter actually represents.*

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) The Column Class

The `Column` class is an object that encompasses more than just the name of the column, but also column-level-transformations, such as sorting in a descending order.

The first question to ask is how do I create a `Column` object?

In Python we have these options:

In [None]:
# Scala & Python both support accessing a column from a known DataFrame
# Uncomment this if you are using the Python version of this notebook
columnA = pagecountsEnAllDF["requests"]

# If we import ...sql.functions, we get a couple of more options:
from pyspark.sql.functions import *

# This uses the col(..) function
columnC = col("requests")

# This uses the expr(..) function which parses an SQL Expression
columnD = expr("a + 1")

# This uses the lit(..) to create a literal (constant) value.
columnE = lit("abc")

# Print the type of each attribute
print("columnA: {}".format(columnA))
print("columnC: {}".format(columnC))
print("columnD: {}".format(columnD))
print("columnE: {}".format(columnE))


In the case of Scala, the cleanest version is the **$"column-name"** variant.

In the case of Python, the cleanest version is the **col("column-name")** variant.

So with that, we can now create a `Column` object, and apply the `desc()` operation to it:

In [None]:
column = col("requests").desc()

# Print the column type
print("column:", column)


And now we can piece it all together...

In [None]:
sortedDescDF = (pagecountsEnAllDF
  .orderBy( col("requests").desc() )
)  
sortedDescDF.show(10, False) # The top 10 is good enough for now

It should be of no surprise that the **Main_Page** (in both the Wikipedia and Wikimedia projects) is the most requested page.

Followed shortly after that is **Special:Search**, Wikipedia's search page.

### Review Column Class

The `Column` objects provide us a programmatic way to build up SQL-ish expressions.

Besides the `Column.desc()` operation we used above, we have a number of other operations that can be performed on a `Column` object.

Here is a preview of the various functions - we will cover many of these as we progress through the class:

**Column Functions**
* Various mathematical functions such as add, subtract, multiply & divide
* Various bitwise operators such as AND, OR & XOR
* Various null tests such as `isNull()`, `isNotNull()` & `isNaN()`.
* `as(..)`, `alias(..)` & `name(..)` - Returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode).
* `between(..)` - A boolean expression that is evaluated to true if the value of this expression is between the given columns.
* `cast(..)` & `astype(..)` - Convert the column into type dataType.
* `asc(..)` - Returns a sort expression based on the ascending order of the given column name.
* `desc(..)` - Returns a sort expression based on the descending order of the given column name.
* `startswith(..)` - String starts with.
* `endswith(..)` - String ends with another string literal.
* `isin(..)` - A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
* `like(..)` - SQL like expression
* `rlike(..)` - SQL RLIKE expression (LIKE with Regex).
* `substr(..)` - An expression that returns a substring.
* `when(..)` & `otherwise(..)` - Evaluates a list of conditions and returns one of multiple possible result expressions.

The complete list of functions differs from language to language.

# DataFrame Column Expressions

**Technical Accomplishments:**
* Continue exploring the `DataFrame` set of APIs.
* Continue to work with the `Column` class and introduce the `Row` class
* Introduce the transformations...
  * `orderBy(..)`
  * `sort(..)`
  * `filter(..)`
  * `where(..)`
* Introduce the actions...
  * `collect()`
  * `take(n)`
  * `first()`
  * `head()`

Let's look at the data once more...

In [None]:
from pyspark.sql.functions import *

sortedDescDF = (pagecountsEnAllDF
  .orderBy( col("requests").desc() )
)  
sortedDescDF.show(10, False)


In looking at the data, we can see multiple Wikipedia projects.

What if we want to look at only the main Wikipedia project, **en**?

For that, we will need to filter out some records.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) filter(..) & where(..)


If you look at the API docs, `filter(..)` and `where(..)` are described like this:
> Filters rows using the given condition.

Both `filter(..)` and `where(..)` return a new dataset containing only those records for which the specified condition is true.
* Like `distinct()` and `dropDuplicates()`, `filter(..)` and `where(..)` are aliases for each other.
  * `filter(..)` appealing to functional programmers.
  * `where(..)` appealing to developers with an SQL background.
* Like `orderBy(..)` there are two variants of these two methods:
  * `filter(Column)`
  * `filter(String)`
  * `where(Column)`
  * `where(String)`
* Unlike `orderBy(String)` which requires a column name, `filter(String)` and `where(String)` both expect an SQL expression.

Let's start by looking at the variant using an SQL expression:


### filter(..) & where(..) w/SQL Expression

In [None]:
whereDF = (sortedDescDF
  .where( "project = 'en'" )
)
whereDF.show(10, False)

Now that we are only looking at the main Wikipedia articles, we get a better picture of the most popular articles on Wikipedia.

Next, let's take a look at the second variant that takes a `Column` object as its first parameter:


### filter(..) & where(..) w/Column

In [None]:
filteredDF = (sortedDescDF
  .filter( col("project") == "en")
)
filteredDF.show(10, False)

### The Solution...


With that behind us, we can clearly **see** the top ten most requested articles.

But what if we need to **programmatically** extract the value of the most requested article's name and its number of requests?

That is to say, how do we get the first record, and from there...
* the value of the second column, **article**, as a string...
* the value of the third column, **requests**, as an integer...

Before we proceed, let's apply another filter to get rid of **Main_Page** and anything starting with **Special:** - they're just noise to us.

In [None]:
articlesDF = (filteredDF
  .drop("bytes_served")
  .filter( col("article") != "Main_Page")
  .filter( col("article") != "-")
  .filter( col("article").startswith("Special:") == False)
)
articlesDF.show(10, False)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) first() & head()

If you look at the API docs, both `first(..)` and `head(..)` are described like this:
> Returns the first row.

Just like `distinct()` & `dropDuplicates()` are aliases for each other, so are `first(..)` and `head(..)`.

However, unlike `distinct()` & `dropDuplicates()` which are **transformations** `first(..)` and `head(..)` are **actions**.

Once all processing is done, these methods return the object backing the first record.

In the case of `DataFrames` (both Scala and Python) that object is a `Row`.

In the case of `Datasets` (the strongly typed version of `DataFrames` in Scala and Java), the object may be a `Row`, a `String`, a `Customer`, a `PendingApplication` or any number of custom objects.

Focusing strictly on the `DataFrame` API for now, let's take a look at a call with `head()`:

In [None]:
firstRow = articlesDF.first()

print(firstRow)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) The Row Class

Now that we have a reference to the object backing the first row (or any row), we can use it to extract the data for each column.

Before we do, let's take a look at the API docs for the `Row` class.

At the heart of it, we are simply going to ask for the value of the object in column N via `Row.get(i)`.

Python being a loosely typed language, the return value is of no real consequence.

However, Scala is going to return an object of type `Any`. In Java, this would be an object of type `Object`.

What we need (at least for Scala), especially if the data type matters in cases of performing mathematical operations on the value, we need to call one of the other methods:
* `getAs[T](i):T`
* `getDate(i):Date`
* `getString(i):String`
* `getInt(i):Int`
* `getLong(i):Long`

We can now put it all together to get the number of requests for the most requested project:

In [None]:
article = firstRow['article']
total = firstRow['requests']

print("Most Requested Article: \"{0}\" with {1:,} requests".format( article, total ))

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) collect()

If you look at the API docs, `collect(..)` is described like this:
> Returns an array that contains all of Rows in this Dataset.

`collect()` returns a collection of the specific type backing each record of the `DataFrame`.
* In the case of Python, this is always the `Row` object.
* In the case of Scala, this is also a `Row` object.
* If the `DataFrame` was converted to a `Dataset` the backing object would be the user-specified object.

Building on our last example, let's take the top 10 records and print them out.

In [None]:
rows = (articlesDF
  .limit(10)           # We only want the first 10 records.
  .collect()           # The action returning all records in the DataFrame
)

# rows is an Array. Now in the driver, 
# we can just loop over the array and print 'em out.

listItems = ""
for row in rows:
  project = row['article']
  total = row['requests']
  listItems += "    <li><b>{}</b> {:0,d} requests</li>\n".format(project, total)
  
html = """
<body>
  <h1>Top 10 Articles</h1>
  <ol>
    %s
  </ol>
</body>
""" % (listItems.strip())

print(html)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) take(n)

If you look at the API docs, `take(n)` is described like this:
> Returns the first n rows in the Dataset.

`take(n)` returns a collection of the first N records of the specific type backing each record of the `DataFrame`.
* In the case of Python, this is always the `Row` object.
* In the case of Scala, this is also a `Row` object.
* If the `DataFrame` was converted to a `Dataset` the backing object would be the user-specified object.

In short, it's the same basic function as `collect()` except you specify as the first parameter the number of records to return.

In [None]:
rows = articlesDF.take(10)

# rows is an Array. Now in the driver, 
# we can just loop over the array and print 'em out.

listItems = ""
for row in rows:
  project = row['article']
  total = row['requests']
  listItems += "    <li><b>{}</b> {:0,d} requests</li>\n".format(project, total)
  
html = """
<body>
  <h1>Top 10 Articles</h1>
  <ol>
    %s
  </ol>
</body>
""" % (listItems.strip())

print(html)

# UNCOMMENT FOR A PRETTIER PRESENTATION
# displayHTML(html)

In [None]:
spark.stop()