# Introduction to DataFrames Lab
## Distinct Articles

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions

In the cell provided below, write the code necessary to count the number of distinct articles in our data set.
0. Copy and paste all you like from the previous notebook.
0. Read in our parquet files.
0. Apply the necessary transformations.
0. Assign the count to the variable `totalArticles`
0. Run the last cell to verify that the data was loaded correctly.

**Bonus**

If you recall from the beginning of the previous notebook, the act of reading in our parquet files will trigger a job.
0. Define a schema that matches the data we are working with.
0. Update the read operation to use the schema.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
# Mount "/mnt/training" again using "%run "./Includes/Dataset-Mounts-New"" if it is failed in "./Includes/Classroom-Setup"
try:
    files = dbutils.fs.ls("/mnt/training")
except:
    dbutils.fs.unmount('/mnt/training/')


/mnt/training/ has been unmounted.


In [0]:
%run "./Includes/Dataset-Mounts-New"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Show Your Work

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()
spark.conf.set(sasEntity, sasToken)

source = '/mnt/training'
path = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [0]:
# TODO
# Replace <<FILL_IN>> with your code. 

df = (spark                    # Our SparkSession & Entry Point
  .read                        # Our DataFrameReader
  .parquet(path)                  # Read in the parquet files
  .select("article")                  # Reduce the columns to just the one
  .distinct()                  # Produce a unique set of values
)
totalArticles = df.count() # Identify the total number of records remaining.

print("Distinct Articles: {0:,}".format(totalArticles))

Distinct Articles: 1,783,138


##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Verify Your Work
Run the following cell to verify that your `DataFrame` was created properly.

In [0]:
expected = 1783138
assert totalArticles == expected, "Expected the total to be " + str(expected) + " but found " + str(totalArticles)


# Bonus:

In [0]:
# Bonus> Read with user defined schema for reducing the jobs

from pyspark.sql.types import *

parqSchema = StructType(
  [
    StructField('project', StringType(), True),
    StructField('article', StringType(), True),
    StructField('requests', IntegerType(), True),
    StructField('bytes_served', LongType(), True)
  ]
)

df = (spark                    # Our SparkSession & Entry Point
  .read                        # Our DataFrameReader
  .schema(parqSchema)
  .parquet(path)                  # Read in the parquet files
  .select("article")                  # Reduce the columns to just the one
  .distinct()                  # Produce a unique set of values
)
totalArticles = df.count() # Identify the total number of records remaining.

print("Distinct Articles: {0:,}".format(totalArticles))

Distinct Articles: 1,783,138
