# Reading Data Lab
* The goal of this lab is to put into practice some of what you have learned about reading data with Apache Spark.
* The instructions are provided below along with empty cells for you to do your work.
* At the bottom of this notebook are additional cells that will help verify that your work is accurate.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions
0. Start with the file **dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv**, some random file you haven't seen yet.
0. Read in the data and assign it to a `DataFrame` named **testDF**.
0. Run the last cell to verify that the data was loaded correctly and to print its schema.
0. The one untestable requirement is that you should be able to create the `DataFrame` and print its schema **without** executing a single job.

**Note:** For the test to pass, the following columns should have the specified data types:
 * **prev_id**: integer
 * **curr_id**: integer
 * **n**: integer
 * **prev_title**: string
 * **curr_title**: string
 * **type**: string

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
# Mount "/mnt/training" again using "%run "./Includes/Dataset-Mounts-New"" if it is failed in "./Includes/Classroom-Setup"
try:
    files = dbutils.fs.ls("/mnt/training")
except:
    dbutils.fs.unmount('/mnt/training/')


/mnt/training/ has been unmounted.


In [0]:
%run "./Includes/Dataset-Mounts-New"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Show Your Work

In [0]:
# TODO

fileName = "dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv"
display(dbutils.fs.ls(fileName))
dbutils.fs.head(fileName)



path,name,size,modificationTime
dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv,2015_02_clickstream.tsv,1322171548,1509997283000


[Truncated to first 65536 bytes]
Out[8]: 'prev_id\tcurr_id\tn\tprev_title\tcurr_title\ttype\n\t3632887\t121\tother-google\t!!\tother\n\t3632887\t93\tother-wikipedia\t!!\tother\n\t3632887\t46\tother-empty\t!!\tother\n\t3632887\t10\tother-other\t!!\tother\n64486\t3632887\t11\t!_(disambiguation)\t!!\tother\n2061699\t2556962\t19\tLouden_Up_Now\t!!!_(album)\tlink\n\t2556962\t25\tother-empty\t!!!_(album)\tother\n\t2556962\t16\tother-google\t!!!_(album)\tother\n\t2556962\t44\tother-wikipedia\t!!!_(album)\tother\n64486\t2556962\t15\t!_(disambiguation)\t!!!_(album)\tlink\n600744\t2556962\t297\t!!!\t!!!_(album)\tlink\n\t6893310\t11\tother-empty\t!Hero_(album)\tother\n1921683\t6893310\t26\t!Hero\t!Hero_(album)\tlink\n\t6893310\t16\tother-wikipedia\t!Hero_(album)\tother\n\t6893310\t23\tother-google\t!Hero_(album)\tother\n8127304\t22602473\t16\tJericho_Rosales\t!Oka_Tokat\tlink\n35978874\t22602473\t20\tList_of_telenovelas_of_ABS-CBN\t!Oka_Tokat\tlink\n\t22602473\t57\tother-google\t!Oka_Tokat\tother

In [0]:
%fs ls /mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv

path,name,size,modificationTime
dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv,2015_02_clickstream.tsv,1322171548,1509997283000


In [0]:
%fs head /mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv

In [0]:
from pyspark.sql.types import *
testSchema = StructType(
  [
    StructField('prev_id',IntegerType(),True),
    StructField('curr_id',IntegerType(),True),
    StructField('n',IntegerType(),True),
    StructField('prev_title',StringType(),True),
    StructField('curr_title',StringType(),True),
    StructField('type',StringType(),True)
  ]
)
		
testDF = (spark.read
  .option("header",'true')
  .option('sep','\t')
  .schema(testSchema)
  .csv(fileName))

testDF.printSchema()

root
 |-- prev_id: integer (nullable = true)
 |-- curr_id: integer (nullable = true)
 |-- n: integer (nullable = true)
 |-- prev_title: string (nullable = true)
 |-- curr_title: string (nullable = true)
 |-- type: string (nullable = true)



In [0]:
display(testDF)

prev_id,curr_id,n,prev_title,curr_title,type
,3632887.0,121,other-google,!!,other
,3632887.0,93,other-wikipedia,!!,other
,3632887.0,46,other-empty,!!,other
,3632887.0,10,other-other,!!,other
64486.0,3632887.0,11,!_(disambiguation),!!,other
2061699.0,2556962.0,19,Louden_Up_Now,!!!_(album),link
,2556962.0,25,other-empty,!!!_(album),other
,2556962.0,16,other-google,!!!_(album),other
,2556962.0,44,other-wikipedia,!!!_(album),other
64486.0,2556962.0,15,!_(disambiguation),!!!_(album),link


##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Verify Your Work
Run the following cell to verify that your `DataFrame` was created properly.

**Remember:** This should execute without triggering a single job.

In [0]:
testDF.printSchema()

columns = testDF.dtypes
assert len(columns) == 6, "Expected 6 columns but found " + str(len(columns))

assert columns[0][0] == "prev_id",    "Expected column 0 to be \"prev_id\" but found \"" + columns[0][0] + "\"."
assert columns[0][1] == "int",        "Expected column 0 to be of type \"int\" but found \"" + columns[0][1] + "\"."

assert columns[1][0] == "curr_id",    "Expected column 1 to be \"curr_id\" but found \"" + columns[1][0] + "\"."
assert columns[1][1] == "int",        "Expected column 1 to be of type \"int\" but found \"" + columns[1][1] + "\"."

assert columns[2][0] == "n",          "Expected column 2 to be \"n\" but found \"" + columns[2][0] + "\"."
assert columns[2][1] == "int",        "Expected column 2 to be of type \"int\" but found \"" + columns[2][1] + "\"."

assert columns[3][0] == "prev_title", "Expected column 3 to be \"prev_title\" but found \"" + columns[3][0] + "\"."
assert columns[3][1] == "string",     "Expected column 3 to be of type \"string\" but found \"" + columns[3][1] + "\"."

assert columns[4][0] == "curr_title", "Expected column 4 to be \"curr_title\" but found \"" + columns[4][0] + "\"."
assert columns[4][1] == "string",     "Expected column 4 to be of type \"string\" but found \"" + columns[4][1] + "\"."

assert columns[5][0] == "type",       "Expected column 5 to be \"type\" but found \"" + columns[5][0] + "\"."
assert columns[5][1] == "string",     "Expected column 5 to be of type \"string\" but found \"" + columns[5][1] + "\"."

print("Congratulations, all tests passed... that is if no jobs were triggered :-)\n")


root
 |-- prev_id: integer (nullable = true)
 |-- curr_id: integer (nullable = true)
 |-- n: integer (nullable = true)
 |-- prev_title: string (nullable = true)
 |-- curr_title: string (nullable = true)
 |-- type: string (nullable = true)

Congratulations, all tests passed... that is if no jobs were triggered :-)

