## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
sc = spark.sparkContext

# File location and type
file_location = "/FileStore/tables/Moby_Dick.txt"
file_type = "txt"

# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = " "

# The applied options are for CSV files. For other file types, these will be ignored.
# df = spark.read.format(file_type) \
#   .option("inferSchema", infer_schema) \
#   .option("header", first_row_is_header) \
#   .option("sep", delimiter) \
#   .load(file_location)

rdd = sc.textFile(file_location)

rdd.collect()

In [0]:
import re

def replace_n_mark(str):

    replace_char = ['.', '"', '-', ',', '*', '(', ')', ';', '[', ']']
    str = re.sub(f"{replace_char}", '', str)
    
    return (str.lower(), 1)

In [0]:
words = rdd.flatMap(lambda x: x.split(' ')) \
            .filter(lambda x: x != '') \
            .map(replace_n_mark) \
            .reduceByKey(lambda x, y: x + y)
words.collect()

In [0]:
# write to file
cols = ['word', 'count']
df = words.toDF(cols)
df.show()

In [0]:
df.createOrReplaceTempView('moby_dick')

In [0]:
%sql

select * from moby_dick
where count > 1000

word,count
of,6668
this,1283
is,1605
at,1312
was,1577
in,4115
i,1724
he,1683
his,2485
as,1713


In [0]:
%sql

select * from moby_dick
where word in ('a', 'an', 'the', 'of', 'he', 'she')
sort by count asc

word,count
an,592
he,1683
of,6668
she,114
a,4658
the,14413


In [0]:
# Save file to HDFS
df.write.format("parquet") \
        .mode("overwrite") \
        .saveAsTable('moby_dick_csv')

In [0]:
# to delete a file from DBFS, uncomment the below code and provide path

path = '/FileStore/tables/appl_stock-1.csv'
# dbutils.fs.rm(path, True)