## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
from pyspark.sql import SparkSession, functions as F

file_location = "/FileStore/tables/people_10000.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

Index,User Id,First Name,Last Name,Sex,Email,Phone,Date of birth,Job Title
1,5f10e9D33fC5f2b,Sara,Mcguire,Female,tsharp@example.net,(971)643-6089x9160,1921-08-17,"Editor, commissioning"
2,751cD1cbF77e005,Alisha,Hebert,Male,vincentgarrett@example.net,+1-114-355-1841x78347,1969-06-28,Broadcast engineer
3,DcEFDB2D2e62bF9,Gwendolyn,Sheppard,Male,mercadojonathan@example.com,9017807728,1915-09-25,Industrial buyer
4,C88661E02EEDA9e,Kristine,Mccann,Female,lindsay55@example.com,+1-607-333-9911x59088,1978-07-27,Multimedia specialist
5,fafF1aBDebaB2a6,Bobby,Pittman,Female,blevinsmorgan@example.com,3739847538,1989-11-17,Planning and development surveyor
6,BdDb6C8Af309202,Calvin,Ramsey,Female,loretta85@example.com,001-314-829-5014x1792,2017-08-31,Therapeutic radiographer
7,FCdfFf08196f633,Collin,Allison,Male,yvaughn@example.net,(314)591-7413,1979-11-21,Administrator
8,356279dAa0F7CbD,Nicholas,Branch,Male,greerjimmy@example.net,+1-667-666-5867,2006-01-21,Fisheries officer
9,F563CcbFBfEcf5a,Emma,Robinson,Female,charleshiggins@example.org,166-234-6882x7457,2009-03-19,Haematologist
10,f2dceFc00F62542,Pedro,Cordova,Male,leslie08@example.com,(389)824-3204x8287,2008-06-17,Phytotherapist


In [0]:
# Create a view or table

temp_table_name = "people_10000_csv"

df.createOrReplaceTempView(temp_table_name)

In [0]:
%sql

/* Query the created temp table in a SQL cell */

select * from `people_10000_csv`

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8
Index,User Id,First Name,Last Name,Sex,Email,Phone,Date of birth,Job Title
1,5f10e9D33fC5f2b,Sara,Mcguire,Female,tsharp@example.net,(971)643-6089x9160,1921-08-17,"Editor, commissioning"
2,751cD1cbF77e005,Alisha,Hebert,Male,vincentgarrett@example.net,+1-114-355-1841x78347,1969-06-28,Broadcast engineer
3,DcEFDB2D2e62bF9,Gwendolyn,Sheppard,Male,mercadojonathan@example.com,9017807728,1915-09-25,Industrial buyer
4,C88661E02EEDA9e,Kristine,Mccann,Female,lindsay55@example.com,+1-607-333-9911x59088,1978-07-27,Multimedia specialist
5,fafF1aBDebaB2a6,Bobby,Pittman,Female,blevinsmorgan@example.com,3739847538,1989-11-17,Planning and development surveyor
6,BdDb6C8Af309202,Calvin,Ramsey,Female,loretta85@example.com,001-314-829-5014x1792,2017-08-31,Therapeutic radiographer
7,FCdfFf08196f633,Collin,Allison,Male,yvaughn@example.net,(314)591-7413,1979-11-21,Administrator
8,356279dAa0F7CbD,Nicholas,Branch,Male,greerjimmy@example.net,+1-667-666-5867,2006-01-21,Fisheries officer
9,F563CcbFBfEcf5a,Emma,Robinson,Female,charleshiggins@example.org,166-234-6882x7457,2009-03-19,Haematologist


In [0]:
# With this registered as a temp view, it will only be available to this particular notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.

permanent_table_name = "people_10000_csv"

# df.write.format("parquet").saveAsTable(permanent_table_name)

In [0]:
#2.	Identify duplicate customers basis first name and last name 
df.groupBy("_c2", "_c3").count().where('count>1').sort('count', ascending =False).show()

+---------+---------+-----+
|      _c2|      _c3|count|
+---------+---------+-----+
|Katherine|   Savage|    3|
|    Carly|  Terrell|    2|
|  Michele|  Douglas|    2|
|  Beverly|    Ochoa|    2|
|   Connie|  Roberts|    2|
|    Tasha|   Madden|    2|
|Alejandra| Petersen|    2|
| Clifford|   Zuniga|    2|
|    Heidi|    Ortiz|    2|
|      Joe|Robertson|    2|
|  Darrell|   Wagner|    2|
|   Albert|  Mccarty|    2|
|    Laura|  Shelton|    2|
|  Caitlin|   Madden|    2|
|    Daisy|    Weiss|    2|
|    Bruce|    Solis|    2|
|   Sergio|    Ponce|    2|
|    Cesar|   Newton|    2|
|     Juan|     Frye|    2|
|  Diamond|     Ball|    2|
+---------+---------+-----+
only showing top 20 rows



In [0]:
#3.	Assign Master ID to matching customers
df_new = df.withColumn("Monotonically_increasing_id", F.monotonically_increasing_id())

In [0]:
display(df_new)

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,Monotonically_increasing_id
Index,User Id,First Name,Last Name,Sex,Email,Phone,Date of birth,Job Title,0
1,5f10e9D33fC5f2b,Sara,Mcguire,Female,tsharp@example.net,(971)643-6089x9160,1921-08-17,"Editor, commissioning",1
2,751cD1cbF77e005,Alisha,Hebert,Male,vincentgarrett@example.net,+1-114-355-1841x78347,1969-06-28,Broadcast engineer,2
3,DcEFDB2D2e62bF9,Gwendolyn,Sheppard,Male,mercadojonathan@example.com,9017807728,1915-09-25,Industrial buyer,3
4,C88661E02EEDA9e,Kristine,Mccann,Female,lindsay55@example.com,+1-607-333-9911x59088,1978-07-27,Multimedia specialist,4
5,fafF1aBDebaB2a6,Bobby,Pittman,Female,blevinsmorgan@example.com,3739847538,1989-11-17,Planning and development surveyor,5
6,BdDb6C8Af309202,Calvin,Ramsey,Female,loretta85@example.com,001-314-829-5014x1792,2017-08-31,Therapeutic radiographer,6
7,FCdfFf08196f633,Collin,Allison,Male,yvaughn@example.net,(314)591-7413,1979-11-21,Administrator,7
8,356279dAa0F7CbD,Nicholas,Branch,Male,greerjimmy@example.net,+1-667-666-5867,2006-01-21,Fisheries officer,8
9,F563CcbFBfEcf5a,Emma,Robinson,Female,charleshiggins@example.org,166-234-6882x7457,2009-03-19,Haematologist,9


In [0]:
#4.	Write back table enriched with Master ID back to another table/csv file
df_new.write.csv('updateFile.csv')

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
[0;32m<command-3727173430892863>[0m in [0;36m<cell line: 2>[0;34m()[0m
[1;32m      1[0m [0;31m#4.     Write back table enriched with Master ID back to another table/csv file[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 2[0;31m [0mdf_new[0m[0;34m.[0m[0mwrite[0m[0;34m.[0m[0mcsv[0m[0;34m([0m[0;34m'updateFile.csv'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/instrumentation_utils.py[0m in [0;36mwrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m             [0mstart[0m [0;34m=[0m [0mtime[0m[0;34m.[0m[0mperf_counter[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m     47[0m             [0;32mtry[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m---> 48[0;31m                 [0mres[0m [0;34m=[0m [0mfunc[0m[0;34m(