# DataFrames

It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. 

more data source options are available here https://spark.apache.org/docs/latest/sql-data-sources.html

In [0]:
data = """id,name,age,city,salary
1,John Doe,28,New York,70000
2,Jane Smith,34,Los Angeles,80000
3,Emily Davis,22,Chicago,60000
4,Michael Brown,45,Houston,95000
5,Jessica Taylor,29,Philadelphia,75000
"""
with open('test.csv','w') as fp:
    fp.write(data)

In [0]:
df  = spark.read.csv('file:/databricks/driver/test.csv', header=True)
display(df)

id,name,age,city,salary
1,John Doe,28,New York,70000
2,Jane Smith,34,Los Angeles,80000
3,Emily Davis,22,Chicago,60000
4,Michael Brown,45,Houston,95000
5,Jessica Taylor,29,Philadelphia,75000


In [0]:
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)
 |-- salary: string (nullable = true)



All the files are string type but id, age, salary sould be integer. we can use infer schema option. 

In [0]:
df  = spark.read.csv('file:/databricks/driver/test.csv', header=True, inferSchema=True)
display(df)
df.printSchema()


id,name,age,city,salary
1,John Doe,28,New York,70000
2,Jane Smith,34,Los Angeles,80000
3,Emily Davis,22,Chicago,60000
4,Michael Brown,45,Houston,95000
5,Jessica Taylor,29,Philadelphia,75000


root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- salary: integer (nullable = true)



Data can also be read from dbfs. 
Databricks has many different sample data available

In [0]:
%fs ls dbfs:/databricks-datasets/timeseries/Fires/


path,name,size,modificationTime
dbfs:/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv,Fire_Department_Calls_for_Service.csv,1892561692,1573520565000
dbfs:/databricks-datasets/timeseries/Fires/SFFire_readme.md,SFFire_readme.md,1222,1573520565000


In [0]:
df  = spark.read.csv('dbfs:/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
display(df.take(10))
df.printSchema()

Call Number,Unit ID,Incident Number,Call Type,Call Date,Watch Date,Received DtTm,Entry DtTm,Dispatch DtTm,Response DtTm,On Scene DtTm,Transport DtTm,Hospital DtTm,Call Final Disposition,Available DtTm,Address,City,Zipcode of Incident,Battalion,Station Area,Box,Original Priority,Priority,Final Priority,ALS Unit,Call Type Group,Number of Alarms,Unit Type,Unit sequence in call dispatch,Fire Prevention District,Supervisor District,Neighborhooods - Analysis Boundaries,Location,RowID
192910017,E11,19125164,Alarms,2019-10-18,2019-10-17,10/18/2019 12:03:52 AM,10/18/2019 12:06:59 AM,10/18/2019 12:07:05 AM,10/18/2019 12:08:28 AM,10/18/2019 12:11:10 AM,,,Fire,10/18/2019 12:33:57 AM,24TH ST/VALENCIA ST,San Francisco,94110,B06,11,5525,3,3,3,True,Alarm,1,ENGINE,1,6,9,Mission,POINT (-122.42066480228367 37.75210364574824),192910017-E11
192910018,B10,19125165,Alarms,2019-10-18,2019-10-17,10/18/2019 12:05:56 AM,10/18/2019 12:07:27 AM,10/18/2019 12:09:49 AM,,,,,Other,10/18/2019 12:10:14 AM,3300 Block of 23RD ST,San Francisco,94110,B06,11,552,3,3,3,False,Alarm,1,CHIEF,1,6,9,Mission,POINT (-122.4202535645237 37.75368162954947),192910018-B10
192910018,T07,19125165,Alarms,2019-10-18,2019-10-17,10/18/2019 12:05:56 AM,10/18/2019 12:07:27 AM,10/18/2019 12:09:49 AM,,,,,Other,10/18/2019 12:10:14 AM,3300 Block of 23RD ST,San Francisco,94110,B06,11,552,3,3,3,False,Alarm,1,TRUCK,3,6,9,Mission,POINT (-122.4202535645237 37.75368162954947),192910018-T07
192910025,B04,19125166,Alarms,2019-10-18,2019-10-17,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,,,Fire,10/18/2019 12:21:52 AM,3300 Block of FILLMORE ST,San Francisco,94123,B04,16,3554,3,3,3,False,Alarm,1,CHIEF,1,4,2,Marina,POINT (-122.43607739030332 37.80034056356869),192910025-B04
192910034,E01,19125167,Structure Fire,2019-10-18,2019-10-17,10/18/2019 12:12:39 AM,10/18/2019 12:12:39 AM,10/18/2019 12:12:48 AM,10/18/2019 12:13:52 AM,10/18/2019 12:16:16 AM,,,Fire,10/18/2019 12:16:25 AM,7TH ST/MISSION ST,San Francisco,94103,B02,1,2315,3,3,3,True,Alarm,1,ENGINE,1,2,6,South of Market,POINT (-122.41093657380038 37.779211684542084),192910034-E01
192910034,T01,19125167,Structure Fire,2019-10-18,2019-10-17,10/18/2019 12:12:39 AM,10/18/2019 12:12:39 AM,10/18/2019 12:12:48 AM,10/18/2019 12:14:28 AM,,,,Fire,10/18/2019 12:16:18 AM,7TH ST/MISSION ST,San Francisco,94103,B02,1,2315,3,3,3,False,Alarm,1,TRUCK,2,2,6,South of Market,POINT (-122.41093657380038 37.779211684542084),192910034-T01
192910039,76,19125168,Medical Incident,2019-10-18,2019-10-17,10/18/2019 12:14:32 AM,10/18/2019 12:14:32 AM,10/18/2019 12:15:10 AM,10/18/2019 12:15:25 AM,10/18/2019 12:32:14 AM,10/18/2019 12:56:35 AM,10/18/2019 01:15:56 AM,Code 2 Transport,10/18/2019 01:48:10 AM,1600 Block of MCKINNON AVE,San Francisco,94124,B10,17,6515,A,2,2,True,Non Life-threatening,1,MEDIC,1,10,10,Bayview Hunters Point,POINT (-122.38972310330021 37.73607882495912),192910039-76
192910048,T08,19125169,Alarms,2019-10-18,2019-10-17,10/18/2019 12:20:25 AM,10/18/2019 12:21:44 AM,10/18/2019 12:21:51 AM,10/18/2019 12:24:30 AM,10/18/2019 12:26:44 AM,,,Fire,10/18/2019 12:40:41 AM,100 Block of BERRY ST,San Francisco,94158,B03,8,2171,3,3,3,False,Alarm,1,TRUCK,3,3,6,Mission Bay,POINT (-122.3921894505535 37.77663138541027),192910048-T08
192910057,78,19125170,Medical Incident,2019-10-18,2019-10-17,10/18/2019 12:23:58 AM,10/18/2019 12:26:34 AM,10/18/2019 12:26:42 AM,10/18/2019 12:27:20 AM,10/18/2019 12:34:19 AM,10/18/2019 12:53:44 AM,10/18/2019 01:01:18 AM,Code 3 Transport,10/18/2019 01:52:13 AM,800 Block of FOERSTER ST,San Francisco,94127,B09,15,8247,3,3,3,True,Potentially Life-Threatening,1,MEDIC,2,9,7,West of Twin Peaks,POINT (-122.44882464734825 37.736080487699894),192910057-78
192910057,E15,19125170,Medical Incident,2019-10-18,2019-10-17,10/18/2019 12:23:58 AM,10/18/2019 12:26:34 AM,10/18/2019 12:26:42 AM,10/18/2019 12:29:33 AM,10/18/2019 12:33:04 AM,,,Code 3 Transport,10/18/2019 01:17:49 AM,800 Block of FOERSTER ST,San Francisco,94127,B09,15,8247,3,3,3,True,Potentially Life-Threatening,1,ENGINE,1,9,7,West of Twin Peaks,POINT (-122.44882464734825 37.736080487699894),192910057-E15


root
 |-- Call Number: integer (nullable = true)
 |-- Unit ID: string (nullable = true)
 |-- Incident Number: integer (nullable = true)
 |-- Call Type: string (nullable = true)
 |-- Call Date: date (nullable = true)
 |-- Watch Date: date (nullable = true)
 |-- Received DtTm: string (nullable = true)
 |-- Entry DtTm: string (nullable = true)
 |-- Dispatch DtTm: string (nullable = true)
 |-- Response DtTm: string (nullable = true)
 |-- On Scene DtTm: string (nullable = true)
 |-- Transport DtTm: string (nullable = true)
 |-- Hospital DtTm: string (nullable = true)
 |-- Call Final Disposition: string (nullable = true)
 |-- Available DtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode of Incident: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- Station Area: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- Original Priority: string (nullable = true)
 |-- Priority: string (nullable = t

Run SQL on files directly. below is the example for csv 

In [0]:
df = spark.sql("SELECT * FROM csv.`/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv`")
display(df.take(10))
df.printSchema()

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19,_c20,_c21,_c22,_c23,_c24,_c25,_c26,_c27,_c28,_c29,_c30,_c31,_c32,_c33
Call Number,Unit ID,Incident Number,Call Type,Call Date,Watch Date,Received DtTm,Entry DtTm,Dispatch DtTm,Response DtTm,On Scene DtTm,Transport DtTm,Hospital DtTm,Call Final Disposition,Available DtTm,Address,City,Zipcode of Incident,Battalion,Station Area,Box,Original Priority,Priority,Final Priority,ALS Unit,Call Type Group,Number of Alarms,Unit Type,Unit sequence in call dispatch,Fire Prevention District,Supervisor District,Neighborhooods - Analysis Boundaries,Location,RowID
192910017,E11,19125164,Alarms,10/18/2019,10/17/2019,10/18/2019 12:03:52 AM,10/18/2019 12:06:59 AM,10/18/2019 12:07:05 AM,10/18/2019 12:08:28 AM,10/18/2019 12:11:10 AM,,,Fire,10/18/2019 12:33:57 AM,24TH ST/VALENCIA ST,San Francisco,94110,B06,11,5525,3,3,3,true,Alarm,1,ENGINE,1,6,9,Mission,POINT (-122.42066480228367 37.75210364574824),192910017-E11
192910018,B10,19125165,Alarms,10/18/2019,10/17/2019,10/18/2019 12:05:56 AM,10/18/2019 12:07:27 AM,10/18/2019 12:09:49 AM,,,,,Other,10/18/2019 12:10:14 AM,3300 Block of 23RD ST,San Francisco,94110,B06,11,0552,3,3,3,false,Alarm,1,CHIEF,1,6,9,Mission,POINT (-122.4202535645237 37.75368162954947),192910018-B10
192910018,T07,19125165,Alarms,10/18/2019,10/17/2019,10/18/2019 12:05:56 AM,10/18/2019 12:07:27 AM,10/18/2019 12:09:49 AM,,,,,Other,10/18/2019 12:10:14 AM,3300 Block of 23RD ST,San Francisco,94110,B06,11,0552,3,3,3,false,Alarm,1,TRUCK,3,6,9,Mission,POINT (-122.4202535645237 37.75368162954947),192910018-T07
192910025,B04,19125166,Alarms,10/18/2019,10/17/2019,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,10/18/2019 12:09:02 AM,,,Fire,10/18/2019 12:21:52 AM,3300 Block of FILLMORE ST,San Francisco,94123,B04,16,3554,3,3,3,false,Alarm,1,CHIEF,1,4,2,Marina,POINT (-122.43607739030332 37.80034056356869),192910025-B04
192910034,E01,19125167,Structure Fire,10/18/2019,10/17/2019,10/18/2019 12:12:39 AM,10/18/2019 12:12:39 AM,10/18/2019 12:12:48 AM,10/18/2019 12:13:52 AM,10/18/2019 12:16:16 AM,,,Fire,10/18/2019 12:16:25 AM,7TH ST/MISSION ST,San Francisco,94103,B02,01,2315,3,3,3,true,Alarm,1,ENGINE,1,2,6,South of Market,POINT (-122.41093657380038 37.779211684542084),192910034-E01
192910034,T01,19125167,Structure Fire,10/18/2019,10/17/2019,10/18/2019 12:12:39 AM,10/18/2019 12:12:39 AM,10/18/2019 12:12:48 AM,10/18/2019 12:14:28 AM,,,,Fire,10/18/2019 12:16:18 AM,7TH ST/MISSION ST,San Francisco,94103,B02,01,2315,3,3,3,false,Alarm,1,TRUCK,2,2,6,South of Market,POINT (-122.41093657380038 37.779211684542084),192910034-T01
192910039,76,19125168,Medical Incident,10/18/2019,10/17/2019,10/18/2019 12:14:32 AM,10/18/2019 12:14:32 AM,10/18/2019 12:15:10 AM,10/18/2019 12:15:25 AM,10/18/2019 12:32:14 AM,10/18/2019 12:56:35 AM,10/18/2019 01:15:56 AM,Code 2 Transport,10/18/2019 01:48:10 AM,1600 Block of MCKINNON AVE,San Francisco,94124,B10,17,6515,A,2,2,true,Non Life-threatening,1,MEDIC,1,10,10,Bayview Hunters Point,POINT (-122.38972310330021 37.73607882495912),192910039-76
192910048,T08,19125169,Alarms,10/18/2019,10/17/2019,10/18/2019 12:20:25 AM,10/18/2019 12:21:44 AM,10/18/2019 12:21:51 AM,10/18/2019 12:24:30 AM,10/18/2019 12:26:44 AM,,,Fire,10/18/2019 12:40:41 AM,100 Block of BERRY ST,San Francisco,94158,B03,08,2171,3,3,3,false,Alarm,1,TRUCK,3,3,6,Mission Bay,POINT (-122.3921894505535 37.77663138541027),192910048-T08
192910057,78,19125170,Medical Incident,10/18/2019,10/17/2019,10/18/2019 12:23:58 AM,10/18/2019 12:26:34 AM,10/18/2019 12:26:42 AM,10/18/2019 12:27:20 AM,10/18/2019 12:34:19 AM,10/18/2019 12:53:44 AM,10/18/2019 01:01:18 AM,Code 3 Transport,10/18/2019 01:52:13 AM,800 Block of FOERSTER ST,San Francisco,94127,B09,15,8247,3,3,3,true,Potentially Life-Threatening,1,MEDIC,2,9,7,West of Twin Peaks,POINT (-122.44882464734825 37.736080487699894),192910057-78


root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- _c20: string (nullable = true)
 |-- _c21: string (nullable = true)
 |-- _c22: string (nullable = true)
 |-- _c23: string (nullable = true)
 |-- _c24: string (nullable = true)
 |-- _c25: string (nullable = true)
 |-- _c26: string (nullable = true)
 |-- _c27: string (nullable = tru