Initialize spark and run basic commands to query local json and csv files

In [2]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
# When connecting to a spark master ensure max limits are specified to avoid resource wait due to starvations
spark = SparkSession \
  .builder \
  .master("spark://rixp330-ubuntu:7077") \
  .appName("LearningSpark") \
  .config("spark.cores.max","1") \
  .getOrCreate()

# spark = SparkSession.builder.master("local[10]").appName("LearningSparkLocal").getOrCreate()

In [3]:
test_json_df = spark.read.json("test.json")
test_json_df.show()

                                                                                

+--------------------+---------+-----+---+-------+---------+-----------------+
|           Campaigns|    First| Hits| Id|   Last|Published|              Url|
+--------------------+---------+-----+---+-------+---------+-----------------+
| [twitter, LinkedIn]|    Jules| 4535|  1|  Damji| 1/4/2016|https://tinyurl.1|
| [twitter, LinkedIn]|   Brooke| 8908|  2|  Wenig| 5/5/2018|https://tinyurl.2|
|[web, twitter, FB...|    Denny| 7659|  3|    Lee| 6/7/2019|https://tinyurl.3|
|       [twitter, FB]|Tathagata|10568|  4|    Das|5/12/2018|https://tinyurl.4|
|[web, twitter, FB...|    Matei|40578|  5|Zaharia|5/14/2014|https://tinyurl.5|
| [twitter, LinkedIn]|  Reynold|25568|  6|    Xin| 3/2/2015|https://tinyurl.6|
+--------------------+---------+-----+---+-------+---------+-----------------+



Read CSV files from local

In [4]:

test_csv_df = spark.read.option("header","true").csv("test.csv")
test_csv_df.show()


+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|       AvailableDtTm|
+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+
|  20110016|   T13|       2003235|  Structure Fire|01/11/2002|01/10/2002|               Other|01/11/2002 01:51:...|
|  20110022|   M17|       2003241|Medical Incident|01/11/2002|01/10/2002|               Other|01/11/2002 03:01:...|
|  20110023|   M41|       2003242|Medical Incident|01/11/2002|01/10/2002|               Other|01/11/2002 02:39:...|
|  20110032|   E11|       2003250|    Vehicle Fire|01/11/2002|01/10/2002|               Other|01/11/2002 04:16:...|
|  20110043|   B04|       2003259|          Alarms|01/11/2002|01/10/2002|               Other|01/11/2002 06:01:...|
+----------+------+--------------+----------------+----------+----------

Create a view from json file and use sql to query that

In [5]:
spark.read.json("test.json").createOrReplaceTempView("blogs")
results = spark.sql("select id,url,campaigns from blogs")
results.show()

+---+-----------------+--------------------+
| id|              url|           campaigns|
+---+-----------------+--------------------+
|  1|https://tinyurl.1| [twitter, LinkedIn]|
|  2|https://tinyurl.2| [twitter, LinkedIn]|
|  3|https://tinyurl.3|[web, twitter, FB...|
|  4|https://tinyurl.4|       [twitter, FB]|
|  5|https://tinyurl.5|[web, twitter, FB...|
|  6|https://tinyurl.6| [twitter, LinkedIn]|
+---+-----------------+--------------------+



Read CSV files from S3

In [6]:
csvS3 = spark.read.format('csv').options(header='false',inferSchema='false',delimiter='|').load('s3a://data-lake-demo-rixon/tickitdb/venue/venue_pipe.txt')
csvS3.show()

                                                                                

+---+--------------------+---------------+---+-----+
|_c0|                 _c1|            _c2|_c3|  _c4|
+---+--------------------+---------------+---+-----+
|  1|         Toyota Park|     Bridgeview| IL|    0|
|  2|Columbus Crew Sta...|       Columbus| OH|    0|
|  3|         RFK Stadium|     Washington| DC|    0|
|  4|CommunityAmerica ...|    Kansas City| KS|    0|
|  5|    Gillette Stadium|     Foxborough| MA|68756|
|  6|New York Giants S...|East Rutherford| NJ|80242|
|  7|           BMO Field|        Toronto| ON|    0|
|  8|The Home Depot Ce...|         Carson| CA|    0|
|  9|Dick's Sporting G...|  Commerce City| CO|    0|
| 10|      Pizza Hut Park|         Frisco| TX|    0|
| 11|   Robertson Stadium|        Houston| TX|    0|
| 13| Rice-Eccles Stadium| Salt Lake City| UT|    0|
| 14|   Buck Shaw Stadium|    Santa Clara| CA|    0|
| 15|     McAfee Coliseum|        Oakland| CA|63026|
| 16| TD Banknorth Garden|         Boston| MA|    0|
| 17|         Izod Center|East Rutherford| NJ|