# Fake Friends Data


## Overview
This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. DBFS is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in Python so the default cell type is Python. However, you can use different languages by using the %LANGUAGE syntax. Python, Scala, SQL, and R are all supported.



In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession;

spark = SparkSession.builder.master("local[4]").appName("ISM6562 Spark App01").enableHiveSupport().getOrCreate();

# Let's get the SparkContext object. It's the entry point to the Spark API. It's created when you create a sparksession
sc = spark.sparkContext  

# note: If you have multiple spark sessions running (like from a previous notebook you've run), 
# this spark session webUI will be on a different port than the default (4040). One way to 
# identify this part is with the following line. If there was only one spark session running, 
# this will be 4040. If it's higher, it means there are still other spark sesssions still running.
spark_session_port = spark.sparkContext.uiWebUrl.split(":")[-1]
print("Spark Session WebUI Port: " + spark_session_port)

23/04/09 20:01:54 WARN Utils: Your hostname, MBP-SMITH515.local resolves to a loopback address: 127.0.0.1; using 192.168.4.81 instead (on interface en0)
23/04/09 20:01:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/09 20:01:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/09 20:01:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/04/09 20:01:55 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark Session WebUI Port: 4042


In [2]:
# this will set the log level to ERROR. This will hide the INFO or WARNING messages that are printed out by default. If you want to see them, set this to INFO or WARN.
sc.setLogLevel("ERROR")

In [3]:
spark

In [13]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# see here for more info on the schema: https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
# and here https://sparkbyexamples.com/pyspark/pyspark-sql-types-datatype-with-examples/

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("friendcount", IntegerType(), True),
    ])

friends = spark.read.csv('data/fakefriends.csv', header=False, schema=schema)

# display the first 5 rows of the dataframe
friends.show(5)

+---+--------+---+-----------+
| id|    name|age|friendcount|
+---+--------+---+-----------+
|  1|Jean-Luc| 26|          2|
|  2|    Hugh| 55|        221|
|  3|  Deanna| 40|        465|
|  4|   Quark| 68|         21|
|  5|  Weyoun| 59|        318|
+---+--------+---+-----------+
only showing top 5 rows



In [14]:
friends.createOrReplaceTempView("fakefriends_csv")

If running the cell below causes an error, you need to install the pyspark-magic module. Open a terminal (and make sure you have the spark conda environment active) and run the following command:
```pip install sparksql-magic```

NOTE: Remember, to activate the spark environment, run the following command:
```conda activate spark```

**NOTE2: You will need to restart your jupyter kernel after installing the module!!!!***

In [15]:
%load_ext sparksql_magic

The sparksql_magic extension is already loaded. To reload it, use:
  %reload_ext sparksql_magic


In [16]:
%%sparksql
select * from fakefriends_csv

only showing top 20 row(s)


0,1,2,3
id,name,age,friendcount
1,Jean-Luc,26,2
2,Hugh,55,221
3,Deanna,40,465
4,Quark,68,21
5,Weyoun,59,318
6,Gowron,37,220
7,Will,54,307
8,Jadzia,38,380
9,Hugh,27,181


In [20]:
%%sparksql 
select age, round(avg(friendcount), 1) As AvgFriendCount from fakefriends_csv group by age

only showing top 20 row(s)


0,1
age,AvgFriendCount
31,267.3
65,298.2
53,222.9
34,245.5
28,209.1
26,242.1
27,228.1
44,282.2
22,206.4


In [21]:
# since our friends data is only stored in volatile memory, let's save the table into our spark-warehouse
friends.write.saveAsTable("fake_friends", mode='overwrite')


                                                                                