### Spark RDD Example: Read CSV File

In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
import random
sc = pyspark.SparkContext(appName="MyApp")
lines = sc.textFile("students.csv")
print(lines.count())
print(lines.first())


3
id,last_name,first_name,email


Close the `SparkContext` when you are done:

In [3]:
sc.stop()

### Spark SQL Example

Create a new `SparkSession` first:

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

Read in a CSV file as a `DataFrame`:

In [5]:
df = spark.read.load("students.csv", format="com.databricks.spark.csv", header=True)
df.show()

+----+---------+----------+-----+
|  id|last_name|first_name|email|
+----+---------+----------+-----+
|twan|     Wang|      Tong| null|
|emma|     Wang|      Emma| null|
+----+---------+----------+-----+



Running SQL Queries:

In [6]:
df.createOrReplaceTempView("students")

In [7]:
spark.sql("SELECT * FROM students WHERE first_name = 'Tong'").show()

+----+---------+----------+-----+
|  id|last_name|first_name|email|
+----+---------+----------+-----+
|twan|     Wang|      Tong| null|
+----+---------+----------+-----+



In [8]:
spark.sql("SELECT count(*) FROM students").show()

+--------+
|count(1)|
+--------+
|       2|
+--------+



Close the `SparkSession` when you are done:

In [9]:
spark.stop()