### DataFrame
Spark SQL provides a special type of RDD called DataFrame which is organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. In the following example, we will show that how to create a Dateframe for a JSON dataset.

In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql.session import SparkSession

sc = pyspark.SparkContext(appName="myAppName")
spark = SparkSession(sc)

In [2]:
# Defines a Python list storing one JSON object.
json_strings = ['{"name":"Bob","address":{"city":"Los Angeles","state":"California"}}']

In [3]:
# Defines an RDD from the Python list.
peopleRDD = sc.parallelize(json_strings)

In [4]:
people = spark.read.json(peopleRDD)
people.show()

+--------------------+----+
|             address|name|
+--------------------+----+
|[Los Angeles,Cali...| Bob|
+--------------------+----+



### Text Search
In this example we show how to search through the error messages in a log file using Dataframe.

In [5]:
from pyspark.sql import Row

In [6]:
text_data = sc.parallelize(["MYSQL ERROR 1\n","MYSQL ERROR 2\n","MYSQL\n"])

In [7]:
# Creates a DataFrame having a single column named "line"
df = text_data.map(lambda r: Row(r)).toDF(["line"])

In [8]:
df.show()

+--------------+
|          line|
+--------------+
|MYSQL ERROR 1
|
|MYSQL ERROR 2
|
|        MYSQL
|
+--------------+



In [9]:
# Counts ERRORs
errors = df.filter(df["line"].like("%ERROR%"))

In [10]:
errors.show()

+--------------+
|          line|
+--------------+
|MYSQL ERROR 1
|
|MYSQL ERROR 2
|
+--------------+



In [11]:
# Counts all the errors
errors.count()

2