# Guided Lab - 345.2.3 - PYSQL
## Create DataFrame From the Data Sources(CSV and JSON)


Lab Overview


In this lab, we will first create a SparkSession. Then, we will create a new data frame by importing CSV files and JSON files.


Example 1 - Creating the DataFrame from CSV File
Click here to Download the dataset about cars:

Open the Jupyter notebook.
Create a DataFrame by applying createDataFrame on RDD with the help of SparkSession as shown below:


In [3]:
import pyspark
from pyspark.sql import SparkSession #Importing the Libraries
# Creating Spark Session
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Reading /loading the Dataset from CSV file
cardf = spark.read.load("cars.csv", format="csv", header = True,inferSchema = True)

cardf.show()

+--------------------+----+---------+------------+----------+------+------------+-----+------+--------+-------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|quantity|   city|
+--------------------+----+---------+------------+----------+------+------------+-----+------+--------+-------+
|AMC Ambassador Br...|13.0|        8|       360.0|       175|  3821|        11.0|   73|    US|      25|NewYork|
|  AMC Ambassador DPL|15.0|        8|       390.0|       190|  3850|         8.5|   70|    US|       2|     NJ|
|  AMC Ambassador SST|17.0|        8|       304.0|       150|  3672|        11.5|   72|    US|       4| DALLAS|
|         AMC Concord|19.4|        6|       232.0|        90|  3210|        17.2|   78|    US|      52|  TEXAS|
|         AMC Concord|24.3|        4|       151.0|        90|  3003|        20.1|   80|    US|      42|     OH|
|     AMC Concord d/l|18.1|        6|       258.0|       120|  3410|        15.1|   78|    US|       4|N

To read a CSV file, simply specify the path to the load() function of the read module. The inferSchema and header  parameters are mandatory whenever reading CSV files; without them, Spark will cast every data type to string, and treat the header row as actual data:

To see the types of columns in DataFrame, we can use the printSchema() method. Let’s apply printSchema() on cars.css files, which will print the schema in a tree format.

In [2]:
cardf.printSchema()

root
 |-- Car: string (nullable = true)
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Displacement: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Acceleration: double (nullable = true)
 |-- Model: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- city: string (nullable = true)



In [3]:
#Let’s have a look at the column’s names.

cardf.columns

['Car',
 'MPG',
 'Cylinders',
 'Displacement',
 'Horsepower',
 'Weight',
 'Acceleration',
 'Model',
 'Origin',
 'quantity',
 'city']

In [5]:
cardf.show()

+--------------------+----+---------+------------+----------+------+------------+-----+------+--------+-------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|quantity|   city|
+--------------------+----+---------+------------+----------+------+------------+-----+------+--------+-------+
|AMC Ambassador Br...|13.0|        8|       360.0|       175|  3821|        11.0|   73|    US|      25|NewYork|
|  AMC Ambassador DPL|15.0|        8|       390.0|       190|  3850|         8.5|   70|    US|       2|     NJ|
|  AMC Ambassador SST|17.0|        8|       304.0|       150|  3672|        11.5|   72|    US|       4| DALLAS|
|         AMC Concord|19.4|        6|       232.0|        90|  3210|        17.2|   78|    US|      52|  TEXAS|
|         AMC Concord|24.3|        4|       151.0|        90|  3003|        20.1|   80|    US|      42|     OH|
|     AMC Concord d/l|18.1|        6|       258.0|       120|  3410|        15.1|   78|    US|       4|N

Inference: Here, we can see that the show() function has returned the top 20 rows of the dataset. Note that we have kept the header type as True so that Spark will treat the first row as header, and inferSchema is also set to True so that it returns the values with the real data type.

We can use the head() operation to see the first N observation (say, 5 observations). Head operation in PySpark is similar to head operation in Pandas.


In [14]:
cardf.head(5)

[Row(Car='AMC Ambassador Brougham', MPG=13.0, Cylinders=8, Displacement=360.0, Horsepower=175, Weight=3821, Acceleration=11.0, Model=73, Origin='US', quantity=25, city='NewYork'),
 Row(Car='AMC Ambassador DPL', MPG=15.0, Cylinders=8, Displacement=390.0, Horsepower=190, Weight=3850, Acceleration=8.5, Model=70, Origin='US', quantity=2, city='NJ'),
 Row(Car='AMC Ambassador SST', MPG=17.0, Cylinders=8, Displacement=304.0, Horsepower=150, Weight=3672, Acceleration=11.5, Model=72, Origin='US', quantity=4, city='DALLAS'),
 Row(Car='AMC Concord', MPG=19.4, Cylinders=6, Displacement=232.0, Horsepower=90, Weight=3210, Acceleration=17.2, Model=78, Origin='US', quantity=52, city='TEXAS'),
 Row(Car='AMC Concord', MPG=24.3, Cylinders=4, Displacement=151.0, Horsepower=90, Weight=3003, Acceleration=20.1, Model=80, Origin='US', quantity=42, city='OH')]

Extracting the last rows means getting the last N rows from the given data frame. For this, we are using the tail() function and can get the last N rows.

In [13]:
cardf.tail(2)

[Row(Car='Volvo 264gl', MPG=17.0, Cylinders=6, Displacement=163.0, Horsepower=125, Weight=3140, Acceleration=13.6, Model=78, Origin='Europe', quantity=320, city='NewYork'),
 Row(Car='Volvo Diesel', MPG=30.7, Cylinders=6, Displacement=145.0, Horsepower=76, Weight=3160, Acceleration=19.6, Model=81, Origin='Europe', quantity=406, city='NJ')]

How many columns do we have in cars.csv files along with their names?

For getting the column's name, we can use columns on DataFrame; similar to what we do for getting the columns in Pandas DataFrame.


In [15]:
len(cardf.columns)  # show number of columns

11

In [16]:
cardf.columns     # show name of the columns

['Car',
 'MPG',
 'Cylinders',
 'Displacement',
 'Horsepower',
 'Weight',
 'Acceleration',
 'Model',
 'Origin',
 'quantity',
 'city']

Get the summary statistics (mean, standard deviance, minimum, maximum, and count) of the numerical columns in a DataFrame.

The describe() operation is used to calculate the summary statistics of numerical column(s) in DataFrame. 

If we do not specify the name of columns, it will calculate summary statistics for all numerical columns present in DataFrame.

Let’s check to see what happens when we specify the name of a categorical /String column in the describe() operation.


In [17]:
cardf.describe('Car').show()

+-------+--------------------+
|summary|                 Car|
+-------+--------------------+
|  count|                 406|
|   mean|                null|
| stddev|                null|
|    min|AMC Ambassador Br...|
|    max|        Volvo Diesel|
+-------+--------------------+



We can use the orderBy() operation on DataFrame to get sorted output based on some columns. 

The orderBy operation takes two arguments:

List of columns.
Ascending = True or False for getting the results in ascending or descending order (list in case of more than two columns).

Let’s sort the cars' DataFrame based on Acceleration.


In [18]:
cardf.orderBy(cardf.Acceleration.desc()).show(10)

+--------------------+----+---------+------------+----------+------+------------+-----+------+--------+-------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|quantity|   city|
+--------------------+----+---------+------------+----------+------+------------+-----+------+--------+-------+
|         Peugeot 504|27.2|        4|       141.0|        71|  3190|        24.8|   79|Europe|     344|NewYork|
|   Volkswagen Pickup|44.0|        4|        97.0|        52|  2130|        24.6|   82|Europe|      96|NewYork|
|Volkswagen Dasher...|43.4|        4|        90.0|        48|  2335|        23.7|   80|Europe|     371| DALLAS|
|   Volkswagen Type 3|23.0|        4|        97.0|        54|  2254|        23.5|   72|Europe|     104|NewYork|
|  Chevrolet Chevette|29.0|        4|        85.0|        52|  2035|        22.2|   76|    US|     240|  TEXAS|
|Oldsmobile Cutlas...|23.9|        8|       260.0|        90|  3420|        22.2|   79|    US|     345|N

## Example 2  - Creating the DataFrame from JSON File
Download Dataset, and click the below links for JSON files:
Zipcode1.json
Zipcode2.json
Zipcode.json
multiline-zipcode.json
 
Read JSON file into dataframe into dataframe:


In [19]:
df = spark.read.json("zipcode.json")
df.printSchema()
df.show()


root
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Decommisioned: boolean (nullable = true)
 |-- EstimatedPopulation: long (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- LocationText: string (nullable = true)
 |-- LocationType: string (nullable = true)
 |-- Long: double (nullable = true)
 |-- Notes: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- TaxReturnsFiled: long (nullable = true)
 |-- TotalWages: long (nullable = true)
 |-- WorldRegion: string (nullable = true)
 |-- Xaxis: double (nullable = true)
 |-- Yaxis: double (nullable = true)
 |-- Zaxis: double (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)

+-------------------+-------+-------------+-------------------+-----+--------------------+--------------------+--------------+-------+-------------+------------+-----+---------------+-------

In [20]:
# Read multi-line JSON file into dataframe:

multiline_df = spark.read.option("multiline","true") \
      .json("multiline-zipcode.json")
multiline_df.show()


+-------------------+------------+-----+-----------+-------+
|               City|RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709|
+-------------------+------------+-----+-----------+-------+



In [22]:
# Read multiple JSON files into dataframe:

df2 = spark.read.json(
    ['zipcode2.json','zipcode1.json'])
df2.show(4)  


+-------------------+-------+-------------+-----+--------------------+--------------------+--------------+------+------------+-----+-----------+-----+-----+-----+-----------+-------+
|               City|Country|Decommisioned|  Lat|            Location|        LocationText|  LocationType|  Long|RecordNumber|State|WorldRegion|Xaxis|Yaxis|Zaxis|ZipCodeType|Zipcode|
+-------------------+-------+-------------+-----+--------------------+--------------------+--------------+------+------------+-----+-----------+-----+-----+-----+-----------+-------+
|PASEO COSTA DEL SUR|     US|        false|17.96|NA-US-PR-PASEO CO...|Paseo Costa Del S...|NOT ACCEPTABLE|-66.22|           2|   PR|         NA| 0.38|-0.87|  0.3|   STANDARD|    704|
|       BDA SAN LUIS|     US|        false|18.14|NA-US-PR-BDA SAN ...|    Bda San Luis, PR|NOT ACCEPTABLE|-66.26|          10|   PR|         NA| 0.38|-0.86| 0.31|   STANDARD|    709|
|        PARC PARQUE|     US|        false|17.96|NA-US-PR-PARC PARQUE|     Parc Parqu

In [23]:
# Read all JSON files from a directory:


# df3 = spark.read.json("*.json")
# df3.show()


In [2]:
# Define custom schema for zipcode.json, because zipcode.json files have not any schema information:


from pyspark.sql.types import StructType,StructField, StringType, IntegerType,BooleanType,DoubleType
schema = StructType([
      StructField("RecordNumber",IntegerType(),True),
      StructField("Zipcode",IntegerType(),True),
      StructField("ZipCodeType",StringType(),True),
      StructField("City",StringType(),True),
      StructField("State",StringType(),True),
      StructField("LocationType",StringType(),True),
      StructField("Lat",DoubleType(),True),
      StructField("Long",DoubleType(),True),
      StructField("Xaxis",IntegerType(),True),
      StructField("Yaxis",DoubleType(),True),
      StructField("Zaxis",DoubleType(),True),
      StructField("WorldRegion",StringType(),True),
      StructField("Country",StringType(),True),
      StructField("LocationText",StringType(),True),
      StructField("Location",StringType(),True),
      StructField("Decommisioned",BooleanType(),True),
      StructField("TaxReturnsFiled",StringType(),True),
      StructField("EstimatedPopulation",IntegerType(),True),
      StructField("TotalWages",IntegerType(),True),
      StructField("Notes",StringType(),True)
  ])

df_with_schema = spark.read.schema(schema).json("zipcode.json")
df_with_schema.printSchema()
df_with_schema.show(3)


NameError: name 'spark' is not defined