![](http://spark.apache.org/images/spark-logo.png)

### Introduction to DataFrames

Objectives:

A DataFrame is two-dimensional. Columns can be of different data types. DataFrames accept many data inputs including series and other DataFrames. We can pass indexes (row labels) and columns (column labels). Indexes can be numbers, dates, or strings/tuples.

After completing this lab we will be able to:

* Load a data file into a DataFrame
* View the data schema of a DataFrame
* Perform basic data manipulation [Add/drop column]
* Rename Column

In [158]:
import pyspark
import pandas as pd

In [159]:
fruits = pd.read_csv('fruits.csv')
fruits.tail()

Unnamed: 0,Name,Type,Mass,Width,Height
0,apple,granny_smith,159,3.5,3.9
1,apple,golden_apple,178,8.3,6.4
2,apple,granny_smith,163,6.8,9.1
3,strawberry,june-bearing,80,1.2,1.9
4,oramge,typical,8,9.7,7.9


<h3>Exercise 1 - Spark Session</h3>

Whenever work with pyspark, we must have to create / start a spark session. To start a spark session:

In [160]:
from pyspark.sql import SparkSession

Task 1 - <code>Creating Spark session and context</code>

In [161]:
spark = SparkSession.builder.appName('console').getOrCreate()

Task 2 - <code>Initializing spark session</code>

In [162]:
spark

Exercise 2 - <code>Load the data and Spark DataFrame</code>

Task 1 - Loading data into a Spark DataFrame

In [163]:
# Read dataset with respect to spark
df_pyspark = spark.read.csv('fruits.csv')
df_pyspark

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string]

In [164]:
df_pyspark.show()

+----------+------------+----+-----+------+
|       _c0|         _c1| _c2|  _c3|   _c4|
+----------+------------+----+-----+------+
|      Name|        Type|Mass|Width|Height|
|     apple|granny_smith| 159|  3.5|   3.9|
|     apple|golden_apple| 178|  8.3|   6.4|
|     apple|granny_smith| 163|  6.8|   9.1|
|strawberry|june-bearing|  80|  1.2|   1.9|
|    oramge|     typical|   8|  9.7|   7.9|
+----------+------------+----+-----+------+



Task 3 - <code>Reading the Spark Dataset</code>

We want to make columns Name, Type etc as our main column not c0, c1... when we are directly reading csv file, we are getting _c0, _c1.

In [165]:
spark.read.option('header', 'true').csv('fruits.csv').show() # .show() -> to view complete dataset

+----------+------------+----+-----+------+
|      Name|        Type|Mass|Width|Height|
+----------+------------+----+-----+------+
|     apple|granny_smith| 159|  3.5|   3.9|
|     apple|golden_apple| 178|  8.3|   6.4|
|     apple|granny_smith| 163|  6.8|   9.1|
|strawberry|june-bearing|  80|  1.2|   1.9|
|    oramge|     typical|   8|  9.7|   7.9|
+----------+------------+----+-----+------+



In [166]:
df_pyspark = spark.read.option('header', 'true').csv('fruits.csv')

In [167]:
type(df_pyspark) # dataset is convert to pandas to sql DataFrame. If we check the type(pd.read_csv('fruits.csv')) we can get pandas.core.frame.DataFrame

pyspark.sql.dataframe.DataFrame

<code>Preview some Spark DataFrame records</code>

In [168]:
df_pyspark.head(3)

[Row(Name='apple', Type='granny_smith', Mass='159', Width='3.5', Height='3.9'),
 Row(Name='apple', Type='golden_apple', Mass='178', Width='8.3', Height='6.4'),
 Row(Name='apple', Type='granny_smith', Mass='163', Width='6.8', Height='9.1')]

<code>Check Spark DataFrame Schema</code>

In [169]:
# printSchema; bit more info about columns. Like a pandas info() which prints the datatype of a column
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Mass: string (nullable = true)
 |-- Width: string (nullable = true)
 |-- Height: string (nullable = true)



Why mass, width and height showing string? It is because by default it takes all the schema to string values.

So, we need to add one option i.e., inferSchema = True 

In [170]:
df_pyspark = spark.read.option('header', 'true').csv('fruits.csv', inferSchema=True)

In [171]:
# Again check the schema
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Mass: integer (nullable = true)
 |-- Width: double (nullable = true)
 |-- Height: double (nullable = true)



Now, we can able to see their correct dtypes

<code>Another way to view dataset</code>

In [172]:
df_pyspark = spark.read.csv('fruits.csv', header=True, inferSchema=True)
df_pyspark.show()

+----------+------------+----+-----+------+
|      Name|        Type|Mass|Width|Height|
+----------+------------+----+-----+------+
|     apple|granny_smith| 159|  3.5|   3.9|
|     apple|golden_apple| 178|  8.3|   6.4|
|     apple|granny_smith| 163|  6.8|   9.1|
|strawberry|june-bearing|  80|  1.2|   1.9|
|    oramge|     typical|   8|  9.7|   7.9|
+----------+------------+----+-----+------+



In [173]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Mass: integer (nullable = true)
 |-- Width: double (nullable = true)
 |-- Height: double (nullable = true)



<code>DataFrame:</code>

- It is a data structures because inside it we can perform different kinds of operation

In [174]:
# check type
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

Exercise 3 - <code>Basic Data Analysis and Manipulation</code>

Task 1 - <code>Selecting columns and Indexing<code>

In [175]:
# how we get all columns
df_pyspark.columns

['Name', 'Type', 'Mass', 'Width', 'Height']

<code>Select the column</code>|

In [176]:
# to viw all table
df_pyspark.show()

+----------+------------+----+-----+------+
|      Name|        Type|Mass|Width|Height|
+----------+------------+----+-----+------+
|     apple|granny_smith| 159|  3.5|   3.9|
|     apple|golden_apple| 178|  8.3|   6.4|
|     apple|granny_smith| 163|  6.8|   9.1|
|strawberry|june-bearing|  80|  1.2|   1.9|
|    oramge|     typical|   8|  9.7|   7.9|
+----------+------------+----+-----+------+



In [177]:
# pick only Name column
df_pyspark.select('Name')

DataFrame[Name: string]

In [178]:
type(df_pyspark.select('Name'))

pyspark.sql.dataframe.DataFrame

In [179]:
# pick two columns
df_pyspark.select(['Name', 'Mass'])

DataFrame[Name: string, Mass: int]

In [180]:
# view all the element of two columns
df_pyspark.select(['Name', 'Mass']).show()

+----------+----+
|      Name|Mass|
+----------+----+
|     apple| 159|
|     apple| 178|
|     apple| 163|
|strawberry|  80|
|    oramge|   8|
+----------+----+



In [181]:
# Check dtypes
df_pyspark.dtypes

[('Name', 'string'),
 ('Type', 'string'),
 ('Mass', 'int'),
 ('Width', 'double'),
 ('Height', 'double')]

Checking <code>Describe()</code> as Pandas

In [182]:
df_pyspark.describe()

DataFrame[summary: string, Name: string, Type: string, Mass: string, Width: string, Height: string]

In [183]:
df_pyspark.describe().show()

+-------+----------+------------+-----------------+-----------------+-----------------+
|summary|      Name|        Type|             Mass|            Width|           Height|
+-------+----------+------------+-----------------+-----------------+-----------------+
|  count|         5|           5|                5|                5|                5|
|   mean|      null|        null|            117.6|              5.9|5.839999999999999|
| stddev|      null|        null|72.19626029095967|3.494996423460259|2.935643030070243|
|    min|     apple|golden_apple|                8|              1.2|              1.9|
|    max|strawberry|     typical|              178|              9.7|              9.1|
+-------+----------+------------+-----------------+-----------------+-----------------+



Columnar Operation: 

<code>Adding columns</code>

In [184]:
df_pyspark.withColumn('Mass After some month', df_pyspark['Mass']+20)

DataFrame[Name: string, Type: string, Mass: int, Width: double, Height: double, Mass After some month: int]

In [185]:
df_pyspark.withColumn('Mass After some month', df_pyspark['Mass']+20).show()

+----------+------------+----+-----+------+---------------------+
|      Name|        Type|Mass|Width|Height|Mass After some month|
+----------+------------+----+-----+------+---------------------+
|     apple|granny_smith| 159|  3.5|   3.9|                  179|
|     apple|golden_apple| 178|  8.3|   6.4|                  198|
|     apple|granny_smith| 163|  6.8|   9.1|                  183|
|strawberry|june-bearing|  80|  1.2|   1.9|                  100|
|    oramge|     typical|   8|  9.7|   7.9|                   28|
+----------+------------+----+-----+------+---------------------+



New DataFrame

In [186]:
df_pyspark = df_pyspark.withColumn('Mass After some month', df_pyspark['Mass']+20)

In [187]:
df_pyspark.show()

+----------+------------+----+-----+------+---------------------+
|      Name|        Type|Mass|Width|Height|Mass After some month|
+----------+------------+----+-----+------+---------------------+
|     apple|granny_smith| 159|  3.5|   3.9|                  179|
|     apple|golden_apple| 178|  8.3|   6.4|                  198|
|     apple|granny_smith| 163|  6.8|   9.1|                  183|
|strawberry|june-bearing|  80|  1.2|   1.9|                  100|
|    oramge|     typical|   8|  9.7|   7.9|                   28|
+----------+------------+----+-----+------+---------------------+



<code>Drop column</code>

In [189]:
df_pyspark = df_pyspark.drop('Mass After some month')

In [190]:
df_pyspark.show()

+----------+------------+----+-----+------+
|      Name|        Type|Mass|Width|Height|
+----------+------------+----+-----+------+
|     apple|granny_smith| 159|  3.5|   3.9|
|     apple|golden_apple| 178|  8.3|   6.4|
|     apple|granny_smith| 163|  6.8|   9.1|
|strawberry|june-bearing|  80|  1.2|   1.9|
|    oramge|     typical|   8|  9.7|   7.9|
+----------+------------+----+-----+------+



<code>Rename Column</code>

In [191]:
df_pyspark.withColumnRenamed('Mass', 'Weight')

DataFrame[Name: string, Type: string, Weight: int, Width: double, Height: double]

In [193]:
df_pyspark.withColumnRenamed('Mass', 'Weight').show()

+----------+------------+------+-----+------+
|      Name|        Type|Weight|Width|Height|
+----------+------------+------+-----+------+
|     apple|granny_smith|   159|  3.5|   3.9|
|     apple|golden_apple|   178|  8.3|   6.4|
|     apple|granny_smith|   163|  6.8|   9.1|
|strawberry|june-bearing|    80|  1.2|   1.9|
|    oramge|     typical|     8|  9.7|   7.9|
+----------+------------+------+-----+------+

