# Basic Operations

This lecture will cover some basic operations with Spark DataFrames.

We will play around with some stock data from Apple.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName("Operations").getOrCreate()

In [None]:
from google.colab import files

In [None]:
uploaded=files.upload()

Saving appl_stock.csv to appl_stock.csv


Let's read in the data

In [None]:
# Let Spark know about the header and infer the Schema types!
df = spark.read.csv('appl_stock.csv',inferSchema=True,header=True)

In [None]:
df.show()

+----------+------------------+------------------+------------------+------------------+---------+------------------+
|      Date|              Open|              High|               Low|             Close|   Volume|         Adj Close|
+----------+------------------+------------------+------------------+------------------+---------+------------------+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|
|2010-01-07|            211.75|        212.000006|        209.050005|            210.58|119282800|          27.28265|
|2010-01-08|        210.299994|        212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|
|2010-01-11|212.79999700000002|        213.000002|      

Check the names of the columns

In [None]:
df.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

Check the schema

In [None]:
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)



###Renaming columns

In [None]:
df1=df.withColumnRenamed('Adj Close','adj_close')

In [None]:
df1.show()

+----------+------------------+------------------+------------------+------------------+---------+------------------+
|      Date|              Open|              High|               Low|             Close|   Volume|         adj_close|
+----------+------------------+------------------+------------------+------------------+---------+------------------+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|
|2010-01-07|            211.75|        212.000006|        209.050005|            210.58|119282800|          27.28265|
|2010-01-08|        210.299994|        212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|
|2010-01-11|212.79999700000002|        213.000002|      

## Filtering Data



In [None]:
df1=df.filter(df.High>214)

In [None]:
df1.show()

+----------+------------------+------------------+------------------+------------------+---------+------------------+
|      Date|              Open|              High|               Low|             Close|   Volume|         Adj Close|
+----------+------------------+------------------+------------------+------------------+---------+------------------+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|
|2010-01-19|        208.330002|215.18999900000003|        207.240004|        215.039995|182501900|27.860484999999997|
|2010-01-20|        214.910006|        215.549994|        209.500002|            211.73|153038200|         27.431644|
|2010-03-05|        214.940006|219.69999500000003|214.62

Let's see how many datapoints ar there in the new datframe after filtering

In [None]:
df1.count()
df.count()
print("Number of records dropped=",df.count()-df1.count())

Number of records dropped= 685


Let's apply a filter based on multiple conditions i.e. High<230 and Low>220

In [None]:
df2=df.filter((df.High<230)& (df.Low>220))

In [None]:
df.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

Let's apply a filter based on multiple conditions i.e. High<230 or Low>220

In [None]:
df3=df.filter((df.High==196)|(df.Low<220))

In [None]:
df3.show()

+----------+------------------+------------------+------------------+------------------+---------+------------------+
|      Date|              Open|              High|               Low|             Close|   Volume|         Adj Close|
+----------+------------------+------------------+------------------+------------------+---------+------------------+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|
|2010-01-07|            211.75|        212.000006|        209.050005|            210.58|119282800|          27.28265|
|2010-01-08|        210.299994|        212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|
|2010-01-11|212.79999700000002|        213.000002|      

Let's look for nulls in the dataset





In [None]:
df3.filter(df.Date.isNull()| df.Open.isNull()).show()

+----+----+----+---+-----+------+---------+
|Date|Open|High|Low|Close|Volume|Adj Close|
+----+----+----+---+-----+------+---------+
+----+----+----+---+-----+------+---------+



Let's apply a filter based on 3 conditions High=196,Volume=187469100 and Close>195

In [None]:
df3=df3.filter(((df.High==196)|(df.Volume==187469100))& (df.Close>195))

You can rewrite the code as shown below. Both will give you the correct result

In [None]:
df3=df.filter(((df['High']==196)|(df['Volume']==187469100))& (df['Close']>195))

In [None]:
df3.show()

+----------+------------------+-----+----------+----------+---------+------------------+
|      Date|              Open| High|       Low|     Close|   Volume|         Adj Close|
+----------+------------------+-----+----------+----------+---------+------------------+
|2010-02-05|192.63000300000002|196.0|190.850002|195.460001|212576700|25.323710000000002|
+----------+------------------+-----+----------+----------+---------+------------------+



###Changing the datatype and using concat,substring functions

Let's look at some other interesting and useful functions. If I want an additional column in the format Date:Open I can use the 'concat' function

In [None]:
from pyspark.sql import functions as F

In [None]:
df4=df3.withColumn('Concatenated',F.concat(F.col('Date'),F.lit(' : '),F.col('Open')))
df4.show(5)

+----------+----------+----------+------------------+------------------+---------+------------------+--------------------+
|      Date|      Open|      High|               Low|             Close|   Volume|         Adj Close|        Concatenated|
+----------+----------+----------+------------------+------------------+---------+------------------+--------------------+
|2010-01-04|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|2010-01-04 : 213....|
|2010-01-05|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|2010-01-05 : 214....|
|2010-01-06|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|2010-01-06 : 214....|
|2010-01-07|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265| 2010-01-07 : 211.75|
|2010-01-08|210.299994|212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|2010-01-08 : 210....|
+----------+----

Now let's create a new column Datenew with Date as a timestamp 

In [None]:
from pyspark.sql.types import TimestampType,IntegerType,FloatType,DoubleType

In [None]:
df4=df4.withColumn('Datenew',F.col('Date').cast(TimestampType()))

In [None]:
df4.dtypes

[('Date', 'string'),
 ('Open', 'double'),
 ('High', 'double'),
 ('Low', 'double'),
 ('Close', 'double'),
 ('Volume', 'int'),
 ('Adj Close', 'double'),
 ('Concatenated', 'string'),
 ('Datenew', 'timestamp')]

Now let's change the datatype of Volume from integer to double

In [None]:
df4=df3.withColumn('Volume',F.col('Volume').cast(DoubleType()))

In [None]:
df4.show()


+----------+------------------+------------------+------------------+------------------+----------+------------------+
|      Date|              Open|              High|               Low|             Close|    Volume|         Adj Close|
+----------+------------------+------------------+------------------+------------------+----------+------------------+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|1.234324E8|         27.727039|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|1.504762E8|27.774976000000002|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|  1.3804E8|27.333178000000004|
|2010-01-07|            211.75|        212.000006|        209.050005|            210.58|1.192828E8|          27.28265|
|2010-01-08|        210.299994|        212.000006|209.06000500000002|211.98000499999998|1.119027E8|         27.464034|
|2010-01-11|212.79999700000002|        213.00000

In [None]:
df4.withColumn('month',F.substring(F.col('Date'),6,2)).show()

+----------+------------------+------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----+
|      Date|              Open|              High|               Low|             Close|   Volume|         Adj Close|        Concatenated|            Datenew|month|
+----------+------------------+------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|2010-01-04 : 213....|2010-01-04 00:00:00|   01|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|2010-01-05 : 214....|2010-01-05 00:00:00|   01|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|2010-01-06 : 214....|2010-01-06 00:00:00|   01|
|2010-01-0

Now, let's try collecting the results as Python objects and convert it to a dictionary. The ".collect" function will bring your result to the driver.If your result is  very big this is not adviseable since it will cause out of memory error.

In [None]:
# Collecting results as Python objects
result=df.filter((df['High']==196)|(df['Volume']==187469100)).collect()

In [None]:
result

[Row(Date='2010-02-01', Open=192.36999699999998, High=196.0, Low=191.29999899999999, Close=194.729998, Volume=187469100, Adj Close=25.229131),
 Row(Date='2010-02-05', Open=192.63000300000002, High=196.0, Low=190.850002, Close=195.460001, Volume=212576700, Adj Close=25.323710000000002)]

In [None]:
type(result)

list

In [None]:
result[1]

Row(Date='2010-02-05', Open=192.63000300000002, High=196.0, Low=190.850002, Close=195.460001, Volume=212576700, Adj Close=25.323710000000002)

In [None]:
newdict=result[0].asDict()

In [None]:
newdict

{'Adj Close': 25.229131,
 'Close': 194.729998,
 'Date': '2010-02-01',
 'High': 196.0,
 'Low': 191.29999899999999,
 'Open': 192.36999699999998,
 'Volume': 187469100}

###Save data 

In [None]:
from google.colab import drive
drive.mount('drive')

Mounted at drive


In [None]:
df5=df4.toPandas()

df5.to_csv('datafromcolab.csv')


cp: cannot stat 'data.csv': No such file or directory


In [None]:
!cp datafromcolab.csv "drive/My Drive/"

cp: cannot stat 'data.csv': No such file or directory


In [None]:
df4.repartition(1).write.format('csv').mode('append').save('homework')


In [None]:
!ls -lrt homework

total 56
-rw-r--r-- 1 root root 53822 Dec  2 19:53 part-00000-76f3e14d-25e4-44c1-91a4-304a4cae5500-c000.csv
-rw-r--r-- 1 root root     0 Dec  2 19:53 _SUCCESS


In [None]:
!cp -r homework "drive/My Drive/"



In [None]:
df4.coalesce(1).write.format('csv').mode('Overwrite').save('homework_coalesce')
!cp -r homework_coalesce "drive/My Drive"

In [None]:
!ls -lrt homework{}

Rows can be called to turn into dictionaries

That is all for now Great Job!