Before we start learning spark let's get a little comfortable with downloading and uploading data in the colab environment

1)Access the local file system using Python code

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz 


In [None]:
!ls

sample_data  spark-3.1.2-bin-hadoop2.7.tgz


In [None]:
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"

We can read the file in either by uploading from local or by mounting the google drive

In [None]:
from google.colab import files

In [None]:
uploaded=files.upload()

Saving people.json to people.json


Let's mount the google drive.Click on the link, copy the code and paste it in the box that will appear when you run cell below and press enter

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


PySpark isn't on sys.path by default.You can add pyspark to sys.path at runtime using findspark.

In [None]:
import findspark
findspark.init()

We are ready to create a spark session. Let's go ahead and do that

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName("Basics").getOrCreate()

In [None]:
!ls

jsonexample  simple.xml			spark-3.1.2-bin-hadoop2.7.tgz
sample_data  spark-3.1.2-bin-hadoop2.7


In [None]:
import pyspark.sql.functions as F
from pyspark.sql.functions import *

Let's start with a simple json file

In [None]:
df=spark.read.format('json').load('people.json')

In [None]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Let's try reading a nested json file with contents below.


{"isbn": "123-456-222",  
 "author": 
    {
      "lastname": "Doe",
      "firstname": "Jane"
    },
"editor": 
    {
      "lastname": "Smith",
      "firstname": "Jane"
    },
  "title": "The Ultimate Database Study Guide",  
  "category": ["Non-Fiction", "Technology"]
 }

In [None]:
df=spark.read.format('json').load('jsonexample')

When we use the above command we see an error


AnalysisException                         Traceback (most recent call last)
<ipython-input-104-1a6ce2362cd4> in <module>()
----> 1 df.show()

2 frames
/content/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count(). *italicized text*

Let's try the same command wth multiLine parameter set to True

In [None]:
df=spark.read.format('json').load('jsonexample',multiLine=True)


In [None]:
df.show()

+-----------+--------------------+-------------+-----------+--------------------+
|     author|            category|       editor|       isbn|               title|
+-----------+--------------------+-------------+-----------+--------------------+
|{Jane, Doe}|[Non-Fiction, Tec...|{Jane, Smith}|123-456-222|The Ultimate Data...|
+-----------+--------------------+-------------+-----------+--------------------+



We can see that the data is still nested. Let us print the schema and see

In [None]:
df.printSchema()


root
 |-- author: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- editor: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- title: string (nullable = true)



In [None]:
df=df.withColumn('author_firstname',F.col('author.firstname')).withColumn('author_lastname',F.col('author.lastname'))\
.withColumn('editor_firstname',F.col('editor.firstname')).withColumn('editor_lastname',F.col('editor.lastname')).withColumn('category',F.explode(F.col('category')))

In [None]:
df.columns

['author',
 'category',
 'editor',
 'isbn',
 'title',
 'author_firstname',
 'author_lastname',
 'editor_firstname',
 'editor_lastname']

We can select the columns we require and display them in the order we desire

In [None]:
df=df.select('isbn','title','author_firstname',
 'author_lastname',
 'editor_firstname',
 'editor_lastname','category',)
df.show(3)

+-----------+--------------------+----------------+---------------+----------------+---------------+-----------+
|       isbn|               title|author_firstname|author_lastname|editor_firstname|editor_lastname|   category|
+-----------+--------------------+----------------+---------------+----------------+---------------+-----------+
|123-456-222|The Ultimate Data...|            Jane|            Doe|            Jane|          Smith|Non-Fiction|
|123-456-222|The Ultimate Data...|            Jane|            Doe|            Jane|          Smith| Technology|
+-----------+--------------------+----------------+---------------+----------------+---------------+-----------+



In [None]:
from google.colab import drive
drive.mount('drive')

Mounted at drive


In [None]:
ls -lrt drive/MyDrive/Colab\ Notebooks/

total 4607423
-rw------- 1 root root       2939 Jul 23  2019  Untitled1.ipynb
-rw------- 1 root root 4023984855 Sep  4  2019  09-03-2019.csv
-rw------- 1 root root        297 Nov  3  2019  Untitled3.ipynb
-rw------- 1 root root     111843 Nov 24  2019 'Copy of colab_02_mnist.ipynb'
-rw------- 1 root root      99374 Nov 25  2019  mnisthw.ipynb
-rw------- 1 root root   72813993 Nov 29  2019  UScomments.csv
-rw------- 1 root root   76461634 Nov 29  2019  UScomments.txt
-rw------- 1 root root    2982768 Nov 29  2019  USvideos.csv
-rw------- 1 root root    5596544 Nov 29  2019  usvideos.txt
-rw------- 1 root root    1949236 Nov 29  2019  usvideosnew.txt
-rw------- 1 root root      90956 Nov 29  2019  ML_Sentiment_Label_Model-master.zip
-rw------- 1 root root   76461634 Nov 29  2019  usvideosnew2.txt
-rw------- 1 root root      58226 Nov 29  2019  amazon_cells_labelled.txt
-rw------- 1 root root      61320 Nov 29  2019  yelp_labelled.txt
-rw------- 1 root root       1070 Nov 29  2019  readme

In [None]:
!ls -lrt /content/gdrive/My Drive/Colab \Notebooks



ls: cannot access '/content/gdrive/My': No such file or directory
ls: cannot access 'Drive/Colab': No such file or directory
ls: cannot access 'Notebooks': No such file or directory


It is possible to access google cloud storage,AWS S3,Kaggle datasets,MySQL databases and so on.Please use the link below if you would like to know more
https://neptune.ai/blog/google-colab-dealing-with-files

Now that you have found your best way of loading data we can start with spark.


First, we have to download and install the necessary packages



In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz


In [None]:
!ls

sample_data  spark-3.0.2-bin-hadoop2.7.tgz


In [None]:
!tar xf spark-3.0.2-bin-hadoop2.7.tgz
!pip install -q findspark



There are some useful linux commands which you can use to check if your data is loaded or not. To check the contents of our present working directory, we can do !ls

In [None]:
!ls

 drive		        spark-3.0.2-bin-hadoop2.7
'new_data_2 (1).json'   spark-3.0.2-bin-hadoop2.7.tgz
 new_data_2.json        spark-3.0.2-bin-hadoop2.7.tgz.1
 sample_data


If you would like to check your present working directory try
!pwd. Don't forget the ! before the linux command

In [None]:
!pwd

/content


Two more steps to go before we start coding!!!
First one is to set the Spark and Java Home

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

PySpark isn't on sys.path by default.You can add pyspark to sys.path at runtime using findspark.

In [None]:
import findspark
findspark.init()

We are ready to create a spark session. Let's go ahead and do that

In [None]:
from pyspark.sql import SparkSession



Let's go ahead and create a variable for the spark session



In [None]:
spark = SparkSession.builder.appName("Basics").getOrCreate()

**Reading the data into a dataframe and displaying the contents**

We have already loaded the data into working directory.Let's load the data onto a spark dataframe.Remember we loaded the data earlier into a pandas dataframe. Now we are going to load it into a spark dataframe.

In [None]:
df=spark.read.json('people.json')

Let's see the content of the dataframe

In [None]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [None]:
df.show()

+---+-------+
|age|   name|
+---+-------+
|  0|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+



Let's look at the schema 

In [None]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



List the columns

In [None]:
df.columns

['age', 'name']

Get a statistical summary of the numeric columns.

In [None]:
  df.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 3|      3|
|   mean|16.333333333333332|   null|
| stddev| 15.17673658377628|   null|
|    min|                 0|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



Let's look into pre-defining the schema before we load the data into a dataframe 

In [None]:
from pyspark.sql.types import StructField,StringType,IntegerType,StructType

Within the paranthesis of the StructField, "age" denotes the column name, IntegerType() refers to datatype of field age and True is set to show that this field can be null. If you don't set True it could cause an error

In [None]:
[StructField("age")]

In [None]:
schema = [StructField("age", IntegerType(), True),StructField("name", StringType(), True)]

In [None]:
struc=StructType(fields=schema)

Now,let's read the data in along with the schema we have defined

In [None]:
df = spark.read.json('people.json', schema=struc)

In [None]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



In [None]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [None]:
df=df.fillna(0)

Let's do a small exercise.I want your to read the data salesinfo in csv format into a datframe and display the data.Predefine the schema to have the following datatypes.The datatypes of the three fields are Company:string,Person:String,Sales:Integer.The field values can be null.



In [None]:
#Load the data from your local system


In [None]:
#uploaded=files.upload()

Saving sales_info.csv to sales_info.csv


In [None]:
#schema=[StructField('Company',StringType(),True),StructField('Person',StringType(),True),StructField('Sales',IntegerType(),True)]

In [None]:
#struc=StructType(schema)

In [None]:
#df=spark.read.csv('sales_info.csv',schema=struc,header=True)

In [None]:
#df.show()

+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
|   GOOG|    Sam|  200|
|   GOOG|Charlie|  120|
|   GOOG|  Frank|  340|
|   MSFT|   Tina|  600|
|   MSFT|    Amy|  124|
|   MSFT|Vanessa|  243|
|     FB|   Carl|  870|
|     FB|  Sarah|  350|
|   APPL|   John|  250|
|   APPL|  Linda| null|
|   APPL|   Mike| null|
|   APPL|  Chris| null|
+-------+-------+-----+



### Grabbing the data

Let's grab the data columnwise.

In [None]:
type(df['age'])

pyspark.sql.column.Column

In [None]:
df.select('age')

DataFrame[age: bigint]

In [None]:
type(df.select('age'))

pyspark.sql.dataframe.DataFrame

In [None]:
df.select('age').show()

+----+
| age|
+----+
|null|
|  30|
|  19|
+----+



Now, grab the first two columns of the dataframe and display the contents of the dataframe

In [None]:
#df.select(['age','name']).show()

Let's grab the data row-wise

In [None]:
# Returns list of Row objects
df.head(2)

[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

What if I want to check only the first row of the dataframe

In [None]:
df.head(2)[0]

Row(age=None, name='Michael')

Creating new column with the current age of the person by adding 5 to the age column.

In [None]:
df=df.withColumn('current_age',df['age']+5)
df=df.withColumn('new_age',df['age']*5)
df=df.withColumn('age',df['age']/2)

In [None]:
df.show()

+----+-------+-----------+-------+
| age|   name|current_age|new_age|
+----+-------+-----------+-------+
|null|Michael|       null|   null|
|15.0|   Andy|         35|    150|
| 9.5| Justin|         24|     95|
+----+-------+-----------+-------+



In [None]:
df.dtypes

[('age', 'double'),
 ('name', 'string'),
 ('current_age', 'bigint'),
 ('new_age', 'bigint')]

### Using SQL

We can try runnin some SQL queries through pyspark.Let's start by creating a table called people

In [None]:
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

In [None]:
sql_results = spark.sql("SELECT * FROM people")

Now,let's make a query based on certain conditions

In [None]:
result=spark.sql("SELECT * FROM people WHERE new_age>100")

In [None]:
result.show()

+----+----+-----------+-------+
| age|name|current_age|new_age|
+----+----+-----------+-------+
|15.0|Andy|         35|    150|
+----+----+-----------+-------+



In [None]:
result=spark.sql("SELECT * FROM people WHERE name='Michael'")

In [None]:
result.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
+----+-------+

