<a href="https://colab.research.google.com/github/sanat259/SparkPractice/blob/main/Data_types_definingDataFrame.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark


Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 68kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 42.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=63797be2c04fc57b5789fc4f7e8fd482fb1683c93d5d59636d1add9fd41f8959
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.1


In [2]:

import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import count
from pyspark import SparkFiles

In [3]:
spark = SparkSession.builder.appName("Data Spark").getOrCreate()


Defining schema rather than schema on read is useful because:
*   You prevent Spark from creating a separate job just to read a large portion of your file to ascertain the schema, which for a large data file can be expensive and time-consuming.
* You relieve Spark from the onus of inferring data types.
* You can detect errors early if data doesn’t match the schema.

Declaring schema can be done either programatically as below or in similar way like DDL in SQL.

In [4]:
#programatic declaration
from pyspark.sql.types import *

schema = StructType([StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False)])

In [5]:
schema

StructType(List(StructField(author,StringType,false),StructField(title,StringType,false),StructField(pages,IntegerType,false)))

In [8]:
#DDL declaration
schema2 = "`Id` INT, `First` STRING, `Last` STRING, `Url` STRING,`Published` STRING, `Hits` INT, `Campaigns` ARRAY<STRING>"

In [9]:
schema2

'`Id` INT, `First` STRING, `Last` STRING, `Url` STRING,`Published` STRING, `Hits` INT, `Campaigns` ARRAY<STRING>'

In [10]:
# Create our static data
data = [[1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter","LinkedIn"]],
[2, "Brooke","Wenig", "https://tinyurl.2", "5/5/2018", 8908, ["twitter",
"LinkedIn"]],
[3, "Denny", "Lee", "https://tinyurl.3", "6/7/2019", 7659, ["web",
"twitter", "FB", "LinkedIn"]],
[4, "Tathagata", "Das", "https://tinyurl.4", "5/12/2018", 10568,
["twitter", "FB"]],
[5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web",
"twitter", "FB", "LinkedIn"]],
[6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568,
["twitter", "LinkedIn"]]
]

In [11]:
data

[[1,
  'Jules',
  'Damji',
  'https://tinyurl.1',
  '1/4/2016',
  4535,
  ['twitter', 'LinkedIn']],
 [2,
  'Brooke',
  'Wenig',
  'https://tinyurl.2',
  '5/5/2018',
  8908,
  ['twitter', 'LinkedIn']],
 [3,
  'Denny',
  'Lee',
  'https://tinyurl.3',
  '6/7/2019',
  7659,
  ['web', 'twitter', 'FB', 'LinkedIn']],
 [4,
  'Tathagata',
  'Das',
  'https://tinyurl.4',
  '5/12/2018',
  10568,
  ['twitter', 'FB']],
 [5,
  'Matei',
  'Zaharia',
  'https://tinyurl.5',
  '5/14/2014',
  40578,
  ['web', 'twitter', 'FB', 'LinkedIn']],
 [6,
  'Reynold',
  'Xin',
  'https://tinyurl.6',
  '3/2/2015',
  25568,
  ['twitter', 'LinkedIn']]]

In [12]:
blogs_df= spark.createDataFrame(schema=schema2, data=data)

In [13]:
blogs_df.show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



In [14]:
blogs_df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (nullable = true)
 |-- Campaigns: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [19]:
blogs_df.write.format("json").save("my_blogs_dfv.json")
