# PySpark Basics

- PySpark Dataframe
- Reading the dataset
- Checking the datatypes of the columns (schema)
- Selecting columns and indexing
- Checking describe options similar to pandas
- Adding columns
- Dropping columns
- Renaming columns

## 0 Setup

In [1]:
!pip install pyspark

Collecting pyspark
  Using cached pyspark-3.2.0.tar.gz (281.3 MB)
Collecting py4j==0.10.9.2
  Using cached py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py): started
  Building wheel for pyspark (setup.py): finished with status 'done'
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=8b94019ab4d1088e8528ec8956d93cf85ec1e5b7e30a83364f121651899a234b
  Stored in directory: c:\users\appdata\local\pip\cache\wheels\2f\f8\95\2ad14a4614b4a9f645ee928fbbd057b1b254c67adb494c9a58
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [1]:
import pyspark
from pyspark.sql import SparkSession

## 1. Start SparkSession and Load data

Syntax - start Session
- `SparkSession.builder.appName('df').getOrCreate()`
- `spark.read.option('header', 'true').csv(file_path, inferSchema=True)`
- `spark.read.csv(file_path, header=True, inferSchema=True)`

Syntax - read data
- `df.show(num)`
- `df.head(num)`
- `df.printSchema()`
- `df.describe()`
- `df.columns`

In [24]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('DataFrame').getOrCreate()
spark

In [54]:
# read the dataset
df_pyspark = spark.read.csv(file_path, header=True, inferSchema=True)
df_pyspark = spark.read.option('header', 'true').csv(file_path, inferSchema=True)
df_pyspark.show(5)

+--------+---------+
| keyword|freq_2021|
+--------+---------+
|    food|      540|
|  drying|      341|
| quality|      264|
|   plant|      214|
|products|      188|
+--------+---------+
only showing top 5 rows



In [32]:
### Check the schema
df_pyspark.printSchema()

root
 |-- keyword: string (nullable = true)
 |-- freq_2021: integer (nullable = true)



In [35]:
df_pyspark.columns

['keyword', 'freq_2021']

In [55]:
df.head(5)

[Row(keyword='food', freq_2021=540),
 Row(keyword='drying', freq_2021=341),
 Row(keyword='quality', freq_2021=264),
 Row(keyword='plant', freq_2021=214),
 Row(keyword='products', freq_2021=188)]

## 2. Reading dataset

Syntax
- `df_pyspark.show()`
- `df.select('keyword', 'freq_2021').show()`
- `df.select(['keyword', 'freq_2021']).show()`

In [44]:
df_pyspark.show()
df.select('keyword', 'freq_2021').show()
df.select(['keyword', 'freq_2021']).show()

+-----------+---------+
|    keyword|freq_2021|
+-----------+---------+
|       food|      540|
|     drying|      341|
|    quality|      264|
|      plant|      214|
|   products|      188|
|     higher|      177|
|       data|      177|
| properties|      173|
|     health|      171|
|      water|      170|
|development|      162|
|      total|      160|
|       risk|      157|
|    control|      150|
|    species|      150|
|  potential|      147|
|       acid|      147|
|  increased|      142|
| production|      139|
|temperature|      137|
+-----------+---------+
only showing top 20 rows



In [49]:
df['keyword']

Column<'keyword'>

In [50]:
df['keyword'].show()

TypeError: 'Column' object is not callable

## 3. Checking data types

Syntax
- `df.dtypes()`
- `df.describe()`
- `df.describe().show()`

In [51]:
df_pyspark.dtypes

[('keyword', 'string'), ('freq_2021', 'int')]

In [52]:
df_pyspark.describe()

DataFrame[summary: string, keyword: string, freq_2021: string]

In [53]:
df_pyspark.describe().show()

+-------+-----------------+------------------+
|summary|          keyword|         freq_2021|
+-------+-----------------+------------------+
|  count|            11470|             11470|
|   mean|             null| 5.390061028770706|
| stddev|             null|13.217850003680276|
|    min|               aa|                 1|
|    max|zygosaccharomyces|               540|
+-------+-----------------+------------------+



## 4. Adding columns

Syntax
- `df.withColumn('new_col', df['old_col'] + do_something).show()`
  - Not an inplace operation
- `df = df.withColumn('new_col', df['old_col'] + do_something)`
- `df.show()`

In [67]:
### Add columns in df
#df_pyspark.withColumn('Scaled freq_2021', df_pyspark['freq_2021']/540).show(10)
df_pyspark = df_pyspark.withColumn('Scaled freq_2021', df_pyspark['freq_2021']/540)
df_pyspark.show(10)

+----------+---------+-------------------+
|   keyword|freq_2021|   Scaled freq_2021|
+----------+---------+-------------------+
|      food|      540|                1.0|
|    drying|      341| 0.6314814814814815|
|   quality|      264| 0.4888888888888889|
|     plant|      214| 0.3962962962962963|
|  products|      188|0.34814814814814815|
|    higher|      177| 0.3277777777777778|
|      data|      177| 0.3277777777777778|
|properties|      173|0.32037037037037036|
|    health|      171|0.31666666666666665|
|     water|      170| 0.3148148148148148|
+----------+---------+-------------------+
only showing top 10 rows



## 5. Drop columns

In [70]:
# NOT inplace operation
#df_pyspark.drop('Scaled freq_2021')
df_pyspark = df_pyspark.drop('Scaled freq_2021')
df_pyspark.show(5)

+--------+---------+
| keyword|freq_2021|
+--------+---------+
|    food|      540|
|  drying|      341|
| quality|      264|
|   plant|      214|
|products|      188|
+--------+---------+
only showing top 5 rows



## 6. Rename

Syntax
- `df_pyspark.withColumnRenamed('old_name', 'new_name')`

In [73]:
# NOT inplace operation
df_pyspark.withColumnRenamed('keyword', 'freq_word').show(5)

+---------+---------+
|freq_word|freq_2021|
+---------+---------+
|     food|      540|
|   drying|      341|
|  quality|      264|
|    plant|      214|
| products|      188|
+---------+---------+
only showing top 5 rows

