## This is a notebook to start practicing with PySpark.

In [1]:
import pyspark

In [2]:
from pyspark.sql import SparkSession

In [3]:
# Create my session
spark = SparkSession.builder.appName('Dataframe').getOrCreate()

In [4]:
spark

In [5]:
# Read the dataset
df_pyspark = spark.read.csv('table.csv')

In [6]:
df_pyspark.show()

+------+---+---------+
|   _c0|_c1|      _c2|
+------+---+---------+
|  name|age|     city|
| Paula| 30|   Madrid|
|  Fran| 32|Barcelona|
|Marina| 25| Valencia|
+------+---+---------+



We can see Spark has created a new header (c0, c1, c2) for my dataset, but I already have a header (name, age, city). 

Next, let's read the dataset again with the option **header=True** so it considers the first row as my header.

In [7]:
df_pyspark = spark.read.csv('table.csv', header=True)

In [8]:
df_pyspark.show()

+------+---+---------+
|  name|age|     city|
+------+---+---------+
| Paula| 30|   Madrid|
|  Fran| 32|Barcelona|
|Marina| 25| Valencia|
+------+---+---------+



In [9]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [10]:
# Head of dataset
df_pyspark.head(3)

[Row(name='Paula', age='30', city='Madrid'),
 Row(name='Fran', age='32', city='Barcelona'),
 Row(name='Marina', age='25', city='Valencia')]

In [11]:
# Check the schema (data types)
df_pyspark.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)



By default, it has read all features' types as strings. Let's fix this by reading the csv file again but now using the option **inferSchema=True**.

In [12]:
df_pyspark = spark.read.csv('table.csv', header=True, inferSchema=True)

In [13]:
# Check the schema (data types)
df_pyspark.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)



Now we can see age has been read as integer type.

In [14]:
# Columns
df_pyspark.columns

['name', 'age', 'city']

In [15]:
# Select a column
df_pyspark.select('name')

DataFrame[name: string]

In [16]:
# Visualise the column
df_pyspark.select('name').show()

+------+
|  name|
+------+
| Paula|
|  Fran|
|Marina|
+------+



In [17]:
# Visualise multiple columns
df_pyspark.select('name','city').show()

+------+---------+
|  name|     city|
+------+---------+
| Paula|   Madrid|
|  Fran|Barcelona|
|Marina| Valencia|
+------+---------+



In [18]:
# Check the data types
df_pyspark.dtypes

[('name', 'string'), ('age', 'int'), ('city', 'string')]

In [19]:
# Describe function
df_pyspark.describe().show()

+-------+-----+-----------------+---------+
|summary| name|              age|     city|
+-------+-----+-----------------+---------+
|  count|    3|                3|        3|
|   mean| null|             29.0|     null|
| stddev| null|3.605551275463989|     null|
|    min| Fran|               25|Barcelona|
|    max|Paula|               32| Valencia|
+-------+-----+-----------------+---------+



## Adding, dropping and renaming columns

In [20]:
# Add a column
add_df_pyspark = df_pyspark.withColumn('age in 2 years', df_pyspark['age'] + 2)

In [21]:
add_df_pyspark.show()

+------+---+---------+--------------+
|  name|age|     city|age in 2 years|
+------+---+---------+--------------+
| Paula| 30|   Madrid|            32|
|  Fran| 32|Barcelona|            34|
|Marina| 25| Valencia|            27|
+------+---+---------+--------------+



In [22]:
# Drop a column
drop_df_pyspark = add_df_pyspark.drop('city')
drop_df_pyspark.show()

+------+---+--------------+
|  name|age|age in 2 years|
+------+---+--------------+
| Paula| 30|            32|
|  Fran| 32|            34|
|Marina| 25|            27|
+------+---+--------------+



In [26]:
# Rename the columns
df_pyspark.withColumnRenamed('name', 'New Name').show()

+--------+---+---------+
|New Name|age|     city|
+--------+---+---------+
|   Paula| 30|   Madrid|
|    Fran| 32|Barcelona|
|  Marina| 25| Valencia|
+--------+---+---------+

