# PySpark DataFrames - part 1
## Start Spark session

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

22/09/24 17:25:49 WARN Utils: Your hostname, RSOLE resolves to a loopback address: 127.0.1.1; using 172.24.238.207 instead (on interface eth0)
22/09/24 17:25:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/24 17:25:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

## Read dataset

In [6]:
df_pyspark = spark.read.option('header', 'true').csv('data/test.csv')

In [7]:
df_pyspark.show()

+------+---+----------+
|  name|age|experience|
+------+---+----------+
| roger| 34|         5|
|  sara| 28|         9|
| arlet|  1|         0|
|victor| 31|         7|
|  aina| 27|         8|
+------+---+----------+



In [8]:
df_pyspark.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- experience: string (nullable = true)



`Read` reads all values as `str` unless said otherwise.

With the `inferSchema=True` it tries to set the type correctly.

In [9]:
df_pyspark = spark.read.option('header', 'true').csv('data/test.csv', inferSchema=True)

In [10]:
df_pyspark.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- experience: integer (nullable = true)



We can also include both options `header=True` and `inferSchema=True` in the same place

In [26]:
df_pyspark = spark.read.csv('data/test.csv', header=True, inferSchema=True)

In [27]:
df_pyspark.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- experience: integer (nullable = true)



In [28]:
df_pyspark.columns

['name', 'age', 'experience']

In [29]:
df_pyspark.head()

Row(name='roger', age=34, experience=5)

In [30]:
df_pyspark.tail(1)

[Row(name='aina', age=27, experience=8)]

## Column operations
### Select

In [31]:
df_pyspark.select('name').show()

+------+
|  name|
+------+
| roger|
|  sara|
| arlet|
|victor|
|  aina|
+------+



In [32]:
df_pyspark.select(['name', 'experience']).show()

+------+----------+
|  name|experience|
+------+----------+
| roger|         5|
|  sara|         9|
| arlet|         0|
|victor|         7|
|  aina|         8|
+------+----------+



In [33]:
df_pyspark.dtypes

[('name', 'string'), ('age', 'int'), ('experience', 'int')]

In [34]:
df_pyspark.describe().show()

+-------+------+------------------+------------------+
|summary|  name|               age|        experience|
+-------+------+------------------+------------------+
|  count|     5|                 5|                 5|
|   mean|  null|              24.2|               5.8|
| stddev|  null|13.255187663703596|3.5637059362410923|
|    min|  aina|                 1|                 0|
|    max|victor|                34|                 9|
+-------+------+------------------+------------------+



Create new column

In [37]:
df_pyspark = df_pyspark.withColumn('experience_2years', df_pyspark['experience'] + 2)
df_pyspark.show()

+------+---+----------+-----------------+
|  name|age|experience|experience_2years|
+------+---+----------+-----------------+
| roger| 34|         5|                7|
|  sara| 28|         9|               11|
| arlet|  1|         0|                2|
|victor| 31|         7|                9|
|  aina| 27|         8|               10|
+------+---+----------+-----------------+



Drop column

In [40]:
df_pyspark = df_pyspark.drop('experience_2years')
df_pyspark.show()

+------+---+----------+
|  name|age|experience|
+------+---+----------+
| roger| 34|         5|
|  sara| 28|         9|
| arlet|  1|         0|
|victor| 31|         7|
|  aina| 27|         8|
+------+---+----------+



Rename columns

In [41]:
df_pyspark.withColumnRenamed('name', 'first_name')

DataFrame[first_name: string, age: int, experience: int]