# PySpark Tutorial

[FreeCodeCamp YouTube](https://www.youtube.com/watch?v=_C8kWso4ne4&t=178s&ab_channel=freeCodeCamp.org)

## Install PySpark

In [3]:
%pip install pyspark

Note: you may need to restart the kernel to use updated packages.


## Import PySpark

In [4]:
import pyspark

In [7]:
import pandas as pd
pd.read_csv('data/test.csv')

Unnamed: 0,name,age
0,roger,34
1,sara,28
2,arlet,1
3,victor,31
4,aina,27


## Start a Spark session

In [8]:
from pyspark.sql import SparkSession

In [10]:
spark = SparkSession.builder.appName('tutorial').getOrCreate()

22/09/18 13:43:29 WARN Utils: Your hostname, RSOLE resolves to a loopback address: 127.0.1.1; using 172.17.20.158 instead (on interface eth0)
22/09/18 13:43:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/18 13:43:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [11]:
spark

## Read dataset from Spark

In [13]:
df_pyspark = spark.read.csv('data/test.csv')
df_pyspark

DataFrame[_c0: string, _c1: string]

In [14]:
df_pyspark.show()

+------+---+
|   _c0|_c1|
+------+---+
|  name|age|
| roger| 34|
|  sara| 28|
| arlet|  1|
|victor| 31|
|  aina| 27|
+------+---+



First row is not read as column name.

It is important to set the option `header` to `true` if we want this.

In [15]:
df_pyspark = spark.read.option('header', 'true').csv('data/test.csv')
df_pyspark

DataFrame[name: string, age: string]

In [16]:
df_pyspark.show()

+------+---+
|  name|age|
+------+---+
| roger| 34|
|  sara| 28|
| arlet|  1|
|victor| 31|
|  aina| 27|
+------+---+



In [17]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [20]:
df_pyspark.head(3)

[Row(name='roger', age='34'),
 Row(name='sara', age='28'),
 Row(name='arlet', age='1')]

In [21]:
df_pyspark.tail(3)

[Row(name='arlet', age='1'),
 Row(name='victor', age='31'),
 Row(name='aina', age='27')]

In [22]:
df_pyspark.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)



`.printSchema()` is similar to `.info()` in a Pandas DF