Jupyter PySpark Notebook

Introduction

Following this
Target local Linux
Future: hopefully easier to run Jupyter or similar notebook with pyspark in AWS/GC
Non-target: Mac laptop

Requirements

Recent Linux server with four or more cores.
Podman, aliased to docker
A directory to save state into

Installation

mkdir $WORKBOOK
ed conf.sh # add $WORKBOOK
mkdir $WORKBOOK
sudo ./install-podman.sh

Running

./doit.sh
Visit the 127.0.0.1 link it prints out
Add these cells

# Setup
from pyspark.sql import SparkSession

# local = this host only, *=use all cores
spark = SparkSession.builder.master("local[*]").getOrCreate()

# add your imports here

from pyspark.sql.functions import *

# Read your data
df = spark.read.option("header", True).csv("10000.csv")

# Analyze your data
df.count

# df.limit(10).toPandas()

References

Installation: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/common.html
PySpark test: https://medium.com/@suci/running-pyspark-on-jupyter-notebook-with-docker-602b18ac4494 (The first cell is wrong; use the above Setup instead.)
PySpark CSV: https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
PySpark getting started: https://spark.apache.org/docs/3.1.1/api/python/getting_started/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.sh		config.sh
doit.sh		doit.sh
gpu-jupyter.sh		gpu-jupyter.sh
install-podman.sh		install-podman.sh
movelogs.sh		movelogs.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

config.sh

config.sh

doit.sh

doit.sh

gpu-jupyter.sh

gpu-jupyter.sh

install-podman.sh

install-podman.sh

movelogs.sh

movelogs.sh

Repository files navigation

Jupyter PySpark Notebook

Introduction

Requirements

Installation

Running

References

About

Releases

Packages

Languages

License

wa5znu/linux-jupyter-pyspark

Folders and files

Latest commit

History

Repository files navigation

Jupyter PySpark Notebook

Introduction

Requirements

Installation

Running

References

About

Resources

License

Stars

Watchers

Forks

Languages