Skip to content

wa5znu/linux-jupyter-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jupyter PySpark Notebook

Introduction

  • Following this
  • Target local Linux
  • Future: hopefully easier to run Jupyter or similar notebook with pyspark in AWS/GC
  • Non-target: Mac laptop

Requirements

  • Recent Linux server with four or more cores.
  • Podman, aliased to docker
  • A directory to save state into

Installation

  • mkdir $WORKBOOK
  • ed conf.sh # add $WORKBOOK
  • mkdir $WORKBOOK
  • sudo ./install-podman.sh

Running

  • ./doit.sh
  • Visit the 127.0.0.1 link it prints out
  • Add these cells
# Setup
from pyspark.sql import SparkSession

# local = this host only, *=use all cores
spark = SparkSession.builder.master("local[*]").getOrCreate()
# add your imports here

from pyspark.sql.functions import *
# Read your data
df = spark.read.option("header", True).csv("10000.csv")
# Analyze your data
df.count
# df.limit(10).toPandas()

References

About

run pyspark with jupyter on linux, using containers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages