GitHub - steve303/sparkSQL: Objective: Perform word count tasks and joins using spark SQL within a Docker container

Perform word count and join operations in spark SQL within a Docker container

Steps:

Create a container from docker image
Note, the purpose of the docker container is so that we don't have to install Apache Spark on our "host" machine. Alternatively you can install Apache Spark on your machine which will take a bit more work.
First, open up a terminal window and issue the commands below:
git clone https://github.com/steve303/sparkSQL - This downloads contents of this repo into your host machine;
cd dockerImage_mp9 - This is where the "Dockerfile" resides and contains the information to build the image;
docker build -t mp9 . - This builds the image, whose name is "mp9", don't forget the . at the end of command;
docker run -it mp9 /bin/bash - This creates an instance, aka "container", of the image, mp9; you should now be running in the environment of the container.
You should now get a prompt that looks like this: root@373d63d17f59:/#. You are now working from the Docker container via BASH (BourneAgainSHell).
Load supporting Docs into the container
Now that we are working from the container, no longer from the host machine (so to speak), we need to download supporting docs. To do this, we will again clone this repo, this time into our container. At the root level in your container clone this repo. Issue the command, git clone https://github.com/steve303/sparkSQL - This will create a new folder /sparkSQL in your container. The files we are interested in are in the /sparkSQL/python/ and /sparkSQL/java/ folders.
Load Dataset into container
Within the container we need to download the data set. If you want to work with Python language, change the current directory to /sparkSQL/python/ otherwise /sparkSQL/java/ directory for java. For this section, issue the commands below:
wget http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-a.gz - This downloads the file;
gzip -d googlebooks-eng-all-1gram-20120701-a.gz - This will decompress the .gz file;
mv googlebooks-eng-all-1gram-20120701-a gbooks - This will rename to "gbooks" (it's a big file 1.8GB).
NOTE: Your programs should always assume "gbooks " file is in the current directory, in this case either /sparkSQL/python/ or /sparkSQL/java/.
Run example python spark SQL scripts
Within the /sparkSQL/python/ folder issue the command: spark-submit <filename>. Filename in our case will be i.e., MP3_partA.py. Note, even though we are running a .py file we do not issue the typical command "python3 filename.py" since we want to use ApacheSpark. Each file in this folder performs a different task to demonstrate how to perform basic spark SQL tasks. These are creating the schema, defining a table/RDD, and performing SQL queries.
Run pyspark interactively
To run interactively type pyspark into the terminal. You can run the commands as in the .py files, i.e. MP3_partA.py, with some minor deviations. Firstly, you don't need to create an instance of the spark session (sparkContext()); it is created for you and is named "spark". So use this whenever the instance name is required.
To exit container and restart
To exit the container type "ctl-d". To restart the container enter:
docker start <container name> - this restarts the container;
docker ps - this lists all of the docker processes, find your container's process#;
docker exec -it <process#> /bin/bash - this will bring up BASH interface in your container.
To bring up another instance-container of your image
docker run -it <image name> /bin/bash

Credits

Some contents were sourced from UIUC MCS coursework

Supplemental Resources

video demonstration
problem statement
note: links require a password which is the repo name

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
dockerImage_mp9		dockerImage_mp9
java		java
python		python
MP3_SQLite.py		MP3_SQLite.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perform word count and join operations in spark SQL within a Docker container

Steps:

Credits

Supplemental Resources

About

Releases

Packages

Contributors 2

Languages

steve303/sparkSQL

Folders and files

Latest commit

History

Repository files navigation

Perform word count and join operations in spark SQL within a Docker container

Steps:

Credits

Supplemental Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages