# It is recommended that you install local PySpark on OSs other than Windows OS

## or use Docker

## Installing "Local" PySpark on Windows 10

With PySpark on Windows, there is a chance that you won't be able to save or export your dataframe onto your local file system if we are not able to successfully trick Windows into thinking Hadoop was installed (that is the purpose of step 8 below). But you should be able to at least read a local text or CSV file.  The background on this is that on Windows, having native Hadoop is NOT optional, it is requried.  In contrast with other OSs, it is optional or not required.  There is currently not a good way to determine which winutils.exe version to obtain.  See this Github [issue](https://github.com/cdarlint/winutils/issues/20) for details.  This will take trial and error - you will have to try using various different hadoop/winutils.exe versions until you no longer get the error.

1. Install Java 1.8 from Sun Java [site](https://www.java.com/download/ie_manual.jsp). Include path to java.exe in your PATH environment variable.
2. Install Python - download binaries at python.org
3. Create pyspark_dev virtual environment: `python -m venv pyspark_dev`
4. Change directory into `pyspark_dev` folder: `cd pyspark_dev`.  Then activate "pyspark_dev" environment with `Scripts/activate.bat`
5. Update pip and then install necessary packages: `python -m pip install -U pip`, then `pip install wheel`, then `pip install pyspark ipykernel`
6. Install kernel: `python -m ipykernel install --user --name pyspark_dev --display-name "Python (pyspark_dev)"`
7. Set environment variables: `PYSPARK_PYTHON=[path_to_python.exe]` and `SPARK_HOME=[path_to_site_packages/pyspark folder]`
8. This step will take trial and error.  Go to https://github.com/cdarlint/winutils and download the entire repo.  Then save contents of different version's `bin` folder into your local `hadoop/bin` folder until you can write to local file system without getting an error. See this Github [issue](https://github.com/cdarlint/winutils/issues/20) for background on how others were able to resolve their issue.
9. `set HADOOP_HOME=[path_to_hadoop_folder]` and append `HADOOP_HOME\bin` to PATH: `set PATH=%PATH%;%HADOOP_HOME%\bin`
10. De-activate your pyspark_dev virtual environment, then activate your python virutal environment that has jupyterlab installed.
11. Confirm you have pyspark_dev installed as a kernel by issuing the following command: `jupyter kernelspec list`  If you see it, then launch jupyterlab via `jupyter lab`
12. Choose your PySpark kernel that you defined in Step 5 when opening a new jupyterlab notebook

See this article https://phoenixnap.com/kb/install-spark-on-windows-10

## Installing "Local" PySpark on Ubuntu Linux WSL via pip in virtual environment, NOT system level installation

1. Install Java 1.8 `sudo apt-get update` then `sudo apt-get install openjdk-8-jdk` and set JAVA_HOME environment variable, as an example: `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64` or execute `whereis java` to find the location.  Do NOT set JAVA_HOME to the `bin` directory, just the root Java directory.
2. Create "pyspark_dev" virtual environment: `python3 -m venv pyspark_dev`
3. Activate "pyspark_dev" environment, then: `python -m pip install -U pip`, then `pip install wheel`, then `PYSPARK_HADOOP_VERSION=3 pip install pyspark pandas ipykernel`
4. Install kernel: `python -m ipykernel install --user --name pyspark_dev --display-name "Python (pyspark_dev)"`
5. Add 2 environment variables (SPARK_HOME and PYSPARK_PYTHON), as an example: `export SPARK_HOME=/home/pybokeh/envs/pyspark_dev/lib/python3.10/site-packages/pyspark` and `export PYSPARK_PYTHON=/home/pybokeh/envs/pyspark_dev/bin/python`
6. Append `SPARK_HOME/bin` and `SPARK_HOME/sbin` to your PATH, as an example: `export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin`
7. `source ~/.bashrc` or `source ~/.profile`
8. Issue the "pyspark" command to check if everything was installed correctly
9. De-activate your pyspark_dev virtual environment, then activate your python virutal environment that has jupyterlab installed.
10. Confirm you have pyspark_dev installed as a kernel by issuing the following command: `jupyter kernelspec list`  If you see it, then launch jupyterlab via `jupyter lab`
11. Choose your PySpark kernel that you defined in Step 4 when opening a new jupyterlab notebook

**NOTE:** When starting your PySpark session, you will see warnings about `SPARK_LOCAL_IP` or loopback address or native-hadoop not found.  You can safely ignore them for the purpose of running local PySpark environment.  For learning purposes, we don't need to actually install hadoop and we did not install hadoop in the steps above.

Official installation [instructions](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) from spark documentation

## To see if our local PySpark is working correctly, confirm we can read a CSV file and also save a dataframe as CSV

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("local_pyspark").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

In [2]:
df = spark.read.csv('data/cars.csv', header=True, sep=";")

In [3]:
df.show(5)

+--------------------+----+---------+------------+----------+------+------------+-----+------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0| 3504.|        12.0|   70|    US|
|   Buick Skylark 320|15.0|        8|       350.0|     165.0| 3693.|        11.5|   70|    US|
|  Plymouth Satellite|18.0|        8|       318.0|     150.0| 3436.|        11.0|   70|    US|
|       AMC Rebel SST|16.0|        8|       304.0|     150.0| 3433.|        12.0|   70|    US|
|         Ford Torino|17.0|        8|       302.0|     140.0| 3449.|        10.5|   70|    US|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
only showing top 5 rows



#### If using Windows and get an error trying to save dataframe to csv file, you can try installing/using local PySpark on Linux/MacOS instead or repeat step 8 with a different hadoop / winutils.exe version

In [4]:
df.coalesce(1).write.mode('overwrite').option("header", "true").csv("data/cars_single_partition.csv")