# Pre-Requisite Notebook
## Welcome
Welcome to Applied Data Science for 2023 Semester 2! 
- We will be using Python, PySpark and Git for the majority of this subject. These tools are staples for big data processing in the workplace.
- This notebook will get you up to speed setting up the environment you will need for this course.
- *You must complete this notebook before the first workshop of the course.*
- Note that **tutorial attendance will be marked after the first 3 tutorials for the industry project component.**

----

## `git` (GitHub) Summary
_Whilst this should have been covered in prerequisite subjects, a refresher may be in order_. For this subject, the minimal requirement for `git` is:
1. `clone`: copy an existing repo from remote (repository) into your local destination.
2. Publishing new changes:
  - `add` + `commit`: create a new snapshot of the local repository and commit the changes.
  - `push`: upload your local commits to remote.
3. Syncing unseen changes:
  - `fetch`: download unseen commits from remote to local.
  - `merge`: merge the commits from remote with changes in local.
    - If local has no new changes (or `is up to date`), the merge does not create new snapshot.
    - Otherwise, changes will be automatically merged if there is no *conflict*, else you need to resolve the conflict. You will need to `commit` the merge result once this process finishes.
      - Question: After `merge`, is the local and remote now synced? Why or why not?
  - `pull`: Shorthand for chaining `fetch` and `merge`
  
Graphical illustration:
![gitoverview](../media/git-process.png)

4. Authentication:
  - For GitHub, [Personal Access Tokens (PAT)](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) is required as a security measure.

### GitHub Desktop
GitHub Desktop hides lots of the process under-the-hood. It is good for those who are not familiar with `git` and honestly, we use it for industry work because its easy.

**Cloning:** 
1. Download [GitHub Desktop](https://desktop.github.com/)
2. Login with your credentials
3. On the top-left menu, click on `Add` -> `Clone repository...`
4. Enter https://github.com/liamhodg/MAST30034_Python as the URL
5. Click on `Clone`.
6. Done!

**Publishing:**
1. Add changes.
2. Click on the `MAST30034_Python` repo.
3. Add a summary (i.e `"removed incorrect transformation for xyz"`)
4. Commit to `main` (or your specified `branch` if you know what it is)
5. Push, and you are done.

**Syncing:**
1. Click on the `MAST30034_Python` repo.
2. Click `Fetch origin` (refresh icon)
3. Pull, and you are done.

### `git` CLI (Command Line Interface)
If you are using `git` CLI, you will need PAT:
1. Visit https://github.com/settings/tokens 
2. Generate a token (set it to expire end of this semester).
3. Add changes and commit as usual.
4. Now, after adding your `username`, you will be prompted with `password`. Rather than using your GitHub password, you should use your generated PAT here.
5. Done!

**Cloning:** 
1. Open a terminal (yes it is commandline `git` for this to work).
2. `git clone HTTPS` (where HTTPS is the https url to your gitlab repo).
3. Enter your credentials (with PAT).
4. Done.

**Publishing:** 
1. Change directories to inside your repository (`cd NAME_OF_REPO_FOLDER`).
2. `git add -A` (this will stage all changed/untracked files files for the next commit, ignored files are excepted). You can use `git status` to track changed files before adding.
3. `git commit -m "message"` (make a commit with a message).
5. `git push`
6. Enter your credentials.
    - Here, use the same username
    - BUT, instead of your password, use the PAT you generated.
7. Done.

**Syncing:** 
1. Change directories to inside your repository (`cd NAME_OF_REPO_FOLDER`).
2. `git pull`
3. Done.

have already download

---

## General Tips for Jupyter Notebook
Cell shortcuts:
- `shift + enter` : Run current cell (equivalent of pressing <button class='btn btn-default btn-xs'><i class="fa-play fa"></i><span class="toolbar-btn-label">Run</span></button>)
- `ctrl + enter` : Run selected cells

Command mode (press `esc` to enter):
- `m` : Makes the cell markdown
- `y` : Makes the cell into code
- `a` : Insert cell above
- `b` : Insert cell above
- double `d` : Delete current cell

Code Shortcuts:
- `shift + tab` : brings function arguments

Multiline Cursor:
- Hold down `ctrl` on Windows or `cmd` on Mac and click on the places you wish to edit all together.

---

## Python / Requests
This notebook will explain how to use requests to download files via Python.

1. There are several libraries and packages available for Python when it comes to requesting data. For this tutorial, we'll use `urllib`.

In [1]:
from urllib.request import urlretrieve

2. We now want to set an output directory. You can manually create it OR you can also use Python to do so. We will be using the latter method for automation purposes. To do so, we will use the [`os` library](https://docs.python.org/3/library/os.html#os.mkdir).

Important (Paths): 
- Windows Users: https://www.computerhope.com/issues/ch001708.htm#windows
- MacOS/Linux/WSL Users: https://www.computerhope.com/issues/ch001708.htm#linux
- `..` is used to _go up_ a level (i.e. the back button).

We will make a new folder _outside_ this `tutorials/tute_1` folder inside the root `MAST30034` directory. To do so, we will use `../data` to "exit" the current directory or "go up" to the parent directory. Then, we will go into the `data` folder to create subdirectories.

If you cloned the repo, you should already have the `data/taxi_zones` directory.

In [2]:
import os

# from the current `tute_1` directory, go back two levels to the `MAST30034` directory
output_relative_dir = '../data/'

# check if it exists as it makedir will raise an error if it does exist
if not os.path.exists(output_relative_dir):
    os.makedirs(output_relative_dir)
    
# now, for each type of data set we will need, we will create the paths
for target_dir in ('tlc_data', 'tute_data'): # taxi_zones should already exist
    if not os.path.exists(output_relative_dir + target_dir):
        os.makedirs(output_relative_dir + target_dir)

3. Now, we will download the required datasets. For this tutorial, we will only use January-February, but you can adjust it to your requirements.

**Please only use the years where there are zones (post 2015).**

In [3]:
YEAR = '2024'
# adjust the range function to the numerical months i.e 1 = jan, 2 = feb, etc...
# MONTHS = range(1, 13)
MONTHS = range(1, 4)

In [4]:
# this is the URL template as of 07/2023
URL_TEMPLATE = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_"#year-month.parquet

In [5]:
# data output directory is `data/tlc_data/`
tlc_output_dir = output_relative_dir + 'tlc_data'

for month in MONTHS:
    # 0-fill i.e 1 -> 01, 2 -> 02, etc
    month = str(month).zfill(2) 
    print(f"Begin month {month}")
    
    # generate url
    url = f'{URL_TEMPLATE}{YEAR}-{month}.parquet'
    # generate output location and filename
    output_dir = f"{tlc_output_dir}/{YEAR}-{month}.parquet"
    # download
    urlretrieve(url, output_dir) 
    
    print(f"Completed month {month}")

Begin month 01
Completed month 01
Begin month 02
Completed month 02
Begin month 03
Completed month 03


4. The shapefile is inside the zip file from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page :
    - https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
    - https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip
    
and now we are done!

_________________

# Working with Larger Datasets with a Scalable Solution!
Consider the size of the datasets you have worked with at Uni. Probably a few hundred megabytes or a couple gigabytes. Whilst `pandas` and `Excel` do have their use cases, it is not feasible to use them when you work with larger datasets over several gigabytes. You have been working with moderately sized data. In this subject, you will be working with larger datasets (not quite big data).

For example:
1. 20k rows would be hard for Excel, but easy for `pandas`.
2. A few million records would be doable for `pandas` depending on RAM (let's say 16GB or 32GB to be generous).
3. Now, consider 100 million rows over several gigabytes. `pandas` **is not your solution**.

Why?

`pandas` works in-memory. That is, you are limited by RAM which can be hard to come across for the average person. Even with 32GB or 64GB memory, it is best to use Apache Spark, which is designed to work with large datasets.

![image.png](https://spark.apache.org/images/spark-logo-trademark.png)


**Disclaimer:**
- Windows 10 or 11 users are required to install `WSL` or `WSL2` for `pyspark`. This is something that you should take the time to learn how to use and install now for a future career in the tech industry. If you have yet to install it, please visit https://learn.microsoft.com/en-us/windows/wsl/install
- MacOS (Intel) or Linux is all good. If you are using an M1, M2, or M3 chip, you will need to follow some specific instructions.

In [None]:
import pandas as pd

df = pd.read_parquet('../data/tlc_data/2024-01.parquet')
df.tail()

You can then follow the tutorial using the alternative `pandas` syntax.

#### Before we begin...
*It may be a good idea to have a chatbot window open for this process: either [ChatGPT](https://chatgpt.com/), [Meta AI](https://www.meta.ai/), or [Gemini](https://gemini.google.com/app). If you encounter any errors, copy and paste the last few lines of the error into the chatbot to ask for assistance. Do not blindly follow its advice - read the response carefully to see if it can solve your problem. This deals with 90% of errors one might encounter. If you encounter other errors with these steps, please visit Liam Hodgkinson during consultation hours.*

**Steps:**

0. (Pre-Req) Install WSL2 for Windows 10 users. MacOS users, please ensure your terminal is set to `bash`.
1. We **strongly recommend** a fresh environment for this subject as there can be package conflicts. If you are getting errors, please reinstall the environment *from scratch* before asking for help.
2. If you are using WSL, it is recommended you install Visual Studio Code and follow [these instructions](https://code.visualstudio.com/docs/remote/wsl-tutorial).
3. Install `Java` and `PySpark`.

**All devices other than MacOS (Linux / WSL):**

```bash
# Update apt formula
sudo apt update
# install java
sudo apt install openjdk-8-jdk -y
# add to path
echo 'JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"' | sudo tee -a /etc/environment
# apply to environment
source /etc/environment
# install spark
pip3 install pyspark pyarrow pandas
```
    
**MacOS:**
1. Install [Homebrew](https://brew.sh/). If your shell prompts to set `zsh` as default shell with `chsh -s /bin/zsh`, run that first!!  
2. Install and setup `Java` and `JAVA_HOME` (spark uses `Java` for backend, similar to how `Python` sits on top of `C`).
```bash
# For Intel CPU
# install java 8 and link to system java wrapper
brew install openjdk@8 
# For newer version of brew, try the command below if brew install doesn't work
#brew install --cask homebrew/cask-versions/adoptopenjdk8
sudo ln -sfn /usr/local/opt/openjdk@8/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-8.jdk
# add to path (earlier OSX defaults to bash while newer ones defaults to zsh)
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v1.8)"' | tee -a $HOME/.bashrc $HOME/.zshrc
```

If you are using MacOS (M1 or M2 chip), follow [this guide](https://code2care.org/q/install-native-java-jdk-jre-on-apple-silicon-m1-mac) for Java JDK or [this guide](https://gist.github.com/brianspiering/1e690b593db025b5acee920fa7330366)

3. Install python packages/spark
```bash
# reload java path
source $HOME/.bashrc ; source $HOME/.zshrc
# install spark. Note: if you are using anaconda/conda environments, you need to make sure the pip3 is the correct pip3!
# Or you should install with conda directly!
#conda install pyspark pyarrow pandas
pip3 install pyspark pyarrow pandas==1.5.3
```

Run the code below to see if you have installed it. As long as it runs (despite red warnings) and there are no errors, you're ready to go!

Troubleshooting guides:
1. The module is still not found after I installed everything!
    - run `which pip` `which python` in your terminal and compare that with results of `import sys; sys.executable` running from your jupyter notebook. They have to be the same path (why?).
    - If not same path, change the kernel of your jupyter notebook to using that python kernel.
    
2. The java instance stopped when executing the cell
    - Ensure java is installed (commands executes without error)
    - make sure `echo $JAVA_HOME` produces the proper location (i.e. it points at where your java is installed)
    
3. `conda`, `pip`, `apt`, `brew`... not found!
    - Install the required softwares and make sure their home folder are present in your `echo $PATH`
    
4. I am using Windows and I don't want to use WSL
    - NO.

#### Now run the following code...

In [None]:
import base64
import traceback
RED = '\033[91m'
GREEN = '\033[92m'
BOLD = '\033[1m'
RESET = '\033[0m'

try:
    from pyspark.sql import SparkSession
    # Create a spark session (which will run spark jobs)
    spark = (
        SparkSession.builder.appName("MAST30034 Tutorial")
        .config("spark.sql.repl.eagerEval.enabled", True) 
        .config("spark.sql.parquet.cacheMetadata", "true")
        .config("spark.sql.session.timeZone", "Etc/UTC")
        .getOrCreate()
    )
    print(f"{GREEN}{BOLD}Success! Your environment is set up and you are ready for the first workshop.{RESET}")
except Exception as e:
    print(f"{RED}{BOLD}Something went wrong. Reinstall and try again.{RESET}")
    traceback.print_exc()