## General Tutorial / Subject Structure
#### Projects
- **Assignment 1 (20%)**: Released in Week 1 or 2 and due 2 weeks later from posted date. Involves an initial visualisation and analysis.  
- **Assignment 2 (20%)**: Released immediately after first assignment and due 2 weeks later from posted date. Builds upon the first assignment and requires an in-depth analysis with perhaps some feature engineering or prediction.

- **Group Assignment (50%)**: Released immediately after the second assignment, perhaps around week 6 or 7. This assignment requires groups to build sophistiacted predictive models with justification. The project is split into several smaller components which require work every week.

- **Individual Self-Reflection Report (10%)**: Self reflection of your individual contribution vs other group members + what you learned in the subject 

#### General Admin stuff...
- Your Tutor Coordinator is Ruining Dong, reach out to her for questions not related to the tutorial
- To start off, we know that COMP30027 and MAST30025 are pre-req. All material covered in those subjects are assumed knowledge for this subject.
- I genuinely believe that this subject should be more *applied* rather than theoretical. You will cover GLM's in detail in MAST30027, and you should have covered the base ML models in COMP30027. 
- Rather, I'm looking to introduce packages and techniques that will compliment your work for this subject.

#### (My) Tutorial Structure (Wed 5\:15pm-7\:15pm):
- Labs are predominantly taught with Python
- First hour or so will be based on Lab material / general programming how-to's and walkthroughs, with *edits* made to make it as Pythonic as possible. If the majority of students are keen to learn advanced `pandas` and data parsing tips, let me know and I will arrange to do so.
- Second hour will generally be me introducing industry tools (cloud computing, industry packages, techniques of data parsing, etc) or consultation-like hours where you can ask any question and I will attempt to answer it within my limits. This is also your chance to ask questions about project approaches and borrow my small brain for a certain amount of time.

- Generally, feel free to attend either half (or the full 2 hours) to suit your interests. You're all classified as *experienced uni veterans* so do what works for you. 


**Personal advice: You can put as little or as much effort in this subject. Doing the bare minimum is easy, but you will gain no new knowledge from this capstone subject, and I assume the majority of us are studying to get a job. So, I suggest you experiment with new tools, read up on the latest technologies and learn to time manage well.**

### Lab 1 Overview
#### First Half
GitLab:
- https://gitlab.unimelb.edu.au/
- Creating a test repo with credentials, GitHub Desktop interface

UniMelb's VM Server:
- http://mast30034.science.unimelb.edu.au/
- Log in to verify access

#### Second Half
- Review of ETL processes in Python
- Plotting maps with cool Python libraries
- *Wait, there's something better than a `CSV`?*

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:
# read in the data (note that this is a subsample)
df = pd.read_csv("../Data/Lab1/sample.csv")

df.tail()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,PickupCell,DropoffCell
99995,2,4/12/15 22:55,4/12/15 23:03,1,0.75,-73.99437,40.746239,1,N,-73.980774,...,2,6.5,0.5,0.5,0.0,0.0,0.3,7.8,25:69,27:68
99996,1,4/12/15 22:55,4/12/15 23:08,1,2.4,-73.968346,40.759735,1,N,-73.969879,...,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3,27:64,24:60
99997,1,4/12/15 22:55,4/12/15 23:01,1,0.8,-73.993484,40.742168,1,N,-73.98439,...,1,6.0,0.5,0.5,1.45,0.0,0.3,8.75,25:69,26:67
99998,2,4/12/15 22:55,4/12/15 23:17,1,4.73,-73.984993,40.747929,1,N,-73.981552,...,1,18.5,0.5,0.5,3.96,0.0,0.3,23.76,26:68,33:76
99999,2,4/12/15 22:55,4/12/15 22:59,2,0.8,-73.975731,40.751968,1,N,-73.981247,...,1,4.5,0.5,0.5,1.16,0.0,0.3,6.96,27:66,27:68


In [3]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'PickupCell', 'DropoffCell'],
      dtype='object')

### Base Visualisation
- Getting spatial maps in Python

In [4]:
COORDS = ['pickup_latitude', 'pickup_longitude']

df[COORDS].describe()

Unnamed: 0,pickup_latitude,pickup_longitude
count,100000.0,100000.0
mean,40.143712,-72.875803
std,4.930469,8.950561
min,0.0,-77.047104
25%,40.734241,-73.992464
50%,40.75116,-73.982414
75%,40.765747,-73.969246
max,42.736137,0.0


In [5]:
# drop nans
df.dropna(inplace=True)
df[COORDS].describe()

Unnamed: 0,pickup_latitude,pickup_longitude
count,98195.0,98195.0
mean,40.749306,-73.975085
std,0.026843,0.037463
min,40.583759,-74.084488
25%,40.735245,-73.992615
50%,40.751526,-73.982658
75%,40.766073,-73.970203
max,40.87952,-73.674927


In [6]:
df[COORDS].describe().loc[['min','max']]

Unnamed: 0,pickup_latitude,pickup_longitude
min,40.583759,-74.084488
max,40.87952,-73.674927


In [7]:
# Bbox = Boundary box
Bbox = df[COORDS].describe().loc[['min','max']].values

# middle coords for map in long, lat
Mcoords = df[COORDS].describe().loc[["50%"]].values[0]

yRange, xRange = sorted(i[0] for i in Bbox), sorted(i[1] for i in Bbox)

In [8]:
df[COORDS].describe().loc[['min','max']].values

array([[ 40.58375931, -74.08448792],
       [ 40.87952042, -73.67492676]])

In [9]:
Mcoords

array([ 40.75152588, -73.98265839])

In [9]:
!pip install folium

Collecting folium
  Using cached folium-0.11.0-py2.py3-none-any.whl (93 kB)
Collecting branca>=0.3.0
  Using cached branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [10]:
# base folium map
import folium

nycM = folium.Map(location=Mcoords, tiles="Stamen Terrain", zoom_start=10)

# save plot
nycM.save('../plots/foliumBaseMap.html.html')
nycM

In [11]:
from bokeh.plotting import figure, show

# tile providers are the underlying map 
from bokeh.tile_providers import get_provider, Vendors

# to display bokeh plots inside jupyter, we need to use output_notebook
from bokeh.io import reset_output, output_notebook
reset_output()
output_notebook()
# note below that it says "BokehJS 1.4.0 successfully loaded."

Bokeh requires axis to be in Mercer format and doesn't accept latitude/longitude
- https://en.wikipedia.org/wiki/Web_Mercator_projection

In [12]:
def lat2mercer(coords):
    """
    Function which converts latitude to its mercer coordinate representation
    """
    k = 6378137
    converted = list()
    for lat in coords:
        converted.append(np.log(np.tan((90 + lat) * np.pi/360.0)) * k)
    return converted

def lon2mercer(coords):
    """
    Function which converts latitude to its mercer coordinate representation
    """
    k = 6378137
    converted = list()
    for lon in coords:
        converted.append(lon * (k * np.pi/180.0))
    return converted

Let's view all the possible tiles

In [13]:
# for each map type in list of Vendors
for mapType in Vendors:
    # create plot with coords
    p = figure(x_range=lon2mercer(xRange), y_range=lat2mercer(yRange),
           x_axis_type="mercator", y_axis_type="mercator")
    # add underlying tile from provider
    p.add_tile(get_provider(mapType))
    p.title.text = mapType
    
    # display
    show(p)

My two cents... stick with `STAMEN_TERRAIN_RETINA`

If you would like to use Bokeh and GMAPS:
- https://docs.bokeh.org/en/latest/docs/user_guide/geo.html

In [14]:
TILE = get_provider("STAMEN_TERRAIN_RETINA")

pPickup = figure(x_range=lon2mercer(xRange), y_range=lat2mercer(yRange),
       x_axis_type="mercator", y_axis_type="mercator")
pPickup.add_tile(TILE)
pPickup.title.text = "Pickups in NYC"

# convert to mercer
df['pickupX'] = df['pickup_longitude'].apply(lambda x: lon2mercer([x])[0])
df['pickupY'] = df['pickup_latitude'].apply(lambda x: lat2mercer([x])[0])

# plot circles (source = data source)
pPickup.circle(x='pickupX', y='pickupY', 
         size=5, fill_color="blue", fill_alpha=0.5, 
         source=df[['pickupX','pickupY']])

show(pPickup)

pDropoff = figure(x_range=lon2mercer(xRange), y_range=lat2mercer(yRange),
       x_axis_type="mercator", y_axis_type="mercator")
pDropoff.add_tile(TILE)
pDropoff.title.text = "Dropoff in NYC"

# convert to mercer
df['dropoffX'] = df['dropoff_longitude'].apply(lambda x: lon2mercer([x])[0])
df['dropoffY'] = df['dropoff_latitude'].apply(lambda x: lat2mercer([x])[0])

# plot circles (source = data source)
pDropoff.circle(x='dropoffX', y='dropoffY', 
         size=5, color="pink", fill_color="red", fill_alpha=0.5, 
         source=df[['dropoffX','dropoffY']])

show(pDropoff)

Simple inferences:
- More pickups in central Manhattan with dropoffs in the suburbs
- Pickup location are easily divided into "hubs" (Manhattan, Aiport, etc)
- Dropoffs seem a bit more random
- Perhaps we should take a look at time, day of week, weather, events, etc

In [15]:
from bokeh.io import save
save(pPickup, '../plots/bokehPickupPoints.html')
save(pDropoff, '../plots/bokehDropoffPoints.html')

'C:\\Users\\HP\\Documents\\GitHub\\Applied-Data-Science\\plots\\bokehDropoffPoints.html'

### Some other libraries to consider
Pickle: 
- Lightweight and super fast serialization for data
- Python native and compatible with several data formats
- High space, Low time

Feather:
- Lightweight and super fast serialization for data
- Python **and** R native, though not compatible with all data formats
- Low space, Low time
- If your pandas doesn't have it installed, use `pip install feather-format`

Plotly:
- Python version of `plotly R` / ggplot
- Super nice visualizations

Geopandas:
- `pandas` variation built for geospatial data
- Creates `GeoDataFrame` which are useful for shapefiles and geo-objects

In [16]:
pip install -U pyarrow

Collecting pyarrowNote: you may need to restart the kernel to use updated packages.
  Downloading pyarrow-1.0.1-cp38-cp38-win_amd64.whl (10.5 MB)
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow



ERROR: Could not install packages due to an EnvironmentError: [WinError 5] 拒绝访问。: 'C:\\Users\\HP\\anaconda\\Lib\\site-packages\\~yarrow\\lib.cp38-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



    Found existing installation: pyarrow 1.0.0
    Uninstalling pyarrow-1.0.0:
      Successfully uninstalled pyarrow-1.0.0


In [19]:
# required to reset index as feather does not support serializing a non-default index for the index.
# similarly to a database, we need a unique index by default, and pandas dataframes hence require a reset_index to push 
# it into the dataframe level
df.reset_index().to_feather('../Data/Lab1/sample.feather')
df.to_pickle('../Data/Lab1/sample.pkl')

In [20]:
%%time

# read in the feather format, also drop the index label we generated above
df_feather = pd.read_feather('../Data/Lab1/sample.feather').drop('index', axis=1)

Wall time: 60.9 ms


In [21]:
%%time

df_pickle = pd.read_pickle('../Data/Lab1/sample.pkl')

Wall time: 59.9 ms


In [22]:
%%time

df_csv = pd.read_csv('../Data/Lab1/sample.csv')

Wall time: 246 ms


In [23]:
df_feather.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,PickupCell,DropoffCell,pickupX,pickupY,dropoffX,dropoffY
0,2,1/12/15 0:00,1/12/15 0:05,5,0.96,-73.979942,40.765381,1,N,-73.966309,...,1.0,0.0,0.3,7.8,25:64,27:63,-8235410.0,4977797.0,-8233892.0,4977460.0
1,2,1/12/15 0:00,1/12/15 0:00,2,2.69,-73.972336,40.762379,1,N,-73.993629,...,3.34,0.0,0.3,25.64,26:64,25:69,-8234563.0,4977356.0,-8236933.0,4974948.0
2,2,1/12/15 0:00,1/12/15 0:00,1,2.62,-73.968849,40.76453,1,N,-73.974548,...,3.56,0.0,0.3,21.36,26:63,22:59,-8234175.0,4977672.0,-8234809.0,4981657.0
3,1,1/12/15 0:00,1/12/15 0:05,1,1.2,-73.993935,40.741684,1,N,-73.997665,...,0.2,0.0,0.3,8.0,25:70,24:69,-8236967.0,4974314.0,-8237382.0,4975164.0
4,1,1/12/15 0:00,1/12/15 0:09,2,3.0,-73.988922,40.72699,1,N,-73.975594,...,0.0,0.0,0.3,12.3,28:71,33:75,-8236409.0,4972156.0,-8234925.0,4967732.0


In [18]:
df_pickle.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,PickupCell,DropoffCell,pickupX,pickupY,dropoffX,dropoffY
0,2,1/12/15 0:00,1/12/15 0:05,5,0.96,-73.979942,40.765381,1,N,-73.966309,...,1.0,0.0,0.3,7.8,25:64,27:63,-8235410.0,4977797.0,-8233892.0,4977460.0
1,2,1/12/15 0:00,1/12/15 0:00,2,2.69,-73.972336,40.762379,1,N,-73.993629,...,3.34,0.0,0.3,25.64,26:64,25:69,-8234563.0,4977356.0,-8236933.0,4974948.0
2,2,1/12/15 0:00,1/12/15 0:00,1,2.62,-73.968849,40.76453,1,N,-73.974548,...,3.56,0.0,0.3,21.36,26:63,22:59,-8234175.0,4977672.0,-8234809.0,4981657.0
3,1,1/12/15 0:00,1/12/15 0:05,1,1.2,-73.993935,40.741684,1,N,-73.997665,...,0.2,0.0,0.3,8.0,25:70,24:69,-8236967.0,4974314.0,-8237382.0,4975164.0
4,1,1/12/15 0:00,1/12/15 0:09,2,3.0,-73.988922,40.72699,1,N,-73.975594,...,0.0,0.0,0.3,12.3,28:71,33:75,-8236409.0,4972156.0,-8234925.0,4967732.0


In [19]:
df_csv.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,PickupCell,DropoffCell
0,2,1/12/15 0:00,1/12/15 0:05,5,0.96,-73.979942,40.765381,1,N,-73.966309,...,1,5.5,0.5,0.5,1.0,0.0,0.3,7.8,25:64,27:63
1,2,1/12/15 0:00,1/12/15 0:00,2,2.69,-73.972336,40.762379,1,N,-73.993629,...,1,21.5,0.0,0.5,3.34,0.0,0.3,25.64,26:64,25:69
2,2,1/12/15 0:00,1/12/15 0:00,1,2.62,-73.968849,40.76453,1,N,-73.974548,...,1,17.0,0.0,0.5,3.56,0.0,0.3,21.36,26:63,22:59
3,1,1/12/15 0:00,1/12/15 0:05,1,1.2,-73.993935,40.741684,1,N,-73.997665,...,1,6.5,0.5,0.5,0.2,0.0,0.3,8.0,25:70,24:69
4,1,1/12/15 0:00,1/12/15 0:09,2,3.0,-73.988922,40.72699,1,N,-73.975594,...,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3,28:71,33:75


### Jupyter Notebook installation for R (Local)
Install the R kernel for Jupyter Notebook:
- `conda install -c r r-irkernel -y`

Install `Rtools40` for Windows:
- https://cran.r-project.org/bin/windows/Rtools/

Now you can create notebooks with R kernels (version 3.6.1). 

**Note that this is generally buggy and not recommended.**

### Jupyter Notebook for Python (default - Local)
Install Anaconda:
- https://www.anaconda.com/products/individual
- Highly recommend you do a fresh installation for this subject
- If you need a step by step guide, refer to https://github.com/akiratwang/Sept2017_PandasWorkshop/blob/master/Python_installation.md

Install Jupyter Notebook extensions (via cmd):
- `conda install -c conda-forge jupyter_contrib_nbextensions`

Install visualisation libraries:
- `conda install -c conda-forge folium`
- `conda install bokeh`

### WSL Environment for Windows 10
Refer to this guide to get a native Linux terminal:
- https://github.com/akiratwang/COMP20003-Setting-Up
- Ignore all the `C` related parts, just get Ubuntu installed

### General Tips for Jupyter Notebook
Cell shortcuts:
- `shift + enter` : Run current cell
- `ctrl + enter` : Run selected cells

Command mode (press `esc` to enter):
- `m` : Makes the cell markdown
- `y` : Makes the cell into code
- `a` : Insert cell above
- `b` : Insert cell above
- double `d` : Delete current cell

Code Shortcuts:
- `shift + tab` : brings function arguments

### Using Latex 
- https://www.overleaf.com offers free hosting for real-time collaborative editing in Latex
- Much nicer, formatted and cleaner document compared to a google doc / word doc
- Should have already been set as the standard for maths subjects for assignments