<small><i>This notebook was create by Franck Iutzeler, Jerome Malick and Yann Vernaz (2016)</i></small>

<center><img src="UGA.png" width="30%" height="30%"></center>
<center><h3>Master of Science in Industrial and Applied Mathematics (MSIAM)</h3></center>
<hr>
<center><h1>Convex and distributed optimization</h1></center>
<center><h2>Hands-on Exercises - Outline</h2></center>

## Goals

- **Introduce the basics of convex and distributed optimization**
- **Introduce the basics of Apache Spark**

We will focus on the analysis of parallelism and distribution costs of algorithms.

__Important Note__

You are expected to use Python and Spark for this assignment. You cannot use any libraries beyond those already provided in Python. You can use only the built-in constructs of PySpark and are not allowed to use mllib or any other Spark library.

__Brief Introduction to Spark__

Spark is a data science software that allows you to write your data processing code in Scala, Python, or Java. The data is loaded as a Resilient Distributed Database (RDD) from either the local filesystem or HDFS. RDDs can be converted into other RDDs using transformations such as map, filter, reduceByKey, etc. The evaluation of RDDs is lazy i.e. the required result won’t be evaluated until you explicitly invoke an action indicating that you need the result. This allows Spark to optimize the execution of transformations scheduled on RDDs.
Another useful feature of Spark is in-memory processing. You can specify that you want to cache an RDD in memory if you intend to reuse the RDD through multiple iterations of your data processing job. The full set of transformations that convert one RDD into another, and actions which force the calculation of a result can be found in the Spark programming guide. The programming guide is also a good introduction to Spark. A more detailed RDD API reference with examples can be found here. If you prefer a lecture, you can try the tutorial from Spark Summit 2016 available here.

## Outline

**Part I** Preliminaries

- Setup and Quick Start
- Learn pyspark (RDD, Dataframe, ...)
- Prepare Dataset (MovieLens and Wikipedia)
- Data Mining

**Part II** Recommender System - Big Matrix Factorization using Spark

- What is a recommender system?
- Non-negative matrix factorization (NMF)
- Stochastic Gradient Descent (SGD) for regularized NMF wtih $L_2$ penalty
    - prgs skeleton
    - stopping
    - algorithm analysis (loss vs. time, prediction error vs. time, $\lambda$ vs. final rmsi)

**Part III** Text Categorization

- What is text categorization?
- Logistic regression
- Proximal algorithm for regularized logistic regression wtih $L_2$ and $L_1$ penalties
    - skeleton
    - stropping
    - algorithms analysis

## Setup

The three hands-on exercices are all presented via Jupyter notebooks. Below are instructions on how to set up the environment using [Docker](http://www.docker.com). 

* Install Docker on your machine.

  You should check your installation by running (open a terminal):  
  
  ```
  docker run hello-world
  ```

* To install Jupyter Notebook with Spark 2.0, start the Docker image by running this:

  ```
  docker run -p 8888:8888 -p 4040:4040 -v $(pwd)/notebook:/home/jovyan/work  jupyter/pyspark-notebook:latest
  ```
  
  ```
  docker run -p 8888:8888 -p 4040:4040 -v $(pwd)/notebook:/home/jovyan/work -e GRANT_SUDO=yes --user root ezamir/jupyter-spark-2.0:latest
  ```
  
    - Notice on Docker options

    `-p 8888:8888` opens the port for the Jupyter Notebook UI.

    `-p 4040:4040` opens the port for the Spark Monitoring and Instrumentation UI.

    `-v /notebook:/home/jovyan/work` mounts the default working directory on the host to preserve your work even when the container is destroyed and recreated (e.g., during an upgrade).
    

* Open Jupyter - open browser to notebook link (http://localhost:8888/)

* Upload the hands-on exercises notebooks - Enjoy!

## Checking your installation

You can run the following code to check the versions of the packages on your system. In Jupyter notebook, press `shift` and `return` together to execute the contents of a cell.

In [None]:
import sys # determine Python version number
print('Python version ' + sys.version)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import matplotlib
print('matplotlib:', matplotlib.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

import pandas
print('pandas:', pandas.__version__)

import IPython
print("Ipython version:" + str(IPython.__version__))

import pyspark
sc = pyspark.SparkContext("local[*]")
print("pyspark version:" + str(sc.version))

Python version 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
numpy: 1.10.4
scipy: 0.17.1
matplotlib: 1.5.1
scikit-learn: 0.17.1
pandas: 0.19.0
Ipython version:5.1.0


## Useful Resources

- **[Jupyter Notebook](http://jupyter.org)** - Open source, interactive data science and scientific computing.
- **[Apache Spark](http://spark.apache.org)** - Fast and general engine for large-scale data processing. 
- **[Docker](http://www.docker.com)** - Open-source project that automates the deployment of applications inside software containers.