# ** Spark stack ** 

![caption](https://spark.apache.org/images/spark-stack.png)

* Spark combines a stack of libraries including
    * SQL and DataFrames. Mix SQL queries with Spark programs Spark DataFrames are based on RDDs
    * MLlib. A library for machine learning, it provides data structures on the top of RDDs and ML algorithms
    * GraphX. a distributed graph database based on RDDs (limited number of functions)
    * Spark Streaming. it brings Spark to stream processing, letting the programmer write streaming jobs the same way than batch jobs


# Spark SQL

* Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed:
    - Seamlessly mix SQL queries with Spark programs.
    - Spark SQL lets you query structured data inside Spark programs, using SQL 
    - Apply functions to results of SQL queries.
```Python
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+
```

# DataSets and DataFrames

A **Dataset** is a distributed collection of data. It provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).

**Problem:** It is not supported by Python ... why?

A **DataFrame** is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
    - Example: 
    
```Python
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+
```

In Python it’s possible to access a DataFrame’s columns either by attribute (```df.age```) or by indexing (```df['age']```).

```Python
# Print the schema in a tree format
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+

# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# +-------+---------+
# |   name|(age + 1)|
# +-------+---------+
# |Michael|     null|
# |   Andy|       31|
# | Justin|       20|
# +-------+---------+

# Select people older than 21
df.filter(df['age'] > 21).show()
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+

# Count people by age
df.groupBy("age").count().show()
# +----+-----+
# | age|count|
# +----+-----+
# |  19|    1|
# |null|    1|
# |  30|    1|
# +----+-----+
```

# DataFrames Data Types

**DataType**: this abstract class is the base type of all built-in data types in Spark SQL, e.g. strings, longs. Besides, it is possible to use DataTypes objects in your code to create complex Spark SQL types, such as arrays or maps.

**StructType** is a built-in data type in Spark SQL to represent a collection of StructFields.
```Python
schemaTyped.printTreeString
root
 |-- a: integer (nullable = true)
 |-- b: string (nullable = true)
```

A **StructField** describes a single field in a ```StructType```. It has a name, the type and whether or not it be empty, and an optional metadata and a comment.

# VectorAssembler

**VectorAssembler** is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.

_Examples_

Assume that we have a DataFrame with the columns id, hour, mobile, userFeatures, and clicked:
```Python
 id | hour | mobile | userFeatures     | clicked
----|------|--------|------------------|---------
 0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0
```

userFeatures is a vector column that contains three user features. We want to combine hour, mobile, and userFeatures into a single feature vector called features and use it to predict clicked or not. If we set VectorAssembler’s input columns to hour, mobile, and userFeatures and output column to features, after transformation we should get the following DataFrame:
```Python
 id | hour | mobile | userFeatures     | clicked | features
----|------|--------|------------------|---------|-----------------------------
 0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5]
 
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
    ["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)
 ```
 
 

# Annex 1.


# ***Linear Regression***

* Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
    * Example: Height, Gender, Weight → Shoe Size
    
    
![caption](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/220px-Linear_regression.svg.png)

# **Linear Regression General Formulation**

For each observation we have a feature vector $x_i$ and label $y$ where $i = 1, \dots ,d$

$$x^T  =  [x_1, \dots x_d]$$

and we assume a linear mapping between features and label: 
 
$$y \sim  w_0 + w_1 x_ 1 + \dots + w_d x_d $$

We can augment the feature vector to incorporate offset: 

$$ x  =  [1, x_1, \dots,  x_d] $$ 

Then, we can then rewrite this linear mapping as scalar product: 

$$ y \sim \hat{y} = \sum_{i=0}^d w_i x_i = w^T x $$

# **1D Example**

__Goal__: find the line of best fit

$x$ coordinate: features

$y$ coordinate: labels

$\hat{y} \sim y = w_0 + w_1 x$

$w_0$ intercept

$w_1$ slope

    
![caption](./images/Linear_regression_1.png)

# **Evaluating Predictions**

* Can measure ‘closeness’ between label and prediction
    * Example -> Shoe size: better to be off by one size than 5 sizes

* What is an appropriate evaluation metric or ‘loss’ function? 
    * Option 1: Absolute loss: $|y - \hat{y}|$
    * Squared loss: $(y - \hat{y})^2$ -> Has nicer mathematical properties

# **How Can We Learn Model (w)?**

* Assume we have $n$ training points, where $x^i$ denotes the $i\ th$ point

* Recall two earlier points:
    * Linear assumption: $\hat{y} = w^T x$ 
    * We use squared loss: $(y - \hat{y})^2$

* Idea: Find $w$ that minimizes squared loss over training points:

$$ min_w \sum_{i=1}^n (\frac{w^T x^i}{y^i}-y^i)^2$$

Given $n$ training points with $d$ features, we define:

* $X \in \mathbb{R}^{n \times d}$ : matrix storing points
* $y \in \mathbb{R}^n $ : real-valued labels
* $\hat{y} \in \mathbb{R}^n $ : predicted labels, where $ \hat{y} = X w$
* $w \in \mathbb{R}^d $ : regression parameters / model to learn

__Least Squares Regression__: Learn mapping (w) from features to labels that minimizes residual sum of squares:

$$ min_w ||Xw - y||^2_2 $$

Equivalent to $min_w \sum_{i=0}^n (w^T x^i - y^i)$ by definition of Euclidean norm

Closed form solution (if inverse exists): $w = (X^T X)^{-1} X^T y$