# Jypter notebook

Before starting, let's take a look at the Jupyter notebook.

1. Stopping and halting a kernel
1. Looking at which notebooks are running
1. Cells
1. Adding cells above and below
1. Changing type of cell from Markdown to Code
1. Adding math

# Class and objects

To import a module, you use the word `import` and then the name of the module

In [None]:
import sklearn

You are able to import this because the module `sklearn` is already part of the `Anaconda` distribution.
You can explore the modules that are part of sklearn by doing `from sklearn import ` and then pressing `Tab`.

In [None]:
# this it below
from sklearn import 

# this also works with submodules
from sklearn.linear_model import 

In [None]:
# from the submodule linear_model, lets import LinearRegression
from sklearn.linear_model import LinearRegression

Python is based on object-oriented programming (OOP). 

- Objects are containers of _data_ and _funcionality_
- Objects are of a _class_ and that class might _inherit_ funcionality from other classes
- A class defines when and how the objects of that class would store data and how those objects would _behave_

The imported `LinearRegression` is a `class` definition. You can know the parents of a class by retrieving the `__bases__` property

In [None]:
LinearRegression.__bases__

To create an object, you call the class with parameters. To retrieve the possible parameters of class (or function) in the notebook,
you can `Shift-Tab` (preview), double `Shift-Tab` (expanded window), triple `Shift-Tab` (expanded window with no time out), quadruple `Shift-Tab` (for split view of help)

In [None]:
# try it below
LinearRegression()

Now, lets create a linear regression object

In [None]:
lr = LinearRegression()

Again, we can explore that object by typing the name of object, then `.`, and then `Tab`

In [None]:
# try it here
lr.

if we type `lr` into the notebook, we will get a customize description of the object

In [None]:
lr

we can obtain a more programmatically class description by calling the built-in `type` command

In [None]:
type(lr)

Now, objects have a global _identity_

In [None]:
id(lr)

# Datasets

`sklearn` has many datasets. We will take a diabetes dataset from it

In [None]:
from sklearn.datasets import load_diabetes

In [None]:
diabetes_ds = load_diabetes()

In [None]:
X = diabetes_ds['data']
y = diabetes_ds['target']

`sklearn` works mostly with `numpy` array, which are $n$-dimensional arrays.

In [None]:
[type(X), type(y)]

# `Numpy` arrays

You can check the number of dimensions of an array

In [None]:
X.ndim

Check the size of the dimensions

In [None]:
X.shape

Get slices of the dimensions. The following are all the same thing: grab the first two rows of a matrix

In [None]:
X[0:2]

In [None]:
X[:2]

In [None]:
X[0:2, :]

We can also grab columns in the same way

In [None]:
X[:, 0:2]

Sometimes you want to grab just one column (feature), but the `numpy` returns a one dimensional object

In [None]:
X[:, 2].shape

We can reshape the $nd$-array and add one dimension:

In [None]:
X[:, 2].reshape([-1, 1])

In [None]:
X[:, 2].reshape([-1, 1]).shape

You can do matrix algebra:

In [None]:
# transpose
X.T.shape

In [None]:
X.dot(X.T).shape

For more functions, you can importa `numpy`

In [None]:
import numpy.linalg as la

In [None]:
la.inv(X.dot(X.T)).shape

# Fitting models

OK, let's go back to our example with linear regression.

Usually `sklearn` objects starts by _fitting_ the data, then either _predicting_ or _transforming_ new data. _Predicting_ is usually for supervised learning and _transforming_ is for unsupervised learning.

In [None]:
# explore the parameters of fit
lr.fit

In [None]:
lr2 = lr.fit(X[:, [2]], y)

`fit` returns an object. If we examine the id of the object it returns:

In [None]:
id(lr2)

In [None]:
id(lr)

We realize that it is the same object `lr`, therefore, the call is fitting the data and modifying the internal structure of the object and it is _returning itself_.

Therefore, you can __chain__ calls, which is very powerful feature.

# Explore the fitted object

By looking at the [online documentation of the `LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), we can know the parameters it found.

In [None]:
lr.intercept_

In [None]:
lr.coef_

# Predicting

In [None]:
# explore the parameters
lr.predict

In [None]:
y_pred = lr.predict(X[:, [2]])

Because we know how linear regression works, we can produce the predictions ourselves

In [None]:
y_pred2 = lr.intercept_ + X[:, [2]].dot(lr.coef_)

In [None]:
# this checks that all entries in the comparison are True
np.all(y_pred2 == y_pred)

Now, due to the powerful concept of chaining, we can combine fit and predict in one line

In [None]:
y_pred3 = lr.fit(X[:, [2]], y).predict(X[:, [2]])

In [None]:
np.all(y_pred3 == y_pred)

# Additional packages

Sometimes you want to use a package that you found online. Many of these packages are available throught the `Python Install Packages` (PIP) package manager.

For example, the package [`quandl`](https://www.quandl.com/tools/python) allows quants to load financial data in Python.

We can install it in the console simply by typing
```
pip install quandl
```

And now we should be able to import that package

In [None]:
import quandl

In [None]:
import quandl
mydata = quandl.get("YAHOO/AAPL")

In [None]:
mydata.head()

# Pandas

In [None]:
# this helps put the plot results in the browser
%matplotlib inline

`Pandas` is a package for loading, manipulating, and display data sets. It tries to mimick the funcionality of `data.frame` in `R`

In [None]:
import pandas as pd

Many packages return data in `pandas` `DataFrame` objects

In [None]:
apple_stocks = quandl.get("YAHOO/AAPL")    

In [None]:
type(apple_stocks)

We can display the beginning of a data frame:

In [None]:
apple_stocks.head()

In [None]:
apple_stocks.tail()

And also, we can plot it with `pandas`

In [None]:
apple_stocks.plot(y='Close');

We can manipulate it too. Let's say we want to compute the stock returns

$$ r = \frac{V_t - V_{t-1}}{V_{t-1}} - 1$$

But for this, we need to compute a rolling filter

In [None]:
apple_stocks[['Close']].pct_change().head()

In [None]:
apple_stocks[['Close']].pct_change().plot();

In [None]:
apple_stocks[['Close']].pct_change().hist(bins=100);

# Spark

`Spark` is a distributed in-memory big data analytics framework. It is `hadoop` on steriods. 

![](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/diagrams/spark-submit-master-workers.png )

Because we launched this `jupyter` notebook with `pyspark`, we have available automatically a variable called `Spark` context `sc` which gives us access to the master and therefore to the workers.

If we go to see the [`Spark` dashboard](http://localhost:4040) (usually in port `4040`), we can see some of the variables.

With `Spark` context you can read data from many sources, including `HDFS` (Hadoop File System), `Hive`, Amazon's `S3`, files, and databases.

In [None]:
# explore the variables and functions availabe in the Spark context
sc

Spark usually works with `RDD` (Resilient Distributed Dataset) and more recently they are moving towards `DataFrame`, which are similar to `Pandas` but distributed instead.

In [None]:
rdd_example = sc.parallelize([1, 2, 3, 4, 5, 6, 7])

We can check the `id` of the `RDD` in the cluster

In [None]:
rdd_example.id()

In [None]:
# this is a RDD
type(rdd_example)

Let's explore the funcions we have available

In [None]:
rdd_example.

One such function is `take` that allows you to get a taste of what the file contains

In [None]:
rdd_example.take(3)

Let's say you want to apply an operation to each element of the list

In [None]:
def square(x):
    return x**2

now we can apply that transformation to the `RDD` with the `map` function

In [None]:
rdd_result = rdd_example.map(square)

Now you might notice that this returns immediately. Well, this is because operations on `RDD` are lazily evaluated

In [None]:
type(rdd_result)

So `rdd_result` is another `RDD`

In [None]:
rdd_result.id()

Now in fact, there is no duplication of data. `Spark` builds a computational graph that keeps tracks of dependencies and recomputes if something crashes.

We can take a look at the contents of the results by using `take` again. Since `take` is an action, it will trigger a job in the Spark cluster

In [None]:
rdd_result.take(3)

In [None]:
rdd_result.count()

In [None]:
rdd_result.first()

Usually, one you have your results, you write it back to `Hadoop` for later preprocessing, because they usually won't fit in memory.

In [None]:
# this function can save into HDFS using Pickle (Python's internal) format
rdd_result.saveAsPickleFile()

# Spark's `DataFrame`

Now, `DataFrame` has some structure. Again, you can create them from different sources. In this case, `DataFrame` funcionality is available from another __context__ called the `sqlContext`. This gives us access to SQL-like transformations.

In this example, we will use the `sklearn` diabetes dataset again

In [None]:
from sklearn.datasets import load_diabetes
import pandas as pd

In [None]:
diabetes_ds = load_diabetes()

To create a dataset useful for machine learning we need to use certain datatypes

In [None]:
from pyspark.mllib.regression import LabeledPoint

In [None]:
l

In [None]:
from pyspark.ml.linalg import Vectors

In [None]:
d

In [None]:
Xy_df = sqlContext.createDataFrame([
        [float(l), Vectors.dense(d)] for d, l in zip(diabetes_ds['data'], diabetes_ds['target'])],
                                  ["y", "features"])

In [None]:
Xy_df

We can register the table in Spark as an SQL

In [None]:
Xy_df.registerTempTable('Xy')

And then run queries

In [None]:
sql_result1_df = sqlContext.sql('select count(*) from Xy')

In [None]:
# which again is lazily executed
sql_result1_df

In [None]:
sql_result1_df.take(1)

We can again run large scale regression using `DataFrame`

In [None]:
from pyspark.ml.regression import LinearRegression

In [None]:
lr_spark = LinearRegression(featuresCol='features', labelCol="y")

In [None]:
lr_spark.coefficients

In [None]:
lr_results = lr_spark.fit(Xy_df)

In [None]:
lr_results.coefficients

In [None]:
lr_results.intercept