## Computing PCA using RDDs

##  PCA

The vectors that we want to analyze have length, or dimension, of 365, corresponding to the number of 
days in a year.

We will perform [Principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis)
on these vectors. There are two steps to this process:

1. Computing the covariance matrix: this is a  simple computation. However, it takes a long time to compute and it benefits from using an RDD because it involves all of the input vectors.
2. Computing the eigenvector decomposition. this is a more complex computation, but it takes a fraction of a second because the size to the covariance matrix is $365 \times 365$, which is quite small. We do it on the head node usin `linalg`

### Computing the covariance matrix
Suppose that the data vectors are the column vectors denoted $x$ then the covariance matrix is defined to be
$$
E(x x^T)-E(x)E(x)^T
$$

Where $x x^T$ is the **outer product** of $x$ with itself.

If the data that we have is $x_1,x_2,x_n$ then the estimates we use are:
$$
\hat{E}(x x^T) = \frac{1}{n} \sum_{i=1}^n x_i x_i^T,\;\;\;\;\;
\hat{E}(x) = \frac{1}{n} \sum_{i=1}^n x_i
$$

### The effect of  `nan`s in arithmetic operations
* We use an RDD of numpy arrays, instead of Dataframes.
* Why? Because numpy treats `nan` entries correctly:
  * In numpy `5+nan=5` while in dataframes `5+nan=nan`

### Performing Cov matrix on vectors with NaNs
As it happens, we often get vectors $x$ in which some, but not all, of the entries are `nan`. 
Suppose that we want to compute the mean of the elements of $x$. If we use `np.mean` we will get the result `nan`. A useful alternative is to use `np.nanmean` which removes the `nan` elements and takes the mean of the rest.

In [1]:
import numpy as np
X=np.array([1,1,1,2])
print('mean of',X,'=',np.mean(X))
print('nanmean of',X,'=',np.nanmean(X))
X=np.array([1,1,np.NaN,2])
print('mean of',X,'=',np.mean(X))
print('nanmean of',X,'=',np.nanmean(X))

mean of [1 1 1 2] = 1.25
nanmean of [1 1 1 2] = 1.25
mean of [  1.   1.  nan   2.] = nan
nanmean of [  1.   1.  nan   2.] = 1.33333333333


#### When should you not use `np.nanmean` ?
Using `n.nanmean` is equivalent to assuming that choice of which elements to remove is independent of the values of the elements. 
* Example of bad case: suppose the larger elements have a higher probability of being `nan`. In that case `np.nanmean` will under-estimate the mean

#### Computing the covariance  when there are `nan`s
The covariance is a mean of outer products.

If the data that we have is $x_1,x_2,x_n$ then the estimates we use are:
$$
\hat{E}(x x^T) = \frac{1}{n} \sum_{i=1}^n x_i x_i^T,\;\;\;\;\;
\hat{E}(x) = \frac{1}{n} \sum_{i=1}^n x_i
$$

In [2]:
import findspark
findspark.init()
from pyspark import SparkContext

#sc.stop()
sc = SparkContext(master="local[3]",pyFiles=['lib/numpy_pack.py','lib/spark_PCA.py','lib/computeStats.py'])

from pyspark.sql import *
sqlContext = SQLContext(sc)

In [3]:
import sys
sys.path.append('./lib')

import numpy as np
from computeStats import  STAT_Descriptions
from computeStatistics import computeStatistics

### Climate data

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

There is a large variety of measurements from all over the world, from 1870 will 2012.
in the directory `../../Data/Weather` you will find the following useful files:

* data-source.txt: the source of the data
* ghcnd-readme.txt: A description of the content and format of the data
* ghcnd-stations.txt: A table describing the Meteorological stations.



### Data cleaning

* Most measurements exists only for a tiny fraction of the stations and years. We therefor restrict our use to the following measurements:
```python
['TMAX', 'SNOW', 'SNWD', 'TMIN', 'PRCP', 'TOBS']
```

* 8 We consider only measurement-years that have at most 50 `NaN` entries

* We consider only measurements in the continential USA

* We partition the stations into the states of the continental USA (plus a few stations from states in canada and mexico).

In [4]:
state='NY'
data_dir='../../Data/Weather'
tarname=state+'.tgz'
parquet=state+'.parquet'

In [5]:
!rm -rf $data_dir/$tarname

command="curl https://mas-dse-open.s3.amazonaws.com/Weather/by_state/%s > %s/%s"%(tarname,data_dir,tarname)
print(command)
!$command
!ls -lh $data_dir/$tarname

curl https://mas-dse-open.s3.amazonaws.com/Weather/by_state/NY.tgz > ../../Data/Weather/NY.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.1M  100 22.1M    0     0  5659k      0  0:00:04  0:00:04 --:--:-- 5338k
-rw-r--r--  1 yoavfreund  staff    22M Mar 18 20:25 ../../Data/Weather/NY.tgz


In [6]:
cur_dir,=!pwd
%cd $data_dir
!tar -xzf $tarname
!du ./$parquet
%cd $cur_dir


df=sqlContext.read.parquet(data_dir+'/'+parquet)
print(df.count())
df.show(5)

/Users/yoavfreund/projects/edX-Micro-Master-in-Data-Science/big-data-analytics-using-spark/notebooks/Data/Weather
56480	./NY.parquet
/Users/yoavfreund/projects/edX-Micro-Master-in-Data-Science/big-data-analytics-using-spark/notebooks/Section2-PCA/PCA
84199
+-----------+-----------+----+--------------------+-----+
|    Station|Measurement|Year|              Values|State|
+-----------+-----------+----+--------------------+-----+
|USC00303452|       PRCP|1903|[00 7E 00 7E 00 7...|   NY|
|USC00303452|       PRCP|1904|[00 00 28 5B 00 0...|   NY|
|USC00303452|       PRCP|1905|[00 00 60 56 60 5...|   NY|
|USC00303452|       PRCP|1906|[00 00 00 00 00 0...|   NY|
|USC00303452|       PRCP|1907|[00 00 00 00 60 5...|   NY|
+-----------+-----------+----+--------------------+-----+
only showing top 5 rows



In [7]:
from time import time
t=time()

N=sc.defaultParallelism
print('Number of executors=',N)
print('took',time()-t,'seconds')

Number of executors= 3
took 0.0013091564178466797 seconds


In [8]:
from pickle import dump
for state in ['NY']:
    parquet=state+'.parquet'
    df=sqlContext.read.parquet(data_dir+'/'+parquet)
    print(state,df.count())

    STAT=computeStatistics(sqlContext,df)

    filename=data_dir+'/STAT_%s.pickle'%state
    dump((STAT,STAT_Descriptions),open(filename,'wb'))

NY 84199
TMAX : shape of mdf is  13437
time for TMAX is 42.75972509384155
SNOW : shape of mdf is  15629
time for SNOW is 46.73251724243164
SNWD : shape of mdf is  14617
time for SNWD is 39.53287196159363
TMIN : shape of mdf is  13442
time for TMIN is 38.655359983444214
PRCP : shape of mdf is  16118
time for PRCP is 45.48331594467163
TOBS : shape of mdf is  10956
time for TOBS is 32.675769090652466


In [9]:
X=STAT['TMAX']['Var']
for key in STAT.keys():
    Y=STAT[key]['Var']
    print(key,sum(abs(X-Y)))

TMAX 0.0
SNOW 852107.705867
SNWD 4464167.85212
TMIN 319734.5315
PRCP 1184305.12284
TOBS 277719.008938


In [11]:
!ls -l ../../Data/Weather/STAT*

-rw-r--r--  1 yoavfreund  staff  25684434 Mar 18 20:30 ../../Data/Weather/STAT_NY.pickle
-rw-r--r--  1 yoavfreund  staff  17545930 Mar 18 17:38 ../../Data/Weather/STAT_NY.pickle.gz
-rw-r--r--  1 yoavfreund  staff  26741490 Mar 18 20:19 ../../Data/Weather/STAT_RI.pickle
-rw-r--r--  1 yoavfreund  staff  13522496 Mar 10 12:06 ../../Data/Weather/STAT_SSSSBBBB.pickle.gz


In [12]:
!gzip -f ../../Data/Weather/STAT*.pickle
!ls -l ../../Data/Weather/STAT*

-rw-r--r--  1 yoavfreund  staff  14948048 Mar 18 20:30 ../../Data/Weather/STAT_NY.pickle.gz
-rw-r--r--  1 yoavfreund  staff  11717259 Mar 18 20:19 ../../Data/Weather/STAT_RI.pickle.gz
-rw-r--r--  1 yoavfreund  staff  13522496 Mar 10 12:06 ../../Data/Weather/STAT_SSSSBBBB.pickle.gz


In [13]:
for state in ['RI','NY']:
    command="aws s3  cp ../../Data/Weather/STAT_%s.pickle.gz s3://mas-dse-open/Weather/by_state/STAT_%s.pickle.gz"%(state,state)
    print(command)
    !$command

aws s3  cp ../../Data/Weather/STAT_RI.pickle.gz s3://mas-dse-open/Weather/by_state/STAT_RI.pickle.gz
upload: ../../Data/Weather/STAT_RI.pickle.gz to s3://mas-dse-open/Weather/by_state/STAT_RI.pickle.gz
aws s3  cp ../../Data/Weather/STAT_NY.pickle.gz s3://mas-dse-open/Weather/by_state/STAT_NY.pickle.gz
upload: ../../Data/Weather/STAT_NY.pickle.gz to s3://mas-dse-open/Weather/by_state/STAT_NY.pickle.gz


In [14]:
!aws s3  ls s3://mas-dse-open/Weather/by_state/ | grep STAT

2018-03-18 20:34:12   14948048 STAT_NY.pickle.gz
2018-03-18 20:33:54   11717259 STAT_RI.pickle.gz
