## Computing PCA using RDDs

##  PCA

The vectors that we want to analyze have length, or dimension, of 365, corresponding to the number of 
days in a year.

We will perform [Principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis)
on these vectors. There are two steps to this process:

1. Computing the covariance matrix: this is a  simple computation. However, it takes a long time to compute and it benefits from using an RDD because it involves all of the input vectors.
2. Computing the eigenvector decomposition. this is a more complex computation, but it takes a fraction of a second because the size to the covariance matrix is $365 \times 365$, which is quite small. We do it on the head node usin `linalg`

### Computing the covariance matrix
Suppose that the data vectors are the column vectors denoted $x$ then the covariance matrix is defined to be
$$
E(x x^T)-E(x)E(x)^T
$$

Where $x x^T$ is the **outer product** of $x$ with itself.

If the data that we have is $x_1,x_2,x_n$ then the estimates we use are:
$$
\hat{E}(x x^T) = \frac{1}{n} \sum_{i=1}^n x_i x_i^T,\;\;\;\;\;
\hat{E}(x) = \frac{1}{n} \sum_{i=1}^n x_i
$$

### The effect of  `nan`s in arithmetic operations
* We use an RDD of numpy arrays, instead of Dataframes.
* Why? Because numpy treats `nan` entries correctly:
  * In numpy `5+nan=5` while in dataframes `5+nan=nan`

### Performing Cov matrix on vectors with NaNs
As it happens, we often get vectors $x$ in which some, but not all, of the entries are `nan`. 
Suppose that we want to compute the mean of the elements of $x$. If we use `np.mean` we will get the result `nan`. A useful alternative is to use `np.nanmean` which removes the `nan` elements and takes the mean of the rest.

import numpy as np
X=np.array([1,1,1,2])
print 'mean of',X,'=',np.mean(X)
print 'nanmean of',X,'=',np.nanmean(X)
X=np.array([1,1,np.NaN,2])
print 'mean of',X,'=',np.mean(X)
print 'nanmean of',X,'=',np.nanmean(X)

#### When should you not use `np.nanmean` ?
Using `n.nanmean` is equivalent to assuming that choice of which elements to remove is independent of the values of the elements. 
* Example of bad case: suppose the larger elements have a higher probability of being `nan`. In that case `np.nanmean` will under-estimate the mean

#### Computing the covariance  when there are `nan`s
The covariance is a mean of outer products.

If the data that we have is $x_1,x_2,x_n$ then the estimates we use are:
$$
\hat{E}(x x^T) = \frac{1}{n} \sum_{i=1}^n x_i x_i^T,\;\;\;\;\;
\hat{E}(x) = \frac{1}{n} \sum_{i=1}^n x_i
$$

x1=np.array([1,np.NaN,3,4,5])
x2=np.array([2,3,4,np.NaN,6])
stacked=np.array([np.outer(x1,x1),np.outer(x2,x2)])
stacked

np.nanmean(stacked,axis=0)

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext

#sc.stop()
sc = SparkContext(master="local[3]",pyFiles=['lib/numpy_pack.py','lib/spark_PCA.py','lib/computeStats.py'])

from pyspark.sql import *
sqlContext = SQLContext(sc)

In [2]:
import sys
sys.path.append('./lib')

import numpy as np
from numpy_pack import packArray,unpackArray
from spark_PCA import computeCov
from computeStats import computeOverAllDist, STAT_Descriptions

### Climate data

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

There is a large variety of measurements from all over the world, from 1870 will 2012.
in the directory `../../Data/Weather` you will find the following useful files:

* data-source.txt: the source of the data
* ghcnd-readme.txt: A description of the content and format of the data
* ghcnd-stations.txt: A table describing the Meteorological stations.



### Data cleaning

* Most measurements exists only for a tiny fraction of the stations and years. We therefor restrict our use to the following measurements:
```python
['TMAX', 'SNOW', 'SNWD', 'TMIN', 'PRCP', 'TOBS']
```

* 8 We consider only measurement-years that have at most 50 `NaN` entries

* We consider only measurements in the continential USA

* We partition the stations into the states of the continental USA (plus a few stations from states in canada and mexico).

In [6]:
!aws s3 ls mas-dse-open/Weather/by_state/ | head

2018-03-16 13:21:36     220190 AB.tgz
2018-03-16 13:23:05   11432311 AL.tgz
2018-03-16 13:23:31   13377712 AR.tgz
2018-03-16 13:26:36   19254643 AZ.tgz
2018-03-16 13:21:47     506076 BC.tgz
2018-03-16 13:27:08   38154057 CA.tgz
2018-03-16 13:26:29   22868140 CO.tgz
2018-03-16 13:22:04    3408494 CT.tgz
2018-03-16 13:21:29      48306 DC.tgz
2018-03-16 13:21:59    1399298 DE.tgz

[Errno 32] Broken pipe


In [18]:
state='RI'
data_dir='../../Data/Weather'

tarname=state+'.tgz'
parquet=state+'.parquet'
!rm -rf $data_dir/$tarname

command="curl https://mas-dse-open.s3.amazonaws.com/Weather/by_state/%s > %s/%s"%(tarname,data_dir,tarname)
print(command)
!$command
!ls -lh $data_dir/$tarname

curl https://mas-dse-open.s3.amazonaws.com/Weather/by_state/RI.tgz > ../../Data/Weather/RI.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  668k  100  668k    0     0   668k      0  0:00:01 --:--:--  0:00:01  721k
-rw-r--r--  1 yoavfreund  staff   668K Mar 17 13:01 ../../Data/Weather/RI.tgz


In [19]:
cur_dir,=!pwd
%cd $data_dir
!tar -xzf $tarname
!du ./$parquet
%cd $cur_dir


/Users/yoavfreund/projects/edX-Micro-Master-in-Data-Science/big-data-analytics-using-spark/notebooks/Data/Weather
1768	./RI.parquet
/Users/yoavfreund/projects/edX-Micro-Master-in-Data-Science/big-data-analytics-using-spark/notebooks/Section2-PCA/PCA


In [26]:
df=sqlContext.read.parquet(data_dir+'/'+parquet)
print(df.count())
df.show(5)

2608
+-----------+-----------+----+--------------------+-----+
|    Station|Measurement|Year|              Values|State|
+-----------+-----------+----+--------------------+-----+
|USC00377581|       PRCP|1998|[00 7E 00 7E 00 7...|   RI|
|USC00377581|       PRCP|1999|[00 00 00 00 9C 5...|   RI|
|USC00377581|       PRCP|2000|[00 00 00 00 00 0...|   RI|
|USC00377581|       PRCP|2001|[00 00 00 00 00 0...|   RI|
|USC00377581|       PRCP|2002|[00 00 00 00 00 0...|   RI|
+-----------+-----------+----+--------------------+-----+
only showing top 5 rows



In [27]:
from time import time
t=time()

N=sc.defaultParallelism
print('Number of executors=',N)
print('took',time()-t,'seconds')

Number of executors= 3
took 0.0008060932159423828 seconds


In [55]:
from numpy import linalg as LA
measurements=['TMAX', 'SNOW', 'SNWD', 'TMIN', 'PRCP', 'TOBS']
sqlContext.registerDataFrameAsTable(df,'weather') #using older sqlContext instead of newer (V2.0) sparkSession

STAT_Descriptions=[
('SortedVals', 'Sample of values', 'vector whose length varies between measurements'),
 ('UnDef', 'sample of number of undefs per row', 'vector whose length varies between measurements'),
 ('mean', 'mean value', ()),
 ('std', 'std', ()),
 ('low100', 'bottom 1%', ()),
 ('high100', 'top 1%', ()),
 ('low1000', 'bottom 0.1%', ()),
 ('high1000', 'top 0.1%', ()),
 ('E', 'Sum of values per day', (365,)),
 ('NE', 'count of values per day', (365,)),
 ('Mean', 'E/NE', (365,)),
 ('O', 'Sum of outer products', (365, 365)),
 ('NO', 'counts for outer products', (365, 365)),
 ('Cov', 'O/NO', (365, 365)),
 ('Var', 'The variance per day = diagonal of Cov', (365,)),
 ('eigval', 'PCA eigen-values', (365,)),
 ('eigvec', 'PCA eigen-vectors', (365, 365))
  ]

def computeStatistics(df):
    """Compute all of the statistics for a given dataframe
    Input: dataframe with the fields 
            Station(string), Measurement(string), Year(integer), Values (byteArray with 365 float16 numbers)
    returns: STAT, a dictionary of dictionaries. First key is measurement, 
             second keys described by STAT_Descriptions above
    """

    STAT={}  # dictionary storing the statistics for each measurement

    for meas in measurements:
        t=time()
        Query="SELECT * FROM weather\n\tWHERE measurement = '%s'"%(meas)
        mdf = sqlContext.sql(Query)

        data=df.rdd.map(lambda row: unpackArray(row['Values'],np.float16))

        #Compute basic statistics
        STAT[meas]=computeOverAllDist(data)   # Compute the statistics 

        # compute covariance matrix
        OUT=computeCov(data)

        #find PCA decomposition
        eigval,eigvec=LA.eig(OUT['Cov'])

        # collect all of the statistics in STAT[meas]
        STAT[meas]['eigval']=eigval
        STAT[meas]['eigvec']=eigvec
        STAT[meas].update(OUT)

        print('time for',meas,'is',time()-t)
    
    return STAT

In [57]:
from pickle import dump
for state in ['NY','RI']:
    parquet=state+'.parquet'
    df=sqlContext.read.parquet(data_dir+'/'+parquet)
    print(state,df.count())
    STAT=computeStatistics(df)
    filename=data_dir+'/STAT_%s.pickle'%state
    dump((STAT,STAT_Descriptions),open(filename,'wb'))


NY 84199
shape of E= (365,) shape of NE= (365,)
time for TMAX is 260.0251200199127


KeyboardInterrupt: 

In [52]:
!ls -lrt $data_dir

total 210096
-rw-r--r--    1 yoavfreund  staff         0 Mar  6 16:57 Weather_Stations.parquet
-rw-r--r--    1 yoavfreund  staff  25674261 Mar 10 12:06 STAT_SSSSBBBB.pickle
-rw-r--r--    1 yoavfreund  staff  12456904 Mar 10 12:06 US_Weather_BBSSBBSS.csv
-rw-r--r--    1 yoavfreund  staff   3430874 Mar 10 12:06 US_Weather_BBSSBBSS.csv.gz
drwxr-xr-x   12 yoavfreund  staff       384 Mar 10 12:06 [34mUS_Weather_BBSSBBSS.parquet[m[m
-rw-r--r--    1 yoavfreund  staff  12880638 Mar 10 12:06 US_Weather_SSSSBBBB.csv
-rw-r--r--    1 yoavfreund  staff   3245918 Mar 10 12:06 US_Weather_SSSSBBBB.csv.gz
drwxr-xr-x  404 yoavfreund  staff     12928 Mar 10 12:06 [34mdecon_SSSSBBBB_SNWD.parquet[m[m
drwxr-xr-x    2 yoavfreund  staff        64 Mar 14 20:11 [34mspark-warehouse[m[m
-rw-r--r--    1 yoavfreund  staff     25213 Mar 14 20:19 Depickle dist_to_coast.ipynb
-rw-r--r--    1 yoavfreund  staff     11989 Mar 14 21:07 combine station information.ipynb
drwxr-xr-x   12 yoavfreund  staf