## Computing PCA using RDDs

##  PCA

The vectors that we want to analyze have length, or dimension, of 365, corresponding to the number of 
days in a year.

We will perform [Principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis)
on these vectors. There are two steps to this process:

1. Computing the covariance matrix: this is a  simple computation. However, it takes a long time to compute and it benefits from using an RDD because it involves all of the input vectors.
2. Computing the eigenvector decomposition. this is a more complex computation, but it takes a fraction of a second because the size to the covariance matrix is $365 \times 365$, which is quite small. We do it on the head node usin `linalg`

### Computing the covariance matrix
Suppose that the data vectors are the column vectors denoted $x$ then the covariance matrix is defined to be
$$
E(x x^T)-E(x)E(x)^T
$$

Where $x x^T$ is the **outer product** of $x$ with itself.

If the data that we have is $x_1,x_2,x_n$ then the estimates we use are:
$$
\hat{E}(x x^T) = \frac{1}{n} \sum_{i=1}^n x_i x_i^T,\;\;\;\;\;
\hat{E}(x) = \frac{1}{n} \sum_{i=1}^n x_i
$$

### The effect of  `nan`s in arithmetic operations
* We use an RDD of numpy arrays, instead of Dataframes.
* Why? Because numpy treats `nan` entries correctly:
  * In numpy `5+nan=5` while in dataframes `5+nan=nan`

### Performing Cov matrix on vectors with NaNs
As it happens, we often get vectors $x$ in which some, but not all, of the entries are `nan`. 
Suppose that we want to compute the mean of the elements of $x$. If we use `np.mean` we will get the result `nan`. A useful alternative is to use `np.nanmean` which removes the `nan` elements and takes the mean of the rest.

import numpy as np
X=np.array([1,1,1,2])
print 'mean of',X,'=',np.mean(X)
print 'nanmean of',X,'=',np.nanmean(X)
X=np.array([1,1,np.NaN,2])
print 'mean of',X,'=',np.mean(X)
print 'nanmean of',X,'=',np.nanmean(X)

#### When should you not use `np.nanmean` ?
Using `n.nanmean` is equivalent to assuming that choice of which elements to remove is independent of the values of the elements. 
* Example of bad case: suppose the larger elements have a higher probability of being `nan`. In that case `np.nanmean` will under-estimate the mean

#### Computing the covariance  when there are `nan`s
The covariance is a mean of outer products.

If the data that we have is $x_1,x_2,x_n$ then the estimates we use are:
$$
\hat{E}(x x^T) = \frac{1}{n} \sum_{i=1}^n x_i x_i^T,\;\;\;\;\;
\hat{E}(x) = \frac{1}{n} \sum_{i=1}^n x_i
$$

x1=np.array([1,np.NaN,3,4,5])
x2=np.array([2,3,4,np.NaN,6])
stacked=np.array([np.outer(x1,x1),np.outer(x2,x2)])
stacked

np.nanmean(stacked,axis=0)

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext

#sc.stop()
sc = SparkContext(master="local[3]",pyFiles=['lib/numpy_pack.py','lib/spark_PCA.py','lib/computeStats.py'])

from pyspark.sql import *
sqlContext = SQLContext(sc)

In [2]:
import sys
sys.path.append('./lib')

import numpy as np
from numpy_pack import packArray,unpackArray
from spark_PCA import computeCov
from computeStats import computeOverAllDist, STAT_Descriptions

### Climate data

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

There is a large variety of measurements from all over the world, from 1870 will 2012.
in the directory `../../Data/Weather` you will find the following useful files:

* data-source.txt: the source of the data
* ghcnd-readme.txt: A description of the content and format of the data
* ghcnd-stations.txt: A table describing the Meteorological stations.



### Data cleaning

* Most measurements exists only for a tiny fraction of the stations and years. We therefor restrict our use to the following measurements:
```python
['TMAX', 'SNOW', 'SNWD', 'TMIN', 'PRCP', 'TOBS']
```

* 8 We consider only measurement-years that have at most 50 `NaN` entries

* We consider only measurements in the continential USA

* We partition the stations into 256 geographical rectangles, indexed from BBBBBBBB to SSSSSSSS. And each containing about 12,000 station,year pairs.

In [3]:
#file_index='BBBSBBBB'
file_index='BBBBBSBS'
data_dir='../../Data/Weather'

filebase='US_Weather_%s'%file_index
!rm -rf $data_dir/$filebase*

c_filename=filebase+'.csv.gz'
u_filename=filebase+'.csv'

command="curl https://mas-dse-open.s3.amazonaws.com/Weather/small/%s > %s/%s"%(c_filename,data_dir,c_filename)
print command
!$command
!ls -lh $data_dir/$c_filename

curl https://mas-dse-open.s3.amazonaws.com/Weather/small/US_Weather_BBBBBSBS.csv.gz > ../../Data/Weather/US_Weather_BBBBBSBS.csv.gz


rm: cannot lstat `../../Data/Weather/US_Weather_BBBBBSBS*': Invalid argument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 43 3739k   43 1631k    0     0  1864k      0  0:00:02 --:--:--  0:00:02 1864k
100 3739k  100 3739k    0     0  2918k      0  0:00:01  0:00:01 --:--:-- 2918k


-rw-rw-rw-  1 kdhiman 0 3.7M 2017-05-15 00:24 ../../Data/Weather/US_Weather_BBBBBSBS.csv.gz


In [4]:
!"C:\Program Files\7-Zip\7z.exe" e $data_dir/$c_filename -o$data_dir



7-Zip [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04

Scanning the drive for archives:
1 file, 3828797 bytes (3740 KiB)

Extracting archive: ..\..\Data\Weather\US_Weather_BBBBBSBS.csv.gz
--
Path = ..\..\Data\Weather\US_Weather_BBBBBSBS.csv.gz
Type = gzip
Headers Size = 34

Everything is Ok

Size:       12850086
Compressed: 3828797


In [5]:
#unzip
#!gzip -dq $data_dir/$c_filename > $data_dir/$u_filename
import pickle
List=pickle.load(open(data_dir+'/'+u_filename,'rb'))
len(List)

12471

In [6]:
df=sqlContext.createDataFrame(List)
print df.count()
df.show(5)

12471
+---------+--------+---------+-----------+-----------+------+--------------------+------+--------+
|elevation|latitude|longitude|measurement|    station|undefs|              vector|  year|   label|
+---------+--------+---------+-----------+-----------+------+--------------------+------+--------+
|     27.4| 43.5914| -70.2989|       TMAX|USC00177523|    31|[00 7E 00 7E 00 7...|1997.0|BBBBBSBS|
|    240.8|    43.9|    -71.3|       TMAX|USC00278612|     2|[00 4F 00 4F 00 0...|1975.0|BBBBBSBS|
|    240.8|    43.9|    -71.3|       TMAX|USC00278612|     6|[20 50 00 00 E0 D...|1976.0|BBBBBSBS|
|    240.8|    43.9|    -71.3|       TMAX|USC00278612|     4|[90 D5 E0 D0 00 4...|1977.0|BBBBBSBS|
|    240.8|    43.9|    -71.3|       TMAX|USC00278612|     1|[00 C6 00 D3 00 D...|1978.0|BBBBBSBS|
+---------+--------+---------+-----------+-----------+------+--------------------+------+--------+
only showing top 5 rows



In [7]:
# Filtering
df=df.filter("longitude<-69")

In [8]:
df.describe().show()

+-------+------------------+-------------------+------------------+-----------+-----------+------------------+------------------+--------+
|summary|         elevation|           latitude|         longitude|measurement|    station|            undefs|              year|   label|
+-------+------------------+-------------------+------------------+-----------+-----------+------------------+------------------+--------+
|  count|             12027|              12027|             12027|      12027|      12027|             12027|             12027|   12027|
|   mean|51.142504365178034|  43.25931726116173|-70.99586796374756|       null|       null| 5.796291677059949|1972.3279288268063|    null|
| stddev| 180.6605372374712|0.46366419882061904|0.4902875790511416|       null|       null|10.055189939553868|30.792364092499163|    null|
|    min|            -999.9|            42.6397|            -71.65|       PRCP|US1MAES0003|                 0|            1872.0|BBBBBSBS|
|    max|            1813.9

In [9]:
#store dataframe as parquet file
outfilename=data_dir+'/'+filebase+'.parquet'
!rm -rf $outfilename
df.write.save(outfilename)

In [10]:
# Compare file sizes
!du -sh $data_dir/$filebase*

13M	../../Data/Weather/US_Weather_BBBBBSBS.csv
3.7M	../../Data/Weather/US_Weather_BBBBBSBS.csv.gz
4.5M	../../Data/Weather/US_Weather_BBBBBSBS.parquet


In [11]:
from time import time
t=time()

N=sc.defaultParallelism
print 'Number of executors=',N
print 'took',time()-t,'seconds'

Number of executors= 3
took 0.00899982452393 seconds


In [12]:
measurements=['TMAX', 'SNOW', 'SNWD', 'TMIN', 'PRCP', 'TOBS']

In [13]:
sqlContext.registerDataFrameAsTable(df,'weather') #using older sqlContext instead of newer (V2.0) sparkSession

In [14]:
from numpy import linalg as LA
STAT={}  # dictionary storing the statistics for each measurement
Clean_Tables={}

for meas in measurements:
    t=time()
    Query="SELECT * FROM weather\n\tWHERE measurement = '%s'"%(meas)
    print Query
    df = sqlContext.sql(Query)
    data=df.rdd.map(lambda row: unpackArray(row['vector'],np.float16))
    #get very basic statistics
    STAT[meas]=computeOverAllDist(data)   # Compute the statistics 

    # compute covariance matrix
    OUT=computeCov(data)

    #find PCA decomposition
    eigval,eigvec=LA.eig(OUT['Cov'])

    # collect all of the statistics in STAT[meas]
    STAT[meas]['eigval']=eigval
    STAT[meas]['eigvec']=eigvec
    STAT[meas].update(OUT)

    print 'time for',meas,'is',time()-t

SELECT * FROM weather
	WHERE measurement = 'TMAX'
shape of E= (365L,) shape of NE= (365L,)
time for TMAX is 30.4430000782
SELECT * FROM weather
	WHERE measurement = 'SNOW'
shape of E= (365L,) shape of NE= (365L,)
time for SNOW is 32.0290000439
SELECT * FROM weather
	WHERE measurement = 'SNWD'
shape of E= (365L,) shape of NE= (365L,)
time for SNWD is 26.6519999504
SELECT * FROM weather
	WHERE measurement = 'TMIN'
shape of E= (365L,) shape of NE= (365L,)
time for TMIN is 32.1050000191
SELECT * FROM weather
	WHERE measurement = 'PRCP'
shape of E= (365L,) shape of NE= (365L,)
time for PRCP is 37.5910000801
SELECT * FROM weather
	WHERE measurement = 'TOBS'
shape of E= (365L,) shape of NE= (365L,)
time for TOBS is 20.3989999294


In [15]:
from pickle import dump
filename=data_dir+'/STAT_%s.pickle'%file_index
dump((STAT,STAT_Descriptions),open(filename,'wb'))
