### Load your AWS Access Key/Secret.

**Never paste your Key/Secret into the notebook. Never print your Key/Secret in the notebook. Never commit your Key/Secret to github.**

In [1]:
import os
myAccessKey = None
mySecretKey = None
def get_secrets():
    global myAccessKey
    global mySecretKey
    myAccessKey = os.environ['AWS_ACCESS_KEY_ID']
    mySecretKey = os.environ['AWS_SECRET_ACCESS_KEY']

get_secrets()

### Set up the Spark Context and AWS S3 credentials

Reference:

[Connecting to PySpark from a Notebook](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#in-a-python-notebook)

[Using PySpark with AWS S3](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/recipes.html#using-pyspark-with-aws-s3)

In [2]:
# this will take some time
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

### Let's explore the `user` table

In [3]:
user = sqlContext.read.parquet("s3://matters-analytics-dev/ETL/output_pg/user.parquet/")
user.printSchema()

root
 |-- id: long (nullable = true)
 |-- uuid: string (nullable = true)
 |-- user_name: string (nullable = true)
 |-- display_name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- avatar: long (nullable = true)
 |-- email: string (nullable = true)
 |-- email_verified: boolean (nullable = true)
 |-- mobile: string (nullable = true)
 |-- password_hash: string (nullable = true)
 |-- read_speed: integer (nullable = true)
 |-- base_gravity: integer (nullable = true)
 |-- curr_gravity: integer (nullable = true)
 |-- language: string (nullable = true)
 |-- role: string (nullable = true)
 |-- state: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- updated_at: timestamp (nullable = true)



In [4]:
user1 = user.select(user.display_name.alias("displayName"), user.description, user.created_at.alias("registerTime")).cache()

In [5]:
user1.show()

+-----------+-------------------------------------+--------------------+
|displayName|                          description|        registerTime|
+-----------+-------------------------------------+--------------------+
|     test 3|                 test user 3 descr...|2019-01-12 06:57:...|
|     test 4|                                 null|2019-01-12 06:57:...|
|    admin 1|                                 null|2019-01-12 06:57:...|
|    ftggggg|                                 null|2019-01-12 09:34:...|
|       思聰|                                 null|2017-12-04 15:13:...|
|     Edward|                                 null|2017-11-30 06:27:...|
|       佳禾|                                     |2017-12-04 15:27:...|
|     方可成|                                 null|2017-12-04 15:50:...|
|       映昕|                                 null|2017-12-04 15:59:...|
|     黃哲斌|                                 null|2017-12-04 22:12:...|
|       Andy|                           寫東西的人|2017-11-30 05:07:

### Comfortable with Pandas?

In [6]:
userpd = user1.toPandas()

In [7]:
userpd.head()

Unnamed: 0,displayName,description,registerTime
0,test 3,test user 3 description,2019-01-12 06:57:01.254902
1,test 4,,2019-01-12 06:57:01.254902
2,admin 1,,2019-01-12 06:57:01.254902
3,ftggggg,,2019-01-12 09:34:50.909537
4,思聰,,2017-12-04 15:13:11.655000


### Visualize through `plotly`

In [8]:
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)

In [9]:
# plot a histogram of users' register time
data = [go.Histogram(x=userpd['registerTime'])]

# with interactive plot `iplot` you can interact with the graph
plotly.offline.iplot(data, filename='User register time histogram')