## Running pyspark

I assume that you have installed `pyspak` somehow similar to the guide here.

<http://bartek-blog.github.io/python/spark/pyspark/2019/03/27/how-to-install-pyspark.html>

Then you should start `pyspark` with
```
pyspark --packages=org.apache.hadoop:hadoop-aws:2.7.3
```

## Code

### Read aws configuration
For more details how to configure AWS access see <http://bartek-blog.github.io/s3/cli/aws/python/boto3/2018/09/10/AWS-CLI-And-S3.html>

In [1]:
import configparser
aws_profile = "myaws"

config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

### configure hadoop 

In [2]:
hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)

### Read data

In [3]:
sdf = spark.read.option("header", "true").csv("s3n://bartek-ml-course/predict_future_sales/sales_train.csv.gz")

Py4JJavaError: An error occurred while calling o32.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:355)
	at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:617)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
	... 30 more


In [4]:
sdf.printSchema()

root
 |-- date: string (nullable = true)
 |-- date_block_num: string (nullable = true)
 |-- shop_id: string (nullable = true)
 |-- item_id: string (nullable = true)
 |-- item_price: string (nullable = true)
 |-- item_cnt_day: string (nullable = true)



## Write data

In [17]:
import pyspark.sql.functions as F
sdf.groupBy("date").agg(F.sum(F.col('item_cnt_day')).alias("items"))\
    .repartition(1)\
    .write.mode("overwrite")\
    .parquet("s3n://bartek-ml-course/predict_future_sales-aggregations/daily-total-sales")

In [18]:
!aws s3 ls s3://bartek-ml-course/predict_future_sales-aggregations/daily-total-sales/ --profile=myaws

2019-04-23 02:16:49          0 _SUCCESS
2019-04-23 02:16:46      10013 part-00000-30d58a19-e3ca-4e34-8bac-19a4d767079d-c000.snappy.parquet


In [19]:
spark.read.parquet("s3n://bartek-ml-course/predict_future_sales-aggregations/daily-total-sales").show()

+----------+------+
|      date| items|
+----------+------+
|16.02.2013|6643.0|
|09.02.2014|4646.0|
|01.09.2014|2887.0|
|18.10.2014|5001.0|
|27.06.2015|2563.0|
|17.09.2015|1887.0|
|29.04.2013|2771.0|
|12.04.2013|3947.0|
|18.09.2014|2441.0|
|15.08.2015|2201.0|
|28.10.2015|3593.0|
|05.02.2013|3302.0|
|21.09.2013|6698.0|
|31.05.2014|5395.0|
|02.11.2014|4390.0|
|08.07.2015|1905.0|
|13.09.2015|2660.0|
|06.10.2015|1343.0|
|13.06.2013|3399.0|
|22.02.2014|8472.0|
+----------+------+
only showing top 20 rows

