### RDDs
There are 3 ways to created RDDs<br>
1 - Parallelized collections<br>
2 - External Sources - AWS S3, HDFS, Hive, txt, csv etc<br>
3 - From existing RDDs, by transforming the existing RDDs and it returns the data in RDD type.

In [1]:
# 1st way, using the paralleize collection
normal_py_num_list = [10,20,30,40,50,60,70,10,90,80,40]
paralleize_num_rdd = sc.parallelize(normal_py_num_list)
print("Numeric RDD {}".format(paralleize_num_rdd.collect()))
print(type(paralleize_num_rdd))
#-----------------------------------------------------------------#
normal_py_string_list = ['Apache','pyspark','python','java','pop','fan','bottle']
paralleize_string_rdd = sc.parallelize(normal_py_string_list)
print("string RDD {}".format(paralleize_string_rdd.collect()))
print(type(paralleize_string_rdd))

Numeric RDD [10, 20, 30, 40, 50, 60, 70, 10, 90, 80, 40]
<class 'pyspark.rdd.RDD'>
string RDD ['Apache', 'pyspark', 'python', 'java', 'pop', 'fan', 'bottle']
<class 'pyspark.rdd.RDD'>


In [2]:
# 2nd way to create RDD, using external sources.
# sc.addFile("D:\\DataEngineering_Learnings\\Week_1_Task\\Week_1_task_requirements.txt")
input_txt_file = sc.textFile("D:\\DataEngineering_Learnings\\Week_1_Task\\Week_1_task_requirements.txt")
input_txt_file.collect()

['Week 1: (Make Note on google drive and repo in bitbucket for source code)',
 'What is Big Data',
 'What is role of Data Engineer',
 'What Spark and Why Spark',
 'PySpark - https://www.tutorialspoint.com/pyspark/index.htm (This is minimum you have to go through. You can even go through any other tutorials on youtube to get understanding of pyspark and Spark) ',
 'Understanding Spark architecture and  try to  relate to what you have learned.',
 'Some practice running spark programs ']

In [3]:
# 3rd way to create RDD is via some tranformation on RDD which return RDD type of object
# Just filter the line which contains Spark keyword
new_transformed_rdd = input_txt_file.filter(lambda x: 'Spark' in x)
new_transformed_rdd.collect()

['What Spark and Why Spark',
 'PySpark - https://www.tutorialspoint.com/pyspark/index.htm (This is minimum you have to go through. You can even go through any other tutorials on youtube to get understanding of pyspark and Spark) ',
 'Understanding Spark architecture and  try to  relate to what you have learned.']

### Operations on RDDs
1 - Action<br>
2 - Transformation

In [4]:
# Action methods

# Collect() - Use to retrive the data from all the worker node to driver program, it returns a list 
print("Fetching Data using collect() action -\n",input_txt_file.collect())

# take() - This method is also use to retrive n number of the data, it returns a list 
print("\nFetching data using take() action -\n",input_txt_file.take(2))

# Count() - This method is used to check the lenght of rdd
print("\nCount of lines - \n",input_txt_file.count())

# First() - This method is used to check the first element from the rdd
print("\nFirst line from file - \n",input_txt_file.first())

# reduce() - Use to perform action on new element based on previous calculated element
# for eg: sum of [1,2,3,4,5] = 15, we can achive this by reduce method
l1 = [1,2,3,4,5]
print("\nReduce method example - \n",sc.parallelize(l1).reduce(lambda x,y:x+y))

Fetching Data using collect() action -
 ['Week 1: (Make Note on google drive and repo in bitbucket for source code)', 'What is Big Data', 'What is role of Data Engineer', 'What Spark and Why Spark', 'PySpark - https://www.tutorialspoint.com/pyspark/index.htm (This is minimum you have to go through. You can even go through any other tutorials on youtube to get understanding of pyspark and Spark) ', 'Understanding Spark architecture and  try to  relate to what you have learned.', 'Some practice running spark programs ']

Fetching data using take() action -
 ['Week 1: (Make Note on google drive and repo in bitbucket for source code)', 'What is Big Data']

Count of lines - 
 7

First line from file - 
 Week 1: (Make Note on google drive and repo in bitbucket for source code)

Reduce method example - 
 15


In [5]:
# Transformation methods

# map() - Use to apply function on each element of rdd. In the below example, entire input file is splitted using space
def split_lines(lines):
    return lines.split()
splitted_rdd = input_txt_file.map(split_lines)
print(splitted_rdd.collect())

# flatmap() - it is use to flatten the rdd into 1D rdd
flatten_array = input_txt_file.flatMap(split_lines)
print("\nFlattened array using flaptmap() - \n",flatten_array.take(10))

# Filter() - This method is use to filter the rdd
stop_word = ['a','an','and','the','with','on','to','why','go','in','for','you']
filtered_words = input_txt_file.flatMap(split_lines).filter(lambda x: x if x not in stop_word else "")
print("\nFiltered stop words using Filter() - \n",filtered_words.collect())

# Filter() - This method is use to filter the rdd
filtered_words1 = input_txt_file.flatMap(split_lines).filter(lambda x: x.startswith('S'))
print("\nFilter words starts with 'S' using Filter() - \n",filtered_words.collect())
print()
mapped_rdd = filtered_words.map(lambda x: (x,1))
grouped_rdd = mapped_rdd.groupByKey()
word_count = grouped_rdd.mapValues(sum).map(lambda x:(x[1],x[0])).sortByKey(False)
print(word_count.take(10))

# Distinct()
print("\nUnique Word count without stop words - ",flatten_array.distinct().count())
print("\nUnique Word count after removing stop words - ",filtered_words.distinct().count())

[['Week', '1:', '(Make', 'Note', 'on', 'google', 'drive', 'and', 'repo', 'in', 'bitbucket', 'for', 'source', 'code)'], ['What', 'is', 'Big', 'Data'], ['What', 'is', 'role', 'of', 'Data', 'Engineer'], ['What', 'Spark', 'and', 'Why', 'Spark'], ['PySpark', '-', 'https://www.tutorialspoint.com/pyspark/index.htm', '(This', 'is', 'minimum', 'you', 'have', 'to', 'go', 'through.', 'You', 'can', 'even', 'go', 'through', 'any', 'other', 'tutorials', 'on', 'youtube', 'to', 'get', 'understanding', 'of', 'pyspark', 'and', 'Spark)'], ['Understanding', 'Spark', 'architecture', 'and', 'try', 'to', 'relate', 'to', 'what', 'you', 'have', 'learned.'], ['Some', 'practice', 'running', 'spark', 'programs']]

Flattened array using flaptmap() - 
 ['Week', '1:', '(Make', 'Note', 'on', 'google', 'drive', 'and', 'repo', 'in']

Filtered stop words using Filter() - 
 ['Week', '1:', '(Make', 'Note', 'google', 'drive', 'repo', 'bitbucket', 'source', 'code)', 'What', 'is', 'Big', 'Data', 'What', 'is', 'role', 'of', '

### Some more functions

In [6]:
rdd1 = sc.parallelize([('a',2),('b',10)])
rdd2 = sc.parallelize([('a',4),('b',20),('c',30)])
rdd1.join(rdd2).collect()

[('a', (2, 4)), ('b', (10, 20))]

### DataFrames

In [83]:
titanic_df = spark.read.csv("titanic/train.csv",inferSchema=True,header=True)

In [84]:
titanic_df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [85]:
titanic_df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [87]:
titanic_df.count()

Py4JJavaError: An error occurred while calling o2316.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 155.0 failed 1 times, most recent failure: Lost task 0.0 in stage 155.0 (TID 248, host.docker.internal, executor driver): java.io.FileNotFoundException: C:\Users\user\AppData\Local\Temp\blockmgr-c9271ab8-bd66-4c71-9c6d-bb772c345760\2c\temp_shuffle_348d2f03-3d5b-4166-a02a-501623a5cdbb (The system cannot find the path specified)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:105)
	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
	at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:2981)
	at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:2980)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2980)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: C:\Users\user\AppData\Local\Temp\blockmgr-c9271ab8-bd66-4c71-9c6d-bb772c345760\2c\temp_shuffle_348d2f03-3d5b-4166-a02a-501623a5cdbb (The system cannot find the path specified)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:105)
	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
