### RDDs
There are 3 ways to created RDDs<br>
1 - Parallelized collections<br>
2 - External Sources - AWS S3, HDFS, Hive, txt, csv etc<br>
3 - From existing RDDs, by transforming the existing RDDs and it returns the data in RDD type.

In [1]:
# 1st way, using the paralleize collection
normal_py_num_list = [10,20,30,40,50,60,70,10,90,80,40]
paralleize_num_rdd = sc.parallelize(normal_py_num_list)
print("Numeric RDD {}".format(paralleize_num_rdd.collect()))
print(type(paralleize_num_rdd))
#-----------------------------------------------------------------#
normal_py_string_list = ['Apache','pyspark','python','java','pop','fan','bottle']
paralleize_string_rdd = sc.parallelize(normal_py_string_list)
print("string RDD {}".format(paralleize_string_rdd.collect()))
print(type(paralleize_string_rdd))

Numeric RDD [10, 20, 30, 40, 50, 60, 70, 10, 90, 80, 40]
<class 'pyspark.rdd.RDD'>
string RDD ['Apache', 'pyspark', 'python', 'java', 'pop', 'fan', 'bottle']
<class 'pyspark.rdd.RDD'>


In [2]:
# 2nd way to create RDD, using external sources.
# sc.addFile("D:\\DataEngineering_Learnings\\Week_1_Task\\Week_1_task_requirements.txt")
input_txt_file = sc.textFile("D:\\DataEngineering_Learnings\\Week_1_Task\\Week_1_task_requirements.txt")
input_txt_file.collect()

['Week 1: (Make Note on google drive and repo in bitbucket for source code)',
 'What is Big Data',
 'What is role of Data Engineer',
 'What Spark and Why Spark',
 'PySpark - https://www.tutorialspoint.com/pyspark/index.htm (This is minimum you have to go through. You can even go through any other tutorials on youtube to get understanding of pyspark and Spark) ',
 'Understanding Spark architecture and  try to  relate to what you have learned.',
 'Some practice running spark programs ']

In [3]:
# 3rd way to create RDD is via some tranformation on RDD which return RDD type of object
# Just filter the line which contains Spark keyword
new_transformed_rdd = input_txt_file.filter(lambda x: 'Spark' in x)
new_transformed_rdd.collect()

['What Spark and Why Spark',
 'PySpark - https://www.tutorialspoint.com/pyspark/index.htm (This is minimum you have to go through. You can even go through any other tutorials on youtube to get understanding of pyspark and Spark) ',
 'Understanding Spark architecture and  try to  relate to what you have learned.']

### Operations on RDDs
1 - Action<br>
2 - Transformation

In [4]:
# Action methods

# Collect() - Use to retrive the data from all the worker node to driver program, it returns a list 
print("Fetching Data using collect() action -\n",input_txt_file.collect())

# take() - This method is also use to retrive n number of the data, it returns a list 
print("\nFetching data using take() action -\n",input_txt_file.take(2))

# Count() - This method is used to check the lenght of rdd
print("\nCount of lines - \n",input_txt_file.count())

# First() - This method is used to check the first element from the rdd
print("\nFirst line from file - \n",input_txt_file.first())

# reduce() - Use to perform action on new element based on previous calculated element
# for eg: sum of [1,2,3,4,5] = 15, we can achive this by reduce method
l1 = [1,2,3,4,5]
print("\nReduce method example - \n",sc.parallelize(l1).reduce(lambda x,y:x+y))

Fetching Data using collect() action -
 ['Week 1: (Make Note on google drive and repo in bitbucket for source code)', 'What is Big Data', 'What is role of Data Engineer', 'What Spark and Why Spark', 'PySpark - https://www.tutorialspoint.com/pyspark/index.htm (This is minimum you have to go through. You can even go through any other tutorials on youtube to get understanding of pyspark and Spark) ', 'Understanding Spark architecture and  try to  relate to what you have learned.', 'Some practice running spark programs ']

Fetching data using take() action -
 ['Week 1: (Make Note on google drive and repo in bitbucket for source code)', 'What is Big Data']

Count of lines - 
 7

First line from file - 
 Week 1: (Make Note on google drive and repo in bitbucket for source code)

Reduce method example - 
 15


In [5]:
# Transformation methods

# map() - Use to apply function on each element of rdd. In the below example, entire input file is splitted using space
def split_lines(lines):
    return lines.split()
splitted_rdd = input_txt_file.map(split_lines)
print(splitted_rdd.collect())

# flatmap() - it is use to flatten the rdd into 1D rdd
flatten_array = input_txt_file.flatMap(split_lines)
print("\nFlattened array using flaptmap() - \n",flatten_array.take(10))

# Filter() - This method is use to filter the rdd
stop_word = ['a','an','and','the','with','on','to','why','go','in','for','you']
filtered_words = input_txt_file.flatMap(split_lines).filter(lambda x: x if x not in stop_word else "")
print("\nFiltered stop words using Filter() - \n",filtered_words.collect())

# Filter() - This method is use to filter the rdd
filtered_words1 = input_txt_file.flatMap(split_lines).filter(lambda x: x.startswith('S'))
print("\nFilter words starts with 'S' using Filter() - \n",filtered_words.collect())
print()
mapped_rdd = filtered_words.map(lambda x: (x,1))
grouped_rdd = mapped_rdd.groupByKey()
word_count = grouped_rdd.mapValues(sum).map(lambda x:(x[1],x[0])).sortByKey(False)
print(word_count.take(10))

# Distinct()
print("\nUnique Word count without stop words - ",flatten_array.distinct().count())
print("\nUnique Word count after removing stop words - ",filtered_words.distinct().count())

[['Week', '1:', '(Make', 'Note', 'on', 'google', 'drive', 'and', 'repo', 'in', 'bitbucket', 'for', 'source', 'code)'], ['What', 'is', 'Big', 'Data'], ['What', 'is', 'role', 'of', 'Data', 'Engineer'], ['What', 'Spark', 'and', 'Why', 'Spark'], ['PySpark', '-', 'https://www.tutorialspoint.com/pyspark/index.htm', '(This', 'is', 'minimum', 'you', 'have', 'to', 'go', 'through.', 'You', 'can', 'even', 'go', 'through', 'any', 'other', 'tutorials', 'on', 'youtube', 'to', 'get', 'understanding', 'of', 'pyspark', 'and', 'Spark)'], ['Understanding', 'Spark', 'architecture', 'and', 'try', 'to', 'relate', 'to', 'what', 'you', 'have', 'learned.'], ['Some', 'practice', 'running', 'spark', 'programs']]

Flattened array using flaptmap() - 
 ['Week', '1:', '(Make', 'Note', 'on', 'google', 'drive', 'and', 'repo', 'in']

Filtered stop words using Filter() - 
 ['Week', '1:', '(Make', 'Note', 'google', 'drive', 'repo', 'bitbucket', 'source', 'code)', 'What', 'is', 'Big', 'Data', 'What', 'is', 'role', 'of', '

### Some more functions

In [6]:
rdd1 = sc.parallelize([('a',2),('b',10)])
rdd2 = sc.parallelize([('a',4),('b',20),('c',30)])
rdd1.join(rdd2).collect()

[('a', (2, 4)), ('b', (10, 20))]

### DataFrames

In [7]:
titanic_df = spark.read.csv("titanic/train.csv",inferSchema=True,header=True)

In [8]:
titanic_df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [9]:
titanic_df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [10]:
titanic_df.count()

891

In [11]:
titanic_df.select('Survived','Sex').show()

+--------+------+
|Survived|   Sex|
+--------+------+
|       0|  male|
|       1|female|
|       1|female|
|       1|female|
|       0|  male|
|       0|  male|
|       0|  male|
|       0|  male|
|       1|female|
|       1|female|
|       1|female|
|       1|female|
|       0|  male|
|       0|  male|
|       0|female|
|       1|female|
|       0|  male|
|       1|  male|
|       0|female|
|       1|female|
+--------+------+
only showing top 20 rows



In [14]:
titanic_df.describe('Age').show()

+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|               714|
|   mean| 29.69911764705882|
| stddev|14.526497332334035|
|    min|              0.42|
|    max|              80.0|
+-------+------------------+



In [23]:
titanic_df.filter(titanic_df.Age==40).show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------+--------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|    Ticket|    Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------+--------+-----+--------+
|         31|       0|     1|Uruchurtu, Don. M...|  male|40.0|    0|    0|  PC 17601| 27.7208| null|       C|
|         41|       0|     3|Ahlin, Mrs. Johan...|female|40.0|    1|    0|      7546|   9.475| null|       S|
|        162|       1|     2|"Watt, Mrs. James...|female|40.0|    0|    0|C.A. 33595|   15.75| null|       S|
|        189|       0|     3|    Bourke, Mr. John|  male|40.0|    1|    1|    364849|    15.5| null|       Q|
|        210|       1|     1|    Blank, Mr. Henry|  male|40.0|    0|    0|    112277|    31.0|  A31|       C|
|        264|       0|     1|Harrison, Mr. Wil...|  male|40.0|    0|    0|    112059|     0.0|  B94|       S|
|        3

In [35]:
titanic_df.where((titanic_df.Age > 40) & (titanic_df.Survived==1)).show(10)

+-----------+--------+------+--------------------+------+----+-----+-----+--------+--------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|  Ticket|    Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+--------+-----+--------+
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|  113783|   26.55| C103|       S|
|         16|       1|     2|Hewlett, Mrs. (Ma...|female|55.0|    0|    0|  248706|    16.0| null|       S|
|         53|       1|     1|Harper, Mrs. Henr...|female|49.0|    1|    0|PC 17572| 76.7292|  D33|       C|
|        188|       1|     1|"Romaine, Mr. Cha...|  male|45.0|    0|    0|  111428|   26.55| null|       S|
|        195|       1|     1|Brown, Mrs. James...|female|44.0|    0|    0|PC 17610| 27.7208|   B4|       C|
|        196|       1|     1|Lurette, Miss. Elise|female|58.0|    0|    0|PC 17569|146.5208|  B80|       C|
|        260|       1|     2

### Creating temp table, so that we can query like SQL

In [36]:
# Converting the df to table with the name titanic_sql
titanic_df.registerTempTable("titanic_sql")

In [37]:
sqlContext.sql("select * from titanic_sql").show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [41]:
sqlContext.sql("select * from titanic_sql where Age >= 40 and Survived=1").show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------+--------+-------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|    Ticket|    Fare|  Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------+--------+-------+--------+
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|    113783|   26.55|   C103|       S|
|         16|       1|     2|Hewlett, Mrs. (Ma...|female|55.0|    0|    0|    248706|    16.0|   null|       S|
|         53|       1|     1|Harper, Mrs. Henr...|female|49.0|    1|    0|  PC 17572| 76.7292|    D33|       C|
|        162|       1|     2|"Watt, Mrs. James...|female|40.0|    0|    0|C.A. 33595|   15.75|   null|       S|
|        188|       1|     1|"Romaine, Mr. Cha...|  male|45.0|    0|    0|    111428|   26.55|   null|       S|
|        195|       1|     1|Brown, Mrs. James...|female|44.0|    0|    0|  PC 17610| 27.7208|     B4|  

In [42]:
sqlContext.sql("select count(*) from titanic_sql where Age >= 40 and Survived=1").show()

+--------+
|count(1)|
+--------+
|      61|
+--------+



### Connecting to postgresql db and reading/writing the data

In [1]:
# Note - Launched pyspark with below command
# pyspark --driver-class-path .\postgresql-42.2.18.jar --jars .\postgresql-42.2.18.jar

from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', "postgresql-42.2.18.jar").getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/postgres'
properties = {'user': 'postgres', 'password': 'n0ob007'}
df = spark.read.jdbc(url=url, table='active_users', properties=properties)

In [2]:
df.show()

+-------+-------+-------------+----------+-----+------+
|user_id|country|     platform|  date_utc|years|months|
+-------+-------+-------------+----------+-----+------+
|      1|Android|           PE|01-07-2020| 2020|     7|
|      2|    iOS|United States|01-07-2020| 2020|     7|
|      3|    iOS|United States|01-07-2020| 2020|     7|
|      4|Android|           PT|01-07-2020| 2020|     7|
|      5|    iOS|United States|01-07-2020| 2020|     7|
|      6|    iOS|United States|01-07-2020| 2020|     7|
|      7|    iOS|United States|01-07-2020| 2020|     7|
|      8|    iOS|United States|01-07-2020| 2020|     7|
|      9|    iOS|United States|01-07-2020| 2020|     7|
|     10|    iOS|United States|01-07-2020| 2020|     7|
|     11|Android|           US|01-07-2020| 2020|     7|
|     12|    iOS|United States|01-07-2020| 2020|     7|
|     13|    iOS|United States|01-07-2020| 2020|     7|
|     14|    iOS|United States|01-07-2020| 2020|     7|
|     15|    iOS|United States|01-07-2020| 2020|

In [3]:
df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- platform: string (nullable = true)
 |-- date_utc: string (nullable = true)
 |-- years: integer (nullable = true)
 |-- months: integer (nullable = true)



In [21]:
from SparkSession.implicits import *
a = spark.createDataset("""{"user_id":150,
       "country":"Android",
       "platform":"United States",
       "date_utc":"01-07-2020",
       "years":2020,
       "months":2}""")

ModuleNotFoundError: No module named 'SparkSession'

In [23]:
json_data_df = spark.read.json("test_data.json")

In [28]:
json_data_df.registerTempTable("json_data_sql")

In [29]:
json_data_sql.show()

NameError: name 'json_data_sql' is not defined

### Creating Dataframe in pandas and converting it to pyspark dataframe and insert it to postgresql

In [30]:
data = {'user_id':[150],
       'country':["Android"],
       'platform':['United States'],
       'date_utc':['01-07-2020'],
       'years':[2020],
       'months':[2]}

In [33]:
import pandas as pd
pandas_df = pd.DataFrame(data)

In [34]:
pyspark_df = spark.createDataFrame(pandas_df)

In [37]:
pyspark_df.show()

+-------+-------+-------------+----------+-----+------+
|user_id|country|     platform|  date_utc|years|months|
+-------+-------+-------------+----------+-----+------+
|    150|Android|United States|01-07-2020| 2020|     2|
+-------+-------+-------------+----------+-----+------+



In [39]:
type(pyspark_df)

pyspark.sql.dataframe.DataFrame

In [None]:
pyspark_df.write\
        .format("jdbc")\
        .mode("overwrite")\
        .option("truncate","true")\
        .option("url",url)\
        .option("dbtable","active_users")\
        .option("user","postgres")\
        .option("password","n0ob007")

In [40]:
url

'jdbc:postgresql://127.0.0.1/postgres'