    from pyspark.sql import SparkSession
    import pyspark.sql.functions as f
    import os

    spark =  SparkSession.builder \
         .master('yarn') \
         .appName("Pyspark Interview Practises") \
         .config("spark.ui.port","56236") \
         .getOrCreate()

    sc = spark.sparkContext

##### What is the difference between map and flatMap and a good use case for each?
    rdd = sc.parallelize(["Roses are red", "Violets are blue"])
###### Map splits each value to list/collection two dimensional
    rdd.map( lambda x:x.split(" ")).collect()
     [['Roses', 'are', 'red'], ['Violets', 'are', 'blue']]

###### Flatmap splits all values to single collection one dimensional
    rdd.flatMap(lambda x :x.split(" ")).collect()
     ['Roses', 'are', 'red', 'Violets', 'are', 'blue']


##### Spark - repartition() vs coalesce()
1. Repartition can increase/decrease no of partitions and shuffles the data over the cluster
2. But coalesce only decrease the no of partitions and don't shuffle the data

#### Notes
1. There is 30 GB file which is processed in 6 mins and generated 1.5 GB file
1. with cluster setup 2 Name Nodes, 3 Gateway Nodes, 5 Worker/Data nodes, 1 resource manager
1. Gateway nodes each one has 16 cores, 64 GB RAM's,
1. All others nodes each one has 8 Cores, 32 GB RAM
1. This big file has 270 blocks
1. Each block it executes 1 task and 270 task will create to process this file
1. Produces same 270 as output files because each block will be considered as one partition

#### What are workers, executors, cores in Spark Standalone cluster?

##### DRIVER
1. The driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors

##### EXECUTORS
1. Executors are worker nodes' processes in charge of running individual tasks in a given Spark job.
2. Once they have run the task they send the results to the driver.

##### APPLICATION EXECUTION FLOW
 when you submit an application to the cluster with spark-submit this is what happens internally:

1. A standalone application starts and instantiates a SparkContext instance (a driver).
2. The driver program ask for resources to the cluster manager to launch executors.
3. The cluster manager launches executors.
4. The driver process runs through the user application. Depending on the actions and transformations over RDDs task are sent to executors.
5. Executors run the tasks and save the results.
6. If any worker crashes, its tasks will be sent to different executors to be processed again.

#### What is the difference between cache and persist?
1. Cache() and persist() both the methods are used to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages.

2. The only difference between cache() and persist() is ,using Cache technique we can save intermediate results in memory only when needed while in Persist() we can save the intermediate results in 5 storage levels(MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY).


###### How to read multiple text files into a single RDD?
    sc.textFile("file1,file2")
    sc.textFile("folder1,folder2")

#### How to stop INFO messages displaying on spark console?
##### log4j.properties
1. Edit your conf/log4j.properties file and change the following line:
2. change "log4j.rootCategory=INFO, console" to "log4j.rootCategory=ERROR, console"

##### Using Spark-shell
     import org.apache.log4j.Logger
     import org.apache.log4j.Level

     Logger.getLogger("org").setLevel(Level.OFF)
     Logger.getLogger("akka").setLevel(Level.OFF)

##### RDDs support two types of operations: 
1. transformations, which create a new dataset from an existing one, and 
2. actions, which return a value to the driver program after running a computation on the dataset.

#### Spark job internal execution
    1. When a SparkContext is created, each worker node starts an executor. Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. 
    
    2. Each executor can hold some partitions.
    
    3. When a job is executed, an execution plan is created according to the lineage graph.
    
    4. The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
    
    5. A task is a command sent from the driver to an executor by serializing the Function object.
    
    6. The executor deserializes (with the driver jar) the command (task) and executes it on a partition.

##### How to overwrite the output directory in spark
     set("spark.hadoop.validateOutputSpecs", "false") 
     df.write.mode(SaveMode.Overwrite)

#### What do the numbers on the progress bar mean in spark-shell?
     (14174 + 5) / 62500] is (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage

##### How to save DataFrame directly to Hive?
     emp = spark.read.csv("data/emp.csv") 
     emp.write.saveAsTable("rposam_db.emp_from_df") 

#### Write to multiple outputs by key Spark - one Spark job
     emp_df.write.partitionBy("ename").csv("output/emp_by_ename")



#### What is the difference between spark checkpoint and persist to a disk
1. Persist(MEMORY_AND_DISK) will store the data frame to disk and memory temporary without breaking the lineage of the program
1. checkpoint(), on the other hand, breaks lineage and forces data frame to be stored on disk. Unlike usage of cache()/persist(), frequent check-pointing can slow down your program.
    ###### Checkpoints are recommended to use when 
    1. working in an unstable environment to allow fast recovery from failures 
    1. storing intermediate states of calculation when new entries of the RDD are dependent on the previous entries

#### Concatenate two PySpark dataframes
###### df1.union(df2)

###### AWS file reading from spark
    sc.textFile("s3n://bucketname/Filename") now raises another error:
    java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
    lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt")
    OR
    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
    val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")


    spark_home = os.environ.get('SPARK_HOME', None)
    spark_home

##### How to speed up spark df.write jdbc to postgres database?
1. Approach 1

         sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir s3://mybucket/myinputfiles/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16 
2. Approach 2
   Perform repartition on Datafarme so that there would multiple executor writing to DB in parallel

          df
          .repartition(10)        // No. of concurrent connection Spark to PostgreSQL
          .write.format('jdbc').options(
          url=psql_url_spark,
          driver=spark_env['PSQL_DRIVER'],
          dbtable="{schema}.{table}".format(schema=schema, table=table),
          user=spark_env['PSQL_USER'],
          password=spark_env['PSQL_PASS'],
          batchsize=2000000,
          queryTimeout=690
          ).mode(mode).save()

##### How to check if a Hive table exists using PySpark
    "your_table" in [t.name for t in spark.catalog.listTables("default")] == True 


##### Spark. Problem when writing a large file on aws s3a storage
    I am having the same issue: Spark 2.2.0 using hadoop 2.7.2. I run pyspark --driver-memory 16g --executor-memory 16g --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 to start pyspark and when I try uploading a small file to S3, it works fine, but when I try a large file (around 10 GB), it throws a confusing 403 Error.
    
    Updated hadoop from 2.7.3 to 2.8.5 and now everything works without errors


#### How can I read from S3 in pyspark running in local mode?
##### I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)

    import os
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
    from pyspark import SparkConf
    from pyspark import SparkContext
    conf = SparkConf()\
        .setMaster("local")\
        .setAppName("pyspark-unittests")\
        .set("spark.sql.parquet.compression.codec", "snappy")
    sc = SparkContext(conf = conf)
    inputFile = sparkContext.textFile("s3://somebucket/file.csv")
    
##### Solution:
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'


    # Only needed if you use s3://
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
    s3File = sc.textFile("s3a://myrepo/test.csv")

    print(s3File.count())
    print(s3File.id())
   
##### Solution 2:
    Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

    spark.hadoop.fs.s3a.access.key=<your access key>
    spark.hadoop.fs.s3a.secret.key=<your secret key>
    python file content:

    from __future__ import print_function
    import os

    from pyspark import SparkConf
    from pyspark import SparkContext

    os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
    os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"

##### How can I get the file-name list of a directory from hdfs in pyspark? [closed]
    org.apache.hadoop.fs.FileSystem

##### Pyspark UDF registration 
    def getTuple(data,separator):
        t = tuple()
        l = list()
        for row in data.split(separator):
            l.append(row)
        t = tuple(l)
        return t
        
    spark.udf.register('convertStringToTuple',getTuple)
    
    line = getTuple("10 HR Hyderabad"," ")
