# Chapter 1

### Big Data


- Volume: Size of the data
- Variety: Different sources and formats
- Velocity: Speed of the data
- Clustered computing: Collection of resources of multiple machines
- Parallel computing: Simultaneous computation on single computer
- Distributed computing: Collection of nodes (networked computers) that run in parallel
- Batch processing: Breaking the job into small pieces and running them on individual machines
- Real-time processing: Immediate processing of data
- Big Data processing systems
    - Hadoop/MapReduce: Scalable and fault tolerant framework written in Java (Batch processing)
    - Apache Spark: General purpose and lightning fast cluster computing system (Both batch and real-time data processing)
    - Note: Apache Spark is nowadays preferred over Hadoop/MapReduce

### Spark

- General purpose data processing engine designed for big data.
- Written in scala
- Spark is a platform for cluster computing.
- Spark lets you spread data and computations over clusters with multiple nodes (each node as a separate computer). 
- Very large datasets are split into smaller datasets and  each node only works with a small amount of data.
- Data processing and computation are performed in parallel over the nodes in the cluster. 
- However, with greater computing power comes greater complexity.
- Can be used for Analytics, Data Integration, Machine learning, Stream Processing.
- Master and Worker:
    - Master: 
        - Connected to the rest of the computers in the cluster, which are called worker
        - sends the workers data and calculations to run
    - Worker: 
        - They send their results back to the master.
- Spark's core data structure is the Resilient Distributed Dataset (RDD)
- Instead of RDDs, it is easier to work with Spark DataFrame abstraction built on top of RDDs ( Operations using DataFrames are automatically optimized.)
- spark dataframes are immutable, you need to return a new instance after modification 
- You start working with `SparkSession` or `SparkContext` entrypoint
- 2 modes:
    - local mode : Single computer
    - cluster mode : cluster computers
    - You first build in local mode and deploy in cluster mode (no code change is required)
- Spark shell : 
    - interactive environment for spark jobs
    - allow interacting with data on disk or in memory

### Lambda function

```
func_name = lambda inputs : return_expression

add = lambda a, b : a + b
add(3,6) ## 9
```

### Map

```
#### Core python use case #####
#map(func_name, some_list)

items = [1, 2, 3, 4]
list(map(lambda x: x + 2 , items))  ## [3, 4, 5, 6]
#### Dataframe Application #####
# Method 1
df["col"].apply(lambda x: x+1)
# Method 2
genders = {'James': 'Male', 'Jane': 'Female'}
df['gender'] = df['name'].map(genders)
```

### Filter

```
## filter(boolean_func, list)

items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items)) ## [1, 3]
```

# Chapter 2

### Pyspark session

```
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Load and Query CSV with SQL") \
    .getOrCreate()

# Load the CSV file into a DataFrame
df = spark.read.csv("file.csv", header=True, inferSchema=True)

# Register the DataFrame as a temporary table or view
df.createOrReplaceTempView("my_table")

# Run SQL queries on the DataFrame
query_result = spark.sql("SELECT * FROM my_table WHERE column_name = 'value'")

# Show the query result
query_result.show()

# Print the tables in the catalog
print(spark.catalog.listTables())

# Access the SparkContext from SparkSession
sc = spark.sparkContext
spark = SparkSession(sc) # Create a SparkSession from SparkContext

# Stop SparkSession
spark.stop()

```

### Pyspark context

```
# Create a context from SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .getOrCreate()
sc = spark.sparkContext

# Alternative : create spark context explicitly
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("YourAppName").setMaster("local[*]") # Set configuration for SparkContext
sc = SparkContext(conf=conf)


print(sc) # Verify SparkContext
print(sc.version) # Print Spark version
print(sc.pythonVer) # Print Python version
print(sc.master) # Print the spark mode

# Loading data (With specified number of partitions)
numRDD = sc.parallelize(range(10), minPartitions = 6)
fileRDD = sc.textFile("README.md", minPartitions = 6)
fileRDD.getNumPartitions() # See number of broken parts

# Create a SparkSession from SparkContext
spark = SparkSession(sc) 
```

### Pyspark dataframe

```
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .getOrCreate()

df = spark.read.csv("file.csv", header=True, inferSchema=True) # load file
df.printSchema() # Show the DataFrame schema
df.show(5) # Show the first few rows of the DataFrame
df.createOrReplaceTempView("table_name") # Register DataFrame as a temporary view
result = spark.sql("SELECT * FROM table_name") # Run query on table
result.show() # Show result

df_pandas = df.toPandas() # Convert from spark dataframe to pandas dataframe
df_spark = spark.createDataFrame(df_pandas) # Convert from pandas dataframe to spark dataframe
```

### RDD operations

```
RDD = sc.textFile("README.md", minPartitions = 5)
RDD.getNumPartitions() # See number of partitions
RDD = sc.parallelize([1,2,3,4])
RDD_map = RDD.map(lambda x: x * x) # using map with an RDD
RDD_filter = RDD.filter(lambda x: x > 2) # using filter with an RDD
RDD_reduce = RDD.reduce(lambda x, y : x + y) # 10

RDD.flatMap(lambda x: x.split(" ")) # flatMap returns multiple values for each element in the original RDD
combinedRDD = RDD1.union(RDD2) # Combining 2 RDDs
RDD.collect() # Return all elements of dataset as an array
RDD.take(2)  # Return first n elements of dataset
RDD.first() # Return first element of dataset
RDD.count() # Return no of elements in the RDD

RDD.saveAsTextFile("tempFile") # Save text file as multiple partition files
RDD.coalesce(1).saveAsTextFile("tempFile")  # Save text file as a single file

# Working with paired data
my_tuple = [("Messi", 23), ("Ronaldo", 34), ("Neymar", 22), ("Messi", 24)]
pairRDD = sc.parallelize(my_tuple)
pairRDD.reduceByKey(lambda x,y : x + y).collect() # [('Neymar', 22), ('Ronaldo', 34), ('Messi', 47)]
pairRDD = pairRDD.map(lambda x: (x[1], x[0])) # keys and values swap places
pairRDD.sortByKey(ascending=False).collect()#  [(47, 'Messi'), (34, 'Ronaldo'), (22, 'Neymar')]
RDD1.join(RDD2).collect() # Joining two RDDs
# Groupby operation
grouped_RDD = pairRDD.groupByKey().collect() 
for key, val in grouped_RDD:
    print(key, list(val))

# Countby operation
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
countby_rdd = rdd.countByKey()
for key, val in countby_rdd.items():
    print(key, val) # ('a', 2) , ('b', 1)

# Turning into dictionary
pairRDD.collectAsMap()

# Turning into dataframe
RDD = sc.parallelize([("X10", 2017, 5.65, 2.79, 6.13),
                    ("8Plus", 2017, 6.23, 3.07, 7.12)])
names = ['Model', 'Year', 'Height', 'Width', 'Weight']
spark_df = spark.createDataFrame(iphones_RDD, schema=names) # spark is sparksession object
```

# Chapter 3

### Spark Dataframe

```
# Create dataframe from RDD
spark_df = spark.createDataFrame(RDD, schema=colname_list)
# Loading file
df = spark.read.csv("file.csv", header=True, inferSchema=True) # .json, .txt
df.show(3)
df.printSchema() # See schema information
df.describe().show() # Summary stats
# Add a new result column
df = df.withColumn("new_col",df.old_col+10)
# Selecting column
df = df.select(df.col1, df.col2, df.col3)
calculated_col = (df.col1/(df.col2/60)).alias("another_col")
df = df.select("col1", "col2", "col3", calculated_col)
df = df.selectExpr("col1", "col2", "col3", "col1/(col2/60) as another_col")
# Filtering (Both produces same results)
df.filter("col_name > 120").show()
df.filter(df.col_name > 120).show()
# Chaining filters
filterA = df.col1 == "SEA"
filterB = df.col2 == "PDX"
result = temp.filter(filterA).filter(filterB)


df.groupBy("col_name").count().show() # Group by and count
df.orderBy("col_name").show(3) # order by and count
# Aggregation
df.filter(df.col == 'value').groupBy().max("another_col").show()
df = df.na.drop(subset=["col_name"]) # Drop nulls
df = df.dropDuplicates() # Drop duplicates
# Rename column
df = df.withColumnRenamed("old_col_name", "new_col_name")

# Casting / Converting column type
from pyspark.sql.functions import col
df = df.withColumn("col_name", col("col_name").cast("float"))
df = df.withColumn("col_name", df.col_name.cast("float"))

# SQL with dataframe
df.createOrReplaceTempView("table_name")
df2 = spark.sql("SELECT * FROM table_name")
result = df2.collect() # Dataframe as list of rows tha you can iterate over

## Visualization : Pyspark_dist_explore, pandas (NOT RECOMMENDED), HandySpark(RECOMMENDED)
pandas_df = spark_df.toPandas()
handy_df = spark_df.toHandy() # Convert to handyspark dataframe
handy_df.cols["col_name"].hist()
spark_df = handy_df.to_spark() # Convert to pyspark dataframe
```