# Chapter 1

### Spark

- General purpose data processing engine designed for big data.
- Spark is a platform for cluster computing.
- Spark lets you spread data and computations over clusters with multiple nodes (each node as a separate computer). 
- Very large datasets are split into smaller datasets and  each node only works with a small amount of data.
- Data processing and computation are performed in parallel over the nodes in the cluster. 
- However, with greater computing power comes greater complexity.
- Can be used for Analytics, Data Integration, Machine learning, Stream Processing.
- Master and Worker:
    - Master: 
        - Connected to the rest of the computers in the cluster, which are called worker
        - sends the workers data and calculations to run
    - Worker: 
        - They send their results back to the master.
- Spark's core data structure is the Resilient Distributed Dataset (RDD)
- Instead of RDDs, it is easier to work with Spark DataFrame abstraction built on top of RDDs ( Operations using DataFrames are automatically optimized.)
- spark dataframes are immutable, you need to return a new instance after modification 
- You start working with `SparkSession` or `SparkContext`

### SparkSession

```
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark K-means example") \
    .getOrCreate()
# Print the tables in the catalog
print(spark.catalog.listTables())

# Load CSV file into DataFrame
df = spark.read.csv("file.csv", header=True, inferSchema=True)
# Show the first few rows of the DataFrame
df.show()
# Print the schema of the DataFrame
df.printSchema()
# Perform basic operations or transformations on the DataFrame as needed
# For example, you can filter rows, perform aggregations, etc.
# Load a spark table in the DataFrame
df_new = spark.table("table_name")
# Stop SparkSession
spark.stop()
```

### SparkContext

```
from pyspark import SparkConf, SparkContext

# Create a SparkConf object to configure the SparkContext
conf = SparkConf().setAppName("YourAppName").setMaster("local[*]")

# Create a SparkContext with the configured SparkConf object
sc = SparkContext(conf=conf)

# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)
```

# Chapter 2

### pyspark operations

```
# Add a new result column
df = df.withColumn("new_col",df.old_col+10)
# Selecting column
calculated_col = (df.col1/(df.col2/60)).alias("another_col")
df = df.select("col1", "col2", "col3", calculated_col)
df = df.select(df.col1, df.col2, df.col3)
df = df.selectExpr("col1", "col2", "col3", "col1/(col2/60) as another_col")

# Filtering (Both produces same results)
df.filter("col_name > 120").show()
df.filter(df.col_name > 120).show()
# Chaining filters
filterA = df.col1 == "SEA"
filterB = df.col2 == "PDX"
result = temp.filter(filterA).filter(filterB)

# Group by
df.groupBy("col_name").count().show()
# Aggregation
df.filter(df.col == 'value').groupBy().max("another_col").show()

# Drop nulls
df = df.na.drop(subset=["col_name"])
# Rename column
df = df.withColumnRenamed("old_col_name", "new_col_name")

# Casting / Converting column type
from pyspark.sql.functions import col
df = df.withColumn("col_name", col("col_name").cast("float"))
df = df.withColumn("col_name", df.col_name.cast("float"))

```

# Chapter 3

### pyspark Machine Learning

```
# One-hot encoding
from pyspark.ml.feature import StringIndexer, OneHotEncoder
string_indexer = StringIndexer(inputCol="col",outputCol="string_index") # categorical vector
one_hot_encoder = OneHotEncoder(inputCol="string_index",outputCol="onehot_feature") # One-hot encoding
# Make a VectorAssembler
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols=["col1", "air_time", "onehot_feature1", "onehot_feature2", "plane_age"], outputCol="features")

# Pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[string_indexer1, one_hot_encoder1, string_indexer2, one_hot_encoder2, vec_assembler])
pipeline_model = pipeline.fit(df) # Fit the dataframe
transformed_df = pipeline_model.transform(df) # Transform the dataframe
# Split the data into training and test sets
training, test = transformed_df.randomSplit([.6, .4])

# LogisticRegression
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression()

# Evaluation metric
import pyspark.ml.evaluation as evals
evaluator = evals.BinaryClassificationEvaluator(metricName="areaUnderROC")

# Create the parameter grid and Fine tuning
import pyspark.ml.tuning as tune
grid = tune.ParamGridBuilder()
grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
grid = grid.addGrid(lr.elasticNetParam, [0, 1])
grid = grid.build()

# Create the CrossValidator
cv = tune.CrossValidator(estimator=lr,
               estimatorParamMaps=grid,
               evaluator=evaluator)
models = cv.fit(training) # Fit the training set
best_lr = models.bestModel # Find the best model
# best_lr = lr.fit(training)
print(best_lr)
test_results = best_lr.transform(test) # Use the model to predict the test set
# Evaluate the predictions
print(evaluator.evaluate(test_results))
```