# Chapter 1

### Big Data


- Defined by 5 Vs:
    - velocity =  Data generated at extremely fast speed.
    - volume = Amount is huge
    - variety = Data comes from many different sources
    - veracity = Consistency, Completeness, Integrity, Ambiguity of data (Structured, unstructured, semi-structured)
    - value = Derive insight from data

- Related concepts:
    - Clustered computing: Collection of resources of multiple machines
    - Parallel computing: Simultaneous computation on single computer
    - Distributed computing: Collection of nodes (networked computers) that run in parallel
    - Batch processing: Breaking the job into small pieces and running them on individual machines
    - Real-time processing: Immediate processing of data
    - Big Data processing systems
        - Hadoop/MapReduce: Scalable and fault tolerant framework written in Java (for Distributed storage and Batch processing)
        - Apache Spark: General purpose and lightning fast cluster computing system for Distributed real-time analytics (Both batch and real-time data processing)
        - Apache Hive (Warehouse for query and analysis)
        - Note: Apache Spark is nowadays preferred over Hadoop/MapReduce

### Spark

- General purpose data processing engine designed for big data.
- Written in scala
- Spark is a platform for cluster computing.
- Spark lets you spread data and computations over clusters with multiple nodes (each node as a separate computer). 
- Very large datasets are split into smaller datasets and  each node only works with a small amount of data.
- Data processing and computation are performed in parallel over the nodes in the cluster. 
- However, with greater computing power comes greater complexity.
- Can be used for Analytics, Data Integration, Machine learning, Stream Processing.
- Master and Worker:
    - Master: 
        - Connected to the rest of the computers in the cluster, which are called worker
        - sends the workers data and calculations to run
    - Worker: 
        - They send their results back to the master.
- Spark's core data structure is the Resilient Distributed Dataset (RDD)
- Instead of RDDs, it is easier to work with Spark DataFrame abstraction built on top of RDDs ( Operations using DataFrames are automatically optimized.)
- spark dataframes are immutable, you need to return a new instance after modification 
- You start working with `SparkSession` or `SparkContext` entrypoint
- 2 modes:
    - local mode : Single computer
    - cluster mode : cluster computers
    - You first build in local mode and deploy in cluster mode (no code change is required)
- Spark shell : 
    - interactive environment for spark jobs
    - allow interacting with data on disk or in memory

### Lambda function

```
func_name = lambda inputs : return_expression

add = lambda a, b : a + b
add(3,6) ## 9
```

### Map

```
#### Core python use case #####
#map(func_name, some_list)

items = [1, 2, 3, 4]
list(map(lambda x: x + 2 , items))  ## [3, 4, 5, 6]
#### Dataframe Application #####
# Method 1
df["col"].apply(lambda x: x+1)
# Method 2
genders = {'James': 'Male', 'Jane': 'Female'}
df['gender'] = df['name'].map(genders)
```

### Filter

```
## filter(boolean_func, list)

items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items)) ## [1, 3]
```

### Reduce

```
from functools import reduce

some_list = [1, 2, 3, 4, 5]
total_sum = reduce(lambda x, y: x + y, some_list) # 15
```

# Chapter 2

### Pyspark session

```
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Load and Query CSV with SQL") \
    .getOrCreate()

# Load the CSV file into a DataFrame
df = spark.read.csv("file.csv", header=True, inferSchema=True)

# Register the DataFrame as a temporary table or view
df.createOrReplaceTempView("my_table")

# Run SQL queries on the DataFrame
query_result = spark.sql("SELECT * FROM my_table WHERE column_name = 'value'")

# Show the query result
query_result.show()

# Print the tables in the catalog
print(spark.catalog.listTables())

# Access the SparkContext from SparkSession
sc = spark.sparkContext
spark = SparkSession(sc) # Create a SparkSession from SparkContext

# Stop SparkSession
spark.stop()

```

### Pyspark context

```
# Create a context from SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .getOrCreate()
sc = spark.sparkContext

# Alternative : create spark context explicitly
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("YourAppName").setMaster("local[*]") # Set configuration for SparkContext
sc = SparkContext(conf=conf)


print(sc) # Verify SparkContext
print(sc.version) # Print Spark version
print(sc.pythonVer) # Print Python version
print(sc.master) # Print the spark mode

# Loading data (With specified number of partitions)
numRDD = sc.parallelize(range(10), minPartitions = 6)
fileRDD = sc.textFile("README.md", minPartitions = 6)
fileRDD.getNumPartitions() # See number of broken parts

# Create a SparkSession from SparkContext
spark = SparkSession(sc) 
```

### Pyspark dataframe

```
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .getOrCreate()

df = spark.read.csv("file.csv", header=True, inferSchema=True) # load file
df.printSchema() # Show the DataFrame schema
df.show(5) # Show the first few rows of the DataFrame
df.createOrReplaceTempView("table_name") # Register DataFrame as a temporary view
result = spark.sql("SELECT * FROM table_name") # Run query on table
result.show() # Show result
spark_df = spark.table("table_name") # start using a spark table as spark dataframe

df_pandas = df.toPandas() # Convert from spark dataframe to pandas dataframe
df_spark = spark.createDataFrame(df_pandas) # Convert from pandas dataframe to spark dataframe
```

### RDD operations

```
RDD = sc.textFile("README.md", minPartitions = 5)
RDD.getNumPartitions() # See number of partitions
RDD = sc.parallelize([1,2,3,4])
RDD_map = RDD.map(lambda x: x * x) # using map with an RDD
RDD_filter = RDD.filter(lambda x: x > 2) # using filter with an RDD
RDD_reduce = RDD.reduce(lambda x, y : x + y) # 10

RDD.flatMap(lambda x: x.split(" ")) # flatMap returns multiple values for each element in the original RDD
combinedRDD = RDD1.union(RDD2) # Combining 2 RDDs
RDD.collect() # Return all elements of dataset as an array
RDD.take(2)  # Return first n elements of dataset
RDD.first() # Return first element of dataset
RDD.count() # Return no of elements in the RDD

RDD.saveAsTextFile("tempFile") # Save text file as multiple partition files
RDD.coalesce(1).saveAsTextFile("tempFile")  # Save text file as a single file

# Working with paired data
my_tuple = [("Messi", 23), ("Ronaldo", 34), ("Neymar", 22), ("Messi", 24)]
pairRDD = sc.parallelize(my_tuple)
pairRDD.reduceByKey(lambda x,y : x + y).collect() # [('Neymar', 22), ('Ronaldo', 34), ('Messi', 47)]
pairRDD = pairRDD.map(lambda x: (x[1], x[0])) # keys and values swap places
pairRDD.sortByKey(ascending=False).collect()#  [(47, 'Messi'), (34, 'Ronaldo'), (22, 'Neymar')]
RDD1.join(RDD2).collect() # Joining two RDDs
# Groupby operation
grouped_RDD = pairRDD.groupByKey().collect() 
for key, val in grouped_RDD:
    print(key, list(val))

# Countby operation
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
countby_rdd = rdd.countByKey()
for key, val in countby_rdd.items():
    print(key, val) # ('a', 2) , ('b', 1)

# Turning into dictionary
pairRDD.collectAsMap()

# Turning into dataframe
RDD = sc.parallelize([("X10", 2017, 5.65, 2.79, 6.13),
                    ("8Plus", 2017, 6.23, 3.07, 7.12)])
names = ['Model', 'Year', 'Height', 'Width', 'Weight']
spark_df = spark.createDataFrame(iphones_RDD, schema=names) # spark is sparksession object
df_pandas = spark_df.toPandas() # Convert from spark dataframe to pandas dataframe
handy_df = spark_df.toHandy() # Convert to handyspark dataframe
```

# Chapter 3

### Spark Dataframe

```
# Create dataframe from RDD
spark_df = spark.createDataFrame(RDD, schema=colname_list)
# Loading file
df = spark.read.csv("file.csv", header=True, inferSchema=True) # .json, .txt
df.show(3)
df.printSchema() # See schema information
df.describe().show() # Summary stats
df.createOrReplaceTempView("table_name") # Register DataFrame as a temporary view
result = spark.sql("SELECT * FROM table_name") # Run query on table
spark_df = spark.table("table_name") # start using a spark table as spark dataframe
# Add a new result column
df = df.withColumn("new_col",df.old_col+10)
# Selecting column
df = df.select(df.col1, df.col2, df.col3)
calculated_col = (df.col1/(df.col2/60)).alias("another_col")
df = df.select("col1", "col2", "col3", calculated_col)
df = df.selectExpr("col1", "col2", "col3", "col1/(col2/60) as another_col")
# Filtering (Both produces same results)
df.filter("col_name > 120").show()
df.filter(df.col_name > 120).show()
# Chaining filters
filterA = df.col1 == "SEA"
filterB = df.col2 == "PDX"
result = temp.filter(filterA).filter(filterB)

df.groupBy("col_name").count().show() # Group by and count
df.orderBy("col_name").show(3) # order by and count
# Aggregation
df.filter(df.col == 'value').groupBy().max("another_col").show()
df = df.na.drop(subset=["col_name"]) # Drop nulls
df = df.dropDuplicates() # Drop duplicates
# Rename column
df = df.withColumnRenamed("old_col_name", "new_col_name")

# Casting / Converting column type
from pyspark.sql.functions import col
df = df.withColumn("col_name", col("col_name").cast("float"))
df = df.withColumn("col_name", df.col_name.cast("float"))

# SQL with dataframe
df.createOrReplaceTempView("table_name")
df2 = spark.sql("SELECT * FROM table_name")
result = df2.collect() # Dataframe as list of rows tha you can iterate over

## Visualization : Pyspark_dist_explore, pandas (NOT RECOMMENDED), HandySpark(RECOMMENDED)
pandas_df = spark_df.toPandas()
handy_df = spark_df.toHandy() # Convert to handyspark dataframe
handy_df.cols["col_name"].hist()
spark_df = handy_df.to_spark() # Convert to pyspark dataframe
```

# Chapter 4

### Machine Learning with dataframe

```
# One-hot encoding
from pyspark.ml.feature import StringIndexer, OneHotEncoder
# StringIndexer does indexing for each category. this step allows handling unseen category in testing set
string_indexer1 = StringIndexer(inputCol="cat_col",outputCol="string_index") 
one_hot_encoder1 = OneHotEncoder(inputCol="string_index",outputCol="onehot_feature") # One-hot encoding using the category indices
# Combine all features
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3", "onehot_feature1", "onehot_feature2"], outputCol="features")

# Define the model
model_rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=10) # from pyspark.ml.classification import RandomForestClassifier
model_lr1 = LogisticRegression(featuresCol="features", labelCol="label") # from pyspark.ml.classification import LogisticRegression
model_lr2 = LinearRegression(featuresCol="features", labelCol="label") # from pyspark.ml.regression import LinearRegression
model_kmeans = KMeans(featuresCol="features", predictionCol="kmeans_prediction", k=3) # from pyspark.ml.clustering import KMeans
# deep learninng
layers = [len(feature_cols) + 2, 5, 2]  # Input layer size, hidden layer sizes, output layer size
model_dl = MultilayerPerceptronClassifier(layers=layers, labelCol="label", featuresCol="features", seed=123)
# Pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[string_indexer1, one_hot_encoder1, string_indexer2, one_hot_encoder2, vec_assembler, model_xx])

# Create the parameter grid
import pyspark.ml.tuning as tune
paramGrid = tune.ParamGridBuilder()\
        .addGrid(lr.regParam, np.arange(0, .1, .01))
        .addGrid(lr.elasticNetParam, [0, 1])
        .build()

# Evaluation metric
import pyspark.ml.evaluation as evals
evaluator_logistic = evals.BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction")
evaluator_reg = evals.RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
evaluator_rf_dl = evals.MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy") 
# Create cross-validator
cv = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3) 

# Split the data into training and test sets
training, test = transformed_df.randomSplit([.6, .4])

cvModel = cv.fit(training) # Fit the dataframe
bestModel = cvModel.bestModel # Best model
bestParams = bestModel.stages[-1].extractParamMap() # See best parameters
test_results = bestModel.transform(test) # Use the model to predict the test set
predictions = cvModel.transform(test) # Predict using testing set
accuracy = evaluator.evaluate(predictions)
feature_importances = bestModel.stages[-1].featureImportances
```

### Machine Learning with RDD

```
data = [
    ('A', 0.7, 'm', 1.0),
    ('B', 0.1, 'f', 0.3),
    ('A', 0.8, 'm', 0.2),
    ('C', 0.2, 'f', 0.5),
    ('C', 0.5, 'f', 0.6)
]
rdd = sc.parallelize(data)

# Convert label to numerical values
label_mapping = {'A': 0, 'B': 1, 'C': 2}
rdd = rdd.map(lambda x: (label_mapping[x[0]], x[1:]))
# Convert data to LabeledPoint (helps to identify the labels as rdd.label and features as rdd.features)
labeled_rdd = rdd.map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1])))

# One-hot encoding ( using Pipeline and pyspark dataframe construct)
pipeline = Pipeline(stages=[
    StringIndexer(inputCol='_2', outputCol='gender_index'),
    OneHotEncoder(inputCol='gender_index', outputCol='gender_encoded')
])
df = spark.createDataFrame(labeled_rdd, ["label", "features"])
pipeline_model = pipeline.fit(df)
df = pipeline_model.transform(df)
# Convert DataFrame back to RDD
rdd = df.rdd.map(lambda row: (row.label, row.features, row.gender_encoded))

# Split the data into training and testing sets
(trainingData, testData) = rdd.randomSplit([0.8, 0.2])

# Train the model 
model_lr = LogisticRegressionWithLBFGS.train(trainingData)
model_lin = LinearRegressionWithSGD.train(trainingData, iterations=100, step=0.1)
model_rf = RandomForest.trainClassifier(trainingData, numClasses=3, categoricalFeaturesInfo={},
                                     numTrees=10, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)
kmeans_model = KMeans.train(trainingData.map(lambda x: x[1]), k=3, maxIterations=10, initializationMode="random")

# Train the model with deep learning 
input_size = len(trainingData.first()[1])
output_size = 3  # number of classes
hidden_layers = [input_size, 5, output_size]  # input layer size, hidden layer sizes, output layer size
model_mlp = MultilayerPerceptronClassifier.train(trainingData, iterations=100, stepSize=0.1, layers=hidden_layers)

# Make predictions on the test data
predictions = testData.map(lambda x: (model_.predict(x[1]), x[0]))
predictions_kmeans = kmeans_model.predict(testData.map(lambda x: x[1]))

# Evaluate MLP, Random Forest model
metrics = MulticlassMetrics(predictions)
accuracy = metrics_mlp.accuracy

# Evaluate Logistic Regression model
metrics_lr = BinaryClassificationMetrics(predictions_lr)
auc_roc_lr = metrics_lr.areaUnderROC

# Evaluate Linear Regression model
metrics_lin = RegressionMetrics(predictions_lin)
rmse_lin = metrics_lin.rootMeanSquaredError

# Compute R-squared for KMeans model
def calculate_rmse(predictions):
    return np.sqrt(predictions.map(lambda x: (x[0] - x[1]) ** 2).mean())

# Calculate RMSE for KMeans model
rmse_kmeans = calculate_rmse(testData.map(lambda x: (predictions_kmeans.predict(x[1]), x[0])))

sc.stop()
```

### Colaborative Filtering pyspark dataframe

```
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("ALS Example") \
    .getOrCreate()

# Sample data (User ID, Item ID, Rating, Additional Column1, Additional Column2)
data = [
    (1, 1, 5, "A", "X"),
    (1, 2, 4, "B", "Y"),
    (2, 1, 3, "C", "Z"),
    (2, 2, 5, "D", "W"),
    (3, 1, 4, "E", "V"),
    (3, 2, 2, "F", "U")
]

# Create DataFrame
df = spark.createDataFrame(data, ["user", "item", "rating", "additional_col1", "additional_col2"])

# Split data into training and test sets
(training_data, test_data) = df.randomSplit([0.8, 0.2])

# Train ALS model
als = ALS(rank=10, maxIter=10, regParam=0.01, userCol="user", itemCol="item", ratingCol="rating")
model = als.fit(training_data)

# Make predictions on test data
predictions = model.transform(test_data)

# Evaluate predictions using RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) = " + str(rmse))

# Stop SparkSession
spark.stop()

```

### Collaborative Filtering RDD

```
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS, Rating

# Initialize SparkContext
sc = SparkContext("local", "ALS Example")

# Sample data (User ID, Item ID, Rating, Additional Column1, Additional Column2)
data = [
    (1, 1, 5, "A", "X"),
    (1, 2, 4, "B", "Y"),
    (2, 1, 3, "C", "Z"),
    (2, 2, 5, "D", "W"), # <--- say this data is for testing
    (3, 1, 4, "E", "V"),
    (3, 2, 2, "F", "U")
]

# Create RDD
ratings_rdd = sc.parallelize(data)

# Map the data to Rating objects (User ID, Item ID, Rating)
ratings = ratings_rdd.map(lambda x: Rating(x[0], x[1], x[2])) # Rating(user=1, product=1, rating=5.0)

# Split data into training and test sets
training_data, test_data = ratings.randomSplit([0.8, 0.2])

# Train ALS model
rank = 10  # Number of latent factors
num_iterations = 10  # Number of iterations
model = ALS.train(training_data, rank, num_iterations)

# Make predictions on test data
test_user_item_pairs = test_data.map(lambda x: (x[0], x[1]))
predictions = model.predictAll(test_user_item_pairs)
predictions = predictions.map(lambda r: ((r[0], r[1]), r[2])) # ((2, 2), 5.008601768134059)

# Join predicted ratings with actual ratings
rates_and_preds = test_data.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) # ((2, 2), (5.0, 5.008601768134059))

# Calculate RMSE
RMSE = (rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())**0.5
print("Root Mean Squared Error (RMSE) = " + str(RMSE))

# Stop SparkContext
sc.stop()

```