<a href="https://colab.research.google.com/github/tyri0n11/distributed-system/blob/main/lab8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I. Theoretical session:
1. Could you list out some limitation of MapReduce?
2. Provide a high level comparison of Apache Hadoop and Apache Spark.
3. What are the advantages of Apache Spark?
4. Provide a comparison of RDD and DataFrame in Spark.  

## Limitations of MapReduce

###Slow because every computation round writes to disk.

###No native support for streaming/real-time workloads.

###Hard to code; requires manual map + reduce logic.

###Poor for machine learning because ML requires iterations.

###Fault tolerance is less efficient than Spark’s lineage mechanism.

| Item            | Takeaway                                           |
| --------------- | -------------------------------------------------- |
| Speed           | Spark wins due to in-memory                        |
| Workload Type   | Hadoop = batch only, Spark = batch + streaming     |
| APIs            | Spark much easier (SQL, Python, Scala)             |
| Fault tolerance | Both reliable but Spark redo is faster via lineage |


## Advantages of Apache Spark

###In-memory computing = 10–100× faster

###Unified: supports batch, streaming, ML, graph at once

###Developer-friendly with SQL, Python, Scala

###Rich ecosystem: MLlib, GraphX, SparkSQL

###Smarter failure recovery vs Hadoop

| Key difference    | RDD                             | DataFrame           |
| ----------------- | ------------------------------- | ------------------- |
| Optimization      | None                            | Catalyst optimizer  |
| Schema            | Unstructured                    | Structured          |
| Use case          | Low-level ops, custom functions | Fast analytics, SQL |
| Performance       | Slower                          | Faster              |
| Abstraction level | Low                             | High                |


II. You are given a file `appl_stock.csv`, please carry out the following tasks:

1. Read this file by PySpark. Print out the schema.
2. Create new columns of combining the High, Low, Close and Adj Close as follow `[High, Low, Close, Adj Close]`.
3. Create a new column which computes the average price of High and Low prices.
4. Create a new column which computes the amount of money based on the formula `Volume * Adj Close`.
3. Using `groupby` and `year()` function to compute the average closing price per year.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName('appl_stock').getOrCreate()
df = spark.read.csv('appl_stock.csv', inferSchema=True, header=True)
df.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)



In [2]:
df.withColumn('Combination of H L C AdjC', (df['High'] + df['Low'] + df['Close'] + df['Adj Close'])).show(5)

+----------+----------+----------+------------------+------------------+---------+------------------+-------------------------+
|      Date|      Open|      High|               Low|             Close|   Volume|         Adj Close|Combination of H L C AdjC|
+----------+----------+----------+------------------+------------------+---------+------------------+-------------------------+
|2010-01-04|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|               668.617034|
|2010-01-05|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|               670.994957|
|2010-01-06|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|               664.283177|
|2010-01-07|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265|        658.9126610000001|
|2010-01-08|210.299994|212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|   

In [4]:
df.withColumn('Amount', (df['Volume'] * df['Adj Close'])).show(5)

+----------+----------+----------+------------------+------------------+---------+------------------+--------------------+
|      Date|      Open|      High|               Low|             Close|   Volume|         Adj Close|              Amount|
+----------+----------+----------+------------------+------------------+---------+------------------+--------------------+
|2010-01-04|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|   3.4224149686636E9|
|2010-01-05|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|4.1794728435712004E9|
|2010-01-06|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|3.7730718911200004E9|
|2010-01-07|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265|     3.25435088342E9|
|2010-01-08|210.299994|212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|3.0732995574918003E9|
+----------+----

In [5]:
df.groupBy(year(df['Date'])).avg('Close').show(5)

+----------+------------------+
|year(Date)|        avg(Close)|
+----------+------------------+
|      2015|120.03999980555547|
|      2013| 472.6348802857143|
|      2014| 295.4023416507935|
|      2012| 576.0497195640002|
|      2016|104.60400786904763|
+----------+------------------+
only showing top 5 rows



III. You are given a data `customer_churn.csv`, which describes the churn status in clients of a marletting agency. As a data scientist, you are required to create a machine learning model **in Spark** that will help predict which customers will churn (stop buying their service). A short description of the data is as follow:
```
Name : Name of the latest contact at Company
Age: Customer Age
Total_Purchase: Total Ads Purchased
Account_Manager: Binary 0=No manager, 1= Account manager assigned
Years: Totaly Years as a customer
Num_sites: Number of websites that use the service.
Onboard_date: Date that the name of the latest contact was onboarded
Location: Client HQ Address
Company: Name of Client Company
```

1. Read, print the schema and check out the data to set the first sight of the data.
2. Format the data according to `VectorAssembler`, which is supported in MLlib of PySpark.
3. Split the data into train/test data, and then fit train data to the logistic regression model.
4. Evaluate the results and compute the AUC.

In [7]:
Spark = SparkSession.builder.appName('Customer Churn').getOrCreate()
df = Spark.read.csv('customer_churn.csv', inferSchema=True, header=True)
df.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [8]:
from pyspark.ml.feature import VectorAssembler

# Define the input columns for the VectorAssembler
input_cols = ['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites']

# Create a VectorAssembler instance
assembler = VectorAssembler(inputCols=input_cols, outputCol='features')

# Transform the DataFrame to include the 'features' column
output = assembler.transform(df)

# Print the schema and show the first few rows with the new 'features' column
output.printSchema()
output.select('features', 'Churn').show(5, truncate=False)

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)
 |-- features: vector (nullable = true)

+-----------------------------+-----+
|features                     |Churn|
+-----------------------------+-----+
|[42.0,11066.8,0.0,7.22,8.0]  |1    |
|[41.0,11916.22,0.0,6.5,11.0] |1    |
|[38.0,12884.75,0.0,6.67,12.0]|1    |
|[42.0,8010.76,0.0,6.71,10.0] |1    |
|[37.0,9191.58,0.0,5.56,9.0]  |1    |
+-----------------------------+-----+
only showing top 5 rows



In [9]:
#split data to train/test
train_data, test_data = output.randomSplit([0.7, 0.3])


In [11]:
#use logistic regression to evaluate results
from pyspark.ml.classification import LogisticRegression

lr_model = LogisticRegression(featuresCol='features', labelCol='Churn')

fitted_model = lr_model.fit(train_data)

In [15]:
fitted_model.summary.predictions.show(5)
auc_logis = fitted_model.summary.areaUnderROC
print("AUC:", auc_logis)

+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|           Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|            features|       rawPrediction|         probability|prediction|
+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|      Aaron King|46.0|       7504.79|              0| 5.98|      8.0|2007-05-13 20:42:11|38346 Smith Prair...|       Hernandez PLC|  0.0|[46.0,7504.79,0.0...|[3.06101637843657...|[0.95525575930095...|       0.0|
|     Aaron Meyer|45.0|       9598.03|              0|  5.0|      7.0|2010-07-17 03:30:38|35821 Bailey Skyw...|Steele, Bates and...|  0.0|[45.0,9598