# ML with Pyspark (In Progress)
This is still in progress actually, yet I wanna show you how to tackle the competition with pyspark. In this noteboook I only use categoric columns for training and predictions, because the number of features get bloated when adding numeric features with mean, std, min and max. I haven't regarded the evaluation metric and just doing training -> predictions -> submissions. 

## Setup and Install

In [None]:
!pip install -q pyspark

## Import Libraries

In [None]:
import itertools
import multiprocessing
import re
from IPython import display

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import pyspark.pandas as ps
from pyspark import StorageLevel
from pyspark.sql import SparkSession, types
from pyspark.sql import functions as F
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression


## Spark Session
Spark session is an entry point into all Spark functionality. We will establish it and tune some configurations.

In [None]:
# SESSION PARAMETER
CORES = multiprocessing.cpu_count()
MAX_PARTITION_SIZE = "134217728b"

The additional configurations are based on these refs:
1. [Configure OffHeap][1] (else get OutOfMemoryError when training model)
2. [Spark job tuning tips][2]


[1]: https://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space
[2]: https://cloud.google.com/dataproc/docs/support/spark-job-tuning

In [None]:
spark = (SparkSession.builder.master(f"local[{CORES}]")
                             .config("spark.memory.offHeap.enabled", "true")
                             .config("spark.memory.offHeap.size","5g")
                             .config("spark.sql.shuffle.partitions", CORES * 3)
                             .config("spark.default.parallelism", CORES * 3)
                             .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", MAX_PARTITION_SIZE)
                             .appName("ML_spark")
                             .getOrCreate())
spark

## Read Data

In [None]:
train_path = "../input/amex-pyspark-parquet/train_amex"
test_path = "../input/amex-pyspark-parquet/test_amex"
label_path = "../input/amex-pyspark-parquet/label_amex"

The data reading speed of pyspark is quick because it didn't read the data as a whole.

In [None]:
%%time
train_df = spark.read.parquet(train_path)
test_df = spark.read.parquet(test_path)
label_df = spark.read.parquet(label_path)

## Checks on Spark and Data

### Physical plan of spark

In [None]:
train_df.select("customer_ID").explain()  # select one column to simplify the output

### Missing Customer
We need to make sure that any customer isn't missing

In [None]:
def get_null_count(sql_df, colname):
    count = (sql_df.select(colname)
                   .filter(F.col(colname).isNull())
                   .count())
    return count

In [None]:
missing_customer_train = get_null_count(train_df, "customer_ID") 
missing_customer_test = get_null_count(test_df, "customer_ID")

total_miss = missing_customer_train + missing_customer_test
print(f"Missing customer_ID: {total_miss}")

### Sample Data

In [None]:
train_df.show(1, vertical=True)
label_df.show(1, vertical=True)

## Preprocessing
Long story short, in `pyspark.sql`, we can transform or apply function in the columns by using `.withColumn()` method. Almost every functions we need are lies inside `pyspark.sql.functions` namespace (we are gonna use `F` as alias). 

### Here is how to deal with columns in `pyspark.sql`:

**1. We can select columns by using `.select` method (return only selected columns)**

```
df.select(["col1", "col2"])
```

<br>


**2. Apply function from the selected columns**

```
df.select([F.func1("col1"), F.func2("col2")])
```

<br>


**3. Make an alias (the colnames change as you apply a functions)**

```
df.select([F.func1("col1").alias("col1"), F.func2("col2").alias("col2")])
```


<br>



**4. Transform specific column (still retaining df)**

```
df.withColumn("colname", F.func(col))
```


<br>


**5. Transform multiple columns**

```
df.withColumns({"colname1", F.func(col1), "colname2": F.func(col2)})
```

<br>



**6. Rename columns**

```
df.withColumnRename("colname", "newcolname")
```

### Don't quite familiar with pyspark.sql?
If you don't quite familiar with pyspark.sql API you can still use pandas API by converting the current dataframe/rdd object into pandas.

To convert pyspark dataframe to pandas you can use `.toPandas` and `.to_pandas_on_spark` methods.
```
spark_df = spark_df.toPandas()
```
or
```
spark_df = spark_df.to_pandas_on_spark(index_col="id")  # index_col for unlocking multiple partitions
```

The drawback is, when the data is in large volume, it will affect the performance and lengthen the computation time. I suggest you to stick with `pyspark.sql` or `pyspark.rdd` until you reduce your data small enough.

### Utility Functions

In [None]:
def add_suffix(names, suffix):
    return [name + suffix for name in names]

### Define Columns
The data is consist of multiple group of columns, there are:
- Individual information (id, date)
- Features (representation of individual)
- Target (label of individual)

In [None]:
# Known Columns
info_cols = ['customer_ID', 'S_2']
target_cols = ['target']
cat_cols = [
    'B_30', 'B_38', 
    'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']


# Define Numeric Columns
excluded = info_cols + cat_cols
num_cols = [col for col in train_df.columns if col not in excluded]

# Define Feature Columns
features_cols =  cat_cols + num_cols

print(f"Number of categoric cols: {len(cat_cols)}")
print(f"Number of numeric cols: {len(num_cols)}")

### Fill Missing Values
There are some columns in this dataframe that have two or more `null` value, our base strategies are:
- Fill null in numeric columns with 0
- Fill null in categoric columns with "null"

In [None]:
train_df = (train_df.fillna(0, subset=num_cols)
                    .fillna("null", subset=cat_cols))

test_df = (test_df.fillna(0, subset=num_cols)
                  .fillna("null", subset=cat_cols))

## Feature Transform and Engineer

**Data Structure**  
Let's see how the data is structured, the data is collections of individual customer records. Every customer have multiple records of their profile. The default status is labeled for each individual customer, it means we need to account the features as combination.

Example:

| id         	| pay 	| remain 	| last       	| due        	| label 	|
|------------	|-----	|--------	|------------	|------------	|-------	|
| customer_1 	| 100 	| 200    	| 2022-01-01 	| 2022-02-01 	|       	|
| customer_1 	| 100 	| 100    	| 2022-02-01 	| 2022-03-01 	|       	|
| customer_1 	| 100 	| 0      	| 2022-02-03 	| -          	| 0     	|
| customer_2 	| 10  	| 290    	| 2022-01-01 	| 2022-02-01 	|       	|
| customer_2 	| 0   	| 290    	| 2022-01-01 	| -           	| 1     	|

<br>


### Feature Engineer
coming soon

### Feature Transform
The data is a combination of numeric and categoric types. Machine learning model can only take numeric input, therefore we need to convert our categoric columns into numeric type. For this specific task, we will use `OneHotEncoder`.

#### Index String
Before using OneHotEncoding, we need to index our categoric columns into integer. We can use `StringIndexer` for this task.

In [None]:
# Create columns aliases
cat_index_cols = add_suffix(cat_cols, "_index")

# Fit StringIndexer
indexers = StringIndexer(inputCols=cat_cols, outputCols=cat_index_cols)
indexers_model = indexers.fit(train_df)

# Transform to data
train_df_indexed = indexers_model.transform(train_df)
test_df_indexed = indexers_model.transform(test_df)

In [None]:
# See what columns the indexer handle
indexers.getInputCols()

# See the indexed columns
train_df_indexed.select("B_30_index").show(5)

#### One Hot Encoding
After that, we can apply OneHotEncoder on the indexed categoric columns.

In [None]:
# Create columns aliases
cat_ohe_cols = add_suffix(cat_cols, "_ohe")

# Fit OneHotEncoder
ohe = OneHotEncoder(inputCols=cat_index_cols, outputCols=cat_ohe_cols)
ohe_model = ohe.fit(train_df_indexed)

# Transform to data
train_df_ohed = ohe_model.transform(train_df_indexed)
test_df_ohed = ohe_model.transform(test_df_indexed)

We know that `B_30` has 4 unique values (3+1 after fillna), but the result of `OneHotEncoder` shows only 3 categories. It is caused by `dropLast=True` default argument from `OneHotEncoder`. The doc said, the last category isn't included in the vector because it will make linear dependent vector. 

refer to: [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html?highlight=onehot#pyspark.ml.feature.OneHotEncoder)

In [None]:
train_df_ohed.select("B_30_ohe").show(5)

### Group Customer
After the features is fit for machine learning model we need to make a few adjustment about the input. Let's see the data structure:

**Data Structure**  
Let's see how the data is structured, the data is collections of individual customer records. Every customer have multiple records of their profile. The default status is labeled for each individual customer, it means that we need to account the features as group.

Example:

| id         	| pay 	| remain 	| last       	| due        	| label 	|
|------------	|-----	|--------	|------------	|------------	|-------	|
| customer_1 	| 100 	| 200    	| 2022-01-01 	| 2022-02-01 	|       	|
| customer_1 	| 100 	| 100    	| 2022-02-01 	| 2022-03-01 	|       	|
| customer_1 	| 100 	| 0      	| 2022-02-03 	| -          	| 0     	|
| customer_2 	| 10  	| 290    	| 2022-01-01 	| 2022-02-01 	|       	|
| customer_2 	| 0   	| 290    	| 2022-01-01 	| -           	| 1     	|

<br>

From above explanation, we will group every customer using some reduce functions like: mean, min and max.

**Aggregate Data**  
We need to aggregate features columns with suitable functions for specific data types:
- Numeric: mean, std, min, max
- Categoric: count, last, nunique

In [None]:
# Functions for each type
# each tuple consist of: (function, column's suffix)
num_funcs = [
    (F.mean, "_mean"),
    (F.stddev, "_std"),
    (F.min, "_min"),
    (F.max, "_max"),
]

cat_funcs = [
    (F.count, "_count"),
    (F.last, "_last"),
    (F.countDistinct, "_nunique"),
]

In [None]:
# Arguments for .agg method
# each arg consist of: func(colname).alias(colname + suffix)
agg_num_args = [
    func(col).alias(col + suffix) 
    for col, (func, suffix) in itertools.product(num_cols, num_funcs)]

agg_cols_args = [
    func(col).alias(col + suffix) 
    for col, (func, suffix) in itertools.product(cat_ohe_cols, cat_funcs)]

# Combine numeric and categoric agg arguments
agg_args = agg_num_args + agg_cols_args
agg_args[0]

In [None]:
# Columns that we won't use
unused_cols = cat_cols + num_cols + cat_index_cols + cat_ohe_cols
print(f"Unused columns {len(unused_cols)}")

In [None]:
# Apply the agg while also dropping unused columns
train_df_grouped = (train_df_ohed.groupBy("customer_ID")
                                 .agg(*agg_cols_args)
                                 .drop(*unused_cols))

test_df_grouped = (test_df_ohed.groupBy("customer_ID")
                                .agg(*agg_cols_args)
                                .drop(*unused_cols))

## Modelling

### Join label and train data
Use broadcast function

In [None]:
train_joined_df = train_df_grouped.join(F.broadcast(label_df), on="customer_ID")

In [None]:
dim = len(train_joined_df.columns)
print(f"Total features: {dim}")

### Assemble Vector

In [None]:
va = VectorAssembler(
    inputCols=train_joined_df.drop("customer_ID", "target").columns,
    outputCol="features",
    handleInvalid="error",
)

train_ready_df = (va.transform(train_joined_df)
                    .select(["customer_ID", "features", "target"])
                    .persist(StorageLevel.DISK_ONLY))

test_ready_df = (va.transform(test_df_grouped)
                   .select(["customer_ID", "features"])
                   .persist(StorageLevel.DISK_ONLY))

## Training

In [None]:
logres = LogisticRegression(featuresCol="features", labelCol="target")
logres_model = logres.fit(train_ready_df)

### Prediction

In [None]:
test_predictions = logres_model.transform(test_ready_df)
test_predictions

## Submission

In [None]:
submission = (test_predictions.select(["customer_ID", "probability"])
                              .toPandas())

In [None]:
# Get probability of target 1
submission["probability"] = submission["probability"].apply(lambda x: x[-1])

# Rename columns
submission = submission.rename(columns={"probability": "prediction"})

# Save File
submission.to_csv("submission.csv", index=False)

In [None]:
pd.read_csv("submission.csv").head()

In [None]:
# training_summary = logres_model.summary

# # Obtain the objective per iteration
# objective_history = training_summary.objectiveHistory
# print("objective history:")
# for objective in objective_history:
#     print(objective)

# # Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
# training_summary.roc.show()
# print("areaUnderROC: " + str(training_summary.areaUnderROC))

# # Set the model threshold to maximize F-Measure
# f_measure = training_summary.fMeasureByThreshold
# maxf_measure = (f_measure.groupBy()
#                          .max('F-Measure')
#                          .select('max(F-Measure)')
#                          .head())

# best_threshold = (f_measure.where(f_measure['F-Measure'] == maxf_measure['max(F-Measure)'])
#                            .select('threshold')
#                            .head()['threshold'])

# print(f"Best threshold: {best_threshold}")

# logres.setThreshold(best_threshold)