##Multinominal classification modelling

## Objectives:
As part of the evaluation of this course, students execute a Big Data project in Spark as a team
to reach the learning objectives of solving and presenting an end-to-end solution to a Big Data 
problem in an intercultural team; and of demonstrating an expertise on key concepts, 
techniques, and trends (among others). In this project, they will apply the knowledge and 
techniques seen in class to a lifelike Big Data project. Furthermore, they will learn how to 
work in an intercultural team, how to develop and share business insights to a business team
and technical analyses to a data science team


BLU’s data science team has given you a dataset of 6 tables of +/- 50k orders placed between 
September 2020 and June 2022. Your goal is to build a prediction model with the highest 
possible performance using the provided data, while respecting the fundamental principles 
of good data science. You can be creative and innovative how you use the available 
information (e.g., create new variables, use unstructured content, etc); as you would do in 
practice! The team that achieves the highest accuracy on the hold-out sample gets +2 bonus 
points on their assignment; respecting of course the appropriate modeling setup process 
(e.g., no AUC-hacking or other methods of cheating)! Furthermore, each team that solves the 
case using a multi-class classification model (where a probability is given per label) gets +1 
bonus point. Describe your approach in the technical section of the presentation. This section 
should be concise and destined for a data science audience (e.g., describe variable creation, 
algorithms used, cross-validation approach, evaluation metric(s)).
In addition to focusing on prediction, also provide insights on which criteria are important for 
improving review scores. Think of 2 creative ways (e.g., website features, marketing efforts) 
on how BLU could improve this based on the insights from the data. Describe these elements 
in the business section of your presentation. This section should be written for middle to 
senior management responsible for business development.

#### Import functions

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import RFormula
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, OneVsRest
from pyspark.ml.evaluation import BinaryClassificationEvaluator 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

import pandas as pd

#### Orders

In [0]:
orders=spark\
      .read\
      .format("csv")\
      .option("header","true")\
      .option("inferSchema","true")\
      .load("/FileStore/tables/orders.csv")

# inspect the table
orders.show(3)
orders.printSchema()

In [0]:
# get the number of obervations/rows
print('Total rows:', orders.count())

# distinct rows/observations
print('Unique rows:', orders.distinct().count())

# get number of distinct order_id
print('Unique order id:', orders.select("order_id").distinct().count())

In [0]:
# Remove NA values from data
orders = orders.filter(~(orders.order_id == "NA"))

#row count after remvoing NA
orders.count()

In [0]:
# Drop duplicate order_id
orders = orders.dropDuplicates(["order_id"])
orders.count()

In [0]:
# rename columns
orders = orders.withColumnRenamed("order_purchase_timestamp", "order_date") \
               .withColumnRenamed("order_approved_at", "order_approved")\
               .withColumnRenamed("order_delivered_carrier_date", "delivered_to_logistics")\
               .withColumnRenamed("order_delivered_customer_date", "delivered_to_customer")\
               .withColumnRenamed("order_estimated_delivery_date", "estimated_dlvry_date")
orders.show(3)

In [0]:
#convert string formats of date to date formats
orders = orders.withColumn("order_date", to_date(col("order_date")))\
               .withColumn("order_approved", to_date(col("order_approved")))\
               .withColumn("delivered_to_logistics", to_date(col("delivered_to_logistics")))\
               .withColumn("delivered_to_customer", to_date(col("delivered_to_customer")))\
               .withColumn("estimated_dlvry_date", to_date(col("estimated_dlvry_date")))
orders.show(3)

In [0]:
# calculate time interval in days between order_dates, approval dates and delivered dates
orders = orders.withColumn("days_to_approval", datediff("order_approved",col("order_date"))) \
               .withColumn("days_to_delivery",datediff("delivered_to_customer",col("order_date"))) \
               .withColumn("delivery_response",datediff("delivered_to_customer",col("estimated_dlvry_date")))

# delivery_response: negative values indicate that delivery came earlier than estimated delivery date (eg 4 days before estimated delivery date)
#                  : positive values indicate that delivery came later than estimated delivery date (eg 4 days after estimated delivery date)

orders.show(3)

In [0]:
# extract months when orders where placed and delivered from order_date and delivered_to_customer columns
orders = orders.withColumn('order_month',month(orders.order_date))
orders = orders.withColumn('delivery_month',month(orders.delivered_to_customer))
display(orders)

In [0]:
#counting null values for each columns
display(orders.select([count(when(col(c).isNull(), c)).alias(c) for c in orders.columns]))

In [0]:
# sort dates columns to investigate missing values  
orders = orders.sort(col("order_date").asc(),
                     col("order_approved").asc(),
                     col("delivered_to_logistics").asc(),
                     col("delivered_to_customer").asc(),
                     col("estimated_dlvry_date").asc())
display(orders)

#### Order_payments

In [0]:
orders_payment = spark\
                .read\
                .format("csv")\
                .option("header","true")\
                .option("inferSchema","true")\
                .load("/FileStore/tables/order_payments.csv")

# inspect the table
orders_payment.show(3)
orders_payment.printSchema()

In [0]:
# get the number of obervations/rows
print('Total rows:', orders_payment.count())

# distinct rows/observations
print('Unique rows:', orders_payment.distinct().count())

# get number of distinct order_id
print('Unique order id:', orders_payment.select("order_id").distinct().count())

In [0]:
# Removing NA values from data
orders_payment = orders_payment.filter(~(orders_payment.order_id == "NA"))

#row count after remvoing NA
orders_payment.count()

In [0]:
# Dropping duplicate rows if exists
orders_payment = orders_payment.dropDuplicates()
orders_payment.count()

In [0]:
# Inspect the table
orders_payment.describe().show()

In [0]:
# Count the number of nulls per column
orders_payment.select([count(when(col(c).isNull(), c)).alias(c) for c in orders_payment.columns]).show()

In [0]:
#Calculate statistics by payment type
orders_payment.groupBy("payment_type").agg(count("payment_type").alias("count"),
                                            round(mean("payment_installments"),1).alias("avg_installment"),
                                            round(mean("payment_value"),1).alias("avg_payment")).show()

In [0]:
# get payment value for each payment method by order_id
payment_method = orders_payment\
                .groupBy("order_id")\
                .pivot("payment_type")\
                .sum("payment_value")

# replace null values by 0
payment_method = payment_method.fillna(0)
print(payment_method.count())
payment_method.show(3)

In [0]:
# Use Binarizer to create dummy variables of payment type with threshold of 0. 
# If the customer made no payment using that method, it assigns 0 else it assigns 1

convert_dummy = Binarizer(threshold=0.0, 
                          inputCols=["credit_card", "debit_card", "mobile", "voucher"], 
                          outputCols=["credit_pymt", "debit_pymt", "mobile_pymt", "voucher_pymt"])

payment_method = convert_dummy.transform(payment_method)

# no of rows/observations
print(payment_method.count())

payment_method.show(3)

In [0]:
# get count of the distinct payment methods used per order_id
payment_method = payment_method.withColumn("distinct_payment_modes", 
                                           expr("credit_pymt + debit_pymt + mobile_pymt + voucher_pymt"))

In [0]:
# Use Binarizer with threshold of 1 to see if an order had multiple payment methods. 
# If the order only had one payment method, it assigns 0 else it assigns 1
multiple_pymt_mode = Binarizer(threshold=1.0, 
                          inputCol="distinct_payment_modes", 
                          outputCol="multiple_pymt_mode")

payment_method = multiple_pymt_mode.transform(payment_method)

# no of rows/observations
print(payment_method.count())

payment_method.show(6)

In [0]:
# get orders that were paid in one installment
single_payment = orders_payment.filter(orders_payment.payment_installments <= 1)

# no of rows/observations
print(single_payment.count())

single_payment.show(3)

In [0]:
# get orders that were paid in more than one installment
mult_payments = orders_payment.filter(orders_payment.payment_installments > 1)

# no of rows/observations
print(mult_payments.count())

mult_payments.show(3)

In [0]:
# get orders that had both single and multiple payment installments eg paying for an order once with voucher and in multiple installments with a credit card
# these order have transactions in both the single_payment and mult_payments tables
hybrid_pymt = single_payment.join(mult_payments, "order_id", "inner").select(single_payment["*"])
print(hybrid_pymt.count())
hybrid_pymt.show(3)

In [0]:
# get order ids that only exists in single_payment but not in mult_payments or hybrid_pymt tables
single_payment = single_payment.join(hybrid_pymt, "order_id", "leftanti")
print(single_payment.count())
single_payment.show(3)

In [0]:
# create a mult_installments column in both single_payment, mult_payments and hybrid_pymt tables stating if order_id was completed using multiple payment installments
# 0 for No and 1 for Yes
single_payment = single_payment.withColumn("mult_installments", lit(0))
hybrid_pymt = hybrid_pymt.withColumn("mult_installments", lit(1))
mult_payments = mult_payments.withColumn("mult_installments", lit(1))

In [0]:
# get a union of single_payment, mult_payments and hybrid_pymt tables
orders_payment2 = single_payment.union(hybrid_pymt)
orders_payment2 = orders_payment2.union(mult_payments)
print(orders_payment2.count())
display(orders_payment2)

In [0]:
# get the number of obervations/rows
print('Total rows:', orders_payment2.count())

# distinct rows/observations
print('Unique rows:', orders_payment2.distinct().count())

# get number of distinct order_id
print('Unique order id:', orders_payment2.select("order_id").distinct().count())

In [0]:
# select only the necessary columns from orders_payment2
distinct_order_pymt = orders_payment2.select("order_id","mult_installments").distinct()
print(distinct_order_pymt.count())
distinct_order_pymt.show(3)

In [0]:
order_pymt_final = payment_method.join(distinct_order_pymt, "order_id", "inner")
print(order_pymt_final.count())
display(order_pymt_final)

#### Order_reviews

In [0]:
order_reviews = spark\
                .read\
                .format("csv")\
                .option("header","true")\
                .option("inferSchema","true")\
                .load("/FileStore/tables/order_reviews.csv")

# inspect the table
order_reviews.show(3)
order_reviews.printSchema()

In [0]:
# inspect data
order_reviews.describe().show()

In [0]:
# get the number of obervations/rows
print('Total rows:', order_reviews.count())

# get number of distinct order_id
print('Unique order id:', order_reviews.select("order_id").distinct().count())

# get number of distinct review_id
print('Unique review id:', order_reviews.select("review_id").distinct().count())

In [0]:
# get number of reviews per order 
display(order_reviews.groupBy("order_id")\
                     .agg(count("review_id").alias("review_count")))

In [0]:
# sort the data in descending order and add unique row id column for review id to get the most recent review id for each order 
review = order_reviews.withColumn("unique_review_rowID",row_number()
                      .over(Window.partitionBy("review_id")
                      .orderBy(col("review_answer_timestamp")
                      .desc())))

review = review.filter(review.unique_review_rowID==1)
print('Unique review id:', review.select("review_id").distinct().count())
display(review)

In [0]:
# counting review_id for each order 
display(review.groupBy("order_id")
              .agg(count("review_id").alias("review_cnt")))

In [0]:
# add unique row id based on order id to have a unique review id for each unique order ID
fnl_review = review.withColumn("unique_order_rowID",row_number()
                   .over(Window.partitionBy("order_id")
                   .orderBy(col("review_answer_timestamp")
                   .desc())))

fnl_review = fnl_review.filter(fnl_review.unique_order_rowID==1)

print('Unique order id:', fnl_review.select("order_id").distinct().count())
print('Unique review id:', fnl_review.select("review_id").distinct().count())

fnl_review = fnl_review.drop(*('unique_order_rowID',
                             'unique_review_rowID'))
display(fnl_review)

In [0]:
# convert dates into date formats
fnl_review = fnl_review.withColumn("review_answer_timestamp",to_date(col("review_answer_timestamp")))
fnl_review = fnl_review.withColumn("review_repns_time",datediff("review_answer_timestamp",col("review_creation_date")))
display(fnl_review)

In [0]:
#Count the number of nulls per column
fnl_review.select([count(when(col(c).isNull(), c)).alias(c) for c in fnl_review.columns]).show()

In [0]:
#Inspect data
fnl_review.describe().show()

#### Order_items

In [0]:
order_items = spark\
            .read\
            .format("csv")\
            .option("header","true")\
            .option("inferSchema","true")\
            .load("/FileStore/tables/order_items.csv")

# inspect the table
order_items.show(3)
order_items.printSchema()

In [0]:
# number of observations/rows
print('Total rows:', order_items.count())

#count of unique product id
print('Unique product id:', order_items.select("product_id").distinct().count())

#count of unique order_id
print('Unique order id:', order_items.select("order_id").distinct().count())

In [0]:
#Remove duplicates
order_items = order_items.drop_duplicates() 
print('Unique order id:', order_items.select("order_id").distinct().count())

In [0]:
# inspect data
order_items.describe().show()

#### Products

In [0]:
products=spark\
        .read\
        .format("csv")\
        .option("header","true")\
        .option("inferSchema","true")\
        .load("/FileStore/tables/products.csv")

# inspect the table
products.show(3)
products.printSchema()

In [0]:
# number of observations/rows
print('Total rows:', products.count())

#count of unique product id
print('Unique product id:', products.select("product_id").distinct().count())

In [0]:
# merge products and order_items table based on product id
prod_order = order_items.join(products,on=["product_id"], how="left")
display(prod_order)

#### Category

In [0]:
category_df = spark\
              .read\
              .format("csv")\
              .option("header","true")\
              .option("inferSchema","true")\
              .load("/FileStore/tables/products_cat_dict-2.csv")

# inspect the table
category_df.show(3)
category_df.printSchema()

In [0]:
# merge prod_order and category_df
prod_order = prod_order.join(category_df,on=["product_category_name"], how="left")
display(prod_order)

In [0]:
#group prod_order by order id and aggregate other columns 
fnl_prd_order=prod_order.groupBy("order_id")\
                        .agg(countDistinct("product_id").alias("nbr_distict_item"),
                             count("product_id").alias("total_qty"), 
                             round(sum("price"),2).alias("total_price"),
                             round(sum("shipping_cost"),2).alias("total_ship_cost"),
                             round(mean("product_name_lenght"),2).alias("order_prod_avg_length"),
                             round(mean("product_description_lenght"),2).alias("order_avg_desc_length"),
                             round(mean("product_photos_qty"),2).alias("order_avg_photo_qty"),
                             round(mean("product_weight_g"),2).alias("order_avg_weight_g"),
                             round(mean("product_length_cm"),2).alias("order_avg_length_cm"),
                             round(mean("product_height_cm"),2).alias("order_avg_height_cm"),
                             round(mean("product_width_cm"),2).alias("order_avg_width_cm"),
                             countDistinct("new_categories").alias("distinct_prod_category"))

In [0]:
# calculate order avg vol in cm
fnl_prd_order = fnl_prd_order.withColumn("order_avg_volume_cm", 
                                           round(expr("order_avg_length_cm * order_avg_height_cm * order_avg_width_cm"),2))

# check if distinct products were ordered; 1 - Yes, 0 - No
fnl_prd_order = fnl_prd_order.withColumn("distinct_items", 
                                         when(col("nbr_distict_item")>1, 1).otherwise(0))

# check if products ordered had distinct product categories; 1 - Yes, 0 - No
fnl_prd_order = fnl_prd_order.withColumn("multiple_catgories", 
                                         when(col("distinct_prod_category")>1, 1).otherwise(0))

display(fnl_prd_order)

In [0]:
display(fnl_prd_order.describe())

In [0]:
# replace null values in fnl_prd_order with -999
fnl_prd_order = fnl_prd_order.fillna(-999)
display(fnl_prd_order.describe())

#### Basetable

In [0]:
# merge all tables to create base table
base = fnl_review.join(orders, on=["order_id"], how="left")
base = base.join(order_pymt_final,on=["order_id"], how="left")
base = base.join(fnl_prd_order,on=["order_id"], how="left")
print('Basetable rows:', base.count())

In [0]:
display(base)

In [0]:
base.printSchema()

In [0]:
len(base.columns)

In [0]:
base = base.withColumn("review_sent", when(((col("review_creation_date") <= col("order_date")) |
                                            (col("review_creation_date") < col("delivered_to_logistics"))),"Before Logisitcs")
                                     .when(((col("review_creation_date") >= col("delivered_to_logistics")) &
                                            (col("review_creation_date") < col("delivered_to_customer"))),"After Logisitcs")
                                     .when(col("review_creation_date") >= col("delivered_to_customer"),"After Delivery")
                                     .otherwise("match not found"))

# Before Logistics: review was sent before or on the same day as the order date (to account for orders that were cancelled and never made it to the logisitcs company) 
#                   or before order was sent to the logistics company.
# After Logisitcs: review was between the date when order was sent to the logistics company and before the order was delivered to the customer
# After Delivery: review was sent after order was delivered to customer

display(base)

In [0]:
# get information is customer is ordering for the first time or customer is a repeat customer
cust_type = base.groupBy("customer_id")\
                .agg(count("customer_id").alias("count"))\
                .withColumn("customer_type",
                            when(col("count")>1,"Return")\
                           .otherwise("New"))
print(cust_type.count())
cust_type.show(5)

In [0]:
# merge cust_type to base table
cust_type = cust_type.select("customer_id","customer_type")
base = base.join(cust_type, on=["customer_id"], how="left")
print(base.count())
display(base)

In [0]:
# replace null values in numeric columns with -999
base = base.dropna(how="any")

#Count the number of nulls per column
display(base.select([count(when(col(c).isNull(), c)).alias(c) for c in base.columns]))

In [0]:
display(base)

In [0]:
s=base.groupby('order_status').count()

### Holdout Data

#### Holdout Orders

In [0]:
test_orders=spark\
            .read\
            .format("csv")\
            .option("header","true")\
            .option("inferSchema","true")\
            .load("/FileStore/tables/test_orders.csv")

# get the number of obervations/rows
print('Total rows:', test_orders.count())

# distinct rows/observations
print('Unique rows:', test_orders.distinct().count())

# get number of distinct order_id
print('Unique order id:', test_orders.select("order_id").distinct().count())

# Remove NA values from data
test_orders = test_orders.filter(~(test_orders.order_id == "NA"))

# Drop duplicate order_id
test_orders = test_orders.dropDuplicates(["order_id"])

# rename columns
test_orders = test_orders.withColumnRenamed("order_purchase_timestamp", "order_date") \
                         .withColumnRenamed("order_approved_at", "order_approved")\
                         .withColumnRenamed("order_delivered_carrier_date", "delivered_to_logistics")\
                         .withColumnRenamed("order_delivered_customer_date", "delivered_to_customer")\
                         .withColumnRenamed("order_estimated_delivery_date", "estimated_dlvry_date")

#convert string formats of date to date formats
test_orders = test_orders.withColumn("order_date", to_date(col("order_date")))\
                         .withColumn("order_approved", to_date(col("order_approved")))\
                         .withColumn("delivered_to_logistics", to_date(col("delivered_to_logistics")))\
                         .withColumn("delivered_to_customer", to_date(col("delivered_to_customer")))\
                         .withColumn("estimated_dlvry_date", to_date(col("estimated_dlvry_date")))

# calculate time interval in days between order_dates, approval dates and delivered dates
test_orders = test_orders.withColumn("days_to_approval", datediff("order_approved",col("order_date"))) \
                         .withColumn("days_to_delivery",datediff("delivered_to_customer",col("order_date"))) \
                         .withColumn("delivery_response",datediff("delivered_to_customer",col("estimated_dlvry_date")))

# delivery_response: negative values indicate that delivery came earlier than estimated delivery date (eg 4 days before estimated delivery date)
#                  : positive values indicate that delivery came later than estimated delivery date (eg 4 days after estimated delivery date)

# extract months when orders where placed and delivered from order_date and delivered_to_customer columns
test_orders = test_orders.withColumn('order_month',month(test_orders.order_date))
test_orders = test_orders.withColumn('delivery_month',month(test_orders.delivered_to_customer))

#counting null values for each columns
display(test_orders.select([count(when(col(c).isNull(), c)).alias(c) for c in test_orders.columns]))

# sort dates columns to investigate missing values  
test_orders = test_orders.sort(col("order_date").asc(),
                               col("order_approved").asc(),
                               col("delivered_to_logistics").asc(),
                               col("delivered_to_customer").asc(),
                               col("estimated_dlvry_date").asc())

print('Final rows:', test_orders.count())
display(test_orders)

#### Holdout Order Payments

In [0]:
test_order_pymts = spark\
                  .read\
                  .format("csv")\
                  .option("header","true")\
                  .option("inferSchema","true")\
                  .load("/FileStore/tables/test_order_payments.csv")

# get the number of obervations/rows
print('Total rows:', test_order_pymts.count())

# distinct rows/observations
print('Unique rows:', test_order_pymts.distinct().count())

# get number of distinct order_id
print('Unique order id:', test_order_pymts.select("order_id").distinct().count())

# Removing NA values from data
test_order_pymts = test_order_pymts.filter(~(test_order_pymts.order_id == "NA"))

# Dropping duplicate rows if exists
test_order_pymts = test_order_pymts.dropDuplicates()

# get payment value for each payment method by order_id
test_pymt_method = test_order_pymts\
                    .groupBy("order_id")\
                    .pivot("payment_type")\
                    .sum("payment_value")

# replace null values by 0
test_pymt_method = test_pymt_method.fillna(0)

# Use Binarizer to create dummy variables of payment type with threshold of 0. 
# If the customer made no payment using that method, it assigns 0 else it assigns 1
test_convert_dummy = Binarizer(threshold=0.0, 
                               inputCols=["credit_card", "debit_card", "mobile", "voucher"], 
                               outputCols=["credit_pymt", "debit_pymt", "mobile_pymt", "voucher_pymt"])

test_pymt_method = test_convert_dummy.transform(test_pymt_method)

# get count of the distinct payment methods used per order_id
test_pymt_method = test_pymt_method.withColumn("distinct_payment_modes", 
                                               expr("credit_pymt + debit_pymt + mobile_pymt + voucher_pymt"))

# Use Binarizer with threshold of 1 to see if an order had multiple payment methods. 
# If the order only had one payment method, it assigns 0 else it assigns 1
test_mult_pymt_mode = Binarizer(threshold=1.0, 
                                inputCol="distinct_payment_modes", 
                                outputCol="multiple_pymt_mode")

test_pymt_method = test_mult_pymt_mode.transform(test_pymt_method)

# get orders that were paid in one installment
test_single_pymt = test_order_pymts.filter(test_order_pymts.payment_installments <= 1)

# get orders that were paid in more than one installment
test_mult_pymt = test_order_pymts.filter(test_order_pymts.payment_installments > 1)

# get orders that had both single and multiple payment installments eg paying for an order once with voucher and in multiple installments with a credit card
# these order have transactions in both the test_single_pymt and test_mult_pymt tables
test_hybrid_pymt = test_single_pymt.join(test_mult_pymt, "order_id", "inner").select(test_single_pymt["*"])

# get order ids that only exists in test_single_pymt but not in test_mult_pymt or test_hybrid_pymt tables
test_single_pymt = test_single_pymt.join(test_hybrid_pymt, "order_id", "leftanti")

# create a mult_installments column in both test_single_pymt, test_mult_pymt and test_hybrid_pymt tables stating if order_id was completed using multiple payment installments
# 0 for No and 1 for Yes
test_single_pymt = test_single_pymt.withColumn("mult_installments", lit(0))
test_hybrid_pymt = test_hybrid_pymt.withColumn("mult_installments", lit(1))
test_mult_pymt = test_mult_pymt.withColumn("mult_installments", lit(1))

# get a union of single_payment, test_mult_pymt and test_hybrid_pymt tables
test_order_pymts2 = test_single_pymt.union(test_hybrid_pymt)
test_order_pymts2 = test_order_pymts2.union(test_mult_pymt)

# select only the necessary columns from orders_payment2
test_distinct_order_pymt = test_order_pymts2.select("order_id","mult_installments").distinct()

test_order_pymt_final = test_pymt_method.join(test_distinct_order_pymt, "order_id", "inner")
print('Final rows:', test_order_pymt_final.count())
display(test_order_pymt_final)

#### Holdout Order Items

In [0]:
test_order_items = spark\
                  .read\
                  .format("csv")\
                  .option("header","true")\
                  .option("inferSchema","true")\
                  .load("/FileStore/tables/test_order_items.csv")

# number of observations/rows
print('Total rows:', test_order_items.count())

#count of unique product id
print('Unique product id:', test_order_items.select("product_id").distinct().count())

#count of unique order_id
print('Unique order id:', test_order_items.select("order_id").distinct().count())

#Remove duplicates
test_order_items = test_order_items.drop_duplicates() 
print('Final rows:', test_order_items.count())

display(test_order_items)

#### Holdout Products

In [0]:
test_products = spark\
                .read\
                .format("csv")\
                .option("header","true")\
                .option("inferSchema","true")\
                .load("/FileStore/tables/test_products.csv")

# number of observations/rows
print('Total rows:', test_products.count())

#count of unique product id
print('Unique product id:', test_products.select("product_id").distinct().count())

# merge test_products and order_items table based on product id
test_prod_order = test_order_items.join(test_products,on=["product_id"], how="left")

# merge test_prod_order and category_df
test_prod_order = test_prod_order.join(category_df,on=["product_category_name"], how="left")
print('Rows after merge:', test_prod_order.count())

# group test_prod_order by order id and aggregate other columns 
fnl_test_prd_order = test_prod_order.groupBy("order_id")\
                                    .agg(countDistinct("product_id").alias("nbr_distict_item"),
                                         count("product_id").alias("total_qty"), 
                                         round(sum("price"),2).alias("total_price"),
                                         round(sum("shipping_cost"),2).alias("total_ship_cost"),
                                         round(mean("product_name_lenght"),2).alias("order_prod_avg_length"),
                                         round(mean("product_description_lenght"),2).alias("order_avg_desc_length"),
                                         round(mean("product_photos_qty"),2).alias("order_avg_photo_qty"),
                                         round(mean("product_weight_g"),2).alias("order_avg_weight_g"),
                                         round(mean("product_length_cm"),2).alias("order_avg_length_cm"),
                                         round(mean("product_height_cm"),2).alias("order_avg_height_cm"),
                                         round(mean("product_width_cm"),2).alias("order_avg_width_cm"),
                                         countDistinct("new_categories").alias("distinct_prod_category"))

# calculate order avg vol in cm
fnl_test_prd_order = fnl_test_prd_order.withColumn("order_avg_volume_cm", 
                                                   round(expr("order_avg_length_cm * order_avg_height_cm * order_avg_width_cm"),2))

# check if distinct products were ordered; 1 - Yes, 0 - No
fnl_test_prd_order = fnl_test_prd_order.withColumn("distinct_items", 
                                                   when(col("nbr_distict_item")>1, 1).otherwise(0))

# check if products ordered had distinct product categories; 1 - Yes, 0 - No
fnl_test_prd_order = fnl_test_prd_order.withColumn("multiple_catgories", 
                                                   when(col("distinct_prod_category")>1, 1).otherwise(0))

# replace null values in fnl_test_prd_order with -999
fnl_test_prd_order = fnl_test_prd_order.fillna(-999)
print('Final rows:', fnl_test_prd_order.count())

display(fnl_test_prd_order)

#### Holdout Basetable

In [0]:
# merge all tables to create base table
test_base = test_order_pymt_final.join(test_orders, on=["order_id"], how="left")
test_base = test_base.join(fnl_test_prd_order,on=["order_id"], how="left")
print('Initial basetable rows:', test_base.count())

test_cust_type = test_base.groupBy("customer_id")\
                          .agg(count("customer_id").alias("count"))\
                          .withColumn("customer_type",
                                      when(col("count")>1,"Return")\
                                     .otherwise("New"))

# merge test_cust_type to test_base table
test_cust_type = test_cust_type.select("customer_id","customer_type")
test_base = test_base.join(test_cust_type, on=["customer_id"], how="left")

# replace null values in numeric columns with -999
test_base = test_base.dropna(how="any")

#Count the number of nulls per column
display(test_base.select([count(when(col(c).isNull(), c)).alias(c) for c in test_base.columns]))

print('Final basetable rows:', test_base.count())
display(test_base)

#### Multi-Class Classification

In [0]:
base.groupby("review_score").count().show()
print('Total rows:', base.count())

In [0]:
fnl_base = base.drop(*("order_id",'review_id',"delivered_to_logistics","delivered_to_customer",'review_id',"review_answer_timestamp","order_id"
                             'review_answer_timestamp',"review_creation_date","order_status","order_date","order_approved","estimated_dlvry_date","customer_id","review_catagory","review_repns_time"))
fnl_base.printSchema()


In [0]:
fnl_base = fnl_base.drop(*("credit_pymt","debit_pymt","mobile_pymt","voucher_pymt","order_prod_avg_length","order_avg_height_cm","order_avg_width_cm","review_sent","review_repns_time","days_to_approva"))

In [0]:
display(fnl_base)
fnl_base.columns

## Features vectors

In [0]:

#using R forumel to have featuere vector and labels
rform = RFormula(formula="review_score ~ .")
dfc = rform.fit(fnl_base).transform(fnl_base).select("features","label")



## splitting data

In [0]:
#splitting the data into training  and testing
dfc_train_new, dfc_test_new = dfc.randomSplit([0.7, 0.3],seed=123)

## training , testing and evaluation

In [0]:
# 
rf_new= RandomForestClassifier().fit(dfc_train_new)
rf_new_pred = rf_new.transform(dfc_test_new)
print("Accuracy:", MulticlassClassificationEvaluator(metricName="accuracy").evaluate(rf_new_pred))
print("F1 Score:", MulticlassClassificationEvaluator(metricName="f1").evaluate(rf_new_pred))

In [0]:
dt_model = DecisionTreeClassifier().fit(dfc_train_new)
dt_pred2 = dt_model.transform(dfc_test_new)
print("Accuracy:", MulticlassClassificationEvaluator(metricName="accuracy").evaluate(dt_pred2))
print("F1 Score:", MulticlassClassificationEvaluator(metricName="f1").evaluate(dt_pred2))

## Feature Selection

In [0]:
def ExtractFeatureImp(featureImp, dataset, featuresCol):
    list_extract = []
    for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
        list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))
  
ExtractFeatureImp(dt_model.featureImportances, dfc_train_new, "features").head(15) 

In [0]:
def ExtractFeatureImp(featureImp, dataset, featuresCol):
    list_extract = []
    for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
        list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))
  
ExtractFeatureImp(rf_new.featureImportances, dfc_train_new, "features").head(15) 

## retraing with selected features

In [0]:
 fnl_features = fnl_base.select("delivery_response", "days_to_delivery","total_qty","distinct_items","total_ship_cost","delivery_month","multiple_catgories","order_avg_volume_cm","order_avg_weight_g","order_avg_desc_length","order_avg_volume_cm","total_price","review_score")

In [0]:

rform = RFormula(formula="review_score ~ .")
feature_random = rform.fit(fnl_features).transform(fnl_features).select("features","label")


In [0]:
feature_random_train, feature_random_test = feature_random.randomSplit([0.7, 0.3],seed=123)

## Training and testing

In [0]:
rf_feature_selected= RandomForestClassifier().fit(feature_random_train)
rf_feature_selected_pred = rf_feature_selected.transform(feature_random_test)
print("Accuracy:", MulticlassClassificationEvaluator(metricName="accuracy").evaluate(rf_feature_selected_pred))
print("F1 Score:", MulticlassClassificationEvaluator(metricName="f1").evaluate(rf_feature_selected_pred))

In [0]:
lr = LogisticRegression()
ovr = OneVsRest(classifier=lr)
ovr_model = ovr.fit(feature_random_train)

In [0]:
ovr_model_pred = ovr_model.transform(feature_random_test)

In [0]:
print("Accuracy:", MulticlassClassificationEvaluator(metricName="accuracy").evaluate(ovr_model_pred))


In [0]:
from pyspark.ml.classification import OneVsRest, LogisticRegression, RandomForestClassifier
from pyspark.ml.classification import GBTClassifier, FMClassifier

classifiers = [OneVsRest(classifier=GBTClassifier()),
               OneVsRest(classifier=FMClassifier()),
               OneVsRest(classifier=LogisticRegression()),
               OneVsRest(classifier=RandomForestClassifier())]
classifier_names = ["GBTClassifier", "FMClassifier", "Logistic Regression", "Random Forest"]
# Split the data into training and test sets
training_data, test_data = feature_random.randomSplit([0.7, 0.3], seed=12345)

# Loop through the classifiers and evaluate their performance
for classifier, name in zip(classifiers, classifier_names):
    model = classifier.fit(training_data)
    prediction = model.transform(test_data)

    # Evaluate the model using appropriate evaluation metric
    # e.g. accuracy, f1-score, etc.
    # Evaluate the model using accuracy
    evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
    accuracy = evaluator.evaluate(prediction)

    print("Classifier: ", name)
    print("Accuracy: ", accuracy)
    
    
    # Add the evaluation code here

In [0]:
from pyspark.ml.classification import GBTClassifier
gbt= GBTClassifier().fit(feature_random_train)
gbt_pred = rf_new.transform(feature_random_test)
print("Accuracy:", MulticlassClassificationEvaluator(metricName="accuracy").evaluate(gbt_pred))
print("F1 Score:", MulticlassClassificationEvaluator(metricName="f1").evaluate(gbt_pred))

In [0]:
fnl_features = fnl_base.select("delivery_response", "days_to_delivery","total_qty","distinct_items","total_ship_cost","delivery_month","multiple_catgories","order_avg_volume_cm","order_avg_weight_g","order_avg_desc_length","order_avg_volume_cm","total_price","review_score")

## performing oversmapling Technique

In [0]:
from pyspark.sql.functions import rand
class_counts = fnl_base.groupBy("review_score").count()

# Find the maximum count of instances for each class
max_count = class_counts.agg({"count": "max"}).collect()[0][0]

# Create a list of dataframes for each class, containing the minority class instances that need to be oversampled
class_dfs = [fnl_base.where(fnl_base["review_score"] == class_id["review_score"]).sample(True, (max_count / class_df.count()), seed=0)
             if class_df.count() < max_count
             else class_df
             for class_id in class_counts.select("review_score").distinct().collect()
             for class_df in [fnl_base.where(fnl_base["review_score"] == class_id["review_score"])]]

# Combine the class dataframes into a single dataframe
oversampled_df = class_dfs[0]
for class_df in class_dfs[1:]:
    oversampled_df = oversampled_df.union(class_df)

# Shuffle the rows of the oversampled dataframe
oversampled_df = oversampled_df.orderBy(rand())

In [0]:
#getting features and labels
rform = RFormula(formula="review_score ~ .")
dfc_over = rform.fit(oversampled_df).transform(oversampled_df).select("features","label")


In [0]:
display(dfc_over)

In [0]:
# spliitig data
dfc_over_train, dfc_over_test = dfc_over.randomSplit([0.8, 0.2],seed=123)

In [0]:
# Training the mode
dt_model_ovr = DecisionTreeClassifier().fit(dfc_over_train)
dt_pred2_ovr = dt_model.transform(dfc_over_test)
#print("Accuracy:", MulticlassClassificationEvaluator(metricName="accuracy").evaluate(dt_pred2_ovr))
#print("F1 Score:", MulticlassClassificationEvaluator(metricName="f1").evaluate(dt_pred2_ovr))

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

print("Accuracy:", MulticlassClassificationEvaluator(metricName="accuracy").evaluate(dt_pred2_ovr))

##hold out check

In [0]:
# # Use only the features used in the model we have chosen
test_base1=test_base.select("delivery_response","days_to_delivery","total_qty","distinct_items","total_ship_cost","delivery_month","multiple_catgories","order_avg_weight_g","order_avg_desc_length","order_avg_volume_cm","total_price")
# Drop null values
test_base1 = test_base1.fillna(-999)
display(test_base1)

In [0]:

from pyspark.ml.feature import VectorAssembler
# combine all independent variables into one vector using Vector Assembler
va = VectorAssembler(inputCols=test_base1.columns,outputCol="features")
vaDF = va.transform(test_base1)
va=vaDF.select("features")
display(vaDF)

In [0]:
#va=vaDF.select("features")

In [0]:
holdout_pred = rf_feature_selected.transform(vaDF)


In [0]:
holdout_pred.display()

## Adiidional work

In [0]:
feature_names = fnl_base.columns[:-1]

In [0]:
from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
result = ChiSquareTest.test(dfc,"features","label").head()


In [0]:
pValues = result.pValues
pValues

In [0]:
dt = DecisionTreeClassifier(featuresCol="features", labelCol="label")
model = dt.fit(train_data)

# Extract the feature importances
importances = model.featureImportances.toArray()

# Get the names of the features
feature_names = data.columns[:-1]

# Zip the feature names and importances together and sort them by importance
feature_importances = list(zip(feature_names, importances))
feature_importances.sort(key=lambda x: x[1], reverse=True)

# Select the top k features
k = 10
top_features = [x[0] for x in feature_importances[:k]]

In [0]:
import matplotlib.pyplot as plt
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator


In [0]:
# spliiting the dataset
dfc_train, dfc_test = dfc.randomSplit([0.7, 0.3],seed=123)

regParam = [0.01, 0.1, ]
elasticNetParam = [0.0, 0.5,]
lr = LogisticRegression(family="multinomial")
# Initialize a list to store the best model
bestModel = None
bestAccuracy = 0

# Loop over the hyperparameters
for r in regParam:
    for e in elasticNetParam:
        # Set the hyperparameters
        lr.setRegParam(r)
        lr.setElasticNetParam(e)
        
        # Fit the model to the training data
        model = lr.fit(dfc_train)
        
        # Make predictions on the test data
        predictions = model.transform(dfc_test)
        
        # Evaluate the model's accuracy
        
        evaluator = MulticlassClassificationEvaluator()
        accuracy = evaluator.evaluate(predictions)
        
        
        # If the accuracy is better than the previous best, store the model
        if accuracy > bestAccuracy:
            bestAccuracy = accuracy
            bestModel = model

# Use the best model for future predictions
predictions = bestModel.transform(dfc_test)
        

In [0]:
# logistic testing accuracy
MulticlassClassificationEvaluator().evaluate(predictions)


In [0]:
# logistic regression training accuracy
print(model.summary.accuracy)

In [0]:
# one vs rest method
lr = LogisticRegression(featuresCol="features", labelCol="label", family="multinomial")
ovr = OneVsRest(classifier=lr)
ovr_model = ovr.fit(dfc_train)


In [0]:
# onevs rest testing
ovr_model_pred = ovr_model.transform(dfc_test)

In [0]:
# Create an instance of BinaryClassificationEvaluator with AUC as the evaluation metric
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")

# Use the model to make predictions on the test data
prediction = ovr_model.transform(dfc_test)

# Evaluate the model using the evaluator
auc = evaluator.evaluate(prediction)

# Print the AUC-ROC
print("AUC-ROC: ", auc)

In [0]:
# Create an instance of MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

# Use the model to make predictions on the test data
prediction = ovr_model.transform(dfc_test)

# Evaluate the model using the evaluator
accuracy = evaluator.evaluate(prediction)

# Print the accuracy
print("Accuracy: ", accuracy)

In [0]:
# random forest
from pyspark.ml.regression import RandomForestRegressor

rf_model = RandomForestClassifier(numTrees=100,featureSubsetStrategy="auto" ,seed=123).fit(dfc_train)


In [0]:
#extracting feature importances
rf_model.featureImportances

In [0]:
## top feature names
# Extract the feature importances
importances = rf_model.featureImportances.toArray()

# Get the names of the features
feature_names = fnl_base.columns[:-1]

# Zip the feature names and importances together and sort them by importance
feature_importances = list(zip(feature_names, importances))
feature_importances.sort(key=lambda x: x[1], reverse=True)

# Select the top k features
k = 15
top_features = [x[0] for x in feature_importances[:k]]

In [0]:
# Extract the feature importances
top_features

In [0]:
# rf model testing
rf_model_pred = rf_model.transform(dfc_test)

In [0]:
#
##if rf_model.numClasses == 2:
    evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
#else:
    evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

In [0]:
train_prediction = rf_model.transform(dfc_train)


In [0]:
print(MulticlassClassificationEvaluator(metricName="accuracy").evaluate(rf_model_pred))


In [0]:
dt = DecisionTreeClassifier(featuresCol="features",impurity="gini",maxDepth= 17,labelCol="label")
model = dt.fit(dfc_train)

# Extract the feature importances
importances = model.featureImportances.toArray()

# Get the names of the features
feature_names = fnl_base.columns[:-1]

# Zip the feature names and importances together and sort them by importance
feature_importances = list(zip(feature_names, importances))
feature_importances.sort(key=lambda x: x[1], reverse=True)

# Select the top k features
k = 15
top_features = [x[0] for x in feature_importances[:k]]


In [0]:
top_features

In [0]:
# model testing with test data
model_pred = model.transform(dfc_test)

In [0]:
# model accuracy
print(MulticlassClassificationEvaluator(metricName="accuracy").evaluate(model_pred))

In [0]:
#Train a RandomForest model and tune featureSubsetStrategy between 'auto' and 'sqrt' using 2-fold CV
  
#RF: Hyperparameters
  #numTrees: number of trees to train
  #featureSubsetStrategy: how many features should be considered at each split? Values: auto, all, sqrt, log2, n 

from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

#Define pipeline
rfr = RandomForestRegressor()

#Set param grid
rfParams = ParamGridBuilder()\
  .addGrid(rfr.featureSubsetStrategy, ['auto', 'sqrt'])\
  .build()

#Run cross-validation
rfCv = CrossValidator()\
  .setEstimator(rfr)\
  .setEstimatorParamMaps(rfParams)\
  .setEvaluator(RegressionEvaluator())\
  .setNumFolds(2) # Here: 2-fold cross validation

#Run cross-validation, and choose the best set of parameters.
rfrModel = rfCv.fit(dfc_train)

In [0]:
train_prediction = rfrModel.transform(dfc_train)

In [0]:
#Evaluate the RF Regression Model
from pyspark.ml.evaluation import RegressionEvaluator

RegressionEvaluator(metricName="rmse").evaluate(train_prediction)

In [0]:
RegressionEvaluator(metricName="mae").evaluate(train_prediction)

In [0]:
print(MulticlassClassificationEvaluator(metricName="accuracy").evaluate(train_prediction))

In [0]:
rfrModel.bestModel.getFeatureSubsetStrategy()

In [0]:
#Evaluate the RF Regression Model
from pyspark.ml.evaluation import RegressionEvaluator

RegressionEvaluator(metricName="r2").evaluate(preds)

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

# Convert PySpark DataFrame to Pandas DataFrame
df_pandas = fnl_base.toPandas()
df_pandas=df_pandas.drop(["credit_pymt","debit_pymt","mobile_pymt","voucher_pymt","order_avg_length_cm","order_avg_height_cm","order_avg_width_cm"], axis=1)

# Generate a correlation heat map using the seaborn library
#sns.heatmap(df_pandas.corr(), annot=True)

# Show the plot
#plt.show()

cor=df_pandas.corr()
cor

In [0]:
from pyspark.ml.stat import Correlation

# Compute the Pearson correlation matrix
correlation = Correlation.corr(dfc, "features").head()

In [0]:
pearson_correlation_matrix = correlation[0].toArray()
pearson_correlation_matrix

In [0]:
display(fnl_base)

In [0]:
len(fnl_base.columns)

In [0]:
from pyspark.sql.functions import coalesce

#spark_df = fnl_base.select([coalesce(c, lit("").cast(c.dataType)) for c in spark_df.columns])
pandas_df = fnl_base.toPandas()
pandas_df.shape

In [0]:

pandas_df = pandas_df.drop(['review_repns_time', 'review_sent',"credit_pymt","debit_pymt","mobile_pymt","voucher_pymt"], axis=1)


In [0]:
pandas_df.shape
from sklearn.preprocessing import LabelEncoder
# on ehot encodint
pandas_df = pd.get_dummies(pandas_df, columns=['customer_type'])


In [0]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
pandas_df["target"]=encoder.fit_transform(pandas_df["review_catagory"])

In [0]:
pandas_df = pandas_df.drop(['review_catagory'],axis=1)
pandas_df.head()

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif ,SelectPercentile

from sklearn.preprocessing import LabelEncoder

# Split the data into features and target
X = pandas_df.iloc[:, :-1]
y = pandas_df.iloc[:, -1]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train

In [0]:

# determine the mutual information
mutual_info = mutual_info_classif (X_train, y_train)
mutual_info

In [0]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)

In [0]:
mutual_info.sort_values(ascending=False).plot.bar(figsize=(15,5))

In [0]:
selected_top_columns = SelectPercentile(mutual_info_classif, percentile=10)
selected_top_columns.fit(X_train.fillna(0), y_train)
X_train.columns[selected_top_columns.get_support()]

In [0]:
selected_top_columns = SelectPercentile(mutual_info_classif, percentile=20)
selected_top_columns.fit(X_train.fillna(0), y_train)
X_train.columns[selected_top_columns.get_support()]

In [0]:
selected_top_columns = SelectPercentile(mutual_info_classif, percentile=50)
selected_top_columns.fit(X_train.fillna(0), y_train)
X_train.columns[selected_top_columns.get_support()]

In [0]:
pip install mlxtend

In [0]:
import seaborn as sns
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
#from sklearn.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.linear_model import LogisticRegression as LGR
from sklearn.ensemble import RandomForestClassifier as rfc

In [0]:
pandas_df.head()

In [0]:
X=pandas_df.drop(["target","review_catagory"], axis=1)
Y=pandas_df["target"]


In [0]:
pandas_df.shape

In [0]:
SFS1= SFS(rfc(n_jobs=-1, class_weight="balanced",criterion="gini"),
          k_features=(1,10),
          forward=True,
          floating=False,
          verbose=2,
          scoring="accuracy",
          cv=2)

SFS2=SFS1.fit(X, Y)

In [0]:
pd.DataFrame.from_dict(SFS2.get_metric_dict()).T

In [0]:
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot    as plt

In [0]:
fig1=plot_sfs(SFS2.get_metric_dict(confidence_interval=0.95), kind ='std_err')
plt.grid()
plt.show()

In [0]:
SFS1_entropy= SFS(rfc(n_jobs=-1, class_weight="balanced",criterion="entropy"),
          k_features=(1,10),
          forward=True,
          floating=False,
          verbose=2,
          scoring="accuracy",
          cv=2)

SFS2_entropy=SFS1_entropy.fit(X, Y)

In [0]:
SFS3= SFS(LGR(max_iter=10000),
          k_features='best',
          forward=True,
          floating=False,
          verbose=2,
          scoring="accuracy",
          cv=2)


SFS4=SFS3.fit(X, Y)

In [0]:
from imblearn.over_sampling import SMOTE
from pyspark.sql.functions import *
import pandas as pd

# Convert PySpark dataframe to Pandas dataframe
pandas_df = fnl_features.toPandas()

# Apply SMOTE to the Pandas dataframe
smote = SMOTE(sampling_strategy='minority')
X_resampled, y_resampled = smote.fit_resample(pandas_df.drop(review_score, axis=1), pandas_df[label_column])

# Convert Pandas dataframe back to PySpark dataframe
spark_df = spark.createDataFrame(pd.concat([pd.DataFrame(X_resampled), pd.DataFrame(y_resampled)], axis=1))