### Continuing exploring restaurant customers
In the notebook <a href='https://www.kaggle.com/erelin6613/customer-is-always-right'>Customer is always right</a> we did some basic analysis based on a sample of data. It is usually the case we have wast amount of data, often even sparse data when dealing with recommendation problem. Luckily for us, we have a few tools whcih come to resque.

In [None]:
!pip3 install pyspark --quiet

In [None]:
!pip3 list | grep pyspark

In [None]:
import os
import pyspark as spark
import pyspark.sql.functions as F
import pyspark.ml as ml
import pyspark.mllib as mllib
from pyspark.sql.types import *
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from datetime import datetime
import numpy as np

In [None]:
sc = spark.SparkContext()
sql = spark.sql.SQLContext(sc)

In [None]:
root_dir = '../input/restaurant-recommendation-challenge'
files = dict()
for f in os.listdir(root_dir):
    if f.endswith('.csv') and f != 'SampleSubmission (1).csv':
        files[f] = sql.read.format('csv').options(header='true').load(os.path.join(root_dir, f))

In [None]:
files['train_full.csv'].groupBy('discount_percentage').count().orderBy('count').show()

In [None]:
files['train_full.csv'].groupby('target').count().orderBy('count').show()

In [None]:
prep_s = files['train_full.csv'].select('prepration_time').toPandas()
prep_s['prepration_time'].astype('float').hist(color='gold')
del prep_s

In [None]:
p_method = files['orders.csv'].select('payment_mode').toPandas()
p_method['payment_mode'].astype('float').hist()
del p_method

In [None]:
total = files['orders.csv'].select('grand_total').toPandas()
sns.distplot(total['grand_total'].astype('float'), color='purple')
del total

I hope this will be enough to ensure our sampled distributions are roughly the same on larger scale too.

### Preprocessing

We will need to dive into data once again to extract features we need but now we will handle it with pyspark. We will begin by dropping columns where there are only two values present: one value or NaN. Could be there is some value either in NaN or another value but for now let's treat them as not informative. Also for now we will drop columns such as 'wednesday_to_time1' as they appear to have not much of a variation (we might reconsider those later).

In [None]:
"""I defined a function to automate dropping columns iterrating trough 
columns of each dataframe but it is painfully slow. Feel free to use 
it if you have plenty of spare time"""

def remove_cols(frame):
    for col in tqdm(frame.columns):
        nans = frame.rdd.map(lambda row: (
            row[col], sum([c == None for c in row]))).collect()
        #print(nans)
        if len(nans) > 0:
            distincts = frame.select(col).distinct().collect()
            #print(distincts)
            if len(distincts) == 2:
                frame = frame.drop(col)
    return frame
                
#for k, v in files.items():
#    files[k] = remove_cols(v)

In [None]:
weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 
            'friday', 'saturday', 'sunday']

In [None]:
to_drop = {'train_full.csv': ['commission', 'display_orders', 
                              'country_id', 'CID X LOC_NUM X VENDOR',
                             'city_id', 'vendor_category_en', 'latitude_x', 
                              'latitude_y', 'longitude_x', 'longitude_y'],
          'test_full.csv': ['commission', 'display_orders', 
                            'country_id', 'CID X LOC_NUM X VENDOR',
                           'city_id', 'vendor_category_en', 'latitude_x', 
                              'latitude_y', 'longitude_x', 'longitude_y'],
          'orders.csv': ['akeed_order_id', 'CID X LOC_NUM X VENDOR'],
          'train_customers.csv': ['language'],
          'train_customers.csv': ['language']}

for k, v in to_drop.items():
    for col in v:
        files[k] = files[k].drop(col)

for each in ['train_full.csv', 'test_full.csv']:
    for col in weekdays:
        for column in files[each].columns:
            if col in column:
                files[each] = files[each].drop(column)

Now we need numeric features to be present as numeric values, not as strings. Then we need to encode categories, here we actually do not need to cast a numeric type but first things first.

In [None]:
numeric_cols = ['delivery_charge', 'serving_distance', 'vendor_rating', 
                'prepration_time', 'discount_percentage', 'verified_x', 
                'is_open', 'status_y', 'verified_y', 'rank', 
                'open_close_flags', 'location_number_obj']

In [None]:
for col in numeric_cols:
    files['train_full.csv'] = files['train_full.csv'].withColumn(
        col, files['train_full.csv'][col].cast(DoubleType()))
    files['test_full.csv'] = files['test_full.csv'].withColumn(
        col, files['test_full.csv'][col].cast(DoubleType()))

files['train_full.csv'] = files['train_full.csv'].withColumn(
        'target', files['train_full.csv']['target'].cast(DoubleType()))

Another step will be is to exctract a numeric feature from 'primary_tag' column. It is a categorical one feature but still let's make a sure we do not have too much unnecessary data at hand.

In [None]:
files['train_full.csv'] = files['train_full.csv'].withColumn(
    'primary_tags', F.regexp_extract(
        files['train_full.csv']['primary_tags'], r"[0-9]+", 0))
files['test_full.csv'] = files['test_full.csv'].withColumn(
    'primary_tags', F.regexp_extract(
        files['test_full.csv']['primary_tags'], r"[0-9]+", 0))

files['train_full.csv'] = files['train_full.csv'].withColumn(
    'primary_tags', files['train_full.csv'][col].cast(DoubleType()))
files['test_full.csv'] = files['test_full.csv'].withColumn(
    'primary_tags', files['test_full.csv'][col].cast(DoubleType()))

Next let's join full frames with customers so we are not missing anything out.

In [None]:
files['train_customers.csv'] = files['train_customers.csv'].withColumnRenamed(
    'akeed_customer_id', 'customer_id')
files['test_customers.csv'] = files['test_customers.csv'].withColumnRenamed(
    'akeed_customer_id', 'customer_id')
train_df = files['train_full.csv'].join(files['train_customers.csv'], on=['customer_id'])
test_df = files['test_full.csv'].join(files['test_customers.csv'], on=['customer_id'])

In [None]:
train_df = train_df.drop('gender').drop('language')
test_df = test_df.drop('gender').drop('language')

Remember funny column with date of birth? We still need preprocess that.

In [None]:
train_df = train_df.withColumn('dob', train_df.dob.cast(
    DoubleType())).na.fill(2020.0)
test_df = test_df.withColumn('dob', test_df.dob.cast(
    DoubleType())).na.fill(2020.0)

train_df = train_df.fillna({'location_type': 'unknown'})
test_df = test_df.fillna({'location_type': 'unknown'})

In [None]:
train_df = train_df.withColumn('age', (2020-train_df.dob)).drop('dob')
test_df = test_df.withColumn('age', (2020-test_df.dob)).drop('dob')
median_age = np.array(train_df.select('age').collect())
median_age = np.median(median_age[median_age != 0.0])

train_df = train_df.withColumn('age', F.when(
    (train_df.age<100) & (train_df.age>12), train_df.age).otherwise(median_age))
test_df = test_df.withColumn('age', F.when(
    (test_df.age<100) & (test_df.age>12), test_df.age).otherwise(median_age))

In [None]:
train_df.select('age').distinct().show()

I could be wrong about my impression of columns `created_at_x` and `updated_at_x` with their pairs for y, range from 2018 till 2020 but it seems to me we should engineer some feature that will tell us who is the lolyal customer. Let's try to do that.

In [None]:
train_df = train_df.withColumn('created_at_x', F.to_date(train_df.created_at_x))
train_df = train_df.withColumn('created_at_y', F.to_date(train_df.created_at_y))
train_df = train_df.withColumn('updated_at_x', F.to_date(train_df.updated_at_x))
train_df = train_df.withColumn('updated_at_y', F.to_date(train_df.updated_at_y))

test_df = test_df.withColumn('created_at_x', F.to_date(test_df.created_at_x))
test_df = test_df.withColumn('created_at_x', F.to_date(test_df.created_at_x))
test_df = test_df.withColumn('updated_at_x', F.to_date(test_df.updated_at_x))
test_df = test_df.withColumn('updated_at_y', F.to_date(test_df.updated_at_y))

In [None]:
try:
    train_df = train_df.withColumn('x_loyal', F.datediff(
        train_df.updated_at_x, train_df.created_at_x))
    train_df = train_df.withColumn('y_loayl', F.datediff(
        train_df.updated_at_y, train_df.created_at_y))

    test_df = test_df.withColumn('x_loyal', F.datediff(
        test_df.updated_at_x, test_df.created_at_x))
    test_df = test_df.withColumn('y_loayl', F.datediff(
        test_df.updated_at_y, test_df.created_at_y))
except Exception:
    pass

train_df = train_df.drop('created_at_x').drop(
    'created_at_y').drop('updated_at_x').drop('updated_at_y')
test_df = test_df.drop('created_at_x').drop(
    'created_at_y').drop('updated_at_x').drop('updated_at_y')

In [None]:
train_df.select('x_loyal').distinct().show(10)

I am tourturing you a lot with preprocessing and feature crafting. Let's drop the rest and see what we do with what we have. But first a small step we neglected at first: categories endcoding. That is what we will start with next time.

In [None]:
to_drop = ['OpeningTime', 'OpeningTime2', 'language', 
           'customer_id', 'vendor_tag', 'vendor_tag_name', 
           'created_at', 'updated_at', 'id', 'authentication_id', 
           'id_obj', 'is_akeed_delivering', 'one_click_vendor']
target = 'target'

for col in to_drop:
    train_df = train_df.drop(col)
    test_df = test_df.drop(col)
train_df.show(1)

In [None]:
categorical = ['location_number', 'location_type', 'status_x',
               'vendor_category_id', 'device_type', 'status', 
               'verified']


In [None]:
# train_df.select('one_click_vendor').distinct().show()

In [None]:
for col in categorical:
    stringIndexer = ml.feature.StringIndexer(inputCol=col, outputCol=col + "_ind")
    indexer = stringIndexer.fit(train_df)
    train_df = indexer.transform(train_df)
    test_df = indexer.transform(test_df)
    encoder = ml.feature.OneHotEncoder(
        inputCols=[stringIndexer.getOutputCol()], outputCols=[col + "_ohe"])
    ohe_encoder = encoder.fit(train_df)
    train_df = ohe_encoder.transform(train_df)
    test_df = ohe_encoder.transform(test_df)

In [None]:
"""
numeric_cols = ['delivery_charge', 'serving_distance', 'vendor_rating', 
                'prepration_time', 'discount_percentage', 'verified_x', 
                'is_open', 'status_y', 'verified_y', 'rank', 
                'open_close_flags', 'location_number_obj']

"""
train_df.show(1)

In [None]:
columns = numeric_cols + [col+'_ohe' for col in categorical]
assembler = ml.feature.VectorAssembler(
    inputCols=columns, 
    outputCol="features")

train = assembler.transform(train_df)
test = assembler.transform(test_df)

In [None]:
train_fit, train_eval = train.randomSplit([0.75, 0.25], seed=13)

l_reg = ml.classification.LogisticRegression(labelCol='target', featuresCol='features', maxIter=20)
l_reg=l_reg.fit(train_fit)

predict_train=l_reg.transform(train_fit)
predict_test=l_reg.transform(train_eval)

In [None]:
predict_test.select('prediction').distinct().show()

Well we need not even to evaluate there is something wrong with our results. What exactly we do not account for?

#### 1) Class imbalance
Remember the distribution of target column? From a sample of 1000 only 128 were targets of 1 which is only 12.8%. It is reasonable to assume this percentage will not vary a lot in training set. Should we check?

In [None]:
train_fit.groupBy('target').count().orderBy('count').show()

In [None]:
train_eval.groupBy('target').count().orderBy('count').show()

Huh, that is even more severe imbalance that our sample of 1000 showed before. Will we do better handling it?

In [None]:
bal_train = train.filter(train.target==1.0)
target_count = bal_train.count()

In [None]:
target_df = train.filter(train.target==0.0).distinct()
target_df = target_df.sample(False, fraction=target_count/target_df.count())
target_df.count()

In [None]:
bal_train = bal_train.unionByName(target_df)
bal_train.sample(False, 0.1).show(10)

In [None]:
train_fit, train_eval = bal_train.randomSplit([0.75, 0.25], seed=13)

l_reg = ml.classification.LogisticRegression(labelCol='target', featuresCol='features', maxIter=20)
l_reg=l_reg.fit(train_fit)

predict_train=l_reg.transform(train_fit)
predict_test=l_reg.transform(train_eval)

In [None]:
predict_test.select('prediction').distinct().show()

In [None]:
predict_test.show()

That already looks much better, doesn't it? But really how accurate are results? That is going to be our next step.