## Data New Embeddings and Time Features
### Team Name : Data Crew

In this notebook we are getting more features 4 of time and we are implementing another way of getting the embeddings for the two columns of sequences first_20_events and time_since_last_event.

In [1]:
# Basic Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

import warnings
warnings.filterwarnings("ignore")

# Librarie for Preprocessing (By Us)
from utils import *

# Import torch
import torch
import torch.nn as nn
from keras.models import Sequential
from keras.layers import LSTM, Dense

#random.seed(2024)
np.random.seed(2024)

## Data

### Data Retrieval

We retrieve the data from the original source, i.e. the one that is uncleaned and unprocessed. Also we retrive the event definition dataset:

In [2]:
# data = pd.read_csv('../../1. Data/export.csv')   # This is the original data
# data = pd.read_csv('../../1. Data/smaller_sample.csv')  # This is a smaller sample
# event_defs = pd.read_csv('../../1. Data/Event+Definitions.csv')  # This is the event dictionary

#### **-> Execute only if working with the original dataset**

As the original dataset does not have some already-merged variables then we have to manually do it. This takes approximately 31 secs.

In [3]:
# event_defs.drop(columns=['event_name'], inplace=True)
# event_defs.rename(columns={'event_definition_id':'ed_id'}, inplace=True)
# data = pd.merge(data, event_defs, on='ed_id', how='left')
# event_defs = pd.read_csv('../../1. Data/Event+Definitions.csv')

#### **-> Execute only if working with the sample dataset**

Then we call the **get_classification_dataset** function with *n_events = $5$* (this parameter could be changed everytime and depends only in the number of sequential events we would like to consider in the last **first_n_events** column). This takes approx **2:07** mins to run (using the smaller_sample data set)

In [4]:
# number_events_fixed = 20
# col_name = 'first_' + str(number_events_fixed) +'_events'

# df = get_classification_dataset(data, event_defs, n_events=number_events_fixed)
# df.reindex(sorted(df.columns), axis=1)

Therefore we reset the index and assigned the current index (which are the cust_ids as a new column)

In [5]:
# cust_ids = df.index
# cust_ids = [x[0] for x in cust_ids]
# df.reset_index(drop=True, inplace=True)
# df['customer_id'] = cust_ids
# df.head()

#### **-> Execute only if working with the original already-preprocessed dataset (or already preprocessed the first section)**

This option was not considered in the beginning as the data set that we are reading is the dataset that it was supposed to be returned by the code in the section **-> Execute only if working with the original dataset**, but due to a time consuming issue, we ran that section once and saved the dataset in a csv file, which is the file we are reading in this section.

In [12]:
def embeddings(df, col_name = 'first_20_events'):
    sequences_evs = df[col_name].apply(lambda x: np.array(x)).to_numpy()
    sequences_times = df['time_since_last_event'].apply(lambda x: np.array(x)).to_numpy()

    padded_time_events = np.vstack(sequences_evs)
    padded_time_waits = np.vstack(sequences_times)
    
    max_seq_length = 20
    embedding_dim = 5

    model = Sequential()
    model.add(LSTM(units=128, input_shape=(max_seq_length, 1), return_sequences=False))  
    model.add(Dense(units=64, activation='relu'))  
    model.add(Dense(units=32, activation='relu'))  
    model.add(Dense(units=embedding_dim, activation='relu'))  
    model.compile(loss='mse', optimizer='adam')

    print('Predicting embeddings for time...')
    time_embeddings = model.predict(padded_time_waits)

    print('Predicting embeddings for events...')
    event_embeddings = model.predict(padded_time_events)

    event_embd = pd.DataFrame(event_embeddings, columns=[f'event_embd_{i}' for i in range(5)])
    time_embd = pd.DataFrame(time_embeddings, columns=[f'time_embd_{i}' for i in range(5)])

    df.drop(columns=[col_name, 'time_since_last_event'], inplace=True)
    df.reset_index(drop=True, inplace=True)

    new_dfx = pd.concat([df, time_embd, event_embd], axis=1)
    return new_dfx

def vec_to_list(event_list):
    event_list = event_list.replace('[', '').replace(']', '').split()
    event_list = [int(float(x)) for x in event_list]
    return event_list

def preprocessing_steps_embedding(data):
    df = data.copy()
    
    # Dropping columns that introduce bias to the model
    df = df.drop(columns=['Unnamed: 1', 'downpayment_cleared', 'first_purchase',
                          'max_milestone', 'downpayment_received', 'account_activitation', 'customer_id'])
    
    # We set this parameters for future interactions with these features
    number_events_fixed = 20
    col_name = 'first_' + str(number_events_fixed) +'_events'
    
    # As we are reading the data from a csv, the list of events is read as a string
    # and therefore we need to transform this type of data
    result = []
    for item in list(df[col_name]):
        numbers = [int(num) for num in item.replace('[', '').replace(']', '').split()]
        #numbers += [0] * (number_events_fixed - len(numbers))    
        result.append(numbers)
    result2 = []
    for item in list(df['time_since_last_event']):
        numbers = [float(num) for num in item.replace('[', '').replace(']', '').split()]
        #numbers += [0] * (number_events_fixed - len(numbers))    
        result2.append(numbers)
    
    # We have the columns again in a list type
    df[col_name] = result
    df['time_since_last_event'] = result2
    
    # Here we set all the float columns to numbers 0 or 1
    df = df.astype({col: 'float' for col in df.columns[:-2]})
    
    # We realized the dataset in the initial_devices had nan values
    df = df.dropna(axis=0)
    
    # Adding more features
    df['total_time_spent'] = df['time_since_last_event'].apply(lambda x: np.sum(x))
    df['time_mean'] = df['time_since_last_event'].apply(lambda x: np.mean(x))
    print('mean added')
    df['time_std'] = df['time_since_last_event'].apply(lambda x: np.std(x))
    print('std added')
    df['time_max'] = df['time_since_last_event'].apply(lambda x: np.max(x))
    print('max added')
    
    # We create and generate the embeddings
    # we drop the first_20_events and the time_since_last_event column
    # but we kept the embeddings
    df = embeddings(df)

    # Getting the dataset balanced
    df_0, df_1 = df[df.order_ships == 0], df[df.order_ships == 1]
    df_0 = df_0.sample(n=len(df_1), random_state=2024)
    # df_1 = df_1.sample(n=(len(df_0)), replace=True)
    df_balanced = pd.concat([df_0, df_1], axis=0).reset_index(drop=True)

    # shuffle
    df_balanced = df_balanced.sample(frac=1)

    df_X = df_balanced.drop(columns='order_ships')
    target = df_balanced.order_ships
    ori_df = df.drop(columns='order_ships')
    ori_target = df.order_ships

    boolean_col = ['discover', 'one_more_journey', 'approved_credit', 'has_prospecting', 'has_pre_application']

    for col in boolean_col:
        df_X[col] = [1 if val == True else 0 for val in df_X[col]]
        ori_df[col] = [1 if val == True else 0 for val in ori_df[col]]

    return ori_df, ori_target, df_X, target

# Read in preprocessed original dataset
df = pd.read_csv('../../1. Data/export_n_20.csv')
df.reindex(sorted(df.columns), axis=1)
df.head()

# Preprocess
ori_data, ori_target, df, target = preprocessing_steps_embedding(df)

mean added
std added
max added
Predicting embeddings for time...
Predicting embeddings for events...


### Saving the dataset and reading again (memory and time issues)

Due to the limited computational resources and the time consuming issue, we decided to save the dataset in a **.csv** file and then read it again in order to avoid this problem.

In [18]:
#pd.concat([ori_data, ori_target], axis=1).to_csv('../../1. Data/data_with_embeddings.csv', index=False)
#pd.concat([df, target], axis=1).to_csv('../../1. Data/data_with_embeddings_balanced.csv', index=False) # This data was made in order to train the models and test with the whole dataset