This is the dataset from Kaggle. [Source website is here](https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-electronics-store?resource=download).

In [None]:
%pip install numpy
%pip install matplotlib
%pip install pandas
%pip install seaborn
%pip install sklearn
%pip install scipy

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler  # to standardize the features
from sklearn.decomposition import PCA  # to apply PCA
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
from sklearn.mixture import GaussianMixture
sns.set_context("talk", font_scale=1.5)

data = pd.read_csv('events.csv')
df = pd.DataFrame(data)
print(df.info()) # check data imported successfully

Feel free to explore the dataset with these codes prepared for you

In [None]:
# try to explore the data

# print(df.head())
# print(df.shape)
# print(df.info())
# print(df.describe())
# print(type(df.loc[234]['StockCode']))

NaN values should be removed first. 

In [None]:
# remove the rows with NaN values

df.dropna(axis=0, how='any', inplace=True) # drop all rows with NaN values
print(df.info())
print(df.isnull().sum()) # check if there are any NaN values

`event_type` column only contains 4 kinds of string, "view", "cart", "purchase", "remove_from_cart". We are going to digitize this column by converting    
"view" to integer 2,     
"cart" to integer 5,     
"purchase " to integer 10,      
"remove_from_cart" to integer 0.      

However, "remove_from_cart" has been removed already since we removed all rows with `NaN` value.

In [None]:
print(df['event_type'].unique()) # check the unique values in the column before replacing

In [None]:
mapping = {'view': 2, 'cart': 5, 'purchase': 10, 'remove_from_cart': 0}
df['event_type'] = df['event_type'].replace(mapping)

In [None]:
print(df['event_type'].unique()) # check the unique values in the column after replacing

Now, Let's drop unnecessary columns.

In [None]:
df.head()

In [None]:
df = df.drop(['event_time', 'user_session','category_code'], axis=1)
df.head()

Hash the brand strings to ints

In [None]:
df['brand'] = df['brand'].apply(lambda x: hash(x) % (10 ** 8))
df.head()

Now, let's split dataset into training and testing

In [None]:
training_data = df.sample(frac=0.8, random_state=25)
testing_data = df.drop(training_data.index)

print(f"No. of training examples: {training_data.shape[0]}")
print(f"No. of testing examples: {testing_data.shape[0]}")

Save data into files. 

In [None]:
training_data.to_csv('training_data.csv', index=False)
testing_data.to_csv('testing_data.csv', index=False)