# Task
Analyze the provided e-commerce event data ("events.csv" and "item_properties.csv") to develop an algorithm that predicts the properties of items added to the cart based on the viewing behavior of visitors. The algorithm should leverage data from "view" events to infer implicit item properties for items in "addtocart" events for any visitor in the log.

## Data preparation

### Subtask:
Process the `events` and `item_properties`'s dataframes. Convert timestamps to datetime objects. Merge the relevant dataframes based on `itemid`.


**Reasoning**:
Convert timestamps to datetime objects and merge the events and item_props_filtered dataframes.



In [None]:
# --- Import Libraries ---
import pandas as pd
from datetime import datetime

In [None]:
# --- Mount Drive ---
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# --- Load Data ---

# Reading in chunks to avoid session crashes
chunk_size = 100000  # Adjust based on available memory

events_chunks = []
for chunk in pd.read_csv('/content/drive/MyDrive/data/TMP/events.csv', chunksize=chunk_size):
    events_chunks.append(chunk)
events = pd.concat(events_chunks, ignore_index=True)

category_tree = pd.read_csv('/content/drive/MyDrive/data/TMP/category_tree.csv')

df_prop1_chunks = []
for chunk in pd.read_csv('/content/drive/MyDrive/data/TMP/item_properties_part1.1.csv', chunksize=chunk_size):
    df_prop1_chunks.append(chunk)
df_prop1 = pd.concat(df_prop1_chunks, ignore_index=True)

df_prop2_chunks = []
for chunk in pd.read_csv('/content/drive/MyDrive/data/TMP/item_properties_part2.csv', chunksize=chunk_size):
    df_prop2_chunks.append(chunk)
df_prop2 = pd.concat(df_prop2_chunks, ignore_index=True)

item_props = pd.concat([df_prop1, df_prop2], ignore_index=True)

# Display the first few rows of the dataframes to confirm
print("First few rows of items_props:")
display(item_props.head())
print("\nFirst few rows of events:")
display(events.head())
print("\nFirst few rows of category:")
display(category_tree.head())

#check for the shape of the datasets
print("\nShape of item_props:")
display(item_props.shape)
print("\nShape of events:")
display(events.shape)
print("\nShape of category_tree:")
display(category_tree.shape)

First few rows of items_props:


Unnamed: 0,timestamp,itemid,property,value
0,1435460400000,460429,categoryid,1338
1,1441508400000,206783,888,1116713 960601 n277.200
2,1439089200000,395014,400,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,n15360.000
4,1431831600000,156781,917,828513



First few rows of events:


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,



First few rows of category:


Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0



Shape of item_props:


(20275902, 4)


Shape of events:


(2756101, 5)


Shape of category_tree:


(1669, 2)

In [None]:
print("\nInfo for item_props DataFrame:")
item_props.info()

print("\nInfo for events DataFrame:")
events.info()

print("\nInfo for category_tree DataFrame:")
category_tree.info()


Info for item_props DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20275902 entries, 0 to 20275901
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   timestamp  int64 
 1   itemid     int64 
 2   property   object
 3   value      object
dtypes: int64(2), object(2)
memory usage: 618.8+ MB

Info for events DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   timestamp      int64  
 1   visitorid      int64  
 2   event          object 
 3   itemid         int64  
 4   transactionid  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 105.1+ MB

Info for category_tree DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1669 entries, 0 to 1668
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   categoryid  1669 non-null   int64  


In [None]:
# Fill missing transactionid values with 0 in the events DataFrame
events['transactionid'] = events['transactionid'].fillna(0)

# Verify the changes
print("\nInfo for events DataFrame after handling missing transactionid:")
events.info()


Info for events DataFrame after handling missing transactionid:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   timestamp      int64  
 1   visitorid      int64  
 2   event          object 
 3   itemid         int64  
 4   transactionid  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 105.1+ MB


In [None]:
# Drop rows with missing parentid in the category_tree DataFrame
category_tree = category_tree.dropna(subset=['parentid'])

# Verify the changes
print("\nInfo for category_tree DataFrame after dropping rows with missing parentid:")
category_tree.info()


Info for category_tree DataFrame after dropping rows with missing parentid:
<class 'pandas.core.frame.DataFrame'>
Index: 1644 entries, 0 to 1668
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   categoryid  1644 non-null   int64  
 1   parentid    1644 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 38.5 KB


In [None]:
# Check for missing values in the item_props DataFrame
print("\nMissing values in item_props DataFrame:")
display(item_props.isnull().sum())


Missing values in item_props DataFrame:


Unnamed: 0,0
timestamp,0
itemid,0
property,0
value,0


In [None]:
# Check for duplicate rows in the item_props_filtered DataFrame
print("\nNumber of duplicate rows in item_props_filtered DataFrame:")
display(item_props.duplicated().sum())


Number of duplicate rows in item_props_filtered DataFrame:


np.int64(0)

In [None]:
# Check for duplicate rows in the events DataFrame
print("\nNumber of duplicate rows in events DataFrame:")
display(events.duplicated().sum())


Number of duplicate rows in events DataFrame:


np.int64(460)

In [None]:
# Check for and drop duplicate rows in the events DataFrame
print("Shape of events DataFrame before dropping duplicates:", events.shape)
events.drop_duplicates(inplace=True)
print("Shape of events DataFrame after dropping duplicates:", events.shape)
print("\nNumber of duplicate rows in events DataFrame after dropping:", events.duplicated().sum())

Shape of events DataFrame before dropping duplicates: (2756101, 5)
Shape of events DataFrame after dropping duplicates: (2755641, 5)

Number of duplicate rows in events DataFrame after dropping: 0


In [None]:
# Check for and drop duplicate rows in the category_tree DataFrame
print("\nShape of category_tree DataFrame before dropping duplicates:", category_tree.shape)
category_tree.drop_duplicates(inplace=True)
print("Shape of category_tree DataFrame after dropping duplicates:", category_tree.shape)
print("\nNumber of duplicate rows in category_tree DataFrame after dropping:", category_tree.duplicated().sum())


Shape of category_tree DataFrame before dropping duplicates: (1644, 2)
Shape of category_tree DataFrame after dropping duplicates: (1644, 2)

Number of duplicate rows in category_tree DataFrame after dropping: 0


In [None]:
print("Description of events DataFrame:")
display(events.describe())

print("\nDescription of item_props_filtered DataFrame:")
display(item_props.describe())

print("\nDescription of category_tree DataFrame:")
display(category_tree.describe())

Description of events DataFrame:


Unnamed: 0,timestamp,visitorid,itemid,transactionid
count,2755641.0,2755641.0,2755641.0,2755641.0
mean,1436424000000.0,701922.7,234921.4,71.93124
std,3366334000.0,405689.2,134194.7,917.3886
min,1430622000000.0,0.0,3.0,0.0
25%,1433478000000.0,350566.0,118120.0,0.0
50%,1436453000000.0,702060.0,236062.0,0.0
75%,1439225000000.0,1053443.0,350714.0,0.0
max,1442545000000.0,1407579.0,466867.0,17671.0



Description of item_props_filtered DataFrame:


Unnamed: 0,timestamp,itemid
count,20275900.0,20275900.0
mean,1435157000000.0,233390.4
std,3327798000.0,134845.2
min,1431227000000.0,0.0
25%,1432436000000.0,116516.0
50%,1433646000000.0,233483.0
75%,1437880000000.0,350304.0
max,1442113000000.0,466866.0



Description of category_tree DataFrame:


Unnamed: 0,categoryid,parentid
count,1644.0,1644.0
mean,847.354623,847.571168
std,489.7462,505.058485
min,0.0,8.0
25%,425.75,381.0
50%,847.5,866.0
75%,1270.25,1291.0
max,1697.0,1698.0


In [None]:
# Convert timestamp columns to datetime objects
events['timestamp'] = pd.to_datetime(events['timestamp'], unit='ms')
item_props['timestamp'] = pd.to_datetime(item_props['timestamp'], unit='ms')

print(item_props.head)


<bound method NDFrame.head of                    timestamp  itemid    property  \
0        2015-06-28 03:00:00  460429  categoryid   
1        2015-09-06 03:00:00  206783         888   
2        2015-08-09 03:00:00  395014         400   
3        2015-05-10 03:00:00   59481         790   
4        2015-05-17 03:00:00  156781         917   
...                      ...     ...         ...   
20275897 2015-06-07 03:00:00  236931         929   
20275898 2015-08-30 03:00:00  455746           6   
20275899 2015-08-16 03:00:00  347565         686   
20275900 2015-06-07 03:00:00  287231         867   
20275901 2015-09-13 03:00:00  275768         888   

                                     value  
0                                     1338  
1                  1116713 960601 n277.200  
2          n552.000 639502 n720.000 424566  
3                               n15360.000  
4                                   828513  
...                                    ...  
20275897                      

In [None]:
# Separate 'categoryid' and 'available' properties from other properties
category_props = item_props[item_props['property'] == 'categoryid'].copy()
available_props = item_props[item_props['property'] == 'available'].copy()
other_props = item_props[~item_props['property'].isin(['categoryid', 'available'])].copy()

# For 'other_props', the 'value' is hashed and can contain normalized/hashed text or numerical values prefixed with 'n'.
# We need to extract numerical values where possible.
def extract_numerical_value(value):
    if isinstance(value, str) and value.startswith('n'):
        try:
            return float(value[1:])
        except ValueError:
            return None  # Return None for values that can't be converted
    return None # Return None for non-string values or those not starting with 'n'

other_props['numerical_value'] = other_props['value'].apply(extract_numerical_value)

# We can also consider encoding the 'property' and 'value' columns in 'other_props'
# For now, let's focus on the numerical values from 'other_props' and the 'categoryid' and 'available'
# Further feature engineering might be needed depending on the model and task.

# Display the first few rows of the separated dataframes
print("First few rows of category_props:")
display(category_props.head())
print("\nFirst few rows of available_props:")
display(available_props.head())
print("\nFirst few rows of other_props with extracted numerical value:")
display(other_props.head())
print("\nInfo for other_props DataFrame after extracting numerical value:")
other_props.info()

First few rows of category_props:


Unnamed: 0,timestamp,itemid,property,value
0,2015-06-28 03:00:00,460429,categoryid,1338
140,2015-05-24 03:00:00,281245,categoryid,1277
151,2015-06-28 03:00:00,35575,categoryid,1059
189,2015-07-19 03:00:00,8313,categoryid,1147
197,2015-07-26 03:00:00,55102,categoryid,47



First few rows of available_props:


Unnamed: 0,timestamp,itemid,property,value
5,2015-07-05 03:00:00,285026,available,0
15,2015-07-19 03:00:00,186518,available,0
79,2015-06-07 03:00:00,423682,available,0
82,2015-06-14 03:00:00,316253,available,1
96,2015-07-19 03:00:00,430459,available,0



First few rows of other_props with extracted numerical value:


Unnamed: 0,timestamp,itemid,property,value,numerical_value
1,2015-09-06 03:00:00,206783,888,1116713 960601 n277.200,
2,2015-08-09 03:00:00,395014,400,n552.000 639502 n720.000 424566,
3,2015-05-10 03:00:00,59481,790,n15360.000,15360.0
4,2015-05-17 03:00:00,156781,917,828513,
6,2015-06-14 03:00:00,89534,213,1121373,



Info for other_props DataFrame after extracting numerical value:
<class 'pandas.core.frame.DataFrame'>
Index: 17984049 entries, 1 to 20275901
Data columns (total 5 columns):
 #   Column           Dtype         
---  ------           -----         
 0   timestamp        datetime64[ns]
 1   itemid           int64         
 2   property         object        
 3   value            object        
 4   numerical_value  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 823.2+ MB


In [None]:
# Concatenate category_props and available_props
item_props_filtered = pd.concat([category_props, available_props], ignore_index=True)

# Display the first few rows and info of the merged DataFrame
print("First few rows of item_props_merged_subset:")
display(item_props_filtered.head())
print("\nInfo for item_props_merged_subset DataFrame:")
item_props_filtered.info()

First few rows of item_props_merged_subset:


Unnamed: 0,timestamp,itemid,property,value
0,2015-06-28 03:00:00,460429,categoryid,1338
1,2015-05-24 03:00:00,281245,categoryid,1277
2,2015-06-28 03:00:00,35575,categoryid,1059
3,2015-07-19 03:00:00,8313,categoryid,1147
4,2015-07-26 03:00:00,55102,categoryid,47



Info for item_props_merged_subset DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2291853 entries, 0 to 2291852
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   timestamp  datetime64[ns]
 1   itemid     int64         
 2   property   object        
 3   value      object        
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 69.9+ MB


In [None]:
print("Shape of original item_props DataFrame:")
display(item_props.shape)
print("\nShape of filtered item_props_filtered DataFrame:")
display(item_props_filtered.shape)

Shape of original item_props DataFrame:


(20275902, 4)


Shape of filtered item_props_filtered DataFrame:


(2291853, 4)

In [None]:
# Define the path to save the DataFrame in your Google Drive
item_props_filtered_save_path = '/content/drive/MyDrive/item_props_filtered.csv'

# Save the DataFrame to a CSV file
# Using index=False to avoid writing the DataFrame index as a column in the CSV
item_props_filtered.to_csv(item_props_filtered_save_path, index=False)

print(f"item_props_filtered DataFrame saved to: {item_props_filtered_save_path}")

item_props_filtered DataFrame saved to: /content/drive/MyDrive/item_props_filtered.csv


In [None]:
# Merge events and item_props_filtered DataFrames on 'itemid'
# Merging large dataframes can cause memory issues.
# We can try merging in chunks if memory is a concern, but let's first try a direct merge
# as it's simpler if memory is sufficient after previous filtering.
# If this cell crashes due to memory, we would implement a chunked merge here.

# Let's try the direct merge first. If it crashes, we will switch to chunked merge.
try:
    # Attempt direct merge
    merged_data = pd.merge(events, item_props_filtered, on='itemid', how='left')

    # Sample 40% of the merged data
    merged_data_sampled = merged_data.sample(frac=0.4, random_state=42)

    # Display the first few rows and info of the merged DataFrame
    print("First few rows of merged_data_sampled:")
    display(merged_data_sampled.head())
    print("\nInfo for merged_data_sampled DataFrame:")
    merged_data_sampled.info()

except MemoryError:
    print("MemoryError: Merging dataframes directly failed. Attempting chunked merge.")
    # If direct merge fails, try merging in chunks
    # Reduced chunk size to further mitigate memory issues
    chunk_size = 50000  # Define a smaller suitable chunk size
    merged_chunks = []

    for i in range(0, len(events), chunk_size):
        print(f"Processing chunk {i//chunk_size + 1}...")
        events_chunk = events[i:i + chunk_size]
        merged_chunk = pd.merge(events_chunk, item_props_filtered, on='itemid', how='left')
        merged_chunks.append(merged_chunk)
        # Optional: Add a small delay or explicit garbage collection if still facing issues
        # import gc
        # gc.collect()

    merged_data = pd.concat(merged_chunks, ignore_index=True)

    # Sample 40% of the merged data after chunked merge
    merged_data_sampled = merged_data.sample(frac=0.4, random_state=42)


    # Display the first few rows and info of the merged DataFrame after chunked merge
    print("First few rows of merged_data_sampled (chunked merge):")
    display(merged_data_sampled.head())
    print("\nInfo for merged_data_sampled DataFrame (chunked merge):")
    merged_data_sampled.info()

First few rows of merged_data_sampled:


Unnamed: 0,timestamp_x,visitorid,event,itemid,transactionid,timestamp_y,property,value
9599723,2015-08-18 02:04:10.946,1150086,addtocart,301602,0.0,2015-08-02 03:00:00,available,0
23822402,2015-07-12 22:14:29.871,267148,view,177773,0.0,2015-05-31 03:00:00,available,1
12791582,2015-09-05 09:26:44.672,1228636,view,92681,0.0,2015-09-13 03:00:00,available,1
21731626,2015-07-03 07:28:04.979,756302,view,343468,0.0,2015-08-30 03:00:00,available,0
20938204,2015-05-31 01:43:16.798,269471,view,202699,0.0,2015-07-19 03:00:00,available,0



Info for merged_data_sampled DataFrame:
<class 'pandas.core.frame.DataFrame'>
Index: 11327563 entries, 9599723 to 17257659
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   timestamp_x    datetime64[ns]
 1   visitorid      int64         
 2   event          object        
 3   itemid         int64         
 4   transactionid  float64       
 5   timestamp_y    datetime64[ns]
 6   property       object        
 7   value          object        
dtypes: datetime64[ns](2), float64(1), int64(2), object(3)
memory usage: 777.8+ MB


In [None]:
merged_data_sampled.shape

(11327563, 8)