# Building and analysing a OLX's user network

In [1]:
import graph_tool_extras as gte
from pathlib import Path
from random import random, choices, seed
import matplotlib.pyplot as plt
import numpy as np

## Introduction

In this notebook a two-mode network of OLX's users and itens was project into a one-mode network of OLX's users. This dataset (olx-jobs) is published by Grupa OLX sp. z o.o. and contains 65 502 201 events made on http://olx.pl/praca by 3 295 942 users who interacted with 185 395 job ads in 2 weeks of 2020. It was found at https://www.kaggle.com/datasets/olxdatascience/olx-jobs-interactions.

In [14]:
FOLDER_PATH = Path.cwd() / 'archive (1)'
FILE_PATH = FOLDER_PATH / 'interactions.csv'

## Randomly limiting the Data

The dataset is too large to read and was consuming all computing resources. Also, to project the two-mode network into an one-mode, there is a time complexy of $O(mn^2)$. Given that, the data was limited to half by randomly selecting it's user node's indexes.

In [15]:
seed(42)
listaNodesIndex = range(0, 3295941) ### 3295941 is the total number of nodes
chosenNodes = choices(listaNodesIndex, k=int(3295941 / 16))
chosenNodesSet = set(chosenNodes)

## Understanding the data

(Description copied from https://www.kaggle.com/datasets/olxdatascience/olx-jobs-interactions)

The file interactions.csv consists of 65 502 201 rows.
Each row represents an interaction between a user and an item and has the following format:

user, item, event, timestamp.

- user: a numeric id representing the user who made the interaction
- item: a numeric id representing the item the user interacted with
- event: a type of interaction between the user and the item, possible values are:
    - click: the user visited the item detail page
    - bookmark: the user added the item to bookmarks
    - chat_click: the user opened the chat to contact the item’s owner
    - contact_phone_click_1: the user revealed the phone number attached to the item
    - contact_phone_click_2: the user clicked to make a phone call to the item’s owner
    - contact_phone_click_3: the user clicked to send an SMS to the item’s owner
    - contact_partner_click: the user clicked to access the item’s owner external page
    - contact_chat: the user sent a message to the item’s owner
- timestamp: the Unix timestamp of the interaction

Maintaining the confidentiality of ads and users was a priority when preparing this dataset. The measures taken to protect privacy included the following:

- original user and item identifiers were replaced by unique random integers;
- some undisclosed constant integer was added to all timestamps;
- some fraction of interactions were filtered out;
- some additional artificial interactions were added.


## Projecting two-mode network

In [16]:
interaction_weights = {
    'click': 1,
    'bookmark': 2,
    'chat_click': 2,
    'contact_phone_click_1': 2,
    'contact_phone_click_2': 3,
    'contact_phone_click_3': 3,
    'contact_partner_click': 2,
    'contact_chat': 3,
}

item_user_dict = {}
user_user_projection = {}

In [17]:
with open(FILE_PATH) as file:
    next(file)

    for line in file:
        parts = line.strip().split(',')
        user_id = int(parts[0])
        item_id = parts[1]
        interaction_type = parts[2]

        if (user_id in chosenNodesSet):
            item_user_dict.setdefault(item_id, {}).setdefault(user_id, set())
            item_user_dict[item_id][user_id].add(interaction_type)

In [18]:
for item, user_interaction in item_user_dict.items():
    for u1, interaction1 in user_interaction.items():
        for u2, interaction2 in user_interaction.items():
            
            if u1 != u2 and interaction1 == interaction2 and len(interaction1) > 0:
                user_user_projection.setdefault(u1, {}).setdefault(u2, 0)
                interaction = interaction1.pop()
                weight = interaction_weights.get(interaction, 0)
                
                user_user_projection[u1][u2] += weight

To understand what should be considered a reasonable threshold to "similar usage behavior" was generated a list of scores and the relative count of each score was printed. The idea is to look for an intense frequency drop based on the assumption that OLX's users with similar usage behaviors would interact with similar intensities to the same products.

In [19]:
scores = []
for u1, neighbors in user_user_projection.items():
    for u2, weight in neighbors.items():
        scores.append(weight)

counts = {}
total = 0
for element in scores:
    total += 1
    if element in counts:
        counts[element] += 1
    else:
        counts[element] = 1

for element, count in counts.items():
    percentage = count / total * 100
    print(f"{element}: {percentage:.2f}")

1: 92.07
2: 6.41
3: 1.14
4: 0.25
6: 0.03
9: 0.00
5: 0.07
7: 0.01
8: 0.01
12: 0.00
10: 0.00
11: 0.00
15: 0.00
14: 0.00
22: 0.00
19: 0.00
16: 0.00
21: 0.00
13: 0.00
20: 0.00


In [20]:
threshold = 1

## Understanding the netwrok projected

#### **Concept of Vertices**
User.

#### **Concept of Edges**
An edge between i and j represents that i and j have similar usage behavior.

**OR**

An edge between i and j represents that both had the sum of weights of equivalent interactions greater than or equal to 3.

#### **Operationalization of Vertices**
Each vertex represents a user of OLX who had the sum of interaction weights with items greater than or equal to 3. Interaction can mean:

- Visit the item's page (weight 1);
- Add the item to favorites (weight 2);
- Access the chat with the item's owner (weight 2);
- Reveal the item's linked phone number (weight 2);
- Click to access the external page of the item's owner (weight 2);
- Click to make a call to the linked item's number (weight 3);
- Click to send an SMS to the linked item's number (weight 3);
- Send a message in the chat with the item's owner (weight 3).

The criteria for assigning weights were:

- Weight 3: Trying to contact the item's owner;
- Weight 2: Interacting with the item's page;
- Weight 1: Accessing the item's page.

The threshold of 3 was set by analyzing weights distribution.

#### **Operationalization of Edges**
Each edge represents that both users had the sum of weights of equivalent interactions greater than or equal to 3. Interaction can mean:

- Visit the item's page (weight 1);
- Add the item to favorites (weight 2);
- Access the chat with the item's owner (weight 2);
- Reveal the item's linked phone number (weight 2);
- Click to access the external page of the item's owner (weight 2);
- Click to make a call to the linked item's number (weight 3);
- Click to send an SMS to the linked item's number (weight 3);
- Send a message in the chat with the item's owner (weight 3).

The criteria for assigning weights were:

- Weight 3: Trying to contact the item's owner;
- Weight 2: Interacting with the item's page;
- Weight 1: Accessing the item's page.

For interactions to be considered equivalent, they must be:

- With the same item;
- Of the same type.

The threshold of 3 was set by analyzing weights distribution.

## Creating functions to build the network

In [21]:
def get_or_add_edge(g, user_a, user_b, weight):
    e = g.edge_by_ids(user_a, user_b)
    if e is None:
        e = g.add_edge_by_ids(user_a, user_b)
        e['weight'] = weight
    return e

In [22]:
def get_or_add_vertex(g, id):
    u = g.vertex_by_id(id)
    if u is None:
        u = g.add_vertex_by_id(id)
    return u

## Reading the data and building the network

In [23]:
g = gte.Graph(directed=False)

In [24]:
g.add_ep('weight')

In [25]:
for u1, neighbors in user_user_projection.items():
    for u2, weight in neighbors.items():
        if weight > threshold:
            
            vertex_u1 = get_or_add_vertex(g, u1)
            vertex_u2 = get_or_add_vertex(g, u2)
            edge = get_or_add_edge(g, u1, u2, weight)

In [26]:
g = gte.clean(g)

In [27]:
print("Number of vertices:", g.num_vertices())
print("Number of edges:", g.num_edges())

Number of vertices: 77449
Number of edges: 186852


In [28]:
gte.save(g, 'similar_olx_user.net.gz')

## Configuring the layout and rendering the network

In [29]:
from graph_tool import draw
import netpixi

In [30]:
layout = draw.sfdp_layout(g)

In [31]:
gte.move(g, layout)

In [32]:
gte.save(g, 'similar_olx_user_layout.net.gz')

In [33]:
r = netpixi.render('similar_olx_user_layout.net.gz', infinite=True)

## Improving network vizualization

In [None]:
r.vertex_default(
    size=2,
    color=0xff7700,
    bwidth=0.2,
    bcolor=0x0000ff,
)

In [None]:
r.edge_default(
    width=0.2,
    color=0xffffff,
    curve1=0,
    curve2=0,
)

## Calculating Density and Transitivity

In [None]:
g.density()

In [None]:
g.transitivity()

## Analysing Degree Distribution

In [None]:
degrees = g.get_total_degrees()

In [None]:
degrees.describe()

In [None]:
degrees.hist();

In [None]:
dst.not_normal(degrees)

In [None]:
dst.more_powerlaw_than_lognormal(degrees)

In [None]:
dst.more_powerlaw_than_exponential(degrees)

## Analyzing Distance Distribution

In [None]:
g.describe_distances()

In [None]:
g.hist_distances()