# Social Platform Emulator

## authors: A. Giovanidis, B. Baynat, C. Magnien, A. Vendeville

08 mai 2020

The emulator uses a tweeter trace as input. It calculates the percentage of time that posts of a specific origin occupy the first position on a Wall of another user. This is defined as the influence Q[user][author]. By finding all these values, we can derive the Psi-score for all users in the trace. Note here that for the emulator, we just use the direct information of what posts enter a Wall, at what time, and when the next post entering pushes the current post from its top position. In other words, occupancy periods divided by the total time period of the trace determine the influence. No further information on the social graph is necessary.

Users are anonymised treated as a number. Tweetids are also anonymised and the content is removed.

In [2]:
import numpy as np

We use a twitter trace as input with the following Data Format in "TextTweet"

tweetid timestamp userid retweetid


In [2]:
K = 1 # Wall list size

Wall is finite for all users. We apply FIFO replacement principle. The Newsfeed is not needed, because we have no information on the policy of the OSP or the user policy to choose among many. We know: 1) which post is tweeted-retweeted and the time-stamp, and the tweet origin.

For each user we create a Wall list of max size $K$.

We use $K=1$, to find the frequency of posts of some origin, 
occupying the most recent position of the user's Wall.

In [3]:
Wall = {}

**Note:** retweetid, is the tweetid from the origin . Leader information is lost. This means that when a user (userid) retweets some tweet at time (timestamp), then the new produced tweet takes the id (tweetid) and the trace informs us that the origin of this tweet is (retweetid). Then we can trace back from the trace the author of this original tweetid, but we do not know the id of the user-leader from whom the tweet was retweeted/shared. This brings essentially problems in the creation of the user leader-follower graph, but not in the calculation of user influence. 

In [7]:
# Open file containing the origin
# Count the number of tweets (tc) contained in trace. 
f = open("/Users/Fishbone/Desktop/NEWSFEEDfresh/PYTHON/NewResults/PsiEmul.tronc.txt")
tc = 0
tc0 = 0
for lign in f:
    lign = lign.split()
    tc +=1
    l1 = float(lign[1])
    if l1!=0:
        tc0+=1
f.close()
print(tc)
print(tc0)

3754043
405236


In [9]:
# The trace is sorted in increasing order of timestamps
f = open("/Users/Fishbone/Desktop/NEWSFEEDfresh/PYTHON/NewResults/Psi.tronque.sorted22gr.0.5804715.txt")
tc = 0
tc0 = 0
for lign in f:
    lign = lign.split()
    tc +=1
    l1 = float(lign[1])
    if l1!=0:
        tc0+=1
f.close()
print(tc)
print(tc0)

5804715
405236


In [6]:
#f = open("TextTweet.txt")
f = open("tweets.parsed.sorted22n.txt")

In [7]:
# Create dictionary for Authors
Author = {}

In [8]:
# Assign each tweetid to its author and fill in the dictionary
for lign in f:
    lign = lign.split()
    tweetid = int(lign[0])
    userid = int(lign[2])
    Author[tweetid] = userid

In [9]:
#Author

In [10]:
f.close()

In [11]:
# Number of authors
print(len(Author))

27916100


In [12]:
# Some users are not included in the Authors
print(74400 in Author)

False


In [13]:
assert 74400 not in Author

We introduce a dictionary Q, which contains many dictionaries, each related to a userid. $Q[userid][auth-out]$ saves the time periods that a post from $[auth-out]$ stays on the Wall of $[userid]$. Since the Wall is of size $K=1$ we actually count the period that a post of origin (auth-out) stays on the first position of the Wall of (userid), before being pushed to a lower position by a new post.

In [14]:
Q = {}
FirstP = {}

** Important!**

if the tweet is original then the person who posted is the author.
    else it is a retweet. If the original post was observed and saved in "Authors",
   we know the author.
   else the retweet has origin outside the dataset. In this case we assume that
   every user with this retweet is considered as the original author.

In [1]:
#f = open("TextTweet.txt")
Amphitrions = {}
f = open("tweets.parsed.sorted22n.txt")
nb_neg = 0
# debug
i=0
j=0
# Time0 marks the beginning of time in the trace
Time0 = None
timestamp_old = 0
count_ties = 0
for lign in f:
    #if j<6:
    #    j+=1
    #    print("original=",lign)
    lign = lign.split()
    tweetid = int(lign[0])
    tstamp = int(lign[1])
    userid = int(lign[2])
    rtid = int(lign[3])  
    if Time0 ==None:
        Time0 = tstamp
    if tstamp == timestamp_old:
        count_ties+=1
    timestamp_old = tstamp
    if userid not in Amphitrions:
        Amphitrions[userid] = {}
    #if j<6:
    #    print("split=",tweetid, tstamp, userid, rtid)  
    # if the tweet is original then the person who posted is the author.
    # else it is a retweet. If the original post was observed and saved in "Authors",
    # we know the author.
    # else the retweet has origin outside the dataset. In this case we assume that
    # the first retweeter is the original author.
    if rtid == -1:
        auth = userid
    elif rtid in Author:
        auth = Author[rtid] # it is the tweet origin ID
    else:
        assert rtid not in Author
        auth = userid
        Author[tweetid] = userid
    if userid not in Wall:
        Wall[userid] = []
        Q[userid] = {}
    if len(Wall[userid])==0:
        FirstP[userid] = tstamp
    Wall[userid].append((tweetid,tstamp,auth))
    if len(Wall[userid])>K:
        auth_out = Wall[userid][0][2]
        # Use time_wall to count the period that a post stays at position K=1, 
        # or find time_wall =0 to identify simultaneous posts with same timestamp
        time_wall = tstamp-Wall[userid][0][1]
        #if time_wall < 0:
        #    nb_neg += 1
        #    # debug
        #    if i<6:
        #        i+=1
        #        #print(userid, tstamp)
        #if userid not in Q:
        #    Q[userid] = {}
        if auth_out not in Q[userid]:
            Q[userid][auth_out] = 0
        # CORRECTION
        #Q[userid][auth_out] += time_wall
        #
        # CORRECTION
        # In our data-sets we found many tweets with the same time-stamp, so we need to account for this as well.
        # The issue arises when the same author posts or retweets instantly several tweets (e.g. a bot).
        # In this case we assume that the posts that simultaneously enter a Wall will split uniformly among them the
        # occupancy time till the next arrival of a post at a fresh time-stamp.
        if time_wall == 0:
            if auth_out not in Amphitrions[userid]:
                Amphitrions[userid][auth_out] = 0 
            Amphitrions[userid][auth_out] += 1
        else: 
            if auth_out not in Amphitrions[userid]:
                Amphitrions[userid][auth_out] = 0 
            Amphitrions[userid][auth_out] += 1
            Ndt = 0
            for author in Amphitrions[userid]:
                Ndt += Amphitrions[userid][author]
            for author in Amphitrions[userid]:
                Q[userid][author] += time_wall/Ndt*Amphitrions[userid][author]
            Amphitrions[userid] = {}
        #### END of CORRECTION
        #if userid == 3474:
        #    print(auth_out, time_wall)
        Wall[userid] = Wall[userid][1:]
        assert( len(Wall[userid]) == K )
#print(nb_neg)

FileNotFoundError: [Errno 2] No such file or directory: 'tweets.parsed.sorted22n.txt'

In [None]:
# The number of entries with clone time-stamp found in our trace
print(count_ties)

In [16]:
f.close()

In [18]:
print(Time0)

1398895201


In [19]:
27916100-4026458

23889642

In [20]:
print(len(Author))

27916100


In [21]:
#print(Wall,'\n')
#print('Q=',Q)
#print('FirstP=',FirstP)

**Note1:** EndPoint of simulation is the time of last arrival.

In [22]:
EndP = tstamp
print(EndP)

1406843993


After the parsing of the trace, there are still posts that stay on the place $K=1$ of all Walls. 
We need to count their part in the influence, so we need to calculate the period they occupy this post, 
from the time instant they entered the first Wall position, till the end of the simulation (because they have not 
been ejected by some other post).

**Note2:** Post-processing: Treat extra users with Wall content less than $K$.

In [23]:
for u in Wall:
    if len(Wall[u])>0:
        auth_out = Wall[u][0][2]
        time_wall = EndP-Wall[u][0][1]
        if u not in Q:
            Q[u] = {}
        if auth_out not in Q[u]:
            Q[u][auth_out] = 0
        #Q[u][auth_out] += time_wall
        # CORRECTION
        if time_wall == 0:
            if auth_out not in Amphitrions[u]:
                Amphitrions[u][auth_out] = 0 
            Amphitrions[u][auth_out] += 1
        else: 
            if auth_out not in Amphitrions[u]:
                Amphitrions[u][auth_out] = 0 
            Amphitrions[u][auth_out] += 1
            Ndt = 0
            for author in Amphitrions[u]:
                Ndt += Amphitrions[u][author]
            for author in Amphitrions[u]:
                # Q[u][author] += time_wall/Ndt*Amphitrions[u][author]
                ## CORRECTION 2: circularite
                Q[u][author] += (time_wall+FirstP[u]-Time0)/Ndt*Amphitrions[u][author]
            Amphitrions[u] = {}
        #### END of CORRECTION

In [24]:
#print(Q)

An example of what has been saved in entry (userid=3474). There are posts from 7 users that entered the Wall
and for each user-origin, the dictionary gives the period that their posts occupied position 1 of the wall of user 3474 cumulatively. 

In [25]:
Q[3474]

{2083: 6.5,
 327: 6.5,
 408: 114161.0,
 3474: 926459.0,
 2660: 666122.0,
 16559: 5431010.0,
 23742: 811027.0}

In [26]:
Q[3474]

{2083: 6.5,
 327: 6.5,
 408: 114161.0,
 3474: 926459.0,
 2660: 666122.0,
 16559: 5431010.0,
 23742: 811027.0}

**Estimated Qs**

**EstQ[user][leader]:** is the proportion of time that posts of "leader" is on 1-st position of "user" Wall. Hence: sum{leaders}EstQ[user][leader] = 1. 

To find these proportions we just need to divide the occupancy time by the total period (EndP-Time0).

In [27]:
EstQ = {}
nb0 = 0
for u in Q:
    EstQ[u] = {}
    for j in Q[u]:
        if EndP-FirstP[u] == 0:
            nb0 += 1
            assert len(Q[u]) == 1
            EstQ.pop(u)
        else:
            #EstQ[u][j] = Q[u][j]/(EndP-FirstP[u])
            # CORRECTION 2: circularity
            EstQ[u][j] = Q[u][j]/(EndP-Time0)
#print(EstQ)
print(nb0)

0


In [28]:
N = len(EstQ)
print(N)

6020228


Here is a test. For user=2500 there are posts from 2 users on his Wall: 33.85% of time these were self-posts of origin  user=2500 and 66.15% of time these were reposts of origin user=1548. These sum up to 1.

In [29]:
testU =2500
print(EstQ[testU])

{1548: 0.6614825246402221, 2500: 0.33851747535977794}


In [30]:
Psi = 0
for user in EstQ[testU]:
    Psi+=EstQ[testU][user]
print(Psi)

1.0


In [2]:
# Export as .py
#jupyter nbconvert --to script OSPemul.ipynb
#through command line convert to .py

**Calculate Influence of user on his followers**

We have found for each user Wall the percentage of time that posts of different origin occupy the first position on his Wall. To calculate user Influence, we need to use this information and find the various time periods that posts for a specific user occupy the first position of Walls of all other users. 

In [32]:
Influence = {}
for follower in EstQ:
    for leader in EstQ[follower]:
        if leader not in Influence:
            Influence[leader] = {}
        Influence[leader][follower] = EstQ[follower][leader]

For example, the Influence of user 158410 is a dictionary.  This user influences 5 users, e.g. his posts have occupied 10.5% of the total time of user's 17700 wall.

In [33]:
print(Influence[158410])

{17700: 0.10512050132900698, 158410: 0.9339905988230665, 172286: 0.26822465602320456, 444382: 5.032211183787424e-07, 1876907: 0.10557893576785}


Similarly, user 5 influences 4 users. Specifically his posts occupy 100% of time of his Wall, which means that this user does not repost. But they also occupy 100% of time the Wall of user 1816833 which means that this user only reposts posts of origin 5. 

In [35]:
print(Influence[5])

{5: 1.0, 10728: 0.6129629508483805, 57295: 0.7196136217930976, 1816833: 1.0}


The Psi-score is calculated by the trace directly, by adding the influence from a user to all others and then dividing by (N-1).

In [37]:
Psi = {}
for user in Influence:
    Psi[user] = 0
    for follower in Influence[user]:
        if follower != user:
            Psi[user]+=Influence[user][follower]
    Psi[user] = Psi[user]/(N-1)

Hence, the Psi-score of user 158410 on the platform is the following:

In [38]:
Psi[158410]

7.955258104738906e-08

In [40]:
Psi[1108115]

8.305334665951965e-08

In [41]:
Psi[21586] #4138, 15698, 17896, 19678, 20116, 20198, 20374, 20376, 21586

5.725805124780818e-12

In [42]:
Psi[266]

7.555219818877829e-06

In [43]:
Psi[11604]

0.018266619353711842

Write the results to external file

In [36]:
import gzip
f = open('PsiCorrCirc20190208','w')
for u in Psi:
    f.write("%d %g\n"%(u, Psi[u]))
f.close()

In [28]:
#import gzip
#f = open('Psi','w')
#for u in Psi:
#    f.write("%d %g\n"%(u, Psi[u]))
#f.close()


## **Part II: Extracting Info from data to feed the analysis**

Here we use the trace to extract information to be used for the model and the analytical formula, rather than the emulator. The information we need is: the rate of tweets and re-tweets per user, as well as the leader graph. We use Ntweet and Nrtweet to count the number of tweets and re-tweets of users during the trace. The Ntweet and Nrtweet are both dictionaries, having as keys the author information.

For the LeadGraph, which is also a dictionary, we note for each user all his leaders. Here we use the Star-graph approach, based on which a user x is assumed to follow another user y , if he has retweeted a post of origin y. Note that this approach gives rise to a star-graph, with a user having several leaders, and direct influence without paths of several hops. The reason we take such approach is that we do not have any additional information about the user from whom a post is re-tweeted. Only the post origin. We would not need to go for such an approach in case we knew the real friendship graph.

In [37]:
Ntweet = {}
Nrtweet = {}
LeadGraph = {}
FirstT = None
LastT = None
el=0
el2=0
f = open("tweets.parsed.sorted22n.txt")
for lign in f:
    lign = lign.split()
    tstamp = int(lign[1])
    userid = int(lign[2])
    rtid = int(lign[3])
    #if el<21:
    #    print(userid, rtid)
    #    el+=1
    if FirstT == None:
        FirstT = tstamp
    if userid not in Ntweet:
        Ntweet[userid] = 0
        Nrtweet[userid] = 0
        LeadGraph[userid] = set() 

        #If the retweetid is -1 this indicates a self-post
    if rtid == -1:
        Ntweet[userid] += 1
    else: 
        if rtid in Author:
            LeadGraph[userid].add(Author[rtid])
            Nrtweet[userid] += 1
        else:
            Ntweet[userid] += 1
LastT = tstamp
f.close()  

In [38]:
# The Rtweet and Rrtweet are dictionaries which contain the tweet and retweet rates. One simply needs to divide the 
# count from Ntweet and Nrtweet by the total trace period (LastT-FirstT)
Rtweet = {}
Rrtweet = {}
for user in Ntweet:
    Rtweet[user] = Ntweet[user]/(LastT-FirstT)
    Rrtweet[user] = Nrtweet[user]/(LastT-FirstT)

In [51]:
Ntweet[11604]

1142

In [31]:
# An example of the result for the 10 first users.
k = 0
for user in Rtweet:
    print(user, 'T=', Rtweet[user],'R=', Rrtweet[user])
    print(LeadGraph[user],'\n')
    if k==10:
        break
    else: 
        k+=1

3039 T= 3.887383139475785e-05 R= 0.0
set() 

790 T= 6.54187453892365e-06 R= 0.0
set() 

4806 T= 8.30314845324925e-06 R= 0.0
set() 

175 T= 8.164762645695094e-05 R= 0.0
set() 

6 T= 0.0004423313630549145 R= 0.0
set() 

831 T= 0.00012253434232522377 R= 0.0
set() 

25 T= 0.0004180509440931402 R= 0.0
set() 

374 T= 5.459949134409354e-05 R= 0.0
set() 

2646 T= 1.1322475163521703e-05 R= 0.0
set() 

250 T= 9.699587056750258e-05 R= 0.0
set() 

194 T= 0.0010933736849574123 R= 0.0
set() 



In [32]:
len(Rtweet)

6020228

In [36]:
Rtweet[17700]

2.1386897531096547e-06

In [36]:
len([0 for u in Rtweet if Rtweet[u]==0])

3416701

In [None]:
#jupyter nbconvert --to script OSPemul.ipynb
#through command line convert to .py