# DA3. Some considerations on privacy

The data we are using is collected using the Twitter API - and contains only public tweets. That does not mean we should forego privacy considerations here (among other normative and ethical implications that we should of course also carefully consider). We will talk about this more in the BI class, but now, we will first look at what technical solutions are there that we can implement in our datasets. 


## A "dilemma"

* Open Science and replicability
* Privacy protection

## Some considerations

* Data minimization
* Data anonymization
* Data pseudonimization
* Profiling

Let's start with importing data. In this tutorial we will work with Twitter data obtained via the API and twarc. We will start with reading the data, fixing teh format and merging information on tweets and users who tweeted.

In [1]:
import pandas as pd

In [5]:
df_jsonl = pd.read_json('../DA-StudentFiles/LocalFiles/results_privacy.jsonl', lines=True)

If you remember from last week, the file structure requires more work. We use the functions introduced last week to create two dataframes - one with all tweets and one with all authors of them.

In [30]:
def get_public_metrics(row):
    if 'public_metrics' in row.keys():
        if type(row['public_metrics']) == dict:
            for key, value in row['public_metrics'].items():
                row['metric_' + str(key)] = value
    return row

def get_tweets(df):
    if 'data' not in df.columns:
        return None
    results = pd.DataFrame()
    for item in df['data'].values.tolist():
        results = pd.concat([results, pd.DataFrame(item)])
        
    results = results.apply(get_public_metrics, axis=1)
        
    results = results.reset_index()
    del results['index']
        
    return results

In [31]:
#This dataframe contains all tweets - one tweet per row
tweets = get_tweets(df_jsonl) 

In [8]:
def get_users(df):
    if 'includes' not in df.columns:
        return None
    results = pd.DataFrame()
    for item in df['includes'].values.tolist():
        results = pd.concat([results,pd.DataFrame(item['users'])])
    
    results = results.apply(get_public_metrics, axis=1)
       
    results = results.reset_index()
    del results['index']
        
    return results

In [9]:
#This dataframe contains all users - one user per row
users = get_users(df_jsonl)

Now we have to dataframes. Imagine we would like to analyze tweets in relation to their authors (or at least control for who the author was). To do so, we need to merge the two dataframe. Let's quickly explore these dataframes to decide how the merge should look like.

In [32]:
len(tweets), len(users)

(1298, 2194)

In [33]:
tweets.columns

Index(['source', 'id', 'author_id', 'possibly_sensitive', 'reply_settings',
       'created_at', 'conversation_id', 'public_metrics', 'referenced_tweets',
       'entities', 'lang', 'text', 'context_annotations',
       'in_reply_to_user_id', 'attachments', 'geo', 'metric_retweet_count',
       'metric_reply_count', 'metric_like_count', 'metric_quote_count'],
      dtype='object')

In [34]:
users.columns

Index(['pinned_tweet_id', 'description', 'created_at', 'profile_image_url',
       'name', 'username', 'id', 'verified', 'public_metrics', 'url',
       'protected', 'entities', 'location', 'metric_followers_count',
       'metric_following_count', 'metric_tweet_count', 'metric_listed_count'],
      dtype='object')

In [46]:
tweets['author_id'].value_counts()

1649630060             46
1370572561801699336    33
1488813217899954182    17
1458925850766848001    10
480201182               9
                       ..
602778335               1
1496058886704431112     1
398498909               1
1313494222717227021     1
56672287                1
Name: author_id, Length: 1046, dtype: int64

In [38]:
users['id'].value_counts()

1462866268810190848    13
1300511436872003585    13
1147078529856577536    12
466200116              11
1649630060             10
                       ..
1409958014929485834     1
1454837700163375109     1
1225058359289819137     1
1473568762292948993     1
1294109558093291520     1
Name: id, Length: 1711, dtype: int64

We see that there are  much more users than tweets. This makes sense - the users are not only authors of the tweets, but any users mentioned in the tweets. But looking further at these users, we see that the user id is not unique in both dataframes. This also makes sense: in the tweets dataframe, the same user can be the author or multiple tweets; in the users dataframe, the same user can be included multiple times as they are either authors of multiple tweets or are mentioned in the tweets multiple times. As we know from merging, it is important that we have unique id's not to multiple cases. For tweets, we want to keep all the tweets (also when same users are authors of multiple tweets). But for users, we want to make sure that we have a dataframe with unique users so that we can add information on users to the tweets dataframe. We can achieve this by `drop_duplicates`.

In [47]:
#users_unique has unique users only
users_unique = users.drop_duplicates(subset=['id'])

In [48]:
users_unique['id'].value_counts()

1495746927794835457    1
776778236512571392     1
1120948858857504768    1
204002789              1
1445127702260830219    1
                      ..
1187806180908777473    1
1349954650745798657    1
354996294              1
1190955239114969088    1
1294109558093291520    1
Name: id, Length: 1711, dtype: int64

In [49]:
len(tweets), len(users_unique)

(1298, 1711)

We see that there are still much more users than tweets. This makes sense - even when unique, the users are not only authors of the tweets, but also any users mentioned in the tweets. This means that if we simply want to add information on authors to the tweets dataframe, we do not need all the unique users. This suggests a left merge on tweets (adding information from unique users (right dataframe) to tweets (left dataframe)). There also is an id that we can use - `author-id` in tweets and `id` in users (this is something we know from the documentation of the API). We could rename the columns, byt can also specify two different column names in the `.merge` function. Let's merge the two dataframes. 

In [50]:
df = tweets.merge(users_unique, how='left', left_on='author_id', right_on='id', suffixes=('_tweets', '_users'))

In [51]:
df.columns

Index(['source', 'id_tweets', 'author_id', 'possibly_sensitive',
       'reply_settings', 'created_at_tweets', 'conversation_id',
       'public_metrics_tweets', 'referenced_tweets', 'entities_tweets', 'lang',
       'text', 'context_annotations', 'in_reply_to_user_id', 'attachments',
       'geo', 'metric_retweet_count', 'metric_reply_count',
       'metric_like_count', 'metric_quote_count', 'pinned_tweet_id',
       'description', 'created_at_users', 'profile_image_url', 'name',
       'username', 'id_users', 'verified', 'public_metrics_users', 'url',
       'protected', 'entities_users', 'location', 'metric_followers_count',
       'metric_following_count', 'metric_tweet_count', 'metric_listed_count'],
      dtype='object')

In [52]:
len(df)

1298

In [7]:
df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,metadata,source,in_reply_to_status_id,...,quoted_status_id_str,retweet_count,favorite_count,favorited,retweeted,lang,quoted_status,possibly_sensitive,extended_entities,withheld_in_countries
0,2021-09-13 04:11:25+00:00,1437267650225131526,1437267650225131520,RT @PeterSweden7: The Great Reset:\n\n- Eat bu...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",,...,,726,0,False,False,en,,,,
1,2021-09-14 08:00:16+00:00,1437687628460175362,1437687628460175360,RT @nicoleperlroth: BIG NEWS: Do you own an Ap...,False,"[0, 140]","{'hashtags': [{'text': 'Pegasus', 'indices': [...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,...,,4114,0,False,False,en,,,,
2,2021-09-15 10:12:00+00:00,1438083169517977601,1438083169517977600,#Apple Issues Emergency #Security Updates to C...,False,"[0, 280]","{'hashtags': [{'text': 'Apple', 'indices': [0,...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://sproutsocial.com"" rel=""nofoll...",,...,,1,5,False,False,en,,0.0,,
3,2021-09-11 15:59:12+00:00,1436720993704230919,1436720993704230912,RT @grandson: The full scale of the global imp...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",,...,1.436715e+18,49,0,False,False,en,,,,
4,2021-09-14 07:12:02+00:00,1437675491226390529,1437675491226390528,An unusual spike in mink deaths at two farms i...,False,"[0, 155]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",1.437675e+18,...,,0,0,False,False,en,,,,


## Data minimization

* What is my RQ?
* Which variables are required to answer my RQ?

Whatever else is not needed - especially if it contains personally identifiable data - I should delete.

Keep in mind to keep all variables relevant for your RQ. For example, if your research question concerns the relation between the text of a tweet and engagement it generates, you may want to keep additional information about the tweet (such as follower count of the author etc.) as control variables. For such a RQ, you could filter the following:

In [82]:
df.columns

Index(['source', 'id_tweets', 'author_id', 'possibly_sensitive',
       'reply_settings', 'created_at_tweets', 'conversation_id',
       'public_metrics_tweets', 'referenced_tweets', 'entities_tweets', 'lang',
       'text', 'context_annotations', 'in_reply_to_user_id', 'attachments',
       'geo', 'metric_retweet_count', 'metric_reply_count',
       'metric_like_count', 'metric_quote_count', 'pinned_tweet_id',
       'description', 'created_at_users', 'profile_image_url', 'name',
       'username', 'id_users', 'verified', 'public_metrics_users', 'url',
       'protected', 'entities_users', 'location', 'metric_followers_count',
       'metric_following_count', 'metric_tweet_count', 'metric_listed_count'],
      dtype='object')

In [83]:
df_min = df[['id_tweets', 'created_at_tweets', 'text', 'metric_retweet_count', 'metric_reply_count', 'metric_like_count',
       'username', 'verified', 'description', 'metric_followers_count', 'metric_following_count', 'metric_tweet_count', 'metric_listed_count']]

Most important is that all your data has a clear purpose - why are you using this data?

## Data pseudonymisation

For our analysis we often need to know who the sender of the tweet was. That said, for almost all research we do not need to know who the users are (in principle). We have multiple columns that give us information about the user, including their screen name and twitter ID. Both columns contain personal information - we can identify the user either by their ID or by their screen name. Hence, we will pseudonimize the authors of the tweets (keeping the information if one person sent more than one tweet). 

**Note:** there are more elegant ways to pseudonimise the data (e.g., encryption), but I am using here some alternatives here that also get the job done. 

### Pseudonimizing the user column

In the dataset, we have columns that give us the screen name and id of the user. We can pick one of them and pseudonimize it.



First, let's make a dataframe with only unique users in our dataset.

In [57]:
users = df[['username']].drop_duplicates()

In [58]:
users.head()

Unnamed: 0,username
0,Thirolagrossa
1,hitechlaw
2,cleopatrabbg
3,CCConsultingSL
4,spdeogarh


Next, we want to be able to replace the names by a pseudonym. We can do a simple trick - we already have a unique pseudonym for each username in the dataframe - we can use the index to create a unique id. Then each username gets a number - a unique pseudonym.

In [59]:
users = users.reset_index()

In [60]:
users.head()

Unnamed: 0,index,username
0,0,Thirolagrossa
1,1,hitechlaw
2,2,cleopatrabbg
3,3,CCConsultingSL
4,4,spdeogarh


In [61]:
users = users.rename(columns={'index': 'pseudID'})

In [62]:
users.head(10)

Unnamed: 0,pseudID,username
0,0,Thirolagrossa
1,1,hitechlaw
2,2,cleopatrabbg
3,3,CCConsultingSL
4,4,spdeogarh
5,5,CapitalFloat
6,6,zl_submisso
7,7,Nikan_iran
8,8,Swiffer_IDN
9,9,song_title


Next, we want to make sure to replace the usernames in our tweets dataframe with the pseudonym. This means that for every tweet in our dataframe, we want to add a column with a pseudonym of the user. This can by using merge - we want to add the information from the `users` dataframe to the `df_min` dataframe (that has info on tweets). We have a common id in both dataframes - `username`. We can use it to add the `pseudoID` column to `df_min`. This means we need to merge left (with `df_min` as left dataframe) on `username`.

In [84]:
df_min = df_min.merge(users, how='left', on='username')

In [80]:
df['username'].value_counts()

Trustwalet_sup7    46
toxic_llolaa       33
Mgh112288          17
BollaSilva         10
Alphaboy133         9
                   ..
BrittneyBraylen     1
pijatenak27         1
__ansley            1
Difyvirginiaa       1
TheSuperKim         1
Name: username, Length: 1046, dtype: int64

In [81]:
df_min['pseudID'].value_counts()

206     46
262     33
335     17
366     10
18       9
        ..
431      1
436      1
438      1
439      1
1297     1
Name: pseudID, Length: 1046, dtype: int64

Let's not forget to delet the column that contains user names!

In [86]:
del df_min['username']

### Anonymizing the text column

In the column with texts, we also have user names that we do not want to keep in our dataset. We know that when a user name is mentioned, it is proceeded by '@'. One possibility we have is to replace words that start with '@'. For example, a tweet 


*'@NSWHealth Have you done any work to estimate the % of the vaccinated who do not present to testing as a consequence of complacency.\nBecause if you did this graph with surveillance-tested HCWs and ACWs the difference in the case numbers would be evident.'*

could be changed into

*'@mention Have you done any work to estimate the % of the vaccinated who do not present to testing as a consequence of complacency.\nBecause if you did this graph with surveillance-tested HCWs and ACWs the difference in the case numbers would be evident.'*

This way, we know that there was a mention in a tweet, but we do not have the user name in our dataset.


**Note:** also here, there are more elegant ways to pseudonimise the data (e.g., using parsers to identify mentions), but I am using here some alternatives here that also get the job done. 

To get this job done, I will use regular expressions to look for all words that start with '@'. We will not have time in the course to go into detail about regular expressions, but they are very useful and commonly used. If you look well, you will see them in some of the UsefulFunctions that we have shared with you on GitHub :)

In [87]:
df_min_pseudo= df_min.replace(to_replace ='@\S+', value = '@mention', regex = True)

In [88]:
df_min_pseudo['text']

0       RT @mention Gravei meu primeiro conteudo +18 c...
1       Conservazione sostitutiva e messa a norma in a...
2       RT @mention More prayer. More self-care. More ...
3       RT @mention Google is rethinking its privacy p...
4       RT @mention OTP is not only stand for the One ...
                              ...                        
1293    RT @mention Fall in love with privacy, boundar...
1294    The bill contains dangerous provisions such as...
1295    RT @mention Vou mostrar os dias que chego do t...
1296    RT @mention She just wants some privacy ðŸ¥º A ti...
1297    @mention Settigns and Privacy &gt; Security Ac...
Name: text, Length: 1298, dtype: object

# Questions you cannot solve with code

* Can we share the data openly, even if minimized? In some cases, only tweet ID's are shared.
* Can we anonymize public data if we still share the text (or some shape of it)?
* Are we doing user profiling?

