# PyConversations: A Twitter-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with Twitter data.

In order to directly follow and run this tutorial notebook, you will need a valid set of Twitter API credentials.

First, we'll obtain some data and then we'll show how to load that directly into PyConversations!

In [1]:
import requests, time, json, sys, os

from datetime import datetime as dt

from pyconversations.convo import Conversation
from pyconversations.message import Tweet

In [2]:
# Place your App's Bearer token here:
token = ''

This first portion of downloading code with ping and author and attempt to identify their most recent tweet ID.

In [3]:
def download_recent(author, bearer_token, fields='&tweet.fields=id'):
    """
    Given an author and a token, attempts to return the ID of the Tweet
    that this user most recently posted
    
    Parameters
    ----------
    author
        A Twitter username
    bearer_token 
        The API credentials
    fields
        Other associated fields for the request
    
    Returns
    -------
    The most recent Tweet ID
    """
    # set the authentication header
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    
    # access the tweet's data
    tweet_url = "https://api.twitter.com/2/tweets/search/recent?query=from:" + author + fields
    tweet_response = requests.request("GET", tweet_url, headers=headers)
    
    return tweet_response.json()['data'][0]['id']

In [4]:
most_recent_tid = download_recent('YouTube', token)
most_recent_tid

'1427329311556648961'

Next, we'll snag a batch of Tweets associated with this conversation.
We could batch-read the entire Conversation by iteratively querying more tweets from this Conversation, however, here, we'll just take the first batch of Tweets returned from the API associated with the Conversation.

In [5]:
# all the fields we'll want to extract from the conversation
all_fields = ("&tweet.fields=author_id,conversation_id,created_at,in_reply_to_user_id,referenced_tweets" +
              ",geo,lang,source,reply_settings,id,public_metrics" +
              "&expansions=author_id,in_reply_to_user_id" +
              "&user.fields=name,username")

In [6]:
def download_thread(tid, bearer_token, fields=all_fields, max_results = '50'):
    """
    Given an arbitrary Tweet ID `tid` and valid API credential `bearer_token`,
    this function downloads an arbitrary Twitter thread associated with 
    the source Tweet `tid`.
    
    Parameters
    ----------
    tid
        A Tweet ID
    bearer_token 
        The API credentials
    fields
        Other associated fields for the request
    
    Returns
    -------
    The data associated with the Twitter thread
    """
    # set the authentication header
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    
    # access the tweet's data
    tweet_url = "https://api.twitter.com/2/tweets?ids=" + tid + fields
    tweet_response = requests.request("GET", tweet_url, headers=headers)
    rate_limit_headers = {ky: tweet_response.headers[ky]
                          for ky in ['x-rate-limit-reset', 'x-rate-limit-limit','x-rate-limit-remaining']}
    data = [tweet_response.json()]
    
    conversation_url = ('https://api.twitter.com/2/tweets/search/recent?query=' +
                        'conversation_id:' + data[0]['data'][0]['conversation_id'] +
                        '&since_id='+str(int(data[0]['data'][0]['conversation_id']) - 1) +
                        '&max_results=' + max_results + fields)
    conversation_response = requests.request("GET", conversation_url, headers=headers)
    rate_limit_headers = {ky: conversation_response.headers[ky]
                          for ky in ['x-rate-limit-reset', 'x-rate-limit-limit','x-rate-limit-remaining']}
    data.append(conversation_response.json())
        
    # make sure we have the materials to continue  
    if 'meta' in data[-1] and 'data' in data[-1]:
        batch_size = data[-1]['meta']['result_count']
        if 'next_token' not in data[-1]['meta']:
            return data
    else:
        return data
        
    # finish up if return wasn't otherwise triggered
    return data

In [7]:
tweets = download_thread(most_recent_tid, token)

In [8]:
len(tweets)

2

In [9]:
# total tweets:
sum([len(x['data']) for x in tweets])

51

In [10]:
tweets[0]

{'data': [{'conversation_id': '1427329311556648961',
   'text': 'RT if you always leave a comment',
   'id': '1427329311556648961',
   'source': 'Sprinklr',
   'author_id': '10228272',
   'lang': 'en',
   'reply_settings': 'everyone',
   'public_metrics': {'retweet_count': 223,
    'reply_count': 152,
    'like_count': 874,
    'quote_count': 17},
   'created_at': '2021-08-16T18:00:01.000Z'}],
 'includes': {'users': [{'id': '10228272',
    'name': 'YouTube',
    'username': 'YouTube'}]}}

In [11]:
tweets[1]['data'][0]

{'conversation_id': '1427329311556648961',
 'text': '@YouTube What’s the point if cowards like dc and Disney  turn comments off 🤣',
 'referenced_tweets': [{'type': 'replied_to', 'id': '1427329311556648961'}],
 'id': '1427334250924285954',
 'source': 'Twitter for iPhone',
 'author_id': '1404903027992051712',
 'lang': 'en',
 'reply_settings': 'everyone',
 'public_metrics': {'retweet_count': 0,
  'reply_count': 0,
  'like_count': 0,
  'quote_count': 0},
 'in_reply_to_user_id': '10228272',
 'created_at': '2021-08-16T18:19:38.000Z'}

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [12]:
# This handles v2 API... Not yet fully integrated in PyConversations
def fromutcformat(utc_str, tz=None):
    iso_str = utc_str.replace('Z', '+00:00')
    return dt.fromisoformat(iso_str).astimezone(tz)

In [13]:
# create conversation
conv = Conversation()

In [14]:
for batch in tweets:
    for t in batch['data']:
        # create data for the post constructor
        cons = {
            'uid':        t['id'],
            'created_at': fromutcformat(t['created_at']),
            'text':       t['text'].strip(),
            'author':     [u['username'] for u in batch['includes']['users']if u['id'] == t['author_id']][0],
            'reply_to':   [r['id'] for r in t.get('referenced_tweets', [])],
            'lang':       t['lang'],
        }
        conv.add_post(Tweet(**cons))

In [15]:
# print number of unique posts contained within the Conversation
len(conv.posts)

51

In [16]:
list(conv.posts.items())[:5]

[('1427329311556648961',
  Tweet(Twitter::YouTube::2021-08-16 14:00:01-04:00::RT if you always leave a comment::tags=)),
 ('1427334250924285954',
  Tweet(Twitter::wba434::2021-08-16 14:19:38-04:00::@YouTube What’s the point if cowards like dc and D::tags=)),
 ('1427334238567747598',
  Tweet(Twitter::43Huntley::2021-08-16 14:19:35-04:00::@YouTube @TechWizYT L A::tags=)),
 ('1427334129008402432',
  Tweet(Twitter::steviexrocker::2021-08-16 14:19:09-04:00::@YouTube I don’t reply if someone always say “nice::tags=)),
 ('1427334111815950342',
  Tweet(Twitter::TempGamers::2021-08-16 14:19:05-04:00::@YouTube https://t.co/eOHIUXlI7F::tags=))]

### Sub-Conversation Segmentation

In [17]:
# seperate disjoint conversations
segs = conv.segment()

len(segs)

5

### (Detected) Language Distribution

In [18]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 37, 'und': 13, 'lv': 1})

### Conversation-Level Redaction

Using `Conversation.redact()` produces a thread that is cleaned of user-specific information. 
This is conversationally-scoped, so all usernames are first enumerated (either from author names or from in-text reference for Reddit and Twitter) and then user mentions (and author names) are replaced by `USER{\d}` where `{\d}` is the integer assigned to that username during the enumeration stage.

Here's a demonstration:

In [19]:
# pre-redaction 
names = {post.author for post in conv.posts.values()}

len(names), names

(49,
 {'43Huntley',
  'AnitaPa00338598',
  'AsamnewAsamnew',
  'DymeADuzin',
  'Gamer_1745',
  'GenuineTech80',
  'Hollywood_652',
  'ItzDodgerz',
  'J_ffS0n',
  'JosephYZSL',
  'Kelashmeghwar1',
  'KennyHylleberg',
  'KhaledXKetata',
  'KyepoTW',
  'Le09hY',
  'Leyoch234',
  'ManakMazhar',
  'MrJayMick',
  'MuhammadShiblu4',
  'Princehooper_',
  'RykersToybox',
  'Sara_AM_23',
  'Scifiz1',
  'Senistro_Band',
  'Spikiie1',
  'StanReinhardt',
  'TJeffy125',
  'TempGamers',
  'TheCrussVentsel',
  'TheHoodedMan0',
  'ThomasD25788825',
  'UnqualifiedDude',
  'VesuvianLevio',
  'YouTube',
  'cabrobst',
  'finds_e',
  'imbellete',
  'kid_hummus',
  'lexbts7',
  'lucas80628444',
  'nakedtruth_fact',
  'sarikaa_sri',
  'sliprings',
  'sneha_chandra',
  'steviexrocker',
  'theeKenyan_Icon',
  'unchangend',
  'wba434',
  'weeklychatter'})

In [20]:
# redaction step
conv.redact()

In [21]:
# post-redaction
names = {post.author for post in conv.posts.values()}
len(names), names

(49,
 {'USER0',
  'USER1',
  'USER10',
  'USER11',
  'USER12',
  'USER13',
  'USER14',
  'USER15',
  'USER17',
  'USER18',
  'USER19',
  'USER2',
  'USER20',
  'USER21',
  'USER22',
  'USER23',
  'USER24',
  'USER25',
  'USER26',
  'USER27',
  'USER28',
  'USER29',
  'USER30',
  'USER36',
  'USER38',
  'USER4',
  'USER40',
  'USER41',
  'USER43',
  'USER44',
  'USER45',
  'USER46',
  'USER47',
  'USER48',
  'USER49',
  'USER5',
  'USER50',
  'USER51',
  'USER52',
  'USER53',
  'USER55',
  'USER56',
  'USER57',
  'USER58',
  'USER59',
  'USER6',
  'USER7',
  'USER8',
  'USER9'})

### Saving and Loading from the universal format

In [22]:
import json

In [23]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
json.dump(conv.to_json(), open('twitter_conv.json', 'w+'))

In [24]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('twitter_conv.json')))
len(conv_reloaded.posts)

51

### Feature Extraction

The remainder of this notebook exhibits some basic vectorization of features from conversations, posts, and users within this conversation. 
For more information, see the documentation for PyConversations.

In [25]:
from pyconversations.feature_extraction import ConversationVectorizer
from pyconversations.feature_extraction import PostVectorizer
from pyconversations.feature_extraction import UserVectorizer

In [26]:
convs = True
# convs = False

users = True
# users = False

posts = True
# posts = False

# normalization = None
# normalization = 'minmax'
# normalization = 'mean'
normalization = 'standard'

# cv = ConversationVectorizer(normalization=normalization, agg_user_fts=users, agg_post_fts=posts, include_source_user=True)
pv = PostVectorizer(normalization=normalization, include_conversation=convs, include_user=users)
# uv = UserVectorizer(normalization=normalization, agg_post_fts=posts)

In [27]:
# cv.fit(conv=conv_reloaded)
pv.fit(conv=conv_reloaded)
# uv.fit(conv=conv_reloaded)

<pyconversations.feature_extraction.extractors.PostVectorizer at 0x117d5cf90>

In [28]:
# cxs = cv.transform(conv=conv_reloaded)
pxs = pv.transform(conv=conv_reloaded)
# uxs = uv.transform(conv=conv_reloaded)

# print(cxs.shape)
print(pxs.shape)
# print(uxs.shape)

(51, 3317)
