# PyConversations: A Reddit-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with Reddit data.

The first step will be to obtain some data. In order to do so, you will need to configure a personal application via your Reddit account's [App Preferences](https://www.reddit.com/prefs/apps). You'll want to set up a personal usage script. See the _Getting Access_ portion of [this blog](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for additional instructions/visualization.

In [1]:
from pprint import pprint

from pyconversations.convo import Conversation
from pyconversations.message import RedditPost

## Data Sample

Before demonstating how we can use `pyconversations` for pre-processing and analysis, we need to obtain a data sample. 
To do so, we'll be using a package called [praw](https://github.com/praw-dev/praw).

In [2]:
import praw

In [3]:
# Private information
CLIENT_ID = ''  # this should be the 'personal use script' on your App Preferences
SECRET_TOKEN = ''  # this should be the 'secret' on your App Preferences
USER_AGENT = ''  # a custom name for your application for the User-Agent parameter in the request headers; gives a brief app description to Reddit

In [4]:
# configure a read-only praw.Reddit instance
reddit = praw.Reddit(
    client_id=CLIENT_ID,
    client_secret=SECRET_TOKEN,
    user_agent=USER_AGENT,
)

reddit, reddit.read_only

(<praw.reddit.Reddit at 0x10a9990d0>, True)

In [5]:
# obtain a sub-reddit of interest
sub_name = 'Drexel'
subreddit = reddit.subreddit(sub_name)

subreddit.title  # PRAW is lazy so won't request till we ask for an attribute

pprint(vars(subreddit))

{'_fetched': True,
 '_path': 'r/Drexel/',
 '_reddit': <praw.reddit.Reddit object at 0x10a9990d0>,
 'accounts_active': 103,
 'accounts_active_is_fuzzed': False,
 'active_user_count': 103,
 'advertiser_category': 'College / University',
 'all_original_content': False,
 'allow_chat_post_creation': False,
 'allow_discovery': True,
 'allow_galleries': True,
 'allow_images': True,
 'allow_polls': True,
 'allow_predictions': False,
 'allow_predictions_tournament': False,
 'allow_videogifs': True,
 'allow_videos': True,
 'banner_background_color': '#ffffff',
 'banner_background_image': 'https://styles.redditmedia.com/t5_2qh6g/styles/bannerBackgroundImage_250phz39qiw31.jpg?width=4000&s=0738f848906e0a9602210260124eb92e697c33e0',
 'banner_img': '',
 'banner_size': None,
 'can_assign_link_flair': False,
 'can_assign_user_flair': True,
 'collapse_deleted_comments': False,
 'comment_score_hide_mins': 0,
 'community_icon': 'https://styles.redditmedia.com/t5_2qh6g/styles/communityIcon_ng3537akpiw31.pn

In [6]:
# get the top submission via 'hot'
top_submission = list(subreddit.hot(limit=1))[0]
top_submission.title

pprint(vars(top_submission))

{'_comments_by_id': {},
 '_fetched': False,
 '_reddit': <praw.reddit.Reddit object at 0x10a9990d0>,
 'all_awardings': [],
 'allow_live_comments': False,
 'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'author': Redditor(name='AstroGnat'),
 'author_flair_background_color': None,
 'author_flair_css_class': 'textflair',
 'author_flair_richtext': [],
 'author_flair_template_id': '59c1edbc-5345-11e1-9edb-12313b08a511',
 'author_flair_text': 'Alumni | Digital Media',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'text',
 'author_fullname': 't2_3gjpr',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'banned_at_utc': None,
 'banned_by': None,
 'can_gild': False,
 'can_mod_post': False,
 'category': None,
 'clicked': False,
 'comment_limit': 2048,
 'comment_sort': 'confidence',
 'content_categories': None,
 'contest_mode': False,
 'created': 1626638983.0,
 'created_utc': 1626638983.0,
 'discussion_type': N

In [7]:
# get all the comments on this submission
all_comments = top_submission.comments.list()

print(len(all_comments))

all_comments[0].score
pprint(vars(all_comments[0]))

17
{'_fetched': True,
 '_reddit': <praw.reddit.Reddit object at 0x10a9990d0>,
 '_replies': <praw.models.comment_forest.CommentForest object at 0x11eb1e610>,
 '_submission': Submission(id='omya9w'),
 'all_awardings': [],
 'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'associated_award': None,
 'author': Redditor(name='Fun-Obligation5729'),
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_883n07bs',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'banned_at_utc': None,
 'banned_by': None,
 'body': 'I have a room available from September 1st- August 31st at 3406 race '
         'street. It’s 775$ . Let me know if anyone’s interested.',
 'body_html': '<div class="md"><p>I have a room available from September 1s

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [8]:
# construct a conversation container
conv = Conversation()

In [9]:
# parse our root submission
cons_params = {
    'lang_detect': True,  # whether to enable the language detection module on the text
    'uid': top_submission.id,  # the unique identifier of the post
    'author': top_submission.author.name if top_submission.author is not None else None,  # name of user
    'created_at': RedditPost.parse_datestr(top_submission.created),  # creation timestamp
    'text': (top_submission.title + '\n\n' + top_submission.selftext).strip()  # text of post
}
top_post = RedditPost(**cons_params)
top_post

RedditPost(Reddit::AstroGnat::2021-07-18 16:09:43::Sublet Thread - Fall Winter

Here's the thread for::tags=)

In [10]:
# add our post to the conversation
conv.add_post(top_post)

In [11]:
# which we can easily access via the .posts property, to verify inclusion
conv.posts

{'omya9w': RedditPost(Reddit::AstroGnat::2021-07-18 16:09:43::Sublet Thread - Fall Winter
 
 Here's the thread for::tags=)}

In [12]:
# iterate through comments and add them to the conversation
for com in all_comments:
    conv.add_post(RedditPost(**{
        'lang_detect': True,  # whethher to enable the language detection module on the txt
        'uid': com.id,  # the unique identifier of the post
        'author': com.author.name if com.author is not None else None,  # name of user
        'created_at': RedditPost.parse_datestr(com.created),  # creation timestamp
        'text': com.body.strip(),  # text of post
        'reply_to': {com.parent_id.replace('t1_', '').replace('t3_', '')}  # set of IDs replied to
    }))

len(conv.posts)

18

### Sub-Conversation Segmentation

In [13]:
# seperate disjoint conversations (there is likely just the one with a full query from the site...)
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [14]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 17, 'und': 1})

### Conversation-Level Redaction

Using `Conversation.redact()` produces a thread that is cleaned of user-specific information. 
This is conversationally-scoped, so all usernames are first enumerated (either from author names or from in-text reference for Reddit and Twitter) and then user mentions (and author names) are replaced by `USER{\d}` where `{\d}` is the integer assigned to that username during the enumeration stage.

Here's a demonstration:

In [15]:
# pre-redaction 
names = {post.author for post in conv.posts.values()}

len(names), names

(17,
 {'According-Rate-2705',
  'AstroGnat',
  'Fit_Web_7741',
  'Fun-Obligation5729',
  'Hasta_La_Pasta827',
  'HumbleAbodee',
  'asapmeelz',
  'cbreck117',
  'ghanshani_ritik',
  'makkirch',
  'memeboi2002',
  'sanjubee',
  'simonest27',
  'starryknight16',
  'thecalk',
  'turtlesturtlesduck',
  'viettran127'})

In [16]:
# redaction step
conv.redact()

In [17]:
# post-redaction
names = {post.author for post in conv.posts.values()}
len(names), names

(17,
 {'USER0',
  'USER1',
  'USER10',
  'USER11',
  'USER12',
  'USER13',
  'USER14',
  'USER15',
  'USER16',
  'USER2',
  'USER3',
  'USER4',
  'USER5',
  'USER6',
  'USER7',
  'USER8',
  'USER9'})

### Saving and Loading from the universal format

In [18]:
import json

In [19]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
json.dump(conv.to_json(), open('reddit_conv.json', 'w+'))

In [20]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('reddit_conv.json')))
len(conv_reloaded.posts)

18

### Feature Extraction

The remainder of this notebook exhibits some basic vectorization of features from conversations, posts, and users within this conversation. 
For more information, see the documentation for PyConversations.

In [21]:
from pyconversations.feature_extraction import ConversationVectorizer
from pyconversations.feature_extraction import PostVectorizer
from pyconversations.feature_extraction import UserVectorizer

In [22]:
convs = True
# convs = False

users = True
# users = False

posts = True
# posts = False

# normalization = None
# normalization = 'minmax'
# normalization = 'mean'
normalization = 'standard'

# cv = ConversationVectorizer(normalization=normalization, agg_user_fts=users, agg_post_fts=posts, include_source_user=True)
pv = PostVectorizer(normalization=normalization, include_conversation=convs, include_user=users)
# uv = UserVectorizer(normalization=normalization, agg_post_fts=posts)

In [23]:
# cv.fit(conv=conv_reloaded)
pv.fit(conv=conv_reloaded)
# uv.fit(conv=conv_reloaded)

<pyconversations.feature_extraction.extractors.PostVectorizer at 0x122860d10>

In [24]:
# cxs = cv.transform(conv=conv_reloaded)
pxs = pv.transform(conv=conv_reloaded)
# uxs = uv.transform(conv=conv_reloaded)

# pprint(cxs.shape)
pprint(pxs.shape)
# pprint(uxs.shape)

(18, 3317)
