# PyConversations: A 4chan-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with 4chan data.

Interfacing with 4chan data _does not_ require any secret keys or tokens or anyting of that nature. Instead, we'll directly use a package called `BASC-py4chan` which provides [an object-oriented interface](https://basc-py4chan.readthedocs.io/en/latest/index.html) for interacting with the 4chan API. 

To begin, let's obtain some data using the package and then proceed by integrating the data with PyConversations!

In [1]:
from pprint import pprint

from pyconversations.convo import Conversation
from pyconversations.message import ChanPost

## Data Sample

In [2]:
import basc_py4chan

In [3]:
# get the board we want
board_name = 'news'

board = basc_py4chan.Board(board_name)
board

<Board /news/>

In [6]:
# select a thread on the board

ix = 2
all_thread_ids = board.get_all_thread_ids()
thread_id = all_thread_ids[ix] if ix < len(all_thread_ids) else all_thread_ids[-1]
thread = board.get_thread(thread_id)

thread_id, thread

(904494, <Thread /news/904494, 51 replies>)

In [7]:
# print thread information
print(thread)
print('Sticky?', thread.sticky)
print('Closed?', thread.closed)
print('Replies:', len(thread.replies))

<Thread /news/904494, 51 replies>
Sticky? False
Closed? False
Replies: 51


In [8]:
# print topic post information
topic = thread.topic
print('Topic Repr', topic)
print('Postnumber', topic.post_number)
print('Timestamp', topic.timestamp)
print('Datetime', repr(topic.datetime))
print('Subject', topic.subject)
print('Comment', topic.comment)

Topic Repr <Post /news/904494#904494, has_file: True>
Postnumber 904494
Timestamp 1628458133
Datetime datetime.datetime(2021, 8, 8, 17, 28, 53)
Subject Trump and the RNC forced to return over $12.8 million that they stole from supporters in 2021
Comment And his cult will still support him even as he just steals their money outright.<br><br>https://www.nytimes.com/2021/08/07/us/politics/trump-recurring-donations.html<br><br>The aggressive fund-raising tactics that former President Donald J. Trump deployed late in last year’s presidential campaign have continued to spur an avalanche of refunds into 2021, with Mr. Trump, the Republican Party and their shared accounts returning $12.8 million to donors in the first six months of the year, newly released federal records show.<br>The refunds were some of the biggest outlays that Mr. Trump made in 2021 as he has built up his $102 million political war chest — and amounted to roughly 20 percent of the $56 million he and his committees raised on

In [9]:
len(thread.all_posts)

52

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [10]:
# create conversation
conv = Conversation()

In [11]:
for post in thread.all_posts:
    
    # gather up raw text
    raw_text = ((post.subject + ':\n' if post.subject else '') + post.text_comment).strip()
    
    # cleanse text and retrieve references to other posts
    text, rfs = ChanPost.clean_text(raw_text)
    if not rfs and topic.post_number != post.post_id:
        rfs = [topic.post_number]
        
    rfs = list(map(int, rfs))
    
    # create data for the post constructor
    cons = {
        'uid':        post.post_id,  # unique identifier for the post (mandatory field)
        'created_at': post.datetime,  # datetime of post creation
        'text':       text,  # cleaned plaintext
        'author':     post.name,  # self-assigned name of the poster (likely 'Anonymous')
        'reply_to':   rfs,  # cleaned references to other posts
        'lang_detect': True  # whether or not to attempt language detection
    }
    conv.add_post(ChanPost(**cons))

In [12]:
# print number of unique posts contained within the Conversation
len(conv.posts)

52

In [13]:
list(conv.posts.items())[:5]

[(904494,
  4chanPost(4chan::Anonymous::2021-08-08 17:28:53::Trump and the RNC forced to return over $12.8 mill::tags=)),
 (904495,
  4chanPost(4chan::Anonymous::2021-08-08 17:29:16::New Federal Election Commission records from WinRe::tags=)),
 (904499,
  4chanPost(4chan::Anonymous::2021-08-08 17:30:17::The Federal Election Commission has since unanimou::tags=)),
 (904500,
  4chanPost(4chan::Anonymous::2021-08-08 17:31:18::WinRed said there was simply a greater volume of r::tags=)),
 (904521,
  4chanPost(4chan::Anonymous::2021-08-08 17:49:27::Can't wait for the MAGAt tears about this.::tags=))]

In [14]:
# Conversations can be sub-segmented (if we have a large collection and are uncertain if posts are missing or would like to splice out disjoint trees)
# This is likely to return a single conversation (a copy of what we built) since we just queried a single thread directly using the API 
# This is more relevant when ingesting a heterogenous collection of posts
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [15]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 44, 'da': 1, 'fr': 1, 'und': 6})

### Saving and Loading from the universal format

In [17]:
import json

In [18]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
j = conv.to_json()
json.dump(j, open('4chan_conv.json', 'w+'))

In [19]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('4chan_conv.json')))
len(conv_reloaded.posts)

52

### Feature Extraction

The remainder of this notebook exhibits some basic vectorization of features from conversations, posts, and users within this conversation. 
For more information, see the documentation for PyConversations.

In [20]:
from pyconversations.feature_extraction import ConversationVectorizer
from pyconversations.feature_extraction import PostVectorizer
from pyconversations.feature_extraction import UserVectorizer

In [21]:
convs = True
# convs = False

users = True
# users = False

posts = True
# posts = False

# normalization = None
# normalization = 'minmax'
# normalization = 'mean'
normalization = 'standard'

# cv = ConversationVectorizer(normalization=normalization, agg_user_fts=users, agg_post_fts=posts, include_source_user=True)
pv = PostVectorizer(normalization=normalization, include_conversation=convs, include_user=users)
# uv = UserVectorizer(normalization=normalization, agg_post_fts=posts)

In [22]:
# cv.fit(conv=conv_reloaded)
pv.fit(conv=conv_reloaded)
# uv.fit(conv=conv_reloaded)

<pyconversations.feature_extraction.extractors.PostVectorizer at 0x1203ba9d0>

In [23]:
# cxs = cv.transform(conv=conv_reloaded)
pxs = pv.transform(conv=conv_reloaded)
# uxs = uv.transform(conv=conv_reloaded)

# pprint(cxs.shape)
pprint(pxs.shape)
# pprint(uxs.shape)

(52, 3317)
