# Measure whether language norms were expressed implicitly in community guidelines


- Using labelled dataset of comments + posts (excluding any banned (?), stickied(?), or moderator posts) train a text classifier to predict the subreddit a post or comment came from. 

- Measure of implicit language style of interface: accuracy of trained classifier on interface text of subreddit

- Classification can be *(subreddit vs. rest of dataset)* , *(subreddit vs. rest of interfaces)* or *multilabel classification* with all subreddits. -- Start with *(subreddit vs. rest of dataset)*

- Can also explore two types of interface text: just the public description, including guidelines and rules, just the moderator and stickied posts, and the two of them together. 

- Use SoPa as classifier, manually inspect patters


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Q0S1WHXyaGC9sg


In [4]:
%load_ext dotenv
%dotenv

import numpy as np
import csv as csv
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import math
import json
from scipy import stats
from datetime import datetime

from nltk import pos_tag
from nltk.util import pad_sequence
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from termcolor import colored

import praw
import requests
import json
import os

client_id = os.environ.get("client_id")
client_secret = os.environ.get("client_secret")
user_agent = os.environ.get("user_agent")

reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret,
                     user_agent=user_agent)

### Test subreddits
### Test some subreddits randomly taken from the identity paper
tested_subs = ['science', 'politics', 'economics', 'depression', 'Cooking', 'pics', 'Naruto', 'BabyBumps']
subreddit_label_dict = {s:i for i, s in enumerate(tested_subs)}
rows = []
for s in tested_subs:
    sub = reddit.subreddit(s)
    rows.append({'subreddit': s , 'descr':sub.description, 'public_descr':sub.public_description})
    
df_sub = pd.DataFrame(rows)


df_sub['full_descr'] = df_sub['descr'] + df_sub['public_descr']
df_sub

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


Unnamed: 0,descr,public_descr,subreddit,full_descr
0,# [Submission Rules](https://www.reddit.com/r/...,This community is a place to share and discuss...,science,# [Submission Rules](https://www.reddit.com/r/...
1,## **Welcome to /r/Politics! Please read [the ...,/r/Politics is for news and discussion about U...,politics,## **Welcome to /r/Politics! Please read [the ...
2,### Subreddit Rules\n\n--\n\nI. **Discipline-S...,"News and discussion about economics, from the ...",economics,### Subreddit Rules\n\n--\n\nI. **Discipline-S...
3,##A supportive space for anyone struggling wit...,Peer support for anyone struggling with depres...,depression,##A supportive space for anyone struggling wit...
4,#####Please read these\n\n1. All posts must be...,/r/Cooking is a place for the cooks of reddit ...,Cooking,#####Please read these\n\n1. All posts must be...
5,A place to share photographs and pictures. Fee...,A place to share photographs and pictures.,pics,A place to share photographs and pictures. Fee...
6,1. [Boruto Chapter 30 Free (VIZ)](https://www....,Everything related to Naruto and Boruto goes h...,Naruto,1. [Boruto Chapter 30 Free (VIZ)](https://www....
7,\n###All Bump Photos belong in our Stickied D...,"A place for pregnant redditors, those who have...",BabyBumps,\n###All Bump Photos belong in our Stickied D...


In [10]:
### Get comments/posts

df_comments_1 = pd.read_csv('data/reddit_comments_2016_01/comments_reddit_comments_2016_01_000000000000.csv')
df_comments_2 = pd.read_csv('data/reddit_comments_2016_01/comments_reddit_comments_2016_01_000000000001.csv')
df_comments_3 = pd.read_csv('data/reddit_comments_2016_01/comments_reddit_comments_2016_01_000000000002.csv')

# group them all together
print(len(df_comments_1), len(df_comments_2), len(df_comments_3))
df_comments = df_comments_1.append(df_comments_2.append(df_comments_3))
print(len(df_comments))

df_posts_1 = pd.read_csv('data/reddit_posts_2016_01/posts_reddit_posts_2016_01_000000000000.csv')
df_posts_2 = pd.read_csv('data/reddit_posts_2016_01/posts_reddit_posts_2016_01_000000000001.csv')
df_posts_3 = pd.read_csv('data/reddit_posts_2016_01/posts_reddit_posts_2016_01_000000000002.csv')

# group them all together
print(len(df_posts_1), len(df_posts_2), len(df_posts_3))
df_posts = df_posts_1.append(df_posts_2.append(df_posts_3))
print(len(df_posts))


### Get only the ones in the current tested subreddits
df_comments = df_comments[df_comments.apply(lambda x: x['subreddit'] in tested_subs, axis=1)]
df_posts = df_posts[df_posts.apply(lambda x: x['subreddit'] in tested_subs, axis=1)]
print('comments:', len(df_comments), ' and ', len(df_posts), ' posts in', tested_subs)


### Remove all deleted and moderator posts
df_posts_no_removed = df_posts[df_posts['selftext'].apply(lambda x: x not in ['[deleted]', '[removed]'])]
df_comments_no_removed = df_comments[df_comments['author'].apply(lambda x: x not in ['[deleted]', 'AutoModerator'])]

df_posts_no_removed_mod = df_posts_no_removed[df_posts_no_removed['distinguished'] != 'moderator']
df_comments_no_removed_mod = df_comments_no_removed[df_comments_no_removed['distinguished'] != 'moderator']

print('cleaned comments:', len(df_comments), ' and ', len(df_posts), ' cleaned posts in', tested_subs)



428076 431016 431583
1290675


  interactivity=interactivity, compiler=compiler, result=result)


112632 112984 112810
338426
comments: 33430  and  3981  posts in ['science', 'politics', 'economics', 'depression', 'Cooking', 'pics', 'Naruto', 'BabyBumps']


In [11]:
print('cleaned comments:', len(df_comments), ' and ', len(df_posts), ' cleaned posts in', tested_subs)


cleaned comments: 33430  and  3981  cleaned posts in ['science', 'politics', 'economics', 'depression', 'Cooking', 'pics', 'Naruto', 'BabyBumps']


In [4]:
print('posts:', df_posts.columns)
print('comments:', df_comments.columns)


posts: Index(['created_utc', 'subreddit', 'author', 'domain', 'url', 'num_comments',
       'score', 'ups', 'downs', 'title', 'selftext', 'saved', 'id',
       'from_kind', 'gilded', 'from', 'stickied', 'retrieved_on', 'over_18',
       'thumbnail', 'subreddit_id', 'hide_score', 'link_flair_css_class',
       'author_flair_css_class', 'archived', 'is_self', 'from_id', 'permalink',
       'name', 'author_flair_text', 'quarantine', 'link_flair_text',
       'distinguished'],
      dtype='object')
comments: Index(['body', 'score_hidden', 'archived', 'name', 'author',
       'author_flair_text', 'downs', 'created_utc', 'subreddit_id', 'link_id',
       'parent_id', 'score', 'retrieved_on', 'controversiality', 'gilded',
       'id', 'subreddit', 'ups', 'distinguished', 'author_flair_css_class'],
      dtype='object')


{'BabyBumps': 7,
 'Cooking': 4,
 'Naruto': 6,
 'depression': 3,
 'economics': 2,
 'pics': 5,
 'politics': 1,
 'science': 0}

In [106]:
### make dev.data and dev.labels and train.data, train.labels
### .data = newline seperated lines of text, 1 = politics, 0 = otherwise
### .labels = corrosponding labels, will do 1 = politics, 0 = otherwise
### 



In [107]:
df_posts_no_removed['selftext'].copy()

9                        Don't let your Dreams be Dreams. 
10       now, I could say a lot about these candidates ...
11       Is there any benefit to being a registered ind...
12       It makes no sense for a handful of reasons. he...
13       I have only seen her proposals, but not actual...
927      I read this in a compilation of uses of baking...
928      Hi all, thought I would get some advice on wha...
929      I want to impress some people and stupidly sai...
930      Alright chefs, foodies, and eaters of /r/cooki...
931      Here's a photo of [the three soups](http://i.i...
932      I'm a Chicago native currently living in Nashv...
933      i am from india.. :/\n\nhere you get two optio...
934      I've diced at least a few thousand onions in m...
935      Hi! Does anyone have any recommendations/ reci...
936      I really like making dishes and i want it to b...
937      I have a pork twnderloin that came cut in half...
938      Hello, \n\nCan anyone guide me to a mouthwater.

In [108]:
df_posts_no_removed['full_text'] = df_posts_no_removed['selftext'] + df_posts_no_removed['title']

df_posts_no_removed['body'] = df_posts_no_removed['selftext'].copy()

df_text = df_comments_no_removed[['body', 'subreddit']].append(df_posts_no_removed[['body', 'full_text', 'title', 'subreddit']])

df_text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,body,full_text,subreddit,title
34,&gt;Wait until you take a look at the hordes o...,,politics,
135,Cut it in half again and you have potato wedge...,,pics,
142,In the medical field that's like THE worst thi...,,pics,
197,I'm Canadian... Nice pics of a warm summer rac...,,pics,
208,Lady in back must not agree.,,pics,
272,"I like Ron and Rand Paul, now I support Bernie...",,politics,
300,You like Herr’s Buffalo Blue Cheese Flavored C...,,pics,
439,&gt; more practical and safe options when it c...,,politics,
519,Thank you for this information it really helps...,,BabyBumps,
578,OK cool but that all seems basically pointless...,,science,


In [109]:

df_comments_no_removed_test = df_comments_no_removed[['body', 'subreddit']]

df_comments_no_removed_test['politics_label'] = df_comments_no_removed['subreddit'].apply(lambda x: x == 'politics').astype(int)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [110]:
import sklearn

# just doing comments for the time being
comment_train, comment_dev = sklearn.model_selection.train_test_split(df_comments_no_removed_test, test_size=.2, random_state=2239)


In [111]:
comment_train.body.to_csv('data/sopa/data/train.data', index=False)
comment_train.politics_label.to_csv('data/sopa/data/train.label', index=False)

comment_dev.body.to_csv('data/sopa/data/dev.data', index=False)
comment_dev.politics_label.to_csv('data/sopa/data/dev.label', index=False)

In [112]:
def check_value_counts(series):
    v_c = series.value_counts()
    print(v_c)
    return v_c

def return_df(df, column, criteria):
    return df[df[column] == criteria]
    

check_value_counts(df_posts.link_flair_text)



Already Submitted                                     84
Unacceptable Title                                    55
Removed: R1                                           54
Off-Topic                                             53
R1: Text/Comic/Infographic                            48
Rehosted Content                                      41
No Images/Memes                                       19
Health                                                19
No Link Shorteners                                    18
No Social Media                                       15
Biology                                               14
Environment                                           13
No ALL CAPS                                           12
Unacceptable Source                                   11
Engineering                                           10
Rant/Vent                                             10
Animal Science                                        10
Medicine                       

Already Submitted                                     84
Unacceptable Title                                    55
Removed: R1                                           54
Off-Topic                                             53
R1: Text/Comic/Infographic                            48
Rehosted Content                                      41
No Images/Memes                                       19
Health                                                19
No Link Shorteners                                    18
No Social Media                                       15
Biology                                               14
Environment                                           13
No ALL CAPS                                           12
Unacceptable Source                                   11
Engineering                                           10
Rant/Vent                                             10
Animal Science                                        10
Medicine                       

In [9]:
df_comments_mod = df_comments[df_comments['distinguished'] == 'moderator']
df_posts_mod = df_posts[df_posts['distinguished'] == 'moderator']


df_posts_mod

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,...,author_flair_css_class,archived,is_self,from_id,permalink,name,author_flair_text,quarantine,link_flair_text,distinguished
12307,1452332616,BabyBumps,MaeBeWeird,self.BabyBumps,https://www.reddit.com/r/BabyBumps/comments/40...,13,3,3,0,Saturday Discussion Thread,...,purple,False,True,,/r/BabyBumps/comments/405rdk/saturday_discussi...,t3_405rdk,Momma Yoda,False,Discussion,moderator
85,1452573669,pics,allthefoxes,self.pics,https://www.reddit.com/r/pics/comments/40kxfw/...,16,94,94,0,New addition to rule 5: No photoshop request p...,...,fox,False,True,,/r/pics/comments/40kxfw/new_addition_to_rule_5...,t3_40kxfw,test,False,,moderator
