# Overview
Subreddits are interesting but it's also cool to learn more about the people in them.  
- What do they like to do, 
- Where do they go on the web
- What subreddit's do they interact with  
In any group people are there for different reasons. Some are just learning, some are looking to hangout with similar people, and some are looking a chance to teach people. Plenty of other reasons too. These are more specific to /r/learnpython.
  
If we can group users together we can better understand where the learnpython beginners hang out and how that differs from some of the more experienced people.  
  
It could help you find more resources to learn about python, other cool subreddits, and give you a glimpse into the hobbies of other programmers.  
  

  
In this notebook, I'll go over some basic analysis on user accounts and give you some ideas for deeper work. 

## Ethical Considerations
So now that I want to show you how to analyze reddit accounts what is fair game? Which account(s) should I use examples for here?  
  
I did a search for post about privacy and found 2 redditors that indicated not caring about privacy on the internet and that they've accepted that using social media gives up their privacy. So to me these are fair game, if you are one of those 2 users and want to be removed from this example please feel free to message me.  
  
More on ethics.. Ethics are a funny concept. I think it's up to you to decide what is ethical. In my view ethics always favor the dominant parties in a group in control of the thought leadership. So I'm not the biggest fan of not doing things because someone says they are considered by some to be unethical. In this exercise I did attempt to follow status quo ethics but the choice is yours in your own research.  
  
I extend these my views on ethics to other things as well. If the API for something like Reddit doesn't work, there is no harm in accessing it by other means and extracting the information you want. Sites like reddit or Facebook use the internet but try as they might they don't control it. One of the favorite techniques used by these sites is obfuscated crap like that mentioned in this [article](https://css-tricks.com/how-facebook-avoids-ad-blockers/).   
 

# Lets Begin!
Just like with traversing "CommentTrees" in posts we'll need to have code to traverse all of a user's comments and posts.

In [1]:
import tqdm # Handy for showing progress on longer running jobs
from utils import * #Load the utilities we created in other notebooks

users=["BojanglesDeloria","Namisaur","Net_User","Hotwater3","ExtremelyBeige","Shannnnnnn"]
user=users[0]
user

'BojanglesDeloria'

In [2]:
redditor=reddit.redditor(user)

#NOTE These might be slow for redditors with big accounts, rather then make them a list, keeping them in "generator" form
#  may be a good idea
posts=[post for post in redditor.submissions.new()]
comments=[comment for comment in redditor.comments.new()]


In [3]:
import pandas as pd

In [4]:
rows=[]
for user in users:
    redditor=reddit.redditor(user)

    #NOTE These might be slow for redditors with big accounts, rather then make them a list, keeping them in "generator" form
    #  may be a good idea
    posts=[post for post in redditor.submissions.new()]
    comments=[comment for comment in redditor.comments.new()]
    for c in posts+comments:
        row={
            "subreddit_name": c.subreddit.display_name,
            "user": user
        }
        rows.append(row)
users_df=pd.DataFrame(rows)
users_df

Unnamed: 0,subreddit_name,user
0,Delco,BojanglesDeloria
1,modernwarfare,BojanglesDeloria
2,EscapefromTarkov,BojanglesDeloria
3,LivestreamFail,BojanglesDeloria
4,GearsOfWar,BojanglesDeloria
...,...,...
1040,ContagiousLaughter,Shannnnnnn
1041,Destiny,Shannnnnnn
1042,suspiciouslyspecific,Shannnnnnn
1043,unpopularopinion,Shannnnnnn


In [5]:
rows=[]
for user, user_df in users_df.groupby("user"):
    redditor=reddit.redditor(user)
    row={"user": user,
            "comment_karma":redditor.comment_karma,
            "post_karma": redditor.awardee_karma,
            "total_karma": redditor.total_karma,
             "cake_day": pd.to_datetime(redditor.created_utc*1e9),
        }
    for subreddit,count in user_df.groupby('subreddit_name').count().iterrows():
        row[f'interacted_in-{subreddit}']=count['user']#count names columns a bit different
    rows.append(row)
user_profile=pd.DataFrame(rows)
user_profile

Unnamed: 0,user,comment_karma,post_karma,total_karma,cake_day,interacted_in-AskReddit,interacted_in-AvoidingThePuddle,interacted_in-BlackPeopleTwitter,interacted_in-Blackops4,interacted_in-Cigarettes,...,interacted_in-suspiciouslyspecific,interacted_in-telescopes,interacted_in-trashy,interacted_in-u_Shannnnnnn,interacted_in-ultimaonline,interacted_in-valheim,interacted_in-wow,interacted_in-wowaddons,interacted_in-yesyesyesno,interacted_in-yugioh
0,BojanglesDeloria,30454,15,34957,2013-12-31 14:49:45,9.0,1.0,1.0,2.0,2.0,...,,,,,,,,,,
1,ExtremelyBeige,25578,88,32265,2018-02-20 01:08:33,16.0,,,,,...,,,,,,,,,,
2,Hotwater3,7908,15,12116,2014-01-02 00:26:01,,,,,,...,,,,,,,,,,
3,Namisaur,33651,329,37457,2014-05-27 17:40:41,2.0,,,,,...,,,,,,,,,,
4,Net_User,8127,61,11746,2013-04-10 01:53:53,26.0,,,,,...,,,,,,,,,,
5,Shannnnnnn,8176,549,15480,2017-08-09 11:10:29,,,,,,...,2.0,3.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0


## What do we do with this data?

Cool! We've rolled up our data so we can now compare it. Looks now make another dataframe that is focused on how many users share interest in a subreddit so we can begin to categorize it

In [6]:
interacted_in_metrics=user_profile.filter(regex="interacted_in").describe().T#transpose so we can more easily query on count
interacted_in_metrics[interacted_in_metrics['count']>1].sort_values(by=["count","mean"],ascending=[False,False])

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
interacted_in-unpopularopinion,4.0,23.5,19.139836,6.0,7.5,23.0,39.0,42.0
interacted_in-AskReddit,4.0,13.25,10.242884,2.0,7.25,12.5,18.5,26.0
interacted_in-PublicFreakout,3.0,12.0,13.856406,4.0,4.0,4.0,16.0,28.0
interacted_in-NoStupidQuestions,3.0,6.333333,5.507571,1.0,3.5,6.0,9.0,12.0
interacted_in-politics,3.0,2.666667,2.081666,1.0,1.5,2.0,3.5,5.0
interacted_in-television,3.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
interacted_in-Destiny,2.0,10.5,13.435029,1.0,5.75,10.5,15.25,20.0
interacted_in-LivestreamFail,2.0,8.5,10.606602,1.0,4.75,8.5,12.25,16.0
interacted_in-news,2.0,7.0,7.071068,2.0,4.5,7.0,9.5,12.0
interacted_in-TooAfraidToAsk,2.0,4.5,2.12132,3.0,3.75,4.5,5.25,6.0


In data science it's common to have data that comes back in terms of a list. If we wanted to compare users using subreddit activity data we would have a mess on our hands. In general you have to take this data and transform it into user attributes.  
  
Think about this for a little. Knowing a user is active in "r/cocaine" is interesting but it might be more informativce to create categories from this data. Things like US illegal drug interest (cocaine, opium, peyote, benzo, etc) and drug interest (trees, cigarettes, as well as anything in the US illegal drug interest category).  
  
By creating a category we will hit on underlying user behavior, a disregard for conventional laws in the US, modeled by interest in certain categories.   

In [9]:
#Lets add categories for all of these subreddits!
interacted_in_metrics[interacted_in_metrics['count']>1].index

Index(['interacted_in-AskReddit', 'interacted_in-Destiny',
       'interacted_in-LivestreamFail', 'interacted_in-StarWars',
       'interacted_in-Tinder', 'interacted_in-cyberpunkgame',
       'interacted_in-movies', 'interacted_in-news',
       'interacted_in-nextfuckinglevel', 'interacted_in-politics',
       'interacted_in-television', 'interacted_in-AmItheAsshole',
       'interacted_in-Coronavirus', 'interacted_in-NoStupidQuestions',
       'interacted_in-PublicFreakout', 'interacted_in-Showerthoughts',
       'interacted_in-TooAfraidToAsk', 'interacted_in-unpopularopinion',
       'interacted_in-Conservative', 'interacted_in-legaladvice',
       'interacted_in-AskDocs', 'interacted_in-anime', 'interacted_in-gaming'],
      dtype='object')

In [10]:
#Quick list comprehension to get the names we need
to_categorize=[x.replace("interacted_in-","") for x in interacted_in_metrics[interacted_in_metrics['count']>1].index]

In [11]:
to_categorize=['AskReddit',
 'Destiny',
 'LivestreamFail',
 'StarWars',
 'Tinder',
 'cyberpunkgame',
 'movies',
 'news',
 'nextfuckinglevel',
 'politics',
 'television',
 'AmItheAsshole',
 'Coronavirus',
 'NoStupidQuestions',
 'PublicFreakout',
 'Showerthoughts',
 'TooAfraidToAsk',
 'unpopularopinion',
 'Conservative',
 'legaladvice',
 'AskDocs',
 'anime',
 'gaming']

In [12]:
interest_mappings={
    "gaming": ["games","pcgaming","modern_warfare","Destiny","cyberpunkgame","gaming"],
    "sci-fi": ["StarWars"],
    "etnertainment": ["movies","television","StarWars"],
    "shooters": ["modern_warfare","GearsOfWar","BlackOps4"],
    "outdoor_activity": ["skateboarding","running"],#ADD more to categories as you see them!
    "us_illegal_drugs": ["cocaine","benzodiazepines"],
    "programming": ["Terraform"],
    "anime": ["anime"],
    "asian_culture": ["anime'"],#ASIA IS HUGE AND DIVERSE I'M SORRY FOR LUMPING ALL YALL IN ONE BUCKET RIGHT NOW
    "investing": [],
    "real_estate": [],
    "apple_products": [],
    "judgemental": ["AmItheAsshole", "LivestreamFail","PublicFreakout"],
    "childfree": ["Vasectomy"],
    "conservative": ["AdamCarolla",#could be good for age discrimination (YES THIS CODE IS FOR DISCRIMINATION?!?!?!?!!!!)
                    "Conservative", 
                    ],
    "confessions": ["TooAfraidToAsk","unpopularopinion","NoStupidQuestions","Showerthoughts"],
    "medicine": ["AskDocs"],
    "current_events": ["Coronavirus","politics","news",],
    "self-help": ['AskDocs',"legaladvice",],
    "dating": ["Marriage","Tinder"],
    "bbq": ["smoking"], 
    "reddit_default": ["AskReddit"],
    "funny": ["LivestreamFail","PublicFreakout"],
    "adrenaline": ["nextfuckinglevel"],
}
interest_mappings['drugs']=interest_mappings['us_illegal_drugs']+[
    "Cigarettes","trees"
]
#so we can easily see what we haven't done yet
not_done_yet=set(to_categorize)-set([x for l in interest_mappings.values() for x in l ])
not_done_yet

set()

In [14]:
#function to check if a a subreddit matches of list of subreddits
def check_interests(collection, matches, match_rule="ignore_case"):
    #match_rule exists in case you want to extend this to checking for common phrases used in comment or post body text.
    if match_rule=="ignore_case":
        count=len([c for c in collection if any(c.casefold()==m.casefold() for m in matches)])
    else:
        raise Exception("Unsupported match_rule")
    return count
row={}
for category, matches in interest_mappings.items():
    row["interest_in-"+category]=check_interests(user_df['subreddit_name'], matches)
row

{'interest_in-gaming': 23,
 'interest_in-sci-fi': 0,
 'interest_in-etnertainment': 0,
 'interest_in-shooters': 0,
 'interest_in-outdoor_activity': 0,
 'interest_in-us_illegal_drugs': 0,
 'interest_in-programming': 2,
 'interest_in-anime': 1,
 'interest_in-asian_culture': 0,
 'interest_in-investing': 0,
 'interest_in-real_estate': 0,
 'interest_in-apple_products': 0,
 'interest_in-judgemental': 29,
 'interest_in-childfree': 0,
 'interest_in-conservative': 8,
 'interest_in-confessions': 8,
 'interest_in-medicine': 1,
 'interest_in-current_events': 0,
 'interest_in-self-help': 1,
 'interest_in-dating': 0,
 'interest_in-bbq': 0,
 'interest_in-reddit_default': 0,
 'interest_in-funny': 29,
 'interest_in-adrenaline': 0,
 'interest_in-drugs': 0}

In [15]:
#add concepts
rows=[]
for user, user_df in users_df.groupby("user"):
    redditor=reddit.redditor(user)
    row={"user": user,
            "comment_karma":redditor.comment_karma,
            "post_karma": redditor.awardee_karma,
            "total_karma": redditor.total_karma,
             "total_interactions": [len(user_df)],
             "cake_day": pd.to_datetime(redditor.created_utc*1e9),
        }
    for category, matches in interest_mappings.items():
        row["interest_in-"+category]=check_interests(user_df['subreddit_name'], matches)
    for subreddit,count in user_df.groupby('subreddit_name').count().iterrows():
        row[f'interacted_in-{subreddit}']=count['user']#count names columns a bit different
    rows.append(row)
user_profile=pd.DataFrame(rows).replace(0, pd.np.nan)
user_profile

  user_profile=pd.DataFrame(rows).replace(0, pd.np.nan)


Unnamed: 0,user,comment_karma,post_karma,total_karma,total_interactions,cake_day,interest_in-gaming,interest_in-sci-fi,interest_in-etnertainment,interest_in-shooters,...,interacted_in-suspiciouslyspecific,interacted_in-telescopes,interacted_in-trashy,interacted_in-u_Shannnnnnn,interacted_in-ultimaonline,interacted_in-valheim,interacted_in-wow,interacted_in-wowaddons,interacted_in-yesyesyesno,interacted_in-yugioh
0,BojanglesDeloria,30454,15,34957,[200],2013-12-31 14:49:45,8.0,1.0,4.0,3.0,...,,,,,,,,,,
1,ExtremelyBeige,25578,88,32265,[120],2018-02-20 01:08:33,,,1.0,,...,,,,,,,,,,
2,Hotwater3,7908,15,12116,[200],2014-01-02 00:26:01,,,3.0,,...,,,,,,,,,,
3,Namisaur,33651,329,37457,[125],2014-05-27 17:40:41,1.0,,,,...,,,,,,,,,,
4,Net_User,8127,61,11746,[200],2013-04-10 01:53:53,,1.0,1.0,,...,,,,,,,,,,
5,Shannnnnnn,8176,549,15480,[200],2017-08-09 11:10:29,23.0,,,,...,2.0,3.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0


In [16]:
interest_metrics=user_profile.filter(regex="interest_in").describe().T#transpose so we can more easily query on count
interest_metrics[interest_metrics['count']>1].sort_values(by=["count","mean"],ascending=[False,False])

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
interest_in-judgemental,5.0,11.4,11.148991,3.0,4.0,5.0,16.0,29.0
interest_in-confessions,4.0,31.0,25.126347,8.0,10.25,29.5,50.25,57.0
interest_in-reddit_default,4.0,13.25,10.242884,2.0,7.25,12.5,18.5,26.0
interest_in-funny,4.0,13.25,11.92686,4.0,4.0,10.0,19.25,29.0
interest_in-etnertainment,4.0,2.25,1.5,1.0,1.0,2.0,3.25,4.0
interest_in-self-help,4.0,1.25,0.5,1.0,1.0,1.0,1.25,2.0
interest_in-gaming,3.0,10.666667,11.23981,1.0,4.5,8.0,15.5,23.0
interest_in-current_events,3.0,8.0,5.0,3.0,5.5,8.0,10.5,13.0
interest_in-dating,3.0,3.333333,1.154701,2.0,3.0,4.0,4.0,4.0
interest_in-conservative,2.0,5.0,4.242641,2.0,3.5,5.0,6.5,8.0


# Wrapping Up
We've gone through collecting interest data of reddit users. You are now equpped to do your own analysis using the data we've collected and categorized. In the next tutorial I'll go over comparing the interests of a few different subreddits, moving ones like NYC and austin, as well as conservative vs liberal.  
  
If you're itching to get going on your own by all means please do. I've got a lot of project ideas listed below.

# Further Work  
At the end of the day information is only as good as what you can use it for.   
Think before you embark on a big programming task. What will this new data piece enable to you to do and is it worth the effort? That being said, here's some ideas I had:    
## Analysis  
- Inferring age from interests
    - On reddit some people may post their age in common phrases like "I'm only 23" or "Us millenials" etc. You can create a model that takes their attributes and predicts age using those that stated their age bands as "training data".   
- Inferring other things
    - Once a user has stated something, I'm from Chicago, I like surfing and do it every day, etc. you can do they same modeling exercise. Come up with a likelihood model that someone is also from Chicago.
- Should I move to "XXX"
    - Analyze followers of various city/country subreddits to see how likely you are to have similar interests to redditors in the place you are considering. 
    - Apply this to any commonalities you might be interested in seeing. Comapre your account to the subreddit's typical users.
- Deviation from the  norm
    - People on reddit have one major thing in common, they've chosen to user reddit. This is very different from people who have not. It also means they are on the internet, may have relatively stable internet access, among other things.  
    - In doing analysis you may want to account for how a user differs from the average reddit user, I'd reckon the average reddit user values formal education and liberal idealogy more than the real world. If you are trying to make sense of people as a whole you may want to temper your expectations a bit in terms of how prevalent certain leaning sites and dogmas may be
  
Another type of analysis commonly done on users is cluster analysis. What types of users make up this subreddit? There isn't necessarily one typical user, just a most common. Many subgroups and cliques may follow a certain subreddit. Starwars followers would be a good place to see the difference between old and young users.   
  
## Cool Project Idea
Tie in some of the analysis from before and create a reddit plugin (can be browser based) that gives you details of the reddit account you are reading. Show their karma, whether they comment in posts they create, where they are from, their estimated age, etc.  