# Overview
Subreddits are interesting but it's also cool to learn more about the people in them.  
- What do they like to do, 
- Where do they go on the web
- What subreddit's do they interact with  
In any group people are there for different reasons. Some are just learning, some are looking to hangout with similar people, and some are looking a chance to teach people. Plenty of other reasons too. These are more specific to /r/learnpython.
  
If we can group users together we can better understand where the learnpython beginners hang out and how that differs from some of the more experienced people.  
  
It could help you find more resources to learn about python, other cool subreddits, and give you a glimpse into the hobbies of other programmers.  
  

  
In this notebook, I'll go over some basic analysis on user accounts and give you some ideas for deeper work. 

## Ethical Considerations
So now that I want to show you how to analyze reddit accounts what is fair game? Which account(s) should I use examples for here?  
  
I did a search for post about privacy and found 2 redditors that indicated not caring about privacy on the internet and that they've accepted that using social media gives up their privacy. So to me these are fair game, if you are one of those 2 users and want to be removed from this example please feel free to message me.  
  
More on ethics.. Ethics are a funny concept. I think it's up to you to decide what is ethical. In my view ethics always favor the dominant parties in a group in control of the thought leadership. So I'm not the biggest fan of not doing things because someone says they are considered by some to be unethical. In this exercise I did attempt to follow status quo ethics but the choice is yours in your own research.  
  
I extend these my views on ethics to other things as well. If the API for something like Reddit doesn't work, there is no harm in accessing it by other means and extracting the information you want. Sites like reddit or Facebook use the internet but try as they might they don't control it. One of the favorite techniques used by these sites is obfuscated crap like that mentioned in this [article](https://css-tricks.com/how-facebook-avoids-ad-blockers/).   
 

# Lets Begin!
Just like with traversing "CommentTrees" in posts we'll need to have code to traverse all of a user's comments and posts.

In [10]:
import tqdm # Handy for showing progress on longer running jobs
from utils import * #Load the utilities we created in other notebooks

users=["BojanglesDeloria","Namisaur","Net_User","Hotwater3","ExtremelyBeige","Shannnnnnn"]
user=users[0]
user

'BojanglesDeloria'

In [11]:
redditor=reddit.redditor(user)

#NOTE These might be slow for redditors with big accounts, rather then make them a list, keeping them in "generator" form
#  may be a good idea
posts=[post for post in redditor.submissions.new()]
comments=[comment for comment in redditor.comments.new()]


In [12]:
import pandas as pd

In [13]:
rows=[]
for c in posts+comments:
    row={
        "subreddit_name": c.subreddit.display_name
    }
    rows.append(row)
user_df=pd.DataFrame(rows)
user_df

Unnamed: 0,subreddit_name
0,Delco
1,modernwarfare
2,EscapefromTarkov
3,LivestreamFail
4,GearsOfWar
...,...
195,LivestreamFail
196,LivestreamFail
197,LivestreamFail
198,LivestreamFail


## What do we do with this data?
In data science it's common to have data that comes back in terms of a list. If we wanted to compare users using subreddit activity data we would have a mess on our hands. In general you have to take this data and transform it into user attributes.  
  
Think about this for a little. Knowing a user is active in "r/cocaine" is interesting but it might be more informativce to create categories from this data. Things like US illegal drug interest (cocaine, opium, peyote, benzo, etc) and drug interest (trees, cigarettes, as well as anything in the US illegal drug interest category).  
  
By creating a category we will hit on underlying user behavior, a disregard for conventional laws in the US, modeled by interest in certain categories.   
  


In [14]:
interest_mappings={
    "gaming": ["games","pcgaming","modern_warfare"],
    "shooters": ["modern_warfare","GearsOfWar","BlackOps4"],
    "outdoor_activity": ["skateboarding","running"],#ADD more to categories as you see them!
    "us_illegal_drugs": ["cocaine","benzodiazepines"],
}
interest_mappings['drugs']=interest_mappings['us_illegal_drugs']+[
    "Cigarettes","trees"
]
interest_mappings

{'gaming': ['games', 'pcgaming', 'modern_warfare'],
 'shooters': ['modern_warfare', 'GearsOfWar', 'BlackOps4'],
 'outdoor_activity': ['skateboarding', 'running'],
 'us_illegal_drugs': ['cocaine', 'benzodiazepines'],
 'drugs': ['cocaine', 'benzodiazepines', 'Cigarettes', 'trees']}

Not too shabby as POC but in actual coding there are better ways of storing and accessing that data, that "nested" category for us_illegal_drugs isn't very readable. I would opt towards something like a tree where each "node" outputs a category and counts all of it's children. Similar to the comment forest in tutorial 1.  
  
Anyways, let's move on

In [22]:
def check_interests(collection, matches, match_rule="ignore_case"):
    if match_rule=="ignore_case":
        count=len([c for c in collection if any(c.casefold()==m.casefold() for m in matches)])
    else:
        raise Exception("Unsupported match_rule")
    return count
row={}
for category, matches in interest_mappings.items():
    row["interest_in-"+category]=check_interests(user_df['subreddit_name'], matches)
row

{'interest_in-gaming': 0,
 'interest_in-shooters': 0,
 'interest_in-outdoor_activity': 0,
 'interest_in-us_illegal_drugs': 0,
 'interest_in-drugs': 0}

In [23]:
user_df['subreddit_name'].value_counts()

Destiny                 30
PublicFreakout          22
BABYMETAL               16
unpopularopinion        10
newworldgame             8
AZURE                    8
Unexpected               7
discordapp               5
CoOpGaming               4
instantkarma             4
java                     4
PowerShell               4
valheim                  3
netflix                  3
blackmagicfuckery        3
telescopes               3
GMail                    3
dayz                     2
LivestreamFail           2
Steam                    2
FateWinxSaga             2
cyberpunkgame            2
FF06B5                   2
gifsthatkeepongiving     2
Terraform                2
Reol                     2
suspiciouslyspecific     2
learnprogramming         2
SuperMarioOdyssey        1
wowaddons                1
gog                      1
androidapps              1
AskDocs                  1
outside                  1
Twitch                   1
asiangirlsbeingcute      1
cpop                     1
f

In [17]:
user

'BojanglesDeloria'

In [18]:
#Now compare multiple users who don't care about privacy
rows=[]
for user in users:
    redditor=reddit.redditor(user)

    #NOTE These might be slow for redditors with big accounts, rather then make them a list, keeping them in "generator" form
    #  may be a good idea
    posts=[post for post in redditor.submissions.new()]
    comments=[comment for comment in redditor.comments.new()]
    for c in posts+comments:
        row={
            "subreddit_name": c.subreddit.display_name,
            "user": user
        }
        rows.append(row)
users_df=pd.DataFrame(rows)
users_df

Unnamed: 0,subreddit_name,user
0,Delco,BojanglesDeloria
1,modernwarfare,BojanglesDeloria
2,EscapefromTarkov,BojanglesDeloria
3,LivestreamFail,BojanglesDeloria
4,GearsOfWar,BojanglesDeloria
...,...,...
1039,PublicFreakout,Shannnnnnn
1040,PublicFreakout,Shannnnnnn
1041,unpopularopinion,Shannnnnnn
1042,Destiny,Shannnnnnn


In [19]:
rows=[]
for user, user_df in users_df.groupby("user"):
    redditor=reddit.redditor(user)
    row={"user": user,
            "comment_karma":redditor.comment_karma,
            "post_karma": redditor.awardee_karma,
            "total_karma": redditor.total_karma,
             "cake_day": pd.to_datetime(redditor.created_utc*1e9),
        }
    for category, matches in interest_mappings.items():
        row["interest_in-"+category]=check_interests(user_df['subreddit_name'], matches)
    rows.append(row)
user_profile=pd.DataFrame(rows)
user_profile

Unnamed: 0,user,comment_karma,post_karma,total_karma,cake_day,interest_in-gaming,interest_in-shooters,interest_in-outdoor_activity,interest_in-us_illegal_drugs,interest_in-drugs
0,BojanglesDeloria,30304,15,34807,2013-12-31 14:49:45,5,3,1,9,21
1,ExtremelyBeige,25578,88,32265,2018-02-20 01:08:33,0,0,0,0,0
2,Hotwater3,7818,15,12026,2014-01-02 00:26:01,0,0,0,0,0
3,Namisaur,33407,329,37213,2014-05-27 17:40:41,0,0,0,0,0
4,Net_User,8127,61,11747,2013-04-10 01:53:53,0,0,0,0,0
5,Shannnnnnn,8162,549,15463,2017-08-09 11:10:29,0,0,0,0,0


Nothing in common! We need to come up with a few more categories and consider extending our analysis to include comment and post text.

In [20]:
interests_mapping={
    "programming": ["Terraform"],
    "anime": [],
    "asian_culture": [],#ASIA IS HUGE AND DIVERSE I'M SORRY FOR LUMPING ALL YALL IN ONE BUCKET RIGHT NOW
    "investing": [],
    "real_estate": [],
    "apple_products": [],
    "judgemental": ["AmItheAsshole"],
    "childfree": ["Vasectomy"],
    "conservative": ["AdamCarolla",#could be good for age discrimination (YES THIS CODE IS FOR DISCRIMINATION?!?!?!?!!!!)
                     ""],
    "bbq": ["smoking"],                  
    #add overwatch
}

In [21]:
pd.options.display.max_rows=99
users_df[users_df['user']=='Hotwater3']['subreddit_name'].value_counts()

unpopularopinion        42
VinylReleases           26
saltierthancrait        19
marvelstudios           13
jobs                    13
Parenting                8
AskMenOver30             7
NoStupidQuestions        6
quit_vaping              5
Marriage                 4
applehelp                3
Blink182                 3
recruitinghell           3
sysadmin                 3
AskScienceFiction        3
boxoffice                3
LockdownSkepticism       2
movies                   2
spotify                  2
HomeImprovement          2
electronic_cigarette     2
mxpx                     2
personalfinance          2
RealEstate               2
povertyfinance           2
Piracy                   1
Atlanta                  1
suits                    1
television               1
landscaping              1
TikTokCringe             1
AdamCarolla              1
hotones                  1
smoking                  1
youtubetv                1
Conservative             1
alexa                    1
V

## Let's also add in sites users support
Just like subreddits we can look up the base domains redditors post. We may also be able to use this to find common idealogy. People who like veganism may link to the same resources like a site on animal cruelty or documentary. By analyzing links users post we should be able to find that. And then can connect them to others posting that link even if they don't follow veganism.     
  
Keep in mind a link isn't always an endorsement. Someone may post something to make fun of it. Adding that [natural language processing/NLP](https://en.wikipedia.org/wiki/Natural_language_processing) understanding to the code is beyond the scope of this tutorial but I'm happy to go into that more in a different tutorial. 

# Further Work  
At the end of the day information is only as good as what you can use it for.   
Think before you embark on a big programming task. What will this new data piece enable to you to do and is it worth the effort? That being said, here's some ideas I had:    
## Analysis  
- Inferring age from interests
    - On reddit some people may post their age in common phrases like "I'm only 23" or "Us millenials" etc. You can create a model that takes their attributes and predicts age using those that stated their age bands as "training data".   
- Inferring other things
    - Once a user has stated something, I'm from Chicago, I like surfing and do it every day, etc. you can do they same modeling exercise. Come up with a likelihood model that someone is also from Chicago.
- Should I move to "XXX"
    - Analyze followers of various city/country subreddits to see how likely you are to have similar interests to redditors in the place you are considering. 
    - Apply this to any commonalities you might be interested in seeing. Comapre your account to the subreddit's typical users.
- Deviation from the  norm
    - People on reddit have one major thing in common, they've chosen to user reddit. This is very different from people who have not. It also means they are on the internet, may have relatively stable internet access, among other things.  
    - In doing analysis you may want to account for how a user differs from the average reddit user, I'd reckon the average reddit user values formal education and liberal idealogy more than the real world. If you are trying to make sense of people as a whole you may want to temper your expectations a bit in terms of how prevalent certain leaning sites and dogmas may be
  
Another type of analysis commonly done on users is cluster analysis. What types of users make up this subreddit? There isn't necessarily one typical user, just a most common. Many subgroups and cliques may follow a certain subreddit. Starwars followers would be a good place to see the difference between old and young users.   
  
## Cool Project Idea
Tie in some of the analysis from before and create a reddit plugin (can be browser based) that gives you details of the reddit account you are reading. Show their karma, whether they comment in posts they create, where they are from, their estimated age, etc.  