# Week 1

## Overview

As explained in the [*Before week 1* notebook](https://nbviewer.jupyter.org/github/lalessan/comsocsci2021/blob/master/lectures/Before_week_1.ipynb), each week of this class is a Jupyter notebook like this one. **_In order to follow the class, you simply start reading from the top_**, following the instructions.

**Hint**: And you can ask me for help at any point if you get stuck!

## Today

This first lecture will go over a few different topics to get you started 

* First, we will learn about Computational Social Science.
* Second, we talk a bit about APIs and how they work.
* Third, we'll use an API to download Reddit data from the _r/wallstreetbet_ subreddit



## Part 1: Computational Social Science


But _What is Computational Social Science_? Watch the  video below, where I will give a short introduction to the topic. 


> **_Video lecture_**: Watch the video below about Computational Social Science

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo("qoPk_C3buD8",width=600, height=337.5)


Now that you have learnt what Computational Social Science, read about the advantages and challenges of using _"Big Data"_ for Social Science Research in Sections 2.1 to 2.3 of the book Bit by Bit. 

> _Reading_: [Bit by Bit, sections 2.1 to 2.3](https://www.bitbybitbook.com/en/1st-ed/observing-behavior/observing-intro/) Read sections 2.1 and 2.3, then skim through section 2.3. The idea is for you to understand, in general terms, advantages and challenges of large observational datasets (a.k.a. Big Data) for social studies.

> *Exercise 1*: This year, lockdowns have helped governments contain the pandemic. But they also negatively impacted our wellbeing. Imagine you had to study the following question: "_What are some of the strategies people adopt to preserve their mental and physical wellbeing during lockdown?_"


> * Write in a couple of lines: 
>> * Which data would you collect to study this topic? 
>>> * Ideally observational data with a camera spying on people in their homes, but this cannot be done due to ethical and lawful problems. 
>>> * Big data such as internet traces could be used, to see whether people look into strategies on the internet and which strategies the might prefer
>> * How would you collect it?
> * Describe the data you would need more in details (also by writing down a couple of lines): 
>> * How big is the data (number of users/number of data points)? 
>> * Which variables it contains? 


## Part 2: Using APIs to download Reddit data

But what is an API? Find the answer in the short video below, where we get familiar with APIs to access Reddit data. 


> **_Video lecture_**: Watch the video below about the Reddit API

In [2]:
from IPython.display import YouTubeVideo
YouTubeVideo("eqBIFua00O4",width=600, height=337.5)


In [3]:
# !pip install psaw
# !pip install pandas

import pandas as pd
import numpy as np
import datetime

In [4]:
from psaw import PushshiftAPI

api = PushshiftAPI()

import datetime

It's time for you to get to work. Take a look at the two texts below - just to get a sense of a more technical description of how the Pushshift API works.


> _Reading_ (just skim): [New to Pushshift? Read this! FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)  
> _Reading_ (just skim): [Pushshift Github Repository](https://github.com/pushshift/api)
> 

## Prelude to part 3: Pandas Dataframes


Before starting, we will also learn a bit about [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), a very user-friendly data structure that you can use to manipulate tabular data. Pandas dataframes are implemented within the [pandas package](https://pandas.pydata.org/).

Pandas dataframes should be intuitive to use. **I suggest you to go through the [10 minutes to Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to learn what you need to solve the next exercise.**

## Part 3: Getting data from the _r/wallstreetbet_ subreddit

There has been a lot of interest in the social platform Reddit this week, after investors from the [_r/wallstreetbet_](https://www.reddit.com/r/wallstreetbets/) subreddit managed to [give a huge boost](https://www.google.com/search?q=GME+price&oq=GME+price&aqs=chrome..69i57.1261j0j4&sourceid=chrome&ie=UTF-8) to the shares of the video game retailer's GameStop (traded as "_GME_"), causing massive losses to professional investors and established hedge funds.

There is so much buzz about _Gamestop_ because it is really something unprecedented! Online discussions about stocks on social media have fuelled massive price moves that cannot be explained by traditional valuation metrics and can seriously destabilize the established market. Many ordinary investors on Reddit have coordinated to buy shares of a stock that had been losing value for a long time. __But how did this all happen?__ 


Today and in the following classes, we will try to answer precisely this question, by studying the social network of Redditors of _r/wallstreetbet_ throughout last year. 

The starting point will be to understand how to download data from Reddit using APIs. But before we start getting our hands diry, if you feel like you don't know much about Gamestop, I suggest to watch this short video summarizing the latest events. If you already know everything about it, feel free to skip it. 

> 
> **_Video_**: [Stocks explained: What's going on with GameStop?](https://www.bbc.com/news/av/technology-55864312)
> 

> *Exercise 2*: __Download submissions of the [_r/wallstreetbet_](https://www.reddit.com/r/wallstreetbets/) subreddit using the [Pushift API](https://github.com/pushshift/api)__
> 1. Use the [psaw Python library](https://pypi.org/project/psaw/) (a wrapper for the Pushshift API) to find all the submissions in subreddit _r/wallstreetbet_', related to either "_GME_" or "_Gamestop_" (**Hint**: Use the [``q``](https://github.com/pushshift/api) parameter to search text. To search multiple words you can separate them with character "|"). Focus on the period included __between Jan,1st 2020 and Jan 25th, 2021__, where time must be provided in [Unix Timestamp](https://www.unixtimestamp.com/). _Note: The Pushift API returns at most 100 results per query, so you may need to divide your entire time period in small enough sub-periods._ 
> 2. For each submission, find the following information: __title, id, score, date of creation, author, and number of comments__ (**Hint**: access the dictionary with all attributes by typing ``my_submission.d_``). Store this data in a pandas DataFrame and save it into a file. (Downloading required me 30 minutes using two cores. While you wait for the results, you can start thinking about _Exercise 3_).
> 3. Create a figure using [``matplotlib``](https://matplotlib.org/) and plot the total number of submissions per day (**Hint**: You can use the function [``datetime.datetime.utcfromtimestamp``](https://docs.python.org/3/library/datetime.html) to convert a timestamp into a date, and you can use the function [``pd.resample``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html) to aggregate by day). What do you observe? 
> 4. How many submissions have you downloaded in total? How many unique authors? 
> 5. _Optional_: How many unique authors are there each week in the period under study? 


> *Exercise 3*: __Download comments from the [_r/wallstreetbet_](https://www.reddit.com/r/wallstreetbets/) subreddit.__ The second task for today is to download the comments associated to each submission, which we will use to build the social network of Redditers.
> 1. For each submission you found in _Exercise 2_, download all the comments (*Hint*: Use the [``search_comments``](https://github.com/pushshift/api) function to search comments. You can specify the parameter ``link_id``, which corresponds to the _id_ of the submission for which you require comments).  
> 2. For each comment, store the following information: __id, submission, score, date of creation, author, parent_id__. Note that the _submission id_ can be retrieved by accessing the _link_id_ attribute. Store this in a pandas DataFrame and save it into a file. We will use it in the next classes.



> __Note__: It took me about a night to get the data for _Exercise 3_. I guess Pushshift servers are going through increasing stress due to the raising interest in the Gamestop saga. If you experience extremely slow downloading time, reach out to me! If you are brave, you can also check out the Reddit API, which is wrapped by [praw](https://praw.readthedocs.io/en/latest/tutorials/comments.html). It functions very much like psaw, but it requires you to first get credentials [here](https://www.reddit.com/prefs/apps) (click on _Create another app_)

In [5]:
### Exercise 2

## 1. Use the psaw Python library (a wrapper for the Pushshift API) to find all the submissions in subreddit r/wallstreetbet', related to either "GME" or "Gamestop" (Hint: Use the q parameter to search text. To search multiple words you can separate them with character "|"). Focus on the period included between Jan,1st 2020 and Jan 25th, 2021, where time must be provided in Unix Timestamp. Note: The Pushift API returns at most 100 results per query, so you may need to divide your entire time period in small enough sub-periods. 
## 2. For each submission, find the following information: title, id, score, date of creation, author, and number of comments (Hint: access the dictionary with all attributes by typing mysubmission.d). Store this data in a pandas DataFrame and save it into a file. (Downloading required me 30 minutes using two cores. While you wait for the results, you can start thinking about Exercise 3).

# Define API
api = PushshiftAPI()

# Define subreddit
my_subreddit = "wallstreetbets"

# Define time itnerval
date1 = int(datetime.datetime(2020,1,1).timestamp())
date2 = int(datetime.datetime(2021,1,25).timestamp())

# Define query
query = "GME|Gamestop"

# Call API
gen = api.search_submissions(subreddit=my_subreddit, after=date1, before=date2, q=query, filter=['author', 'title', 'id', 'score', 'created_utc', 'num_comments'])

# Convert api call to list
results = list(gen)

In [6]:
len(list(gen)) # hvorfor er denne blot 0?

0

In [7]:
# Convert list to dataframe
df_res = pd.DataFrame([(p.d_["title"], 
                        p.d_["id"], 
                        p.d_["score"], 
                        p.d_["created_utc"],
                        datetime.datetime.utcfromtimestamp(p.d_["created_utc"]).strftime("%Y-%m-%d"),
                        p.d_["author"],
                        p.d_["num_comments"]) for p in results], columns = ["title", "id", "score", "created_utc", "creation_date", "author", "num_comments"])


# Write dataframe to .csv file
# display(dfres) #look at the dataframe in a nice view than print()
df_res.to_csv(path_or_buf="Data/wallstreebets_GME_Katrine_submissions.csv",index=False)
df_res = pd.read_csv("Data/wallstreebets_GME_Katrine_submissions.csv")
df_submissions = df_res
df_submissions

Unnamed: 0,title,id,score,created_utc,creation_date,author,num_comments
0,I am finally buying GME &amp; BB tomorrow.,l49xif,1,1611529173,2021-01-24,mricecream429,1
1,Something that will help you autists in GME.,l49x88,1,1611529149,2021-01-24,SethEllis,0
2,Holy shit you guys https://hard-money.net/cath...,l49we9,1,1611529073,2021-01-24,steeej92,0
3,New member here. WHERE THE FUCK DO I BUY SOME ...,l49ve6,1,1611528981,2021-01-24,krasaa,1
4,Realistically is it too late to get in on GME ...,l49vc5,1,1611528976,2021-01-24,Biverrarton,0
5,My retard meal before $GME goes TO THE MOON! 🚀...,l49ubq,1,1611528890,2021-01-24,maskedmurader,0
6,GME option chain just got upped to $115.,l49tsl,1,1611528845,2021-01-24,nicoleandjimok,0
7,Thanks to the squad here. The money I made wit...,l49qxz,1,1611528599,2021-01-24,WineandWeight,1
8,Is gamestop still a buy. Will it rise on monday?,l49qhc,1,1611528560,2021-01-24,accoubt2468,0
9,Is it too late to buy GME,l49phl,1,1611528476,2021-01-24,rooh62,0


# Submissions per day

In [8]:
## 3. Create a figure using matplotlib and plot the total number of submissions per day 
   # What do you observe?
    # We see the spike around 

import matplotlib.pyplot as plt

df_res['creation_date'] = pd.to_datetime(df_res['creation_date'])
df_plot = df_res.resample('1D', on = 'creation_date').count()

del df_plot["creation_date"]
df_plot = df_plot.reset_index()

display(df_plot) #390 rows × 7 columns

df_plot.plot.line(x='creation_date',y='id')
plt.show()

Unnamed: 0,creation_date,title,id,score,created_utc,author,num_comments
0,2020-01-01,1,1,1,1,1,1
1,2020-01-02,1,1,1,1,1,1
2,2020-01-03,0,0,0,0,0,0
3,2020-01-04,0,0,0,0,0,0
4,2020-01-05,0,0,0,0,0,0
5,2020-01-06,1,1,1,1,1,1
6,2020-01-07,0,0,0,0,0,0
7,2020-01-08,0,0,0,0,0,0
8,2020-01-09,1,1,1,1,1,1
9,2020-01-10,0,0,0,0,0,0


<Figure size 640x480 with 1 Axes>

# Unique authors

How many unique authors are there each week in the period of this study?

### The total number of submissions
len(df_res)  #11552+1  due to python being zero-indexed Now 8018+1

len(np.unique(df_submissions["id"])) = 15014


# Comments


In [10]:
from tqdm import tqdm

df_submissions = pd.read_csv("Data/wallstreebets_GME_Katrine.csv")

my_subreddit = "wallstreetbets"
date1 = int(datetime.datetime(2020, 1, 1).timestamp())
date2 = int(datetime.datetime(2021, 1, 25).timestamp())
query = "GME|Gamestop"


# get comments through a for loop
comments = []
sub_ids = list(df_submissions[df_submissions["num_comments"] > 0]["id"])
N = 50 #split api call in N bits
step = len(sub_ids)/N
for i in tqdm(range(N)):
    ids = sub_ids[int(round(i*step)):int(round((i+1)*step))]
    gen_comments = api.search_comments(subreddit = my_subreddit,
                                       after = date1,
                                       before = date2,
                                       link_id = ids)
    comments.extend(list(gen_comments))


100%|███████████████████████████████████████████████████████████████████████████████| 50/50 [4:12:06<00:00, 297.64s/it]


In [11]:
df_comments = pd.DataFrame([(p.d_["id"],
                             p.d_["link_id"],
                             p.d_["score"],
                             p.d_["created_utc"],
                             datetime.datetime.utcfromtimestamp(p.d_["created_utc"]).strftime("%Y-%m-%d"),
                             p.d_["author"],
                             p.d_["parent_id"]) for p in comments], columns = [ "id", "submission_id", "score", "created_utc", "creation_date", "author", "parent_id"])

df_comments.to_csv("Data/wallstreebets_GME_Katrine_ALL_comments.csv", index = False)
df_comments

Unnamed: 0,id,submission_id,score,created_utc,creation_date,author,parent_id
0,gkjkq9i,t3_l3y4mp,82,1611489991,2021-01-24,skinfather11216,t3_l3y4mp
1,gkjkjmx,t3_l3y4mp,9,1611489944,2021-01-24,DivingDeep21,t1_gkjja9s
2,gkjkiyd,t3_l3y4mp,28,1611489939,2021-01-24,je_veux_sentir,t1_gkjjk3k
3,gkjkiki,t3_l3y4mp,18,1611489936,2021-01-24,BlazingLeo,t1_gkjjk3k
4,gkjkdnm,t3_l3y4mp,11,1611489901,2021-01-24,Anon-1400secret,t1_gkjjzr8
5,gkjkbg4,t3_l3y4mp,34,1611489885,2021-01-24,je_veux_sentir,t1_gkjjcjj
6,gkjkax0,t3_l3xy7m,9,1611489882,2021-01-24,bschug,t1_gkji91v
7,gkjk9ke,t3_l3xmhk,15,1611489872,2021-01-24,WesternBenefit,t3_l3xmhk
8,gkjk6wp,t3_l3y4mp,13,1611489851,2021-01-24,je_veux_sentir,t1_gkjja9s
9,gkjk6q0,t3_l3vqn9,1,1611489850,2021-01-24,[deleted],t3_l3vqn9


In [12]:
a = [i for i in range(20)]
# for i in range(2):
#     print(a[i::2])


# sub_ids = list(df_submissions[df_submissions["num_comments"] > 0]["id"])
sub_ids = a
n = 12
s = int(len(sub_ids)/n+1) if (len(sub_ids)/n) % 2 == 1 else int(len(sub_ids)/n)
s = 2
s = len(sub_ids)/n
print("s = ", round(s))
for i in range(0, n):
    k = sub_ids[int(round(i*s)):int(round((i+1)*s))]
    print(len(k))
    print(k[:10])

s =  2
2
[0, 1]
1
[2]
2
[3, 4]
2
[5, 6]
1
[7]
2
[8, 9]
2
[10, 11]
1
[12]
2
[13, 14]
2
[15, 16]
1
[17]
2
[18, 19]


In [13]:
len(np.unique(df_submissions["id"]))

15014

### Final comments

The correct csv file for alle the submissions is wallstreebets_GME_Katrine.csv and the correct csv file for all the comments is wallstreebets_GME_Katrine_ALL_comments.csv. 