# Data collection: Twitch API
Now that we've identified the top 2k streamers and collected their data from Twitch Tracker, we use the Twitch API to get user-streamer following pairs. Specifically, we select the top 200/2k streamers and get their most recent 100 followers, totalling to 20k users. Then, for each of the 20k users, we get their entire list of streamers they follow.

# Getting started with the Twitch API
The Twitch API developer docs can be found at https://dev.twitch.tv/docs/api, within which [/reference](https://dev.twitch.tv/docs/api/reference) gives instructions on how to make API queries. 

To access the Twitch API you will need to register an application by following the steps at https://dev.twitch.tv/docs/api. Choose a redirect_uri (I just used http://localhost:8888 since I was working in Jupyter) and then modify `config.py` to reflect your specific client_id and redirect_uri.

In [1]:
# data processing
import json

# Requests & web scraping
import requests

# miscellaneous
from glob import glob
import time
import numpy as np
from collections import Counter

# settings (API keys)
import config

## Authentication

Before making API requests, we need to get a bearer token. Make sure your `config.py` file has been modified and then run the code below.

In [2]:
client_id = config.client_id
redirect_uri = config.redirect_uri
response_type = "token"
scope = "user:read:email"

oauth_url = "https://id.twitch.tv/oauth2/authorize?client_id=%s&redirect_uri=%s&response_type=%s&scope=%s&force_verify=true" \
             % (client_id, redirect_uri, response_type, scope)

print(oauth_url)

https://id.twitch.tv/oauth2/authorize?client_id=YOUR_CLIENT_ID&redirect_uri=YOUR_REDIRECT_URI&response_type=token&scope=user:read:email&force_verify=true


To get your access token, click the link above. It will redirect to `https://<redirect_uri>#access_token=<an access token>`. Reopen `config.py` and modify bearer_token to reflect your specific token.

In [4]:
bearer_token = config.bearer_token

# Data collection

**Please see** `get_follows_data.py` **for more recent and better documented versions of the functions in this section.** We used Jupyter for developing and debugging these functions, but eventually moved to a `.py` script where we parallelized data collection using `concurrent.futures`. 

## Get streamer ID corresponding to username

According to the Twitch API reference, Get Users Follows requires from_id or to_id and cannot take a username for either the from or to fields. Thus, before each Get User Follows query, we need to translate usernames to IDs.

Below is an example of what this looks like.

In [6]:
url = "https://api.twitch.tv/helix/users?login=%s" % "tysonpo"
response = requests.get(url, headers={"client-id":client_id, "authorization":"Bearer %s" % bearer_token})

response.text

'{"data":[{"id":"47128387","login":"tysonpo","display_name":"tysonpo","type":"","broadcaster_type":"","description":"","profile_image_url":"https://static-cdn.jtvnw.net/user-default-pictures-uv/ce57700a-def9-11e9-842d-784f43822e80-profile_image-300x300.png","offline_image_url":"","view_count":42,"email":"pondtyson@gmail.com","created_at":"2013-08-04T22:06:39.878613Z"}]}'

We can extract the id by doing `json.loads(response.text)["data"][0]["id"]`. Let's define a function for this

In [7]:
def get_user_id(user):
    url = "https://api.twitch.tv/helix/users?login=%s" % user
    response = requests.get(url, headers={"client-id":client_id, "authorization":"Bearer %s" % bearer_token})
    return int(json.loads(response.text)["data"][0]["id"])

In [8]:
# test
ID = get_user_id("xqcow")
ID

71092938

## Get streamer followers
Now we can get a user's followers/following list. The example below uses the ID above and gets xQcOW's following list by using `from_id` as a url field. We limited the results to 10, but the API can get up to 100 results (per request).

In [14]:
max_results = 10
url = "https://api.twitch.tv/helix/users/follows?from_id=%i&first=%i" % (ID, max_results)
response = requests.get(url, headers={"client-id":client_id, "authorization":"Bearer %s" % bearer_token})

response.text

'{"total":164,"data":[{"from_id":"71092938","from_name":"xQcOW","to_id":"175870676","to_name":"mbovosumo","followed_at":"2020-11-12T08:29:48Z"},{"from_id":"71092938","from_name":"xQcOW","to_id":"7236692","to_name":"DansGaming","followed_at":"2020-10-25T12:19:11Z"},{"from_id":"71092938","from_name":"xQcOW","to_id":"39195074","to_name":"TheAlbertChang","followed_at":"2020-10-10T03:56:48Z"},{"from_id":"71092938","from_name":"xQcOW","to_id":"443164500","to_name":"jamescharles","followed_at":"2020-10-09T21:39:01Z"},{"from_id":"71092938","from_name":"xQcOW","to_id":"519025175","to_name":"UltraGearGaming","followed_at":"2020-10-03T21:56:23Z"},{"from_id":"71092938","from_name":"xQcOW","to_id":"72256775","to_name":"VADIKUS007","followed_at":"2020-10-01T10:29:37Z"},{"from_id":"71092938","from_name":"xQcOW","to_id":"52795976","to_name":"GFuelEnergy","followed_at":"2020-09-25T18:51:53Z"},{"from_id":"71092938","from_name":"xQcOW","to_id":"41939266","to_name":"Gosu","followed_at":"2020-09-08T11:56:3

We can extract just the following list:

In [15]:
[x["to_name"] for x in json.loads(response.text)["data"]]

['mbovosumo',
 'DansGaming',
 'TheAlbertChang',
 'jamescharles',
 'UltraGearGaming',
 'VADIKUS007',
 'GFuelEnergy',
 'Gosu',
 'Chess',
 'souljaboy']

Below we define a general function for getting user followers/following data.

In [16]:
def get_follow_data(ID, kind="followers", cursor=None, return_pagination=False, max_results=100, fmt=None):
    if kind == "followers":
        url_field, opp = "to", "from"
    else: # following
        url_field, opp = "from", "to"
    if cursor:
        url = "https://api.twitch.tv/helix/users/follows?%s_id=%i&first=%i&after=%s" % (url_field, ID, max_results, cursor)
    else:
        url = "https://api.twitch.tv/helix/users/follows?%s_id=%i&first=%i" % (url_field, ID, max_results)
        
    response = requests.get(url, headers={"client-id":client_id, "authorization":"Bearer %s" % bearer_token})
    data = json.loads(response.text)
    
    if not data:
        return "NA"

    # FORMAT
    # 1. Get streamer followers
    # {"ID":streamer_ID, "followers":[follower_id1, follower_id2, ...]}
    # 2. Get user follows
    # {"following":[[streamer_name1, time_followed1] , [streamer_name2, time_followed2] , ...]  }
    # 3. Generic get list of follows
    # [follow1, follow2, ...]
    if fmt == 1:
        follow_data = {"ID":ID, "followers":[x["%s_id" % opp] for x in data["data"]]}
    elif fmt == 2:
        follow_data = {ID: {"total":data["total"], "following": [[x["%s_name" % opp], x["followed_at"]] for x in data["data"]]}}
    else:
        follow_data = [x["%s_id" % opp] for x in data["data"]]
    
    if return_pagination:
        pagination = data["pagination"]
        if "cursor" in pagination:
            pagination = pagination["cursor"]
        return follow_data, pagination
        
    return follow_data

def get_all_follow_data(ID, kind="followers", cursor=None, fmt=None, sleep_time=1):
    
    # make 1 query and return cursor (in case no cursor was specified)
    follow_data, cursor = get_follow_data(ID, kind, cursor, return_pagination=True, max_results=100, fmt=fmt)
        
    # if there are < 1 full page of results, we can stop here
    if not cursor:
        return follow_data
    
    # else we recursively get each page
    count = 100 # we will cap results to 1000
    while cursor and count < 1000:
        results, cursor = get_follow_data(ID, kind, cursor, return_pagination=True, max_results=100, fmt=fmt)
        if fmt == 1:
            follow_data["followers"] = follow_data["followers"] + results["followers"]
        elif fmt == 2:
            follow_data[ID]["following"] = follow_data[ID]["following"] + results[ID]["following"]
        else:
            follow_data.extend(results)
        time.sleep(sleep_time)
        
        count += 100
        
    return follow_data

## For each user, get their follow data

In [20]:
user_list = [94753024, 71092938]
for user in user_list:
    data = get_all_follow_data(user, kind="following", cursor=None, fmt=2, sleep_time=1)

In [18]:
data

{94753024: {'total': 333,
  'following': [['SteveAoki', '2020-11-16T00:55:55Z'],
   ['Twitch', '2020-11-13T19:47:50Z'],
   ['IslandGrown', '2020-11-09T07:50:05Z'],
   ['sashagrey', '2020-11-05T20:56:03Z'],
   ['lilyachty', '2020-11-04T10:56:35Z'],
   ['MK64MR', '2020-11-02T02:57:01Z'],
   ['artesianbuilds', '2020-10-29T19:19:56Z'],
   ['TheEret', '2020-10-27T02:00:09Z'],
   ['Masayoshi', '2020-10-23T03:15:47Z'],
   ['crystalboyisland', '2020-10-22T03:49:31Z'],
   ['Nihachu', '2020-10-21T20:19:23Z'],
   ['AOC', '2020-10-21T00:56:24Z'],
   ['lukeafkfan', '2020-10-19T21:34:25Z'],
   ['TheAlbertChang', '2020-10-10T03:45:52Z'],
   ['Punz', '2020-10-06T23:03:54Z'],
   ['Natsumiii', '2020-10-05T06:46:01Z'],
   ['Sapnap', '2020-10-04T23:43:54Z'],
   ['Tubbo', '2020-10-04T20:38:20Z'],
   ['Lacari', '2020-10-04T04:57:45Z'],
   ['eaJParkOfficial', '2020-10-02T23:45:20Z'],
   ['5uppp', '2020-09-26T09:53:03Z'],
   ['illumina1337', '2020-09-18T10:39:00Z'],
   ['jacksepticeye', '2020-09-16T21:37:36Z'

## Data inspection
After collecting the data using `get_follow_data.py` let's take a look at what we have.

In [37]:
total = 0
follows = []
with open("data/user_follows.json", "r") as f:
    data = json.load(f)
    for d in data.values():
        total += d["total"]
        follows.extend([x[0] for x in d["following"]])
        
print("Total number of users:", len(data)) # i.e. number of unique users
print("Total follows possible:", total) # total amount of data if we did not cap at 1k streamers/user
print("Actual follows collected:", len(follows)) # actual amount of data due to capping at 1k streamers/user
print("Number of unique follows:", len(set(follows))) # i.e. number of unique streamers

Total number of users: 15997
Total follows possible: 1110148
Actual follows collected: 1023790
Number of unique follows: 264744


In [55]:
counter = Counter(follows).most_common()
counts = [count for name,count in counter]
print("Avg streamer follows per user", len(follows)/len(data))
print("Avg users following per streamer", np.mean(counts))
print("Number of streamers with > 4 follows", len([x for x in counts if x > 4]) )

Avg streamer follows per user 63.99887478902294
Avg users following per streamer 3.8670942495391776
Number of 'streamers' with > 4 follows 26769


To summarize:
- 15.9k Twitch users
- 1.02m retrieved follows ("ratings"); this is 92.22% of total follows (1.11m) due to capping at 1k, NOT BAD
- 264.7k unique follows  ("items")

Considerations:
- Most "streamers" are followed only once or twice in our data, only 26.7k have more than 4 followers. This suggests most of the 264k unique streamers are either: (i) not popular, (ii) not actually streamers, but rather Twitch users that people follow. In my mind, this is an issue with Twitch -- I have 6 followers but I've never streamed (why do they follow me?).
- Our user-item matrix is 15.9k x 264.7k -- we need to reduce this size. If we take streamers with >= 5 followers this still leaves us with a 15.9k x 26.7k matrix. Ideally we want more users than items for collaborative filtering to work well. We will later solve this issue by taking only streamers that fall into the top 2k (as determined in the next section).
  - This method will result in 15.9k x 2k matrix with 288k non-null entries (0.9% full), which is more far more suitable for model building.