# Twitter scraping at scale - leverage multiprocessing to speed up the follower retrieval process

- toc: true 
- badges: true
- comments: true
- categories: [tutorial, multiprocessing]

In this tutorial, we will show how to use the python `multiprocessing` package to speed up the process of scraping data from Twitter. As a scrapper, we will use the twint package available [here](https://github.com/twintproject/twint). Due to the often changes of twitter, we recommend to download the package directly from Github by running:

```bash
git clone --depth=1 https://github.com/twintproject/twint.git

cd twint

pip install . -r requirements.txt
```

We will show how to handle a case where you need to run multiple processes concurrently, as an example we will show how to concurrently download the following lists for multiple accounts.

# Imports

In [1]:
from functools import partial
import multiprocessing
import os
from os.path import join, realpath, dirname

import nest_asyncio
import pandas as pd
from tqdm.auto import tqdm
import twint

In [2]:
nest_asyncio.apply()

# Users for which we want to gather the followers data

We will download a list of following accounts for users who posted a tweet with hashtag #BLM on 26th October 2k20. Dataset was downloaded also using twint and can be found [here](https://github.com/tugot17/data-science-blog/tree/master/_notebooks/twint_multiprocessing/blm.csv)

In [3]:
path = join("twint_multiprocessing", "blm.csv")

df = pd.read_csv(path)

df

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1321138727738085381,1321128813892349955,2020-10-27 18:16:32 CET,2020-10-27,18:16:32,100,3169905249,trapmasterrick,Rick Spartan ‚ô®Ô∏è,,...,,,,,,"[{'screen_name': 'dstiddypop43', 'name': 'Dere...",,,,
1,1321138726278369280,1321138726278369280,2020-10-27 18:16:32 CET,2020-10-27,18:16:32,100,589448567,demhugh,‚öñ Fightin for America!üêùüíõ,,...,,,,,,[],,,,
2,1321138714786074636,1321138714786074636,2020-10-27 18:16:29 CET,2020-10-27,18:16:29,100,771361727166423044,stephenritz,Stephen Ritz,,...,,,,,,[],,,,
3,1321138682129072128,1321138682129072128,2020-10-27 18:16:22 CET,2020-10-27,18:16:22,100,1282523269728370688,melsddd,MTKA,,...,,,,,,[],,,,
4,1321138663955132417,1321138663955132417,2020-10-27 18:16:17 CET,2020-10-27,18:16:17,100,1278050664677953536,owlsi1,OWLSI- CA equality chick--SHARE my creations PLZüòÅ,,...,,,,,,[],,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,1321131950871060483,1321131950871060483,2020-10-27 17:49:37 CET,2020-10-27,17:49:37,100,21678187,ziggymarly,nasdog,,...,,,,,,[],,,,
181,1321131943908487170,1321131943908487170,2020-10-27 17:49:35 CET,2020-10-27,17:49:35,100,373985118,wearewhispertv,Whisper,,...,,,,,,[],,,,
182,1321131928834109441,1321126626298634240,2020-10-27 17:49:31 CET,2020-10-27,17:49:31,100,857472736725155844,eugjhawk,Billy D Why is it OK for tRump to kill milllions?,,...,,,,,,"[{'screen_name': 'realDonaldTrump', 'name': 'D...",,,,
183,1321131864136974336,1321126626298634240,2020-10-27 17:49:16 CET,2020-10-27,17:49:16,100,857472736725155844,eugjhawk,Billy D Why is it OK for tRump to kill milllions?,,...,,,,,,"[{'screen_name': 'realDonaldTrump', 'name': 'D...",,,,


In [4]:
df[["username"]]

Unnamed: 0,username
0,trapmasterrick
1,demhugh
2,stephenritz
3,melsddd
4,owlsi1
...,...
180,ziggymarly
181,wearewhispertv
182,eugjhawk
183,eugjhawk


In [5]:
users = df.username.unique()

f"{len(users)} users, 5 first: {users[:5]}"

"163 users, 5 first: ['trapmasterrick' 'demhugh' 'stephenritz' 'melsddd' 'owlsi1']"

# Scraping method

Let's define a simple function to download the accounts a selected user follows

In [6]:
def generate_csv_with_user_followings(username, save_dir=""):
    """
    Function downloads users's followers and saves it as a .csv
    :param username: simple string e.g. user2137 (not @user)
    :param save_dir: path to the folder where you want to save the csv, e.g. ./data/
    :return:  None
    """
    
    try:
        os.makedirs(save_dir)
    except OSError as e:
        pass

    c = twint.Config()
    c.Username = username

    c.Store_csv = True
    c.Output = f"{join(save_dir, username)}_followers.csv"
    c.Hide_output = True

    twint.run.Following(c)

# Multiprocessing

Now we can run the defined function concurrently for multiple users by using `multiprocessing`. 
To show the progress we will use `tqdm`, the estimated time may not be very accurate but at least it will give us a "more less" view of the progress

**Warning:** Using too many processes may severely slow down your computer

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
save_dir = join("twint_multiprocessing", "followings")

number_of_processes = 30

In [10]:
# define partial so we can use two arguments with imap
user_csv_generation = partial(generate_csv_with_user_followings, save_dir=save_dir)

with multiprocessing.Pool(processes=number_of_processes) as p:
    with tqdm(total=len(users)) as pbar:
        for i, _ in enumerate(p.imap_unordered(user_csv_generation, users)):
            pbar.update()

## Example following list

In [12]:
path = join(save_dir, f"{df.username.iloc[0]}_followers.csv")

pd.read_csv(path)

Unnamed: 0,username
0,jimgaffigan
1,kotsiebader
2,eshaenic
3,ladywardog
4,ruth_a_buzzi
...,...
572,archaeologynews
573,fernbankmuseum
574,carlosmuseum
575,britishmuseum


# Summary

We have shown how to leverage the usage of multiprocessing for tasks that can be run concurrently. The presented method is by no means limited to twint nevertheless in our opinion, it is an interesting and intuitive use case of the `multiprocessing` package. 