# Dataset Collection

The task is to find out which Indian problems people are speaking about in social media. To this end, we will create a social media based dataset, perform an exploratory analysis, identify salient topics of discussion, and eventually create a model for predicting the topic given a social media post.

In this notebook, our task is to collect data to form our dataset.

Before we start collecting data, let us ask certain questions about how we wish to collect data.

Q1) <strong>What platforms can we use?</strong>

Popular social media platforms where opinions and problems have been discussed are Reddit and Twitter. While Reddit has longer written pieces and a more focused discussion on a post, Twitter is mostly short-form opinions without major discussion. Having tried to extract data from both, a preliminary look at the quality of data suggested that it is best to stick with a <strong>purely Reddit dataset</strong>.

Q2) <strong>How do we ensure posts we collect are discussing an Indian Problem?</strong>

First of all, to make sure that the broader topic of discussion is about India, we only look at India based subreddits like **r/India** and **r/IndiaSpeaks**. In order to further ensure that a problem is being discussed, we first try to find a flair (category on a subreddit) that is related to problems.

For eg, r/IndiaSpeaks has the flair #Social-Issues on posts discussing problems.

However, r/India does not have such a flair. In that case, we apply a filter on posts that only selects posts containing our keywords ("Social" and "Problem") or ("Social" and "Issue")

Q3) **How do we collect reddit posts?**

The official API offered by reddit for data scraping is praw. However, it offers little to no customizability and has terrible rate limits. A brilliant alternative is to use the Pushshift API, which allows for searching posts in a date range, according to particular search terms, etc.

I am using a wrapper built around Pushshift called pmaw which allows for easier data collection.


Other details like the date range we should enforce or the number of posts we should collect will be discussed ahead.

In [None]:
import pandas as pd

In [None]:
import requests
from datetime import datetime
import traceback

In [None]:
!pip install pmaw

Collecting pmaw
  Downloading pmaw-2.1.3-py3-none-any.whl (25 kB)
Collecting praw
  Downloading praw-7.5.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 7.5 MB/s 
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.3.2-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.6 MB/s 
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw, pmaw
Successfully installed pmaw-2.1.3 praw-7.5.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.3.2


In [None]:
def get_update_array(post):
    title = post['title']
    author = post['author']
    created_utc = post['created_utc']
    self_post = post['is_self']
    score = post['score']
    over_18 = post['over_18']
    num_comments = post['num_comments']
    
    if 'is_original_content' not in post:
        is_original_content = None
    else:
        is_original_content = post['is_original_content']
    
    if 'selftext' not in post:
        self_text = ""
    else:
        self_text = post['selftext']
    
    if 'link_flair_text' not in post:
        flair = None
    else:
        flair = post['link_flair_text']
    
    update_array = [title, flair, score, num_comments, author, is_original_content, created_utc, self_post, self_text, over_18]
    return update_array

In [None]:
start_timestamp = int(datetime.utcnow().timestamp())
url = "https://api.pushshift.io/reddit/submission/search/?q={}&score=>0&before={}&after={}&sort_type=score&sort=desc&subreddit=India&limit=1000"
dataset = []
epoch = start_timestamp
year = 365*24*60*60
epoch_prev = epoch - year
search_terms = ["Social Issues", "Social Issue", "Social Problem", "Social Problems"]
post_counts = 0
total_years = 5
for every_year in range(total_years):
    for term in search_terms:
        final_url = url.format(term, str(epoch),str(epoch_prev))
        json_data = requests.get(final_url, headers={'User-Agent': "test reddit app"})
        if json_data is None:
            continue
        print(json_data)
        data = json_data.json()
        posts = data['data']
        for post in posts:
            post_counts = post_counts + 1
            update_array = get_update_array(post)
            dataset.append(update_array)
        epoch = epoch_prev
        epoch_prev = epoch - year
        print(post_counts)

<Response [200]>
100
<Response [200]>
174
<Response [200]>
249
<Response [200]>
271
<Response [200]>
333
<Response [200]>
371
<Response [200]>
434
<Response [200]>
447
<Response [200]>
452
<Response [429]>


JSONDecodeError: ignored

##r/India Extraction

My first attempt at directly using Pushshift failed due to exceeding rate limits almost immediately. This led me to look for [pmaw](https://github.com/mattpodolak/pmaw) which implements intelligent rate limiting.

From the **r/India** subreddit, I determine the search terms social issue and social problem, and pull the top scoring 20,000 posts that contain either search term. I add the details received into the all_post_list

In [None]:
from pmaw import PushshiftAPI
import datetime as dt
all_post_list = []
api = PushshiftAPI()
# before = dt.datetime(2022,5,1,0,0).timestamp()
# month = 30*24*60*60
# after = before - month*6
search_terms = ["Social Issue", "Social Problem"]
for term in search_terms:
    posts = api.search_submissions(subreddit='India', q=term, sort_type="score", sort="desc", limit=20000)
    all_post_list.extend([post for post in posts])
    print(len(all_post_list))
    # before = after
    # after = after - month*6

Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift sh

496


Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift sh

1015


In [None]:
df_india = pd.DataFrame(all_post_list)
df_india.shape

(1015, 92)

We find that Pushshift could only bring us 1000 posts that had those keywords. That's alright.

Next, we check whether any duplicates seeped in that might have had both social issue and social problem in the same post.

In [None]:
df_india['created_utc'].value_counts()

1579245405    6
1636027326    6
1579519599    5
1636268366    5
1533407177    5
             ..
1524470944    1
1523774245    1
1524320778    1
1523746202    1
1424591869    1
Name: created_utc, Length: 857, dtype: int64

In [None]:
df_dup_rem_ind = df_india.drop_duplicates(subset="created_utc")
df_dup_rem_ind.shape

(857, 92)

Indeed, they did. After removing duplicates, we have 857 posts from the r/India subreddit

In [None]:
df_dup_rem_ind.to_pickle("df_india.pkl")

##r/IndiaSpeaks extraction

The next part of the code is going to extract data from the r/IndiaSpeaks subreddit.

The way this is done is the following:

1. We take periods of two months from now till the last 18 months.
2. In each period, we take the top 10,000 scoring posts.
3. From these posts, we add all those which have their flair as "#Social-Issues"

The above method was chosen after quite a bit of trial and error. There are barely any posts with the social-issues flair beyond the 18months. Additionally, given the sparseness of social-issue posts, we have kept the limit for each 2 month period as top 10k posts.

In [None]:
def flair_filter(item):
    if 'link_flair_text' in item.keys():
        if item["link_flair_text"]=="#Social-Issues 🗨️":
            return True
    return False

In [None]:
from pmaw import PushshiftAPI
import datetime as dt
all_post_list = []
api = PushshiftAPI()
before = int(dt.datetime(2022,5,1,0,0).timestamp())
month = 30*24*60*60
after = before - month*2
for i in range(9):
    print(before)
    posts = api.search_submissions(subreddit='IndiaSpeaks', before=before, after=after, sort_type="score", sort="desc", limit=10000, filter_fn=flair_filter)
    all_post_list.extend([post for post in posts])
    print(len(all_post_list))
    before = after
    after = after - month*2

1651363200
288
1646179200
677
1640995200
1122
1635811200
1536
1630627200
1791
1625443200
2043
1620259200
2105
1615075200
2105
1609891200


Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.


2105


In [None]:
df_all_post = pd.DataFrame(all_post_list)

In [None]:
df_all_post.shape

(2105, 84)

In [None]:
df_all_post['link_flair_text'].value_counts()

#Social-Issues 🗨️    2105
Name: link_flair_text, dtype: int64

In [None]:
df_all_post['created_utc'].value_counts()

1644137529    9
1644126217    9
1644115796    8
1644136176    8
1619347455    7
             ..
1640589828    1
1640617524    1
1640623727    1
1640777420    1
1619234227    1
Name: created_utc, Length: 1216, dtype: int64

In [None]:
df_dup_rem = df_all_post.drop_duplicates(subset="created_utc")

In [None]:
df_dup_rem.shape

(1216, 84)

In [None]:
df_dup_rem.to_pickle("df_ispeaks.pkl")

Finally, after removing the duplicates from the IndiaSpeaks data as well, we are left with 1200 posts. Let us save this to pickle as well and do the data preprocessing for both datasets in the next part of our task!