# Notebook for Scraping Twitter Tweets Using snscrape's Python Wrapper

Install required modules for scraping

Dependencies:

Your Python version must be 3.8 or higher. The development version of snscrape will not work with Python 3.7 or lower. You can download the latest Python version here.

Development version of snscrape, uncomment the pip install line in the below cell to pip install in the notebook if you don't already have it.

Pandas, the dataframes allows easy manipulation and indexing of data, this is more of a preference but is what I follow in this notebook.

In [None]:
!pip install snscrape
!pip install pandas
!pip install pymongo

Import the required modules

In [19]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
import datetime
import pymongo
import time 

Database connection is established to store the scraped data

Keyword, Start date, End date and Number of tweets are declared


In [20]:
# REQUIRED VARIABLES
client = pymongo.MongoClient("mongodb://localhost:27017/")  # To connect to MONGODB
mydb = client["Twitter_Database"]    # To create a DATABASE
word = 'LIC Policy'
end = datetime.date.today()
d = datetime.timedelta(days = 100)
start = end - d
tweet_c = 1000
tweets_list = []
print('KeyWord to be searched: ', word)
print('Starting date: ', start)
print('Ending date:', end)
print('Total Number of tweets to be scraped:',tweet_c)

KeyWord to be searched:  LIC Policy
Starting date:  2022-10-15
Ending date: 2023-01-23
Total Number of tweets to be scraped: 1000


## Query by Text Search

Using TwitterSearchScraper from snscrape, data will be scraped by providing Keyword, Start date, End date and Number of tweets and stored in a DataFrame

In [31]:
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(f'{word} + since:{start} until:{end}').get_items()):
    if i>=tweet_c:
        break
    tweets_list.append([ tweet.id, tweet.date,  tweet.content, tweet.lang, tweet.user.username, tweet.replyCount, tweet.retweetCount,tweet.likeCount, tweet.source, tweet.url ])
tweets_df = pd.DataFrame(tweets_list, columns=['ID','Date','Content', 'Language', 'Username', 'ReplyCount', 'RetweetCount', 'LikeCount','Source', 'Url'])

  tweets_list.append([ tweet.id, tweet.date,  tweet.content, tweet.lang, tweet.user.username, tweet.replyCount, tweet.retweetCount,tweet.likeCount, tweet.source, tweet.url ])


Lets look at the dataframe, a samle of 25 records are displayed out of 1000 records

In [32]:
print(tweets_df.shape)
print(tweets_df.head(25))

(3001, 10)
                     ID                      Date  \
0   1617248306635362304 2023-01-22 19:50:17+00:00   
1   1617248167631667200 2023-01-22 19:49:44+00:00   
2   1617214223414874114 2023-01-22 17:34:51+00:00   
3   1617174754905182209 2023-01-22 14:58:01+00:00   
4   1617150270248742915 2023-01-22 13:20:44+00:00   
5   1617143358803292166 2023-01-22 12:53:16+00:00   
6   1617099461733539848 2023-01-22 09:58:50+00:00   
7   1617098030913835008 2023-01-22 09:53:09+00:00   
8   1617083043663040513 2023-01-22 08:53:36+00:00   
9   1617080594403594241 2023-01-22 08:43:52+00:00   
10  1617078339738669064 2023-01-22 08:34:54+00:00   
11  1617073978342002688 2023-01-22 08:17:34+00:00   
12  1617043594837393409 2023-01-22 06:16:50+00:00   
13  1617030820736163840 2023-01-22 05:26:05+00:00   
14  1617021339163922434 2023-01-22 04:48:24+00:00   
15  1617013090545025025 2023-01-22 04:15:38+00:00   
16  1617000208034066433 2023-01-22 03:24:26+00:00   
17  1616997291109552128 2023-01-22 

Lets store the scraped data in to Database, by adding an additional field KeyWord_or_Hashtag_Timestamp

In [16]:
coll=word
coll=coll.replace(' ','_')+'_Tweets' # collection name is same as Keyword with spaces replaced as _
mycoll=mydb[coll]         # New collection is created
dic=tweets_df.to_dict('records')  # Tweets dataframe is concerted into Dictionary  
mycoll.insert_many(dic)  # All the Tweets are inserted
ts = time.time()         # A new column is added KeyWord_or_Hashtag_Timestamp
mycoll.update_many({}, {"$set": {"KeyWord_or_Hashtag": word+str(ts)}}, upsert=False, array_filters=None)

<pymongo.results.UpdateResult at 0x1351d882730>

This is just a step by step demo for Twitter data scraping and uploading the data into database, In the next file I have implemented a solution that will be able to scrape the twitter data and store that in the database and allow the user to download the data with multiple data formats using streamlit app


All the tweets are inserted into database, lets have a look at documents in the collection

In [15]:
for i in mycoll.find():
    print(i)
    print()

{'_id': ObjectId('63ceb8c0460a98917f0c91fb'), 'ID': 1617248306635362304, 'Date': datetime.datetime(2023, 1, 22, 19, 50, 17), 'Content': 'Lic new policy Launched more details contact me\n9908638616 https://t.co/zhAxtsh52a', 'Language': 'en', 'Username': 'sahukariNarasi1', 'ReplyCount': 0, 'RetweetCount': 0, 'LikeCount': 0, 'Source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'Url': 'https://twitter.com/sahukariNarasi1/status/1617248306635362304', 'KeyWord_or_Hashtag': 'LIC Policy1674492096.425073'}

{'_id': ObjectId('63ceb8c0460a98917f0c91fc'), 'ID': 1617248167631667200, 'Date': datetime.datetime(2023, 1, 22, 19, 49, 44), 'Content': '@MSNBC If police come in contact call into dispatch &amp; supervisor veh lic plate description and location and call for immediate backup &amp; follow your police dept policy on felony traffic stops (any normal traffic stop tell person why being stopped,it saves lives)', 'Language': 'en', 'Username': 'HobbsLolita

In [24]:
for coll in mydb.list_collection_names():
    print(coll)
    coll=mydb[coll]
    coll.drop()