Userprofile fixes (#173)

* move scripts with example code to get twitter user data to examples folder * user.py: * user.py: change default values of __init__ * query.py: remove redundant return * update setup.py, changelog, LICENSE, version number * README: add section for userprofile scraping info * main.py: add command line argument for user profile scraping
taspinar · Jun 15, 2019 · 0b37442 · 0b37442
1 parent 911f682
commit 0b37442
Show file tree

Hide file tree

Showing 11 changed files with 71 additions and 18 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2016 by Ahmet Taspinar (taspinar@gmail.com)
+Copyright (c) 2016-2019 by Ahmet Taspinar (taspinar@gmail.com)
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.rst b/README.rst
@@ -37,9 +37,27 @@ access Tweets written in the **past 7 days**. This is a major bottleneck
 for anyone looking for older past data to make a model from. With
 TwitterScraper there is no such limitation.
 
-Per Tweet it scrapes the following information: + Username and Full Name
-+ Tweet-id + Tweet-url + Tweet text + Tweet html + Tweet timestamp + No. of likes +
-No. of replies + No. of retweets
+Per Tweet it scrapes the following information: 
+ + Tweet-id 
+ + Tweet-url 
+ + Tweet text 
+ + Tweet html 
+ + Tweet timestamp 
+ + Tweet No. of likes
+ + Tweet No. of replies
+ + Tweet No. of retweets
+ + Username
+ + User Full Name
+ + User ID
+ + Date user joined
+ + User location (if filled in)
+ + User blog (if filled in)
+ + User No. of  tweets
+ + User No. of following
+ + User No. of followers
+ + User No. of likes
+ + User No. of lists
+
 
 2. Installation and Usage
 =========================
@@ -118,7 +136,6 @@ Below is an example of how twitterscraper can be used:
 
 ``twitterscraper Trump -l 100 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``
 
-``twitterscraper realDonaldTrump -u -o tweets_username.json``
 
 
 2.2.2 Examples of advanced queries
@@ -149,14 +166,17 @@ Also see `Twitter's Standard operators <https://developer.twitter.com/en/docs/tw
 2.2.3 Examples of scraping user pages
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You can also scraped all tweets written by retweetet by a specific user. This can be done by adding the boolean argument ``-u / --user`` argument to the query. 
+You can also scraped all tweets written or retweetet by a specific user. This can be done by adding the boolean argument ``-u / --user`` argument to the query. 
 If this argument is used, the query should be equal to the username. 
 
 Here is an example of scraping a specific user:
 
 ``twitterscraper realDonaldTrump -u -o tweets_username.json``
 
-This does not work in combination with ``-p``, ``-bd``, or ``-ed`` but it is the only way to scrape for retweets. 
+This does not work in combination with ``-p``, ``-bd``, or ``-ed``.
+
+The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes
+all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets). 
 
 
 2.3 From within Python
@@ -188,6 +208,13 @@ You can easily use TwitterScraper from within python:
 A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output. To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, a search for ``#trump2020`` will only show the original tweet. The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument. 
 
 
+2.5 Scraping for User Profile information
+----------------------
+By adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
+The results will be saved in the file "userprofiles_<filename>".
+Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :)
+It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder. 
+
 
 3. Output
 =========

diff --git a/changelog.txt b/changelog.txt
@@ -1,5 +1,13 @@
 # twitterscraper changelog
 
+# 1.0.0 ( 2019-02-04 )
+### Added
+- PR #159: scrapes user profile pages for additional information. 
+### Fixed:
+- Moved example scripts demonstrating use of get_user_info() functionality to examples folder
+- removed screenshot demonstrating get_user_info() works
+- Added command line argument to main.py which calls get_user_info() for all users in list of scraped tweets.
+
 # 0.9.3 ( 2018-11-04 )
 ### Fixed
 - PR #143: cancels query if end-date is earlier than begin-date. 

diff --git a/twitterscraper/get_twitter_user_data.py → examples/get_twitter_user_data.py b/twitterscraper/get_twitter_user_data.py → examples/get_twitter_user_data.py
diff --git a/...scraper/get_twitter_user_data_parallel.py → examples/get_twitter_user_data_parallel.py b/...scraper/get_twitter_user_data_parallel.py → examples/get_twitter_user_data_parallel.py
diff --git a/setup.py b/setup.py
@@ -8,7 +8,7 @@
 
 setup(
     name='twitterscraper',
-    version='0.9.3',
+    version='1.0.0',
     description='Tool for scraping Tweets',
     url='https://github.com/taspinar/twitterscraper',
     author=['Ahmet Taspinar', 'Lasse Schuirmann'],

diff --git a/twitterscraper/__init__.py b/twitterscraper/__init__.py
@@ -1,15 +1,18 @@
 # TwitterScraper
-# Copyright 2016-2018 Ahmet Taspinar
+# Copyright 2016-2019 Ahmet Taspinar
 # See LICENSE for details.
 """
 Twitter Scraper tool
 """
 
-__version__ = '0.9.3'
+__version__ = '1.0.0'
 __author__ = 'Ahmet Taspinar'
 __license__ = 'MIT'
 
 
 from twitterscraper.query import query_tweets
+from twitterscraper.query import query_tweets_from_user
+from twitterscraper.query import query_user_info
 from twitterscraper.tweet import Tweet
+from twitterscraper.user import User
 from twitterscraper.ts_logger import logger as ts_logger
diff --git a/twitterscraper/main.py b/twitterscraper/main.py
@@ -7,7 +7,9 @@
 import collections
 import datetime as dt
 from os.path import isfile
-from twitterscraper.query import query_tweets, query_tweets_from_user
+from twitterscraper.query import query_tweets
+from twitterscraper.query import query_tweets_from_user
+from twitterscraper.query import query_user_info
 from twitterscraper.ts_logger import logger
 
 
@@ -58,6 +60,10 @@ def main():
         parser.add_argument("-u", "--user", action='store_true',
                             help="Set this flag to if you want to scrape tweets from a specific user"
                                  "The query should then consist of the profilename you want to scrape without @")
+        parser.add_argument("--profiles", action='store_true',
+                            help="Set this flag to if you want to scrape profile info of all the users where you" 
+                            "have previously scraped from. After all of the tweets have been scraped it will start"
+                            "a new process of scraping profile pages.")
         parser.add_argument("--lang", type=str, default=None,
                             help="Set this flag if you want to query tweets in \na specific language. You can choose from:\n"
                                  "en (English)\nar (Arabic)\nbn (Bengali)\n"
@@ -112,5 +118,11 @@ def main():
                                         x.text, x.html])
                     else:
                         json.dump(tweets, output, cls=JSONEncoder)
+            if args.profiles and tweets:
+                list_users = list(set([tweet.user for tweet in tweets]))
+                list_users_info = [query_user_info(elem) for elem in list_users]
+                filename = 'userprofiles_' + args.output
+                with open(filename, "w", encoding="utf-8") as output:
+                    json.dump(list_users_info, output, cls=JSONEncoder)
     except KeyboardInterrupt:
-        logger.info("Program interrupted by user. Quitting...")
+        logger.info("Program interrupted by user. Quitting...")
diff --git a/twitterscraper/query.py b/twitterscraper/query.py
@@ -180,6 +180,10 @@ def query_tweets_once(*args, **kwargs):
 
 def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang=''):
     no_days = (enddate - begindate).days
+
+    if(no_days < 0):
+        sys.exit('Begin date must occur before end date.')
+
     if poolsize > no_days:
         # Since we are assigning each pool a range of dates to query,
 		# the number of pools should not exceed the number of dates.
@@ -253,8 +257,7 @@ def query_user_page(url, retry=10):
         response = requests.get(url, headers=HEADER)
         html = response.text or ''
 
-        user = User()
-        user_info = user.from_html(html)
+        user_info = User.from_html(html)
         if not user_info:
             return None
 

diff --git a/...scraper/screenshot_demonstrating_execution_of_scraping_twitter_user_profile.png b/...scraper/screenshot_demonstrating_execution_of_scraping_twitter_user_profile.png
diff --git a/twitterscraper/user.py b/twitterscraper/user.py
@@ -2,7 +2,7 @@
 
 
 class User:
-    def __init__(self, user=None, full_name="", location="", blog="", date_joined=None, id=None, tweets=0, 
+    def __init__(self, user="", full_name="", location="", blog="", date_joined="", id="", tweets=0, 
         following=0, followers=0, likes=0, lists=0):
         self.user = user
         self.full_name = full_name
@@ -15,8 +15,8 @@ def __init__(self, user=None, full_name="", location="", blog="", date_joined=No
         self.followers = followers
         self.likes = likes
         self.lists = lists
-        
-
+
+    @classmethod
     def from_soup(self, tag_prof_header, tag_prof_nav):
         """
         Returns the scraped user data from a twitter user page.
@@ -85,7 +85,7 @@ def from_soup(self, tag_prof_header, tag_prof_nav):
             self.lists = int(lists)
         return(self)
 
-
+    @classmethod
     def from_html(self, html):
         soup = BeautifulSoup(html, "lxml")
         user_profile_header = soup.find("div", {"class":'ProfileHeaderCard'})