Skip to content

Loading…

Update search to use V1.1 API methods and results #248

Closed
wants to merge 8 commits into from
@inactivist

Initial changes required to make search work with the Twitter V1 API. Please refer to the Twitter 1.1 API search/tweets documentation and errata for implementation details.

Please review and comment. This works in my limited testing, but I cannot run the full test suite in my environment.

  • search() fix path (URL) to use 1.1 API url
  • search() allowed_param changes: rpp -> count, adding include_entities; allowed_param entries now in alphabetical order.
  • search_api flag no longer needed or used.
  • SearchResults class: Pull list results from ['statuses'], use parse_datetime() rather than parse_search_datetime() since the date format is now standardized.
  • utils.parse_search_datetime() no longer used and has been deleted.

Cursor support

search does not support reverse Cursor pagination at this time.

Breaking changes

  • The parsed SearchResult structures have changed due to changes in the underlying API. models.SearchResult is now a subclass of models.Status -- the results list is now parsed as a list of models.Status objects, and the search metadata is contained in the search_metadata attribute of API.search()'s returned value.
inactivist added some commits
@inactivist inactivist Fixes for issue #225 and #247
 - First pass works in limited testing.
- search() fix path (URL) to use 1.1 API url
 - search() allowed_param changes: rpp -> count, adding include_entities
 - search_api flag no longer needed or used.
 - SearchResults class: Pull list results from ['statuses'], use
  parse_datetime() rather than parse_search_datetime() since the date
  format is now standardized.
 - utils.parse_search_datetime() no longer used and has been deleted.
60be72e
@inactivist inactivist Adding search_metadata parsing and basic unit tests for search() method. 39b774f
@inactivist inactivist 'page' is no longer a valid parameter for V1.1 search API. 1e374a7
@inactivist inactivist Adding SearchResultsIterator for v1.1 API Cursor capability 17ceab8
@inactivist inactivist search_host is no longer used with v1.1 API support. 23ba7e5
@inactivist inactivist Simplifying search() forward paging. f519aa3
@inactivist inactivist Fix typo in previous commit. 5b00921
@inactivist inactivist Move page limit test (saves an extra invocation at end.) b6f56c7
@inactivist

I think this is RTBC pending testing. search() now supports Cursor functionality (forward only) and works well in my testing. I've simplified the paging mechanism in commit f519aa3 and 5b00921 -- Tweepy can now perform 'endless' pagination through search results (subject to API rate limits, of course.)

Question - Do I need to implement reverse paging (using the prev iterator method) ?

I can see a need to go back, but I'm not sure if anyone else does. Are there any code samples demonstrating reverse pagination or is it just a side effect of reverse iteration through the Cursor's result iterable? I'd like to assemble a few test cases...

@rehandalal

@joshthecoder I really need this!

@Khutuck

Does search() currently support API 1.1? There isn't any data on internet. I'm a newbie and I believe I'm getting API 1.0 results.

@joshthecoder
tweepy member

Would it be possible to create a more generic iterator for max_id and since_id?
This way we could use it with other endpoints such as the timelines.

@inactivist

@joshthecoder I suppose so but I'm not sure what I need to do here. Can you provide guidance?

@nirg

@joshthecoder @inactivist I created the more generic max_id iterator in pull request #282. In particular, check out MaxIdIterator in cursor.py.

@rehandalal

@joshthecoder @inactivist is there anything I can help with here to speed this along?

@nirg

@rehandalal if you need a solution to this asap then check out my pull request #282 where I solved this issue already.

@domino14

Has this been merged in? I don't see it in master and the search is still hitting v1 which I don't understand why it's still working.

@danielsamuels

With the v1 API being turned off next Tuesday (11th June), this fix really needs to be merged into the master branch. Search is currently broken for all users of tweepy.

@rehandalal

@joshthecoder this has now become pretty urgent!

@inactivist

@domino14 @danielsamuels @rehandalal : I think @joshthecoder is waiting for me to make additional changes, though I don't have much time to deal with it right now.

@nirg has a pull request which claims to resolve everything, but I've not tried it.

In any case, my fork's fix-v2-search branch seems to work in all my test cases; I've been using it for quite a while but I'm not using Cursor features much. (Here's a live test site I've built using my fork, and it uses the v1.1 search API.)

@danielsamuels

@inactivist Unfortunately for me and my organisation, it's a little too late now. The API is changing over next week and we have sites to maintain. I've had to make the decision to switch to Twython, which is a shame because I liked tweepy.

@inactivist

@danielsamuels I understand.

@nirg

@danielsamuels if you still want to use tweepy you can use my forked version till it gets pulled to the main repo by @joshthecoder where search v1.1 is working with proper cursoring/pagination (and everything is tested).

@nucflash

@nirg when I use your version, and I do: for item in Cursor(api.search, q="the").items(100), I get:
TypeError: argument of type 'SearchResult' is not iterable

I get the same when I use the example on search you have in your code. Any idea on how to fix this?

@Oire

@danielsamuels, do you believe it's possible to quickly switch to Twython? Thanks!

@danielsamuels

@Oire Yes, took me 10 mins max earlier.

@domino14

any word on when this can be merged in (hopefully before Tuesday)? any way we can help?

@btipling

Switching to @inactivist's fork. Old search API has stopped working today.

@nirg

@nucflash check it out, I just fixed a minor issue in my fork of tweepy and tested it, search should work properly with api v1.1.

@nucflash

@nirg, many thanks! It works, and I'm using it.

@Oire

@nirg g, don't know if you are subscribed to the Google group. There Josh said that he would transfer the ownership to a person willing to proceed with Tweepy coding. Don't you consider this possible for you?

@Khutuck

@nirg I've been using your version for a few weeks, it seems working. I don't use cursor methods, though.

@nirg

@Khutuck glad to hear that you find my fork useful.

@Oire just joined the google group, sorry to see josh retiring. As I said there, I'll continue contributing to the project, but can't fully take on its entire maintenance.

@joshthecoder
tweepy member

I should have some time this weekend to fix the search issue. I don't imagine it will take that long.
While I am stepping down as maintainer, I do hope someone or a group of people can step up in my place.

@drevicko drevicko commented on the diff
tweepy/api.py
((6 lines not shown))
cache=None, secure=True, api_root='/1.1', search_root='',
retry_count=0, retry_delay=0, retry_errors=None,
parser=None):
self.auth = auth_handler
self.host = host
- self.search_host = search_host
self.api_root = api_root
self.search_root = search_root
@drevicko
drevicko added a note

No need for search_root now I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@drevicko drevicko commented on the diff
tweepy/api.py
((7 lines not shown))
payload_type = 'search_result', payload_list = True,
- allowed_param = ['q', 'lang', 'locale', 'rpp', 'page', 'since_id', 'geocode', 'show_user', 'max_id', 'since', 'until', 'result_type']
+ allowed_param = ['count', 'cursor', 'geocode', 'include_entities', 'lang', 'locale', 'max_id', 'q', 'result_type', 'show_user', 'since', 'since_id', 'until'],
@drevicko
drevicko added a note

I've been using the same set of allowed_param in the hack I put together (which is working well, at least with my app) except for these:

  • "show_user" : have you tried to see if it actually does anything? I couldn't see it mentioned in the api docs - perhaps it's ignored and perhaps at a later date it'll generate an error??
  • "cursor" : I think tweepy complains if you include this in a request (and it's not mentioned in the 1.1 search api docs).
  • 'since' : the api docs don't mention it, but maybe it works? has someone tried it?
@drevicko
drevicko added a note

I just tried it out - "show_user" and "cursor" are ignored (users are always shown).

"since" is more interesting - since:2014-06-04 can be added to the query string q. Using since=2014-06-04 has the same effect. That's probably mentioned in the api docs somewhere, but I havn't seen it there yet (admittedly I'm not known to be thorough with reading the docs ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@drevicko drevicko commented on the diff
tweepy/cursor.py
((17 lines not shown))
+ self.oldest_id = 0
+
+ def next(self):
+ self.current_page += 1
+
+ if (self.limit > 0 and self.current_page > self.limit):
+ raise StopIteration
+
+ if self.current_page > 1:
+ self.kargs['max_id'] = self.oldest_id
+
+ items = self.method(*self.args, **self.kargs)
+ if len(items) == 0:
+ raise StopIteration
+ # Stash last result's oldest id for next page access
+ self.oldest_id = items[-1].id -1
@drevicko
drevicko added a note

There is an assumption here that the search results are sorted by id - this may not always be true?? It's probably better to use the "next_results" query string in the "search_metadata" than to rely on sorted results. Alternatively, we could extract the list of id's and use it's max().

Per this discussion on dev.twitter.com -- next_results isn't always reliable.

(Yeah, I know the pull request is closed, but wanted to document my reasons for not relying on next_results.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@drevicko drevicko commented on the diff
tweepy/cursor.py
((11 lines not shown))
+ We'll use a simplistic forward paging scheme, by storing the oldest id
+ returned from the last search and using that in the max_id on the next
+ page request.
+ """
+ def __init__(self, method, args, kargs):
+ PageIterator.__init__(self, method, args, kargs)
+ self.oldest_id = 0
+
+ def next(self):
+ self.current_page += 1
+
+ if (self.limit > 0 and self.current_page > self.limit):
+ raise StopIteration
+
+ if self.current_page > 1:
+ self.kargs['max_id'] = self.oldest_id
@drevicko
drevicko added a note

twitter provides "next_results" in the returned "search_metadata" - may be a better place to get the next pages 'max_id'..

Per this discussion on dev.twitter.com -- next_results isn't always reliable.

(Yeah, I know the pull request is closed, but wanted to document my reasons for not relying on next_results.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@drevicko drevicko commented on the diff
tweepy/models.py
((22 lines not shown))
@classmethod
def parse_list(cls, api, json_list, result_set=None):
results = ResultSet()
- results.max_id = json_list.get('max_id')
- results.since_id = json_list.get('since_id')
- results.refresh_url = json_list.get('refresh_url')
- results.next_page = json_list.get('next_page')
- results.results_per_page = json_list.get('results_per_page')
- results.page = json_list.get('page')
- results.completed_in = json_list.get('completed_in')
- results.query = json_list.get('query')
-
- for obj in json_list['results']:
+ search_metadata = json_list.get('search_metadata')
+ if search_metadata:
@drevicko
drevicko added a note

Twitter says it'll always be there.. If we follow my suggestion above, and use "next_results" to get the 'max_id' for the next page, we'd best throw an exception if 'search_metadata' isn't present...

Per this discussion on dev.twitter.com -- next_results isn't always reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@drevicko drevicko commented on the diff
tweepy/models.py
((27 lines not shown))
- results.since_id = json_list.get('since_id')
- results.refresh_url = json_list.get('refresh_url')
- results.next_page = json_list.get('next_page')
- results.results_per_page = json_list.get('results_per_page')
- results.page = json_list.get('page')
- results.completed_in = json_list.get('completed_in')
- results.query = json_list.get('query')
-
- for obj in json_list['results']:
+ search_metadata = json_list.get('search_metadata')
+ if search_metadata:
+ # Convert smd dict to object with properties. Use Model class
+ # for convenience but this could be any generic object.
+ t = Model()
+ t.__dict__.update(search_metadata)
+ results.search_metadata = t
@drevicko
drevicko added a note

Just to put it out here for thought, in my version, I put the attributes of 'search_metadata' directly into the results object in keeping with the way the previous SearchResult class did it. This made updating my application code a little easier...
To be honest, I think I prefer keeping a similar structure to the twitter result json (ie: they way it's done here)..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@drevicko drevicko commented on the diff
tweepy/api.py
((11 lines not shown))
)
- search.pagination_mode = 'page'
+ search.pagination_mode = 'search'
@drevicko
drevicko added a note

while we're chaging things, perhaps it'd be better to set the pagination mode in bind_api(), since that's where it gets set in all other cases. We could pass search_api = True or pagination_mode = "search" to bind_api(). We'd best be careful if "cursor" remains in allowed_param if we do this (as bind_api() will set pagination_mode = "cursor" in this case).

@nirg
nirg added a note
@drevicko
drevicko added a note

Fair cop - MaxIdIterator makes more sense as a name (and to use it in more contexts too).
So we should pass pagination_mode = "max_id" to bind_api() and set the pagination mode there..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@inactivist

@nirg, I'd prefer that your pull request #282 dealt with a single issue (Search V1.1 support, for example.) Multiple mods covering multiple subjects tend to make it harder to evaluate the proposed changes.

I prefer to take a separate branch for each major change.

@joshthecoder
tweepy member

I'd like to just get a simple fix in place first and we can worry about supporting Cursors later.
Here's my pull request on changing the endpoint to version 1.1: #296

@joshthecoder
tweepy member

Merged a fix to switch to v1.1 endpoint of search and added new ID based cursor. Thank you all for your patches, it made the changes very easy. :)

@joshthecoder
tweepy member

And if you notice any bugs with search please let me know and I'll get them addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 14, 2013
  1. @inactivist

    Fixes for issue #225 and #247

    inactivist committed
     - First pass works in limited testing.
    - search() fix path (URL) to use 1.1 API url
     - search() allowed_param changes: rpp -> count, adding include_entities
     - search_api flag no longer needed or used.
     - SearchResults class: Pull list results from ['statuses'], use
      parse_datetime() rather than parse_search_datetime() since the date
      format is now standardized.
     - utils.parse_search_datetime() no longer used and has been deleted.
  2. @inactivist
  3. @inactivist
  4. @inactivist
Commits on Feb 15, 2013
  1. @inactivist
Commits on Feb 16, 2013
  1. @inactivist
  2. @inactivist

    Fix typo in previous commit.

    inactivist committed
  3. @inactivist
Showing with 73 additions and 50 deletions.
  1. +16 −1 tests.py
  2. +5 −6 tweepy/api.py
  3. +2 −8 tweepy/binder.py
  4. +35 −0 tweepy/cursor.py
  5. +15 −23 tweepy/models.py
  6. +0 −12 tweepy/utils.py
View
17 tests.py
@@ -298,7 +298,15 @@ def testsavedsearches(self):
self.api.destroy_saved_search(s.id)
def testsearch(self):
- self.api.search('tweepy')
+ count = 5
+ q = 'tweepy'
+ s = self.api.search(q=q, count=count)
+ self.assertEqual(len(s), count)
+ self.assertIsNotNone(getattr(s, 'search_metadata'))
+ self.assertEqual(s.search_metadata.count, count)
+ self.assertEqual(s.search_metadata.query, q)
+ # TODO: Test paging?
+ # TODO: Test other search_metadata attributes?
def testgeoapis(self):
def place_name_in_list(place_name, place_list):
@@ -350,6 +358,13 @@ def testcursorcursorpages(self):
pages = list(Cursor(self.api.followers_ids, 'twitter').pages(5))
self.assert_(len(pages) == 5)
+ def testcursorsearch(self):
+ items = list(Cursor(self.api.search, q='twitter').items(30))
+ self.assert_(len(items) == 30)
+
+ pages = list(Cursor(self.api.search, q='twitter').pages(5))
+ self.assert_(len(pages) == 5)
+
class TweepyAuthTests(unittest.TestCase):
def testoauth(self):
View
11 tweepy/api.py
@@ -15,13 +15,12 @@ class API(object):
"""Twitter API"""
def __init__(self, auth_handler=None,
- host='api.twitter.com', search_host='search.twitter.com',
+ host='api.twitter.com',
cache=None, secure=True, api_root='/1.1', search_root='',
retry_count=0, retry_delay=0, retry_errors=None,
parser=None):
self.auth = auth_handler
self.host = host
- self.search_host = search_host
self.api_root = api_root
self.search_root = search_root
@drevicko
drevicko added a note

No need for search_root now I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
self.cache = cache
@@ -620,12 +619,12 @@ def test(self):
""" search """
search = bind_api(
- search_api = True,
- path = '/search.json',
+ path = '/search/tweets.json',
payload_type = 'search_result', payload_list = True,
- allowed_param = ['q', 'lang', 'locale', 'rpp', 'page', 'since_id', 'geocode', 'show_user', 'max_id', 'since', 'until', 'result_type']
+ allowed_param = ['count', 'cursor', 'geocode', 'include_entities', 'lang', 'locale', 'max_id', 'q', 'result_type', 'show_user', 'since', 'since_id', 'until'],
@drevicko
drevicko added a note

I've been using the same set of allowed_param in the hack I put together (which is working well, at least with my app) except for these:

  • "show_user" : have you tried to see if it actually does anything? I couldn't see it mentioned in the api docs - perhaps it's ignored and perhaps at a later date it'll generate an error??
  • "cursor" : I think tweepy complains if you include this in a request (and it's not mentioned in the 1.1 search api docs).
  • 'since' : the api docs don't mention it, but maybe it works? has someone tried it?
@drevicko
drevicko added a note

I just tried it out - "show_user" and "cursor" are ignored (users are always shown).

"since" is more interesting - since:2014-06-04 can be added to the query string q. Using since=2014-06-04 has the same effect. That's probably mentioned in the api docs somewhere, but I havn't seen it there yet (admittedly I'm not known to be thorough with reading the docs ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ require_auth = True
)
- search.pagination_mode = 'page'
+ search.pagination_mode = 'search'
@drevicko
drevicko added a note

while we're chaging things, perhaps it'd be better to set the pagination mode in bind_api(), since that's where it gets set in all other cases. We could pass search_api = True or pagination_mode = "search" to bind_api(). We'd best be careful if "cursor" remains in allowed_param if we do this (as bind_api() will set pagination_mode = "cursor" in this case).

@nirg
nirg added a note
@drevicko
drevicko added a note

Fair cop - MaxIdIterator makes more sense as a name (and to use it in more contexts too).
So we should pass pagination_mode = "max_id" to bind_api() and set the pagination mode there..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
""" trends/daily """
trends_daily = bind_api(
View
10 tweepy/binder.py
@@ -42,10 +42,7 @@ def __init__(self, api, args, kargs):
self.build_parameters(args, kargs)
# Pick correct URL root to use
- if self.search_api:
- self.api_root = api.search_root
- else:
- self.api_root = api.api_root
+ self.api_root = api.api_root
# Perform any path variable substitution
self.build_path()
@@ -55,10 +52,7 @@ def __init__(self, api, args, kargs):
else:
self.scheme = 'http://'
- if self.search_api:
- self.host = api.search_host
- else:
- self.host = api.host
+ self.host = api.host
# Manually set Host header to fix an issue in python 2.5
# or older where Host is set including the 443 port.
View
35 tweepy/cursor.py
@@ -11,6 +11,8 @@ def __init__(self, method, *args, **kargs):
if hasattr(method, 'pagination_mode'):
if method.pagination_mode == 'cursor':
self.iterator = CursorIterator(method, args, kargs)
+ elif method.pagination_mode == 'search':
+ self.iterator = SearchResultsIterator(method, args, kargs)
else:
self.iterator = PageIterator(method, args, kargs)
else:
@@ -126,3 +128,36 @@ def prev(self):
self.count -= 1
return self.current_page[self.page_index]
+
+class SearchResultsIterator(PageIterator):
+ """ Paginate through search results.
+
+ GET Search results do not explicitly support pagination.
+ See: https://dev.twitter.com/issues/513
+
+ We'll use a simplistic forward paging scheme, by storing the oldest id
+ returned from the last search and using that in the max_id on the next
+ page request.
+ """
+ def __init__(self, method, args, kargs):
+ PageIterator.__init__(self, method, args, kargs)
+ self.oldest_id = 0
+
+ def next(self):
+ self.current_page += 1
+
+ if (self.limit > 0 and self.current_page > self.limit):
+ raise StopIteration
+
+ if self.current_page > 1:
+ self.kargs['max_id'] = self.oldest_id
@drevicko
drevicko added a note

twitter provides "next_results" in the returned "search_metadata" - may be a better place to get the next pages 'max_id'..

Per this discussion on dev.twitter.com -- next_results isn't always reliable.

(Yeah, I know the pull request is closed, but wanted to document my reasons for not relying on next_results.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+ items = self.method(*self.args, **self.kargs)
+ if len(items) == 0:
+ raise StopIteration
+ # Stash last result's oldest id for next page access
+ self.oldest_id = items[-1].id -1
@drevicko
drevicko added a note

There is an assumption here that the search results are sorted by id - this may not always be true?? It's probably better to use the "next_results" query string in the "search_metadata" than to rely on sorted results. Alternatively, we could extract the list of id's and use it's max().

Per this discussion on dev.twitter.com -- next_results isn't always reliable.

(Yeah, I know the pull request is closed, but wanted to document my reasons for not relying on next_results.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ return items
+
+ def prev(self):
+ raise TweepError('search does not support reverse pagination.')
View
38 tweepy/models.py
@@ -4,7 +4,7 @@
from tweepy.error import TweepError
from tweepy.utils import parse_datetime, parse_html_value, parse_a_href, \
- parse_search_datetime, unescape_html
+ unescape_html
class ResultSet(list):
@@ -209,33 +209,25 @@ def destroy(self):
return self._api.destroy_saved_search(self.id)
-class SearchResult(Model):
+class SearchResult(Status):
+ """ Search results in V1.1 are now the same as any other status
+ data structure. Therefore we'll just derive from Status class.
- @classmethod
- def parse(cls, api, json):
- result = cls()
- for k, v in json.items():
- if k == 'created_at':
- setattr(result, k, parse_search_datetime(v))
- elif k == 'source':
- setattr(result, k, parse_html_value(unescape_html(v)))
- else:
- setattr(result, k, v)
- return result
+ No need to define a special parse() method.
+ """
@classmethod
def parse_list(cls, api, json_list, result_set=None):
results = ResultSet()
- results.max_id = json_list.get('max_id')
- results.since_id = json_list.get('since_id')
- results.refresh_url = json_list.get('refresh_url')
- results.next_page = json_list.get('next_page')
- results.results_per_page = json_list.get('results_per_page')
- results.page = json_list.get('page')
- results.completed_in = json_list.get('completed_in')
- results.query = json_list.get('query')
-
- for obj in json_list['results']:
+ search_metadata = json_list.get('search_metadata')
+ if search_metadata:
@drevicko
drevicko added a note

Twitter says it'll always be there.. If we follow my suggestion above, and use "next_results" to get the 'max_id' for the next page, we'd best throw an exception if 'search_metadata' isn't present...

Per this discussion on dev.twitter.com -- next_results isn't always reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ # Convert smd dict to object with properties. Use Model class
+ # for convenience but this could be any generic object.
+ t = Model()
+ t.__dict__.update(search_metadata)
+ results.search_metadata = t
@drevicko
drevicko added a note

Just to put it out here for thought, in my version, I put the attributes of 'search_metadata' directly into the results object in keeping with the way the previous SearchResult class did it. This made updating my application code a little easier...
To be honest, I think I prefer keeping a similar structure to the twitter result json (ie: they way it's done here)..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+ for obj in json_list['statuses']:
results.append(cls.parse(api, obj))
return results
View
12 tweepy/utils.py
@@ -34,18 +34,6 @@ def parse_a_href(atag):
return atag[start:end]
-def parse_search_datetime(string):
- # Set locale for date parsing
- locale.setlocale(locale.LC_TIME, 'C')
-
- # We must parse datetime this way to work in python 2.4
- date = datetime(*(time.strptime(string, '%a, %d %b %Y %H:%M:%S +0000')[0:6]))
-
- # Reset locale back to the default setting
- locale.setlocale(locale.LC_TIME, '')
- return date
-
-
def unescape_html(text):
"""Created by Fredrik Lundh (http://effbot.org/zone/re-sub.htm#unescape-html)"""
def fixup(m):
Something went wrong with that request. Please try again.