# Twitter Scraping
The aim of this script is the tweet extraction using [twitter][1] and [searchtweets][2] according to the needs of this project. To run the first part of this notebook, the only prerequisite is to create a Twitter developer account successfully. In the second part, it is required to have a premium developer account.

Later in this project we are going to analyse these tweets in respect with time. Therefore, our primary concern is to aqcuire the population of tweets in a specific time period, while the frequency and the volume of the data are of minor importance.

[1]: https://pypi.org/project/twitter/
[2]: https://github.com/twitterdev/search-tweets-python

### 1 Tweet Extraction from Profiles
In this section we focus on the tweet acquisition from various profiles. Twitter module offers a coprehensive framework to work with a free developer account. However, we need to import the necessary packages.

#### 1.1 Import Packages
While we use pandas package to save our results in `.json` format, it is possible to use [json][1] package instead.

[1]: https://docs.python.org/3/library/json.html

In [11]:
import twitter
import pandas as pd

#### 1.2 Account Authintication
After we import the necessary packages, we can save our twitter developer credentials. Note that the following credentials are not eligible and they are presented for the sake of format example.

In [14]:
ACCESS_TOKEN    = '479738713-91JwRgHsr7wmqLCJQIkmgE04TiaiJHoEKhozgOwx'
ACCESS_SECRET   = 'cDPoalGJx2wD6iq7SmBSz9EG2o2zjhL1M6qRBjaB5HwVk'
CONSUMER_KEY    = 'TMLKPLjFsIxhT9smhiyAKA14p'
CONSUMER_SECRET = 'onJDsMZBfNy0SF7VMMX1SHY6HPrw5MuOqimkaAtQzHGdUVnJZq'

t = twitter.Api(consumer_key=CONSUMER_KEY,
                consumer_secret=CONSUMER_SECRET,
                access_token_key=ACCESS_TOKEN,
                access_token_secret=ACCESS_SECRET,
                tweet_mode='extended')

#### 1.3 Extract Maximum ID
After we enter to our developer credentials successfully, we are ready to extract the data we need. As we mentioned above, the goal is to acquire data from a *number* of profiles in a *certain* time period. The first can be achieved using a loop over the profile names of interest, while the second is achieved by retrieving the tweet ID at the desired end point of time. Below there is an example, which can be used to obtain the end point.

In [None]:
example = t.GetUserTimeline(screen_name='microsoft',count=100)
print(example.id)

#### 1.4 Extract our Data
It's worth noting that the tweets are sorted from the most recent to the oldest. Also, sandbox (free) accounts have a cap of 100 tweets per request. However, whenever we set the limit to 100 the twitter module functions tend to skip tweets upon retrieval. Hence, we decided to set a limit on 50 to ensure that we get all the data we need.

Now we are ready to run the algorithm to extract and save the profile tweets in `.json` format.

*Note*: In this case we retrieve data from the start of 2018, so we can use the `str.endswith()` method to simplify the process. Although, it is suggested to use datetime module from datetime package and `datetime.strftime()` method.

In [157]:
profile_names = ['intel', 'amd', 'nvidia']

# delete data var in case we run
# the algorithm more than once
try:
    del data
except:
    pass

# loops over profiles
for name in profile_names:
    
    goal = True                  # def goal
    myDict = dict()
    max_id = 1148939992389079040 # starting tweet
    goal_year = '2017'           # ending point

    
    # loops until desired year
    while goal:
        timeline = t.GetUserTimeline(screen_name=name,
                                     exclude_replies=True,
                                     max_id=max_id, count=50)

        # creates a list of dictionaries
        for tweet in timeline:
            # lists hashtags
            if tweet.hashtags:
                if len(tweet.hashtags) > 1:
                    hashtags = list()
                    for tag in tweet.hashtags:
                        hashtags.append(tag.text)
                else:
                    hashtags = tweet.hashtags[0].text
            else:
                hashtags = 'null'

            # lists media type (e.g. photo, video)
            if tweet.media:
                media = True
            else:
                media = False

            myDict.append({'id': tweet.id,
                           'created_sec': tweet.created_at_in_seconds,
                           'text': tweet.full_text,
                           'hashtags': hashtags,
                           'media': media,
                           'retweets': tweet.retweet_count,
                           'created_at': tweet.created_at
                          })

            goal = not tweet.created_at.endswith(goal_year)
            max_id = tweet.id-1
        
    data = pd.DataFrame.from_dict(myDict)
    file_name = '%s.json' % name
    
    data.to_json(orient='values').replace("\'","")
    data.to_json(file_name)
    print('File <%s> is written succefully' % file_name)

File < intel.json > written succefully
File < amd.json > written succefully
File < nvidia.json > written succefully


### 2 Tweet Extraction from custom Queries
In this section we retrieve a series of tweets using predefined queries. In order to have full access to historical data, we needed to upgrade the developer's account to Premium. The Premium account may provide full access to tweets, but it imposes limitations in both *requests* and *tweet usage* per month.

#### 2.1 Import Packages
When we changed to a Premium account, we faced a couple of authintication problems with Twitter package. Hence, we decided to use the suggested-by-Twitter package searchtweets to extract the data. Similarly, pandas can be replaced as suggested in the first section. Finally, we import time package to put the system on sleep to have greater control in case of errors.

In [2]:
from searchtweets import load_credentials, gen_rule_payload, ResultStream, collect_results
import pandas as pd
import time

#### 2.2 Account Authintication
In comparison with twitter, searchtweets supports YAML to store the developer credentials, which increases security. YAML is a Markup Language and its extension is `.yaml`. Below there is an example of how to enter your credentials in a YAML file. It's worth mentioning that if you use Jupyter, you can just create a `.txt` file and rename the extension to `.yaml`.


```yaml
search_tweets_api:
    account_type: premium
    endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/example.json
    consumer_key: TMLKPLjFsIxhT9smhiyAKA14p
    consumer_secret: onJDsMZBfNy0SF7VMMX1SHY6HPrw5MuOqimkaAtQzHGdUVnJZq
```

Now we are ready to come back to Jupyter and put our credentials!

In [33]:
premium_search_args = load_credentials("twitter_keys.yaml",
                                       yaml_key="search_tweets_api",
                                       env_overwrite=False)

Grabbing bearer token from OAUTH


#### 2.3 Queries & Format
The query format follows Twitter's advanced search rules. It is important to be **EXTRA** careful with the queries. First, assess the total amount of requests for your queries by trying them directly in Twitter search. Otherwise, there is a chance to run out of requests this month.

In this project we pull tweets from 111 financial accounts that include the keywords intc, intel, amd, nvda or nvidia under a specific timeframe. We assess the amount of requests by calculating a "generous" average of tweets per day using Twitter's search.

However, let's see how we created the queries in first place. First, we retrieved the [top 10][1] accounts from NASDAQ, [top 100][2] from Forbes and then we added NASDAQ Twitter account as well. While NASDAQ's top list was short, Forbes list was retrieved using Python. We were quite lucky and the list was in an [HTML Table][3], so we used magic command ```%%html``` to get the table locally and we saved it in a `.txt` file. Then, the process of creating a list with the accounts of interest was really simple.

[1]: https://www.nasdaq.com/article/10-twitter-feeds-investors-need-to-follow-cm522728
[2]: https://www.forbes.com/sites/alapshah/2017/11/16/the-100-best-twitter-accounts-for-finance/#130a10347ea0
[3]: #html





In [3]:
# Query Keywords
keywords = ['INTC','Intel','AMD','NVDA','Nvidia']

# NASDAQ List
accounts = ['FinancialTimes','business','cnbc',
            'stockTwits','WSJMoneyBeat','stlouisfed',
            'Carl_C_Icahn','NASDAQ','carney',
            'CGasparino','ZacksResearch']

# Load Forbes List with pandas
forbes_lst = pd.read_html('top_100.txt', header=0)[0]['Twitter Handle'].to_list()

# Merging Lists
accounts = accounts+forbes_lst

Now we only need to change the format to be readable by Twitter, so we defined a simple query handler function according to our needs.

In [3]:
def query_handler(var,prefix=str(),space=False):
    if prefix:
        prefix = prefix + ':'
        
    if isinstance(var,list):
        result = str()
        for i in var:
            result += ' OR ' + prefix + i
            
        var = result[4:len(result)].join('()')
    else:
        var = prefix + var
    if space:
        return var+' '
    else:
        return var

However, there is one more problem with search queries in Twitter. There is a character limitation, so we need to seperate our query into a list of queries. Below there is the algorithm we used to implement this limit per query.

In [None]:
if len(keywords+accounts) > 30:
    queries = list()
    i = 0; until = 30-len(keywords)
    
    while i <= len(accounts):

        queries.append(query_handler(keywords,space=True)+query_handler(accounts[i:until],'from'))

        if until+30-len(keywords) < len(accounts):
            i += until+1
            until = i+30-len(keywords)
        else:
            i += 1
            until = len(accounts)
            queries.append(query_handler(keywords,space=True)+query_handler(accounts[i:until],'from'))
            break

#### 2.4 Extract our Data
In comparison with twitter requests which are free, here there is a possibility to run out of requests if the algorithm stucks and keeps doing meaningless loops. Therefore, we try to mitigate this risk by saving every milestone in `.json` format and having several flow control and error handlers in our algorithm.

The following cell includes the variables we need to set before we run the algorithm for first time.

In [269]:
since     = "2018-01-03"
to_date   = "2019-07-09"
path      = 'tweet_news/'
file_name = 'query1'
df        = pd.DataFrame()

stuck     = False
until     = to_date

Now we are ready to run our algorithm!

In [269]:
for query in queries:
    
    try: # flow control
        if not stuck: # if not error
            df = pd.DataFrame()
            file_name = query.split()[0]
            until = to_date
        else: # if error - skip queries and start from checkpoint
            if query is not stuck_query:
                continue
            else:
                file_name = query.split()[0]
                until = stuck_date
    except: # if run for first time - stuck not defined
        df = pd.DataFrame()
        file_name = query.split()[0]
        until = to_date

    while since is not until: # query paging loop

        # query rules
        rule = gen_rule_payload(query,
                        results_per_call=500,
                        from_date=since,
                        to_date=until)

        try: # flow control

            # tweet request
            tweets = collect_results(rule,
                             max_results=500,
                             result_stream_args=premium_search_args)
            stuck = False

            for tweet in tweets: # import values to dataframe

                if tweet.in_reply_to_user_id: # boolean if it is reply
                    reply = True
                else:
                    reply = False

                if tweet.lang == 'en': # boolean if it is english
                    english = True
                else:
                    english = False

                if tweet.media_urls: # list media_urls
                    media_url = tweet.media_urls
                else:
                    media_url = None

                if tweet.hashtags: # list hashtags
                    hashtags = tweet.hashtags
                else:
                    hashtags = None

                if tweet.user_mentions: # boolean if there are mentions
                    mention = True
                else:
                    mention = False

                date = tweet.created_at_datetime.strftime('%Y-%m-%d %H:%M')

                df = df.append({'id': tweet.id,
                                'user': tweet.user_id,
                                'created_sec': tweet.created_at_seconds,
                                'text': tweet.all_text,
                                'hashtags': hashtags,
                                'english': english,
                                'followers': tweet.follower_count,
                                'favorite': tweet.favorite_count,
                                'media': media_url,
                                'retweets': tweet.retweet_count,
                                'quotes': tweet.quote_count,
                                'type': tweet.tweet_type,
                                'date': date,
                                'time': tweet.created_at_string[11:19],
                                'full_text': tweet.all_text
                               },ignore_index=True)

            # write checkpoint
            df.to_json(path+file_name+'.json')
            if until != date:
                until = date
                print(until)
                time.sleep(1)
            else:
                print(date)
                break

        except: # error handler
            stuck = True
            stuck_query = query
            stuck_date = until
            print('WARNING LOOP BROKEN ON DATE:\t', stuck_date)
            break

    if stuck:
        break

    print('\nquery <', query, '>:\tFinished',
          '\nFile name:\t', file_name,
          '\nNumber of observations:\t', len(df),'\n')

2018-10-10 11:45
2018-03-29 21:37
2018-01-03 13:21
2018-01-03 13:21

query < (INTC OR Intel OR AMD OR NVDA OR Nvidia) (from:FinancialTimes OR from:business OR from:cnbc OR from:stockTwits OR from:WSJMoneyBeat OR from:stlouisfed OR from:Carl_C_Icahn OR from:NASDAQ OR from:carney OR from:CGasparino OR from:ZacksResearch OR from:John_Hempton OR from:BarbarianCap OR from:muddywatersre OR from:AlderLaneeggs OR from:CitronResearch OR from:BrattleStCap OR from:KerrisdaleCap OR from:modestproposal1 OR from:marketfolly OR from:EventDrivenMgr OR from:ActivistShorts OR from:Carl_C_Icahn OR from:LongShortTrader) >:	Finished 
File name:	 query1 
Number of observations:	 1234 



---
### Appendix 1: HTML Table
The HTML code is pretty long, so we present a sample to get how it looks;
```python
%%html # IPython magic for html
<table> 
    <tbody> 
        <tr> 
            <td><span style="font-weight: 400;">Rank</span></td> 
            <td><span style="font-weight: 400;">Twitter Handle</span></td> 
            <td><span style="font-weight: 400;">Popularity Rating</span></td> 
            <td><span style="font-weight: 400;">Total Followers</span></td> 
            <td><span style="font-weight: 400;">% of Total Followers</span></td>
        </tr>
        <tr> 
            <td><span style="font-weight: 400;">1</span></td> 
            <td><span style="font-weight: 400;">John_Hempton</span></td> 
            <td><span style="font-weight: 400;">79</span></td> 
            <td><span style="font-weight: 400;">24,400</span></td> 
            <td><span style="font-weight: 400;">0.32%</span></td> 
        </tr> 
    </tbody> 
</table>
```
<a id='html'></a>

0,1,2,3,4
Rank,Twitter Handle,Popularity Rating,Total Followers,% of Total Followers
1,John_Hempton,79,24400,0.32%
2,BarbarianCap,76,21100,0.36%
3,muddywatersre,73,47600,0.15%
4,AlderLaneeggs,71,16400,0.43%
5,CitronResearch,68,60200,0.11%
6,BrattleStCap,67,18700,0.36%
7,KerrisdaleCap,65,20200,0.32%
8,modestproposal1,65,22700,0.29%
9,marketfolly,65,48200,0.13%
