# <center>Web Scraping by API </center>

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses `HTTP` requests to `GET`, `PUT`, `POST` and `DELETE` data
- Example:
    - https://groceries.asda.com/api/items/search<font color="blue"><b>?</b></font><font color='green'><b>keyword</b></font>=<font color='red'><b>yogurt<b></font><front color='purple'><b>&</b></font><font color='green'><b>r</b></font>=<font color='red'><b>json<b></font>, where
        - `?`: separate API endpoint  `https://groceries.asda.com/api/items/search` from parameters
        - `keyword=yogurt`: search `yogurt` on parameter `keyword`
        - `&`: combine multiple search criteria
        - `r=json`: result is in json format 
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [18]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json
import pandas as pd

In [19]:
import requests
import json

keyword = 'yogurt'


url="https://groceries.asda.com/api/items/search?keyword=" + keyword + "&r=json"

print(url)

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    # json. dumps() function converts a Python object into a json string
    result = r.json()
    print (json.dumps(result, indent=4))



https://groceries.asda.com/api/items/search?keyword=yogurt&r=json
{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "435",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_613f0638f04fd127608474cd^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a34::false",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "7368400",
  

In [20]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', \
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "isHookLogicInsert": "false",
    "totalResult": "435",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_613f0638f04fd127608474cd^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "10::for::\u00a34::false",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "7368400",
            "promoDetailFull": "10 for \u00a34",
            "availa

## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "Self-describing" and easy to understand
- JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [21]:
# Exercise 3.1 API returns a JSON object 

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# if the API call returns a successful response
if r.status_code==200:
    result = r.json()
    
    df = pd.DataFrame(result["items"])
    df.head()
    

Unnamed: 0,shelfId,shelfName,deptId,deptName,isBundle,meatStickerDetails,extraLargeImageURL,bundledItemCount,scene7Host,cin,...,avgWeight,iconDetails,maxQty,pricePerWt,productURL,pricePerUOM,searchTuningScore,onSale,salePrice,positionChngByMargin
0,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,7368400,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,4390417.0,False,,0
1,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,7368408,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,4268856.5,False,,0
2,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,7368402,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,3510444.8,False,,0
3,1215639170517,Chocolate & Caramel Desserts,1215341888021,Yogurts & Desserts,False,3::for::£2::false,,0,https://ui.assets-asda.com:443/dm/,7367121,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,1711667.0,False,,0
4,910000977060,"Diet, Low Fat & No Added Sugar",1215341888021,Yogurts & Desserts,False,10::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,6346354,...,,"{'promotionalIcons': ['59600051'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,1584629.1,False,,0


In [22]:
# Exercise 3.2. Parse JSON object (a dictionary)

# convert the first 2 items to string
s = json.dumps(result["items"][0:2], indent=4)
print(s)

# load back from a string
items = json.loads(s)
items

# save to file
json.dump(result["items"], open("items.json","w"))

# load back from file
items = json.load(open("items.json","r"))
print("test loaded data\n")
len(items)
items[0]

[
    {
        "shelfId": "1215286383583",
        "shelfName": "Corners",
        "deptId": "1215341888021",
        "deptName": "Yogurts & Desserts",
        "isBundle": "false",
        "meatStickerDetails": "10::for::\u00a34::false",
        "extraLargeImageURL": "",
        "bundledItemCount": "0",
        "scene7Host": "https://ui.assets-asda.com:443/dm/",
        "cin": "7368400",
        "promoDetailFull": "10 for \u00a34",
        "availability": "A",
        "totalReviewCount": "18",
        "asdaSuggest": "",
        "itemName": "Corner Vanilla Yogurt with Chocolate Balls",
        "price": "\u00a30.70",
        "imageURL": "",
        "aisleName": "Yogurts & Fromage Frais",
        "id": "1000377031656",
        "promoId": "ls92759",
        "isFavourite": "false",
        "hasAlternates": "false",
        "wasPrice": "",
        "brandName": "Muller",
        "promoType": "No Promo",
        "weight": "124g      ",
        "promoOfferTypeCode": "15",
        "promoQty": "

[{'shelfId': '1215286383583',
  'shelfName': 'Corners',
  'deptId': '1215341888021',
  'deptName': 'Yogurts & Desserts',
  'isBundle': 'false',
  'meatStickerDetails': '10::for::£4::false',
  'extraLargeImageURL': '',
  'bundledItemCount': '0',
  'scene7Host': 'https://ui.assets-asda.com:443/dm/',
  'cin': '7368400',
  'promoDetailFull': '10 for £4',
  'availability': 'A',
  'totalReviewCount': '18',
  'asdaSuggest': '',
  'itemName': 'Corner Vanilla Yogurt with Chocolate Balls',
  'price': '£0.70',
  'imageURL': '',
  'aisleName': 'Yogurts & Fromage Frais',
  'id': '1000377031656',
  'promoId': 'ls92759',
  'isFavourite': 'false',
  'hasAlternates': 'false',
  'wasPrice': '',
  'brandName': 'Muller',
  'promoType': 'No Promo',
  'weight': '124g      ',
  'promoOfferTypeCode': '15',
  'promoQty': '10',
  'promoValue': '£4',
  'productAttribute': '',
  'scene7AssetId': '4025500277031',
  'promoDetail': '10 for £4',
  'bundleDiscount': '0.00',
  'avgStarRating': '4.7222',
  'name': 'Mull

test loaded data



60

{'shelfId': '1215286383583',
 'shelfName': 'Corners',
 'deptId': '1215341888021',
 'deptName': 'Yogurts & Desserts',
 'isBundle': 'false',
 'meatStickerDetails': '10::for::£4::false',
 'extraLargeImageURL': '',
 'bundledItemCount': '0',
 'scene7Host': 'https://ui.assets-asda.com:443/dm/',
 'cin': '7368400',
 'promoDetailFull': '10 for £4',
 'availability': 'A',
 'totalReviewCount': '18',
 'asdaSuggest': '',
 'itemName': 'Corner Vanilla Yogurt with Chocolate Balls',
 'price': '£0.70',
 'imageURL': '',
 'aisleName': 'Yogurts & Fromage Frais',
 'id': '1000377031656',
 'promoId': 'ls92759',
 'isFavourite': 'false',
 'hasAlternates': 'false',
 'wasPrice': '',
 'brandName': 'Muller',
 'promoType': 'No Promo',
 'weight': '124g      ',
 'promoOfferTypeCode': '15',
 'promoQty': '10',
 'promoValue': '£4',
 'productAttribute': '',
 'scene7AssetId': '4025500277031',
 'promoDetail': '10 for £4',
 'bundleDiscount': '0.00',
 'avgStarRating': '4.7222',
 'name': 'Muller Corner Vanilla Yogurt with Choco

## 4. Get Tweets

Reference: 
- https://github.com/scalto/snscrape-by-location/blob/main/snscrape_by_location_tutorial.ipynb
- https://medium.com/swlh/how-to-scrape-tweets-by-location-in-python-using-snscrape-8c870fa6ec25

Note: User object is not exposed by TwitterSearchScraper any more.

In [23]:
import pandas as pd
import snscrape.modules.twitter as sntwitter
import itertools


In [24]:
#  search by keywords + time
# TwitterSearchScraper returns an interator, islice loops through the iterator

df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    '"blockchain + since:2021-12-1 until:2022-1-31"').get_items(), 500))

print(len(df))
df.head()

500


Unnamed: 0,url,date,content,id,username,outlinks,outlinksss,tcooutlinks,tcooutlinksss
0,https://twitter.com/CryptoAnomalous/status/148...,2022-01-30 23:59:59+00:00,@BradSherman do not get in the way of US techn...,1487938677535907845,CryptoAnomalous,[],,[],
1,https://twitter.com/gerard_dache/status/148793...,2022-01-30 23:59:55+00:00,"Congrats to Bill Rockwood Jr, Esq., MBA, J.D.,...",1487938658023919619,gerard_dache,"[https://lnkd.in/dqGwujW6, https://lnkd.in/dRn...",https://lnkd.in/dqGwujW6 https://lnkd.in/dRnUC_Yw,"[https://t.co/jhxRK9jjly, https://t.co/AsO08Xc...",https://t.co/jhxRK9jjly https://t.co/AsO08XcIEm
2,https://twitter.com/TrustCheckxyz/status/16212...,2023-02-02 23:27:56+00:00,TrustCheck is your partner in crypto protectio...,1621289346040205312,TrustCheckxyz,[],,[],
3,https://twitter.com/Rekttrading8/status/148793...,2022-01-30 23:59:54+00:00,@kararesurrect You are still announcing it on ...,1487938655159140353,Rekttrading8,[],,[],
4,https://twitter.com/KeanuBelieves/status/14879...,2022-01-30 23:59:50+00:00,@WatcherGuru @AffinityBSC. Being listing on a ...,1487938639820693508,KeanuBelieves,[],,[],


In [25]:
df.content[0:20]


0     @BradSherman do not get in the way of US techn...
1     Congrats to Bill Rockwood Jr, Esq., MBA, J.D.,...
2     TrustCheck is your partner in crypto protectio...
3     @kararesurrect You are still announcing it on ...
4     @WatcherGuru @AffinityBSC. Being listing on a ...
5                            @mine_blockchain scam mnet
6     46150 Well, who would have thought that? Minin...
7     The Stellar network enables change. Change in ...
8     👋 Hey! Wait! help me, help you, help us? Who's...
9     🐳 #Cardano $ADA Whale ❤️laced!\n💰 Transaction ...
10    TrustCheck is your partner in crypto protectio...
11    @TheVunderkind I still think we are yet to app...
12    63208 Well, who would have thought that? Minin...
13    @NathanielBandy1 Nintendo has a hard time putt...
14    39313 Well, who would have thought that? Minin...
15    The Stellar network enables change. Change in ...
16    Asian Company Uses Blockchain Technology to Pr...
17                         blockchain the piece 

In [26]:
# search by user

df = pd.DataFrame(itertools.islice(sntwitter.TwitterUserScraper(
    '"zawphyowai199"').get_items(), 500))

print(len(df))
df.tail()


500


Unnamed: 0,url,date,content,id,username,outlinks,outlinksss,tcooutlinks,tcooutlinksss
495,https://twitter.com/zawphyowai199/status/13652...,2021-02-26 11:48:41+00:00,@cryptomasters07 @FredrickNwa @eth_rift @LiveN...,1365267538259623937,zawphyowai199,[],,[],
496,https://twitter.com/zawphyowai199/status/13648...,2021-02-25 11:13:11+00:00,@MidasDollar @BinanceChain @cz_binance @BearnF...,1364896218338390020,zawphyowai199,[],,[],
497,https://twitter.com/zawphyowai199/status/13644...,2021-02-24 04:20:54+00:00,@1MillionTokens @mma728122 \n@mst5792 \n@wt7276,1364430076763246592,zawphyowai199,[],,[],
498,https://twitter.com/zawphyowai199/status/13644...,2021-02-24 04:17:08+00:00,@phoswapofficial $PHO #BinanceSmartChain👍 \n#Y...,1364429127122493441,zawphyowai199,[],,[],
499,https://twitter.com/zawphyowai199/status/13642...,2021-02-23 17:44:14+00:00,@bakery_swap @ape_swap #BananaBake\n#Bake\n#Nf...,1364269854350446592,zawphyowai199,[],,[],


## 5. Tweepy
- Tweepy is a python library to access Twitter API. 
- `pip install tweepy`
- The Tweepy documentation has detailed explanations: https://docs.tweepy.org/en/stable/
- You need to apply for a developer account from here: https://developer.twitter.com/en/apply-for-access

In [1]:
import tweepy
import csv
import datetime

# https://docs.tweepy.org/en/stable/auth_tutorial.html

CONSUMER_KEY=''
CONSUMER_SECRET=''
ACCESS_KEY=''
ACCESS_SECRET=''

auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
api=tweepy.API(auth)

In [2]:
# Take a look at the public tweets from your account's home timeline 

public_tweets = api.home_timeline()
print(len(public_tweets))
for tweet in public_tweets[:2]:
    print(tweet.text)

20
Public Storage makes $11 billion hostile bid for Life Storage https://t.co/CIgJuBnSTy https://t.co/vZOg6ojHdr
Firefighters battled dozens of raging wildfires in Chile, seeking to gain control of one of the country's worst nat… https://t.co/fACQjuZSzz


In [7]:
# is this useful information?
# Let's take a close look at ONE tweet json

public_tweets[0]
# there's no way to figure this out


Status(_api=<tweepy.api.API object at 0x7fdeb726cd90>, _json={'created_at': 'Mon Feb 06 02:35:17 +0000 2023', 'id': 1622423655387959303, 'id_str': '1622423655387959303', 'text': 'Public Storage makes $11 billion hostile bid for Life Storage https://t.co/CIgJuBnSTy https://t.co/vZOg6ojHdr', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/CIgJuBnSTy', 'expanded_url': 'http://reut.rs/3Yosrcp', 'display_url': 'reut.rs/3Yosrcp', 'indices': [62, 85]}], 'media': [{'id': 1622423653555097601, 'id_str': '1622423653555097601', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media/FoQBt4lX0AEbRGs.jpg', 'media_url_https': 'https://pbs.twimg.com/media/FoQBt4lX0AEbRGs.jpg', 'url': 'https://t.co/vZOg6ojHdr', 'display_url': 'pic.twitter.com/vZOg6ojHdr', 'expanded_url': 'https://twitter.com/Reuters/status/1622423655387959303/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 120

In [8]:
# make it look better
# convert to string
json_str = json.dumps(public_tweets[0]._json)

# deserialise string into python object
parsed = json.loads(json_str)


print(json.dumps(parsed, indent=4, sort_keys=True))
# Now we can have a better idea of the clustered relations of the json object

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Mon Feb 06 02:35:17 +0000 2023",
    "entities": {
        "hashtags": [],
        "media": [
            {
                "display_url": "pic.twitter.com/vZOg6ojHdr",
                "expanded_url": "https://twitter.com/Reuters/status/1622423655387959303/photo/1",
                "id": 1622423653555097601,
                "id_str": "1622423653555097601",
                "indices": [
                    86,
                    109
                ],
                "media_url": "http://pbs.twimg.com/media/FoQBt4lX0AEbRGs.jpg",
                "media_url_https": "https://pbs.twimg.com/media/FoQBt4lX0AEbRGs.jpg",
                "sizes": {
                    "large": {
                        "h": 628,
                        "resize": "fit",
                        "w": 1200
                    },
                    "medium": {
                        "h": 628,
                        "resize": "fit",
           

### 5.1. Get tweets from users' timeline
- Make a Timeline call to retrieve the most recent 3200 tweets by a user (a rule set by Twitter).
    - Note: the time range you get depends on how often the user posts tweets. 
- Parameters for the timeline call
    - `count`: the number of results to try and retrieve per page. Maximum is 200. 
    - Make multiple calls to retrieve the 3200 tweets. 
    - `tweet_mode`:swaps the text index for full_text, and prevents a primary tweet longer than 140 characters from being truncated.
- Variables of tweet objects
    - https://docs.tweepy.org/en/stable/api.html#tweepy-api-twitter-api-wrapper

In [10]:
# Get the first five tweets of a user.
timeline = api.user_timeline(screen_name="KelloggCompany",count=5,tweet_mode="extended")

for status in timeline:
    print (status.id)
    print (status.full_text)

1586056988315844610
We're proud to be ranked #2 in the @Access to Nutrition Initiative 2022 US Access to Nutrition Index. We remain committed to providing access to nutritious foods to create #betterdays for people in the U.S. and across the world. #LifeAtK #ESG #ATNIUSIndex https://t.co/AMHkEMwiO8 https://t.co/0vrgkkkojK
1585399725930729472
One easy way to help kids get excited about going to school? Serve school breakfast!
 
According to new research, children who attend breakfast programs are excited about going to school.
 
Learn about this and other social benefits: https://t.co/9Dx1ggxfDA https://t.co/qSWUGGd8eC
1585240451036381184
RT @KelloggCompany: Webcast alert! Kellogg &amp; @foodbanking will host a discussion about new research on the benefits of school breakfast pro…
1584897978136944641
Webcast alert! Kellogg &amp; @foodbanking will host a discussion about new research on the benefits of school breakfast programs &amp; the role that food banks play in supporting these prog

In [11]:
# Step 1: get a list of tweets 
# Step 2: extract the varaibles you want

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth)
    
    # initialize the first call
    alltweets=[]
    new_tweets=api.user_timeline(screen_name=user_name, count=200)
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  #next time start from the oldest one minus one 
    
    # continue to get tweets
    while len(new_tweets)>0:  
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(screen_name=user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
    
    # extract the variables you want
    outtweets = [[tweet.id_str, tweet.user.name, tweet.created_at, tweet.user.followers_count,
                  tweet.text.encode("utf-8")] for tweet in alltweets]
            
    # write out your variables
    with open('%s_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","user_name","created_at","followers","text"])
        writer.writerows(outtweets)
    pass

# use your function
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")
    

getting tweets before 1436313521776701471
...400 tweets downloaded so far
getting tweets before 1372185589525717001
...600 tweets downloaded so far
getting tweets before 1310917338070429695
...800 tweets downloaded so far
getting tweets before 1235602518031372287
...1000 tweets downloaded so far
getting tweets before 1171171132633997314
...1200 tweets downloaded so far
getting tweets before 1097565742130712576
...1400 tweets downloaded so far
getting tweets before 1007004846355083263
...1599 tweets downloaded so far
getting tweets before 930802267111882751
...1799 tweets downloaded so far
getting tweets before 871796275842138111
...1999 tweets downloaded so far
getting tweets before 798912118175080448
...2199 tweets downloaded so far
getting tweets before 751417169762717696
...2396 tweets downloaded so far
getting tweets before 704673280029020160
...2594 tweets downloaded so far
getting tweets before 656850808840003583
...2793 tweets downloaded so far
getting tweets before 592687275378

In [15]:
# Take a look at the table you got
df= pd.read_csv('KelloggCompany_tweet.csv', header=0)
df.head()

# how many tweets we get?
len(df)

# The following tweet is a retweet. Take a look at the text.
# The index of this retweet is based on the table generated previously
# if you run at a different time, you will see a differet tweet
# try to find a retweet and compare the text with the actual tweet
# You can find each tweet on Twitter by its ID.
df.text[0]


Unnamed: 0,id,user_name,created_at,followers,text
0,1586056988315844610,Kellogg Company,2022-10-28 18:07:08,77052,"b""We're proud to be ranked #2 in the @Access t..."
1,1585399725930729472,Kellogg Company,2022-10-26 22:35:24,77052,b'One easy way to help kids get excited about ...
2,1585240451036381184,Kellogg Company,2022-10-26 12:02:30,77052,b'RT @KelloggCompany: Webcast alert! Kellogg &...
3,1584897978136944641,Kellogg Company,2022-10-25 13:21:38,77052,b'Webcast alert! Kellogg &amp; @foodbanking wi...
4,1583159737117724673,Kellogg Company,2022-10-20 18:14:29,77052,b'RT @frootloops: Pick-up your limited-edition...


3186

'b"We\'re proud to be ranked #2 in the @Access to Nutrition Initiative 2022 US Access to Nutrition Index. We remain com\\xe2\\x80\\xa6 https://t.co/Xo15b3andL"'

### 5.2. Deal with truncated text
- For text mining on Twitter, it is important to get the full text. 
    - Full text would be essential for topic modeling and sentiment analysis.
    - Full text is also important for extracting mention networks (note the previous example). 
- Use the `tweet_mode="extended"` when calling a user's timeline.
    - When using extended mode, the `text` attribute of Status objects returned is replaced by a `full_text` attribute, which contains the entire untruncated text of the Tweet. 
- Full text for tweets that are retweets.
    - If the tweet is a retweet, the full_text is still truncated. 
    - We need to access the full text through `retweeted_status` attribute, which is a status object itself. 
- For reference: https://docs.tweepy.org/en/stable/extended_tweets.html

In [16]:
# Let's deal with retweets

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth,wait_on_rate_limit=True)

    alltweets=[]
    new_tweets=api.user_timeline(user_name, count=200,tweet_mode="extended")
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  
    
    # set date condition
    startDate=datetime.datetime(2022, 10, 1, 0, 0, 0)
    while new_tweets[-1].created_at > startDate:
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
        
    # check if it's a retweet
    # When using extended mode with a Retweet, the full_text attribute of the Status object may be truncated    
    # However, since the retweeted_status attribute (of a Status object that is a Retweet) is itself a Status object
    # the full_text attribute of the Retweeted Status object can be used instead.
    
    outtweets_all=[]
    for tweet in alltweets:
        status = api.get_status(tweet.id, tweet_mode="extended")
        
        if hasattr(status, "retweeted_status"):  # is a retweet
            full_text=status.retweeted_status.full_text.encode("utf-8")
            
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            tweet.retweeted_status.user.name,tweet.retweeted_status.user.screen_name,tweet.retweeted_status.user.description]
            outtweets_all.append(outtweets)
  
        else: # not a retweet
            full_text=status.full_text.encode("utf-8")
                    
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            "no value","no value","no value"]
            outtweets_all.append(outtweets)

    with open('%s_full_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","created_at","full_text",
                        "user.name","user.screen_name","user.followers_count",
                        "retweeted_status.user.name","retweeted_status.user.screen_name","retweeted_status.user.description"])
        writer.writerows(outtweets_all)

        
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")


getting tweets before 1436313521776701471
...400 tweets downloaded so far
getting tweets before 1372185589525717001
...600 tweets downloaded so far


In [17]:
df= pd.read_csv('KelloggCompany_full_tweet.csv', header=0)
df.head()
len(df)
# plese compare this full text with the above truncated text, what differences can you find?
df.full_text[0]

Unnamed: 0,id,created_at,full_text,user.name,user.screen_name,user.followers_count,retweeted_status.user.name,retweeted_status.user.screen_name,retweeted_status.user.description
0,1586056988315844610,2022-10-28 18:07:08,"b""We're proud to be ranked #2 in the @Access t...",Kellogg Company,KelloggCompany,77052,no value,no value,no value
1,1585399725930729472,2022-10-26 22:35:24,b'One easy way to help kids get excited about ...,Kellogg Company,KelloggCompany,77052,no value,no value,no value
2,1585240451036381184,2022-10-26 12:02:30,b'Webcast alert! Kellogg &amp; @foodbanking wi...,Kellogg Company,KelloggCompany,77052,Kellogg Company,KelloggCompany,https://t.co/ejl5pSUP0e
3,1584897978136944641,2022-10-25 13:21:38,b'Webcast alert! Kellogg &amp; @foodbanking wi...,Kellogg Company,KelloggCompany,77052,no value,no value,no value
4,1583159737117724673,2022-10-20 18:14:29,b'Pick-up your limited-edition Rise &amp; Kind...,Kellogg Company,KelloggCompany,77052,Froot Loops,frootloops,It’s spelled Froot. Play now on Roblox 🎮


600

'b"We\'re proud to be ranked #2 in the @Access to Nutrition Initiative 2022 US Access to Nutrition Index. We remain committed to providing access to nutritious foods to create #betterdays for people in the U.S. and across the world. #LifeAtK #ESG #ATNIUSIndex https://t.co/AMHkEMwiO8 https://t.co/0vrgkkkojK"'

### 5.3. Build Twitter networks
- **Follower-followee network**
    - If you have a list of user accounts, you may retrive the pairwise boolean values of following relations. 
    - Parameters
        * `source_id` – The user_id of the subject user.
        * `source_screen_name` – The screen_name of the subject user.
        * `target_id` – The user_id of the target user.
        * `target_screen_name` – The screen_name of the target user.
- **Retweet network**
    - Retweeted accounts can be extracted while scraping the API. 
    - Or retweeted accounts can be extracted from the text. 
- **Mention network**
    - Can be extracted from the full text. 

In [14]:
# How to scrape the follower-followee network?
# we can directly retrieve a bollean value 

dog="Microsoft"
cat="Oracle"

is_following = api.show_friendship(source_screen_name=cat,target_screen_name=dog)
print(is_following[1].following)

# Question: how to get the adjacency matrix of a follower-followee network?

False


### 5.4 Keywords search


In [13]:
# Define the search term and the date_since date as variables
search_words = "blockchain"
date_since = "2021-11-01"
#removed

# Collect tweets
tweets = tweepy.Cursor(api.search,
              q=search_words,
              lang="en",
              since=date_since).items(5)


# Iterate and print tweets
for tweet in tweets:
    print(tweet.text)

RT @chiqshoes: Make this egg the most liked post on Twitter🔥

@MandoxCreate @Wire_Blockchain #Egg #NFT #NFTs https://t.co/yLllS767Zh
RT @lovetr33_xyz: LoveTr33 is an experimental project designed to allow people to express and celebrate love on Valentine's day in a web3 a…
RT @ReautyDao: #Reautydao #Fridaybrews: Beauty brands are taking their operations to the next level with web3,blockchain &amp; AI!From improved…
RT @Ronald_vanLoon: What Is Contract Intelligence?
by @monishdarda @Forbes

Read more: https://t.co/nafLmwTZ0m

#AI #BigData #ArtificialInt…
RT @EOSnFoundation: The #EOS EVM is a major catalyst for the $EOS blockchain on the path to mass adoption 🔥

This recent feature with @cryp…


##### Twitter data resources
https://github.com/echen102/us-pres-elections-2020 <br>
https://github.com/echen102/COVID-19-TweetIDs