# Scrapping Data

Scrapping data is where you are extracting data from an online platform. There are several aways to do data scrapping and they all boil up to 2 type:

                           1. Using APIs to get access to the data from the database
                           2. Viewing the source code of the platform (Read website)

Let me explain each a bit

###                                                                           1. Using APIs

* Here the online platform you are planning to access happens to have a gateway to access the data for developers who wish to access their data for learning purposes or creation of new solutins. 

* The companies give this leeway to developers so that they can increase user traffic or get new innovative solutions they can take to their users. 

###                                                             2. Extracting platform sorce code

* Here you are just taking the html code of a website and parsing it to get the data you want.

* This is a little tough if you have more than 1 websites to scrap data from. Escpecially when the developers ar different and have different syntax of coding.



### Scrapping Tweeter

* Here we will be using APIs to access twitter data. 
* Twitter provides several APIs to access different kinds of data. To use this APIs you need to acquire some access tokens and keys for authentication purposes hence why you need an approved twitter developer account.
* To work with the twitter APIs there several libraries you use.I have used 2 of them which are:
                1. Tweepy
                2. GetOldTweets3
* The reason I worked with both is because each has a particular limitation that make the work hard to gather the kind of data you are looking for.
* For Tweepy the limitation are:
                1. You can only get data from the last 30 days
                2. You can't get more than 300 tweets per query hence you need to query for 300 and wait for 15 minutes   before you query again.
* For GetOldTweets3 the limitations are:
                1. Though you are able to get a lot of tweets and old ones, there are specific attributes you can't get   from the tweet objects you return.

***Import Libraries to use***

In [1]:
# Import Tweeter APIs
import tweepy as tp
import GetOldTweets3 as got

# Import libraries for data reading
import pandas as pd

#For reading secured access code and tokens file
import yaml

***Read access codes and Tokens to authenticate the twitter API***

In [3]:
#Twitter API access token and consumer key with their authentication code read from a yaml file.
# Keep the secret keys private and not public
with open(r"secret.yml") as file:
    secret_list = yaml.load(file, Loader=yaml.FullLoader)
    
#Access the Twitter API
auth = tp.OAuthHandler(secret_list["consumer_key"], secret_list["consumer_secret"])
auth.set_access_token(secret_list["access_token"], secret_list["access_secret"])
api = tp.API(auth, wait_on_rate_limit=True)

***Set up tweet query with GetOldTweets3***

In [4]:
tweet_query = "@AIRTEL_KE"
count = 200000

In [5]:
#Set the criteria for searching the tweets
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(tweet_query)\
                                            .setSince("2020-01-01")

#Query for the tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)


In [None]:
# Create a list holding lists with tweet details we want
tweets_lst = [[tw.id, tw.date, tw.text, tw.username, tw.retweets, tw.favorites, tw.geo, tw.mentions, tw.hashtags] for tw in tweets]

In [None]:
# Confirm that we received the number of tweets requested
len(tweets_lst)

136766

In [None]:
# Create a dataframe of the tweets we queried
tweets_df = pd.DataFrame(tweets_lst, columns=["ID", "Date", "Post", "Username","Retweets", "Favorites", "Geo", "Mentions", "Hashtags"])
tweets_df.sample(10)

Unnamed: 0,ID,Date,Post,Username,Retweets,Favorites,Geo,Mentions,Hashtags
95494,1246075645395963904,2020-04-03 14:02:39+00:00,#BeSmartBeSafe ^Caro,AIRTEL_KE,0,0,,,#BeSmartBeSafe
40560,1277336948709998592,2020-06-28 20:23:55+00:00,Na tusisahau @AIRTEL_KE banaa tho hao wataona ...,KamauTheSecond,0,2,,@AIRTEL_KE,
62755,1264964584009617410,2020-05-25 17:00:33+00:00,,SirJeremyKE,0,0,,,
16469,1288361388860211200,2020-07-29 06:31:06+00:00,Okay.Please share via dm the disconnected numb...,AIRTEL_KE,0,0,,,
95173,1246391239362174980,2020-04-04 10:56:42+00:00,@AIRTEL_KE Bought your mifi last saturday at 4...,nancy_mwongeli,0,0,,@AIRTEL_KE,
12591,1290212818944311296,2020-08-03 09:08:01+00:00,Checking. ^Jamo,AIRTEL_KE,0,0,,,
133214,1214957281747652609,2020-01-08 17:09:23+00:00,"Hello Thomas, Amazing bundles and unliminet bu...",AIRTEL_KE,0,0,,,
43789,1294543754511020033,2020-08-15 07:57:37+00:00,"JLCPCB Prototype For $2/5pcs, 24 Hours Quick T...",JLCPCB,43,375,,,
103530,1239965622563307521,2020-03-17 17:23:36+00:00,Always here to assist.^Caro,AIRTEL_KE,0,0,,,
89048,1250885624216903683,2020-04-16 20:35:47+00:00,"The number is in excess of 1 digit,please dial...",AIRTEL_KE,0,0,,,


In [None]:
# Filter the tweets that mention @AIRTEL_KE since those are the tweets with questions and queries.
airtel_mention_df = tweets_df[tweets_df["Mentions"].str.contains("@AIRTEL_KE") | tweets_df["Mentions"].str.contains("@airtel_ke")]
print(airtel_mention_df.shape)

# To avoid having to repeat the querying process again, we save the results we got
airtel_mention_df.to_csv(path_or_buf="AirtelMentions1.csv")
airtel_mention_df.sample(20)

(33677, 9)


Unnamed: 0,ID,Date,Post,Username,Retweets,Favorites,Geo,Mentions,Hashtags
104885,1239202406522634242,2020-03-15 14:50:51+00:00,Re: @Safaricom @JTLKenya @AIRTEL_KE who price ...,Muriu,0,1,,@safaricom @JTLKenya @AIRTEL_KE,
124277,1222932631286841346,2020-01-30 17:20:34+00:00,Try @AIRTEL_KE. You'll never regret. #SwitchTo...,Njokiwainaina3,0,0,,@AIRTEL_KE,#SwitchToAirtel
104879,1239206781559222272,2020-03-15 15:08:14+00:00,Hello @AIRTEL_KE mbona network inashinda ikika...,Niqy_Steamerman,0,0,,@AIRTEL_KE,
18921,1287091657456914432,2020-07-25 18:25:38+00:00,@AIRTEL_KE i was at your main office today in ...,denisyulempole,0,0,,@AIRTEL_KE,
91293,1249349397864947712,2020-04-12 14:51:22+00:00,@AIRTEL_KE please your network is very poor,Shadrackmwanza,0,0,,@AIRTEL_KE,
95416,1246131144858521601,2020-04-03 17:43:11+00:00,"@AIRTEL_KE your network is very poor, sasa tut...",fredie_wambua,0,1,,@AIRTEL_KE,#ukweliusemwe
39897,1277824520921833475,2020-06-30 04:41:21+00:00,@AIRTEL_KE what's wrong with your network sinc...,edochomo,0,0,,@AIRTEL_KE,
120199,1226177428512542726,2020-02-08 16:14:14+00:00,@AIRTEL_KE I am seriously having issues with m...,georgekamotho,0,0,,@AIRTEL_KE,#ShittyService
130518,1217054115953553409,2020-01-14 12:01:27+00:00,@AIRTEL_KE the way you guys are eating my bund...,NixonLumbugu,0,0,,@AIRTEL_KE,
82312,1254460245742686208,2020-04-26 17:20:03+00:00,@AIRTEL_KE Hey please refresh my line havin ne...,Digneez,0,0,,@AIRTEL_KE,


In [None]:
# Get the list we already created from the earlier query.
airtel_mention_df = pd.read_csv("AirtelMentions1.csv")
airtel_mention_df.drop(columns=['Unnamed: 0'], inplace=True)
airtel_mention_df.sample(20)

Unnamed: 0,ID,Date,Post,Username,Retweets,Favorites,Geo,Mentions,Hashtags
22314,1244577940446367744,2020-03-30 10:51:18+00:00,@AIRTEL_KE Hello work on your network strength...,Manu_Onyango,0,0,,@AIRTEL_KE,
6617,1282456879164198914,2020-07-12 23:28:41+00:00,@AIRTEL_KE Okay your bundles are depleted so f...,Chep15_,0,0,,@AIRTEL_KE,
14230,1265217098990735361,2020-05-26 09:43:57+00:00,"Dear @AIRTEL_KE, you claim to be competing wit...",Chipmunk254,0,2,,@AIRTEL_KE @SafaricomPLC,
19015,1252908401752973312,2020-04-22 10:33:35+00:00,Pigia @AIRTEL_KE customer care,Stevemulwa9,0,0,,@AIRTEL_KE,
9624,1277351813570793483,2020-06-28 21:22:59+00:00,Been a while haven't used @AIRTEL_KE for inter...,otikenne,0,0,,@AIRTEL_KE,
7911,1280828066260975631,2020-07-08 11:36:22+00:00,@AIRTEL_KE kwani izo night data yenu zinafanya...,SimonChae2,0,0,,@AIRTEL_KE,
3446,1289167680201789442,2020-07-31 11:55:01+00:00,@AIRTEL_KE hey Airtel. Network seems to be unu...,RitaOgada,0,0,,@AIRTEL_KE,
13857,1266043770337927169,2020-05-28 16:28:51+00:00,"Hey @AIRTEL_KE , I attribute this package with...",chumba_boaz,0,0,,@AIRTEL_KE,
11778,1271145042221113347,2020-06-11 18:19:29+00:00,@AIRTEL_KE @AIRTEL_KE check on my 2g data conn...,Antony23832972,0,0,,@AIRTEL_KE @AIRTEL_KE,
2717,1290711301203910656,2020-08-04 18:08:49+00:00,What's the customer service number for @AIRTEL...,UncleJayDwayne,0,0,,@AIRTEL_KE,


In [None]:
#This searches for replies for tweet by taking the name of the user and the tweet ID and looks for all the tweets after that tweet ID with with the username

# airtel_replies=[] # This holds all our posts with their replies in form of dictionaries per each reply

# We loop through our dataframe of tweets getting the value of ID for each row which is the tweet ID as well as get the current number of the loop
# for x, Id in enumerate(airtel_mention_df["ID"]):
#     tweet_id = Id
#     name = airtel_mention_df.Username.iloc[x] # We get the username of the current tweet
#     replies = [] # List of tweets that have the "in_reply_to_status_id_str" attribute equal to the value of our current tweet ID
#     print(x)

      # we retrieve all tweets meantioning our username and that were posted after our tweet was posted
#     for tweet in tp.Cursor(api.search,q='to:'+name, since_id = tweet_id, timeout=999999).items():

        # Iterate through the tweets gotten to check thos that reply to our tweet ID
#         if hasattr(tweet, 'in_reply_to_status_id_str'):
#             if (tweet.in_reply_to_status_id_str==tweet_id):
#                 replies.append(tweet)
            # Loop through our list of tweet replies to create a dictionary that has both the tweet and its replies
#             for tweet in replies:
#                 row = {'ID':tweet_id, 'Date': airtel_mention_df.Date.iloc[x], 'Username':name, 
#                         'Post': airtel_mention_df.Post.iloc[x],  'Replier': tweet.user.screen_name, 
#                         'Mentions': airtel_mention_df.Mentions.iloc[x],  'Hashtags': airtel_mention_df.Hashtags.iloc[x],  
#                         'Reply_date':tweet.created_at, 'Reply': tweet.text.replace('\n', ' '), 
#                         'Reply_mentions':' '.join(x['screen_name'] for x in tweet.entities['user_mentions']), 
#                         'Reply_Hashtags':' '.join(x['text'] for x in tweet.entities['hashtags'])}
#                 airtel_replies.append(row)


In [None]:
"""This function finds the tweets by AIRTEL_KE since the tweet 
    of the customer asking a question tweeted the question
    All those tweets are then added to a list of tweets 
    avoiding creation of duplicates"""

def retriver(name, tweet_id,tweetsData):
    try:
        tweet_data = tp.Cursor(api.user_timeline,id='AIRTEL_KE', since_id = tweet_id, timeout=999999).items()
    except:
        print('failed to get data')
        tweet_data = []
    for tweet in tweet_data:
        if tweet not in tweetsData:
            tweetsData.append(tweet)
    
    return tweetsData

In [None]:
# Testing our function to make sure it returns what we expect
# Data_tweets=[]
# ts = retriver('ntvkenya', '1294919890839773184', Data_tweets)
# ts[0].id

In [None]:
def get_replies(Data_tweets,df, tweet_id):
    airtel_replies=[]
    replies = []
    for tweet in Data_tweets:
#         print('In list')
        if hasattr(tweet, 'in_reply_to_status_id_str'):
            if (tweet.in_reply_to_status_id_str==tweet_id):
                replies.append(tweet)
                print('good to go ID')
    if len(replies) > 0:
        for tweet in replies:
            print('good to go')
            row = {'ID':tweet_id, 'Date': df.Date.iloc[x], 'Username':name, 
                    'Post': df.Post.iloc[x],  'Replier': tweet.user.screen_name, 
                    'Mentions': df.Mentions.iloc[x],  'Hashtags': df.Hashtags.iloc[x],  
                    'Reply_date':tweet.created_at, 'Reply': tweet.text.replace('\n', ' '), 
                    'Reply_mentions':' '.join(x['screen_name'] for x in tweet.entities['user_mentions']), 
                    'Reply_Hashtags':' '.join(x['text'] for x in tweet.entities['hashtags'])}
            airtel_replies.append(row)
    return airtel_replies

In [None]:
repliesData = []
sort_df = airtel_mention_df.sort_values(by = 'ID')
Data_tweets=[]

In [None]:
# Loop though the sorted dataframe to get replies for each tweet starting with the oldest
for x, Id in enumerate(sort_df["ID"]):
    tweet_id = Id
    name = sort_df.Username.iloc[x]
    print(x)
    present = False
    print(len(Data_tweets))
    for tw in Data_tweets:
        if tw.in_reply_to_status_id_str == tweet_id:
            present = True
    print(present)
    if present == True:
        print('good')
        try:
            repliesData.extend(get_replies(Data_tweets, sort_df, tweet_id))
        except:
            print('failed')
    else:
        try:
            Data_tweets= retriver(name, tweet_id, Data_tweets)
            print("Run retriver")
            repliesData.extend(get_replies(Data_tweets, sort_df, tweet_id))
        except:
            print('failed')
#     Save each data scrapped to prevent loss in case of the code crashing        
    airtelData_df = pd.DataFrame(repliesData)

    airtelData_df.to_csv(path_or_buf="AirtelData.csv")

0
0
False
Run retriver
1
3242
False
Run retriver
2
3242
False
Run retriver
3
3242
False
Run retriver
4
3242
False
Run retriver
5
3242
False
Run retriver
6
3242
False
Run retriver
7
3242
False
Run retriver
8
3242
False
Run retriver
9
3242
False
Run retriver
10
3242
False
Run retriver
11
3242
False
Run retriver
12
3242
False
Run retriver
13
3242
False
Run retriver
14
3242
False
Run retriver
15
3242
False
Run retriver
16
3242
False
Run retriver
17
3242
False
Run retriver
18
3242
False
Run retriver
19
3242
False
Run retriver
20
3242
False
Run retriver
21
3242
False
Run retriver
22
3242
False
Run retriver
23
3242
False
Run retriver
24
3242
False


In [None]:
airtelData_df = pd.DataFrame(repliesData)

airtelData_df.to_csv(path_or_buf="AirtelData.csv")

In [None]:
airtelData_df.sample(20)

In [None]:
# test = api.get_status('1243081255102615552')
# test.entities

In [23]:
tC = got.manager.TweetCriteria().setUsername("barackobama").setSince("2015-09-10")\
                                            .setMaxTweets(1)
twts = got.manager.TweetManager.getTweets(tC)
for twe in twts:
    for tw in twe:
        print(tw)

TypeError: 'Tweet' object is not iterable

In [29]:
print(sort_df.ID.head(1))
print(sort_df.ID.tail(1))

199993    1169605566479687680
Name: ID, dtype: object
3    1295448699070554119
Name: ID, dtype: object


In [25]:
# A view of how a tweet object looks and its attributes
api.get_status('1169605566479687680')

Status(_api=<tweepy.api.API object at 0x0000022441941B00>, _json={'created_at': 'Thu Sep 05 13:37:51 +0000 2019', 'id': 1169605566479687680, 'id_str': '1169605566479687680', 'text': 'Please clarify this because I have visited your shop in Narok and they are saying the lines are not workingâ€¦ https://t.co/PNmw0GfW7G', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/PNmw0GfW7G', 'expanded_url': 'https://twitter.com/i/web/status/1169605566479687680', 'display_url': 'twitter.com/i/web/status/1â€¦', 'indices': [109, 132]}]}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 3026331367, 'id_str': '3026331367', 'name': 'Ronoh K Clinton ðŸ‡°ðŸ‡ª', 'screen_name': 'RonohClinton', 'location': 'Kenya', 'descrip

In [30]:
1295448699070554119 - 1169605566479687680

125843132590866439

In [None]:
# This is the code I need you to run on a strong and faster machine without any internet fluctuation
# This code goes through all tweets posted since the oldest tweet in our list to the last tweet in our tweet
status_id_lst = airtel_mention_df["ID"].tolist()
dataAirtel = []

for x in range(1169605566479687680, 1295448699070554119):
    tweet = api.get_status(str(x))
    if tweet.in_reply_to_status_id_str in status_id_lst:
        df = airtel_mention_df.loc[airtel_mention_df['ID'] == tweet.in_reply_to_status_id_str]
        for rw in df.values.tolist():
            row = {'ID':rw[0], 'Date': rw[1], 'Username':rw[3], 
                    'Post': rw[2], 'Mentions': rw[7],  'Hashtags': rw[8],
                    'Replier': tweet.user.screen_name,  
                    'Reply_date':tweet.created_at, 'Reply': tweet.text.replace('\n', ' '), 
                    'Reply_mentions':' '.join(x['screen_name'] for x in tweet.entities['user_mentions']), 
                    'Reply_Hashtags':' '.join(x['text'] for x in tweet.entities['hashtags'])}
            dataAirtel.append(row)
    
