<a href="https://colab.research.google.com/github/tsparaskevas/ML_EDDE2/blob/main/Final%20essay/Scraping_tweets_for_academic_research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a



###**1 --- Import libraries**

In [None]:
# For sending GET requests from the API
import requests
# For saving access tokens and for file management when creating and adding to the dataset
import os
# For dealing with json responses we receive from the API
import json
# For displaying the data after
import pandas as pd
# For saving the response data in CSV format
import csv
# For parsing the dates received from twitter in readable formats
import datetime
# For the current date if we need it
from datetime import date
import dateutil.parser
import unicodedata
# To add wait time between requests
import time
# For number utilities
import numpy as np 
# For searching in gdrive folders
from pathlib import Path
# For saving csvs to gdrive
from google.colab import drive 
drive.mount('gdrive', force_remount=True)

Mounted at gdrive


###**2 --- Define Variables and Preferences**

####**Twitter Developer | My Keys & Tokens**
Fill in your Keys and tokens

In [None]:
os.environ['TOKEN'] = ''

###**3 --- Functions**

In [None]:
# retrieve the token from the environment
def auth():
    return os.getenv('TOKEN')

In [None]:
# take the bearer token, pass it for authorization and return headers we will use to access the API
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

In [None]:
# build the request for the endpoint we are going to use and the parameters we want to pass
# here we use the full-archive search endpoint
def create_url(keyword, start_date, end_date, max_results = 10):
    
    # Change to the endpoint you want to collect data from
    search_url = "https://api.twitter.com/2/tweets/search/all" # Twitter’s API full list of different endpoints: https://developer.twitter.com/en/docs/twitter-api/early-access

    # change params based on the endpoint you are using
    # a query can be customized using search operators
    # full list of operators:
    # https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query
    query_params = {'query': keyword, # there may be combinations (e.g. “(xbox europe) OR (xbox usa)”)
                    'start_time': start_date,
                    'end_time': end_date,
                    'max_results': max_results,
                    'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
                    'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

In [None]:
# put it all together and connect to the endpoint
def connect_to_endpoint(url, headers, params, next_token = None): # next_token is set to none because we only care about it if it exists 
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

###**4 --- Testing the desired query**

**Setup inputs**

In [None]:
# Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = "(δολοφονία OR δολοφόνος OR συζυγοκτόνος OR συζυγοκτονία OR γυναικοκτόνος OR γυναικοκτονία OR ανθρωποκτονία OR ανθρωποκτόνος) (-is:retweet)"# (place_country:GR) (lang:el) (-is:retweet)"
start_time = "2020-03-01T00:00:00.000Z"
end_time = "2020-03-31T00:00:00.000Z"
max_results = 15

**Make request and get response**

In [None]:
# create the URL and get the response from the API
url = create_url(keyword, start_time, end_time, max_results)
json_response = connect_to_endpoint(url[0], headers, url[1])

Endpoint Response Code: 200


In [None]:
# Print the json_response
print(json.dumps(json_response, indent=4, sort_keys=True, ensure_ascii=False))

In [None]:
# Create df from the json_response
df = pd.DataFrame(json_response['data'])

In [None]:
# Preview the df
df

**Save data to csv file**

In [None]:
# Create file
#csvFile = open(file_path_and_name, "a", newline="", encoding='utf-8')
#csvWriter = csv.writer(csvFile)

# Create headers for the data you want to save, in this example, we only want to save these columns in our dataset
#csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','retweet_count','source','text'])
#csvFile.close()

In [None]:
def append_to_csv(json_response, fileName):

    # A counter variable
    counter = 0

    # Open OR create the target CSV file
    csvFile = open(fileName, "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    #Loop through each tweet
    for tweet in json_response['data']:
        
        # We will create a variable for each since some of the keys might not exist for some tweets
        # So we will account for that

        # 1. Author ID
        author_id = tweet['author_id']

        # 2. Time created
        created_at = dateutil.parser.parse(tweet['created_at'])

        # 3. Geolocation
        if ('geo' in tweet):   
            geo = tweet['geo']['place_id']
        else:
            geo = " "

        # 4. Tweet ID
        tweet_id = tweet['id']

        # 5. Language
        lang = tweet['lang']

        # 6. Tweet metrics
        retweet_count = tweet['public_metrics']['retweet_count']
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']

        # 7. source
        source = tweet['source']

        # 8. Tweet text
        text = tweet['text']
        
        # Assemble all data in a list
        res = [author_id, created_at, geo, tweet_id, lang, like_count, quote_count, reply_count, retweet_count, source, text]
        
        # Append the result to the CSV file
        csvWriter.writerow(res)
        counter += 1

    # When done, close the CSV file
    csvFile.close()

    # Print the number of tweets for this iteration
    print("# of Tweets added from this response: ", counter) 

**Save csv**

In [None]:
# Run the function and save to csv
#append_to_csv(json_response, file_path_and_name)

###**5 --- Set file path and name**

In [None]:
# Set file path
file_path = "/content/gdrive/MyDrive/Colab Notebooks/Scraping/Tweets/Topics"
# Set file's folder name
file_folder = "ergasiaEDDE2" 
# Set file name
file_name = "homicide2_2020"
# AUTO Generate file path
file_path_and_folder = file_path + "/" + file_folder
# AUTO Genarate file path and name
file_path_and_name = file_path_and_folder + "/" + file_name + ".csv"

In [None]:
if not os.path.exists(file_path_and_name):
  os.makedirs(file_path_and_folder)
  print(f"{file_folder}'s folder is now available in gdrive")
else:
  print(f"{file_folder}'s folder already exists in gdrive")

ergasiaEDDE2's folder is now available in gdrive


###**6 --- Put to production**

In [None]:
#Inputs for tweets
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = keyword
start_list =    ['2020-01-01T00:00:00.000Z',
                 '2020-02-01T00:00:00.000Z',
                 '2020-03-01T00:00:00.000Z',
                 '2020-04-01T00:00:00.000Z',
                 '2020-05-01T00:00:00.000Z',
                 '2020-06-01T00:00:00.000Z',
                 '2020-07-01T00:00:00.000Z',
                 '2020-08-01T00:00:00.000Z',
                 '2020-09-01T00:00:00.000Z',
                 '2020-10-01T00:00:00.000Z',
                 '2020-11-01T00:00:00.000Z',
                 '2020-12-01T00:00:00.000Z']

end_list =      ['2020-01-31T00:00:00.000Z',
                 '2020-02-29T00:00:00.000Z',
                 '2020-03-31T00:00:00.000Z',
                 '2020-04-30T00:00:00.000Z',
                 '2020-05-31T00:00:00.000Z',
                 '2020-06-30T00:00:00.000Z',
                 '2020-07-31T00:00:00.000Z',
                 '2020-08-31T00:00:00.000Z',
                 '2020-09-30T00:00:00.000Z',
                 '2020-10-31T00:00:00.000Z',
                 '2020-11-30T00:00:00.000Z',
                 '2020-12-31T00:00:00.000Z']
max_results = 500

#Total number of tweets we collected from the loop
total_tweets = 0

# Create file
csvFile = open(file_path_and_name, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

# Create headers for the data you want to save, in this example, we only want to save these columns in our dataset
csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','retweet_count','source','tweet'])
csvFile.close()

# The For-loop  goes over the months/weeks/days we want to cover
for i in range(0,len(start_list)):

    # Inputs
    count = 0 # Counting tweets per time period
    max_count = 8000 # Max tweets per time period
    flag = True
    next_token = None
    
    # Check if flag is true
    while flag:
        # Check if max_count reached
        if count >= max_count:
            break
        print("-------------------")
        print("Token: ", next_token)
        url = create_url(keyword, start_list[i],end_list[i], max_results)
        json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
        result_count = json_response['meta']['result_count']

        if 'next_token' in json_response['meta']:
            # Save the token to use for next call
            next_token = json_response['meta']['next_token']
            print("Next Token: ", next_token)
            if result_count is not None and result_count > 0 and next_token is not None:
                print("Start Date: ", start_list[i])
                append_to_csv(json_response, file_path_and_name)
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)                
        # If no next token exists
        else:
            if result_count is not None and result_count > 0:
                print("-------------------")
                print("Start Date: ", start_list[i])
                append_to_csv(json_response, file_path_and_name)
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)
            
            #Since this is the final request, turn flag to false to move to the next time period.
            flag = False
            next_token = None
        time.sleep(5)
print("Total number of results: ", total_tweets)

-------------------
Token:  None
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fo71enc1ew1o23o4jfv5vd24simqv1
Start Date:  2020-01-01T00:00:00.000Z
# of Tweets added from this response:  495
Total # of Tweets added:  495
-------------------
-------------------
Token:  b26v89c19zqg8o3fo71enc1ew1o23o4jfv5vd24simqv1
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fo71dt53l3k3yg48hmq4138i5vis8t
Start Date:  2020-01-01T00:00:00.000Z
# of Tweets added from this response:  493
Total # of Tweets added:  988
-------------------
-------------------
Token:  b26v89c19zqg8o3fo71dt53l3k3yg48hmq4138i5vis8t
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fo6yhtlg1nalk0188haw3srdv36v0d
Start Date:  2020-01-01T00:00:00.000Z
# of Tweets added from this response:  489
Total # of Tweets added:  1477
-------------------
-------------------
Token:  b26v89c19zqg8o3fo6yhtlg1nalk0188haw3srdv36v0d
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fo6ygzp350y1tygiaukww4nxmrzl31
Sta

**Check csv**

In [None]:
df = pd.read_csv('gdrive/MyDrive/Colab Notebooks/Scraping/Tweets/Topics/ergasiaEDDE2/troxaio_2020.csv')
df

Unnamed: 0,author id,created_at,geo,id,lang,like_count,quote_count,reply_count,retweet_count,source,tweet
0,1124027292517376000,2020-01-30 22:45:51+00:00,,1.223014e+18,el,7.0,0.0,0.0,0.0,Twitter for Android,"Μεγαλυτερη μαλακια απ το ""εμεις φτιαχνουμε τη..."
1,790784018530897920,2020-01-30 21:55:06+00:00,,1.223002e+18,el,0.0,0.0,0.0,0.0,CyprusTodayNews,Πέθανε ο ποδοσφαιριστής Γρηγόρης Πικής – Είχε ...
2,3288531958,2020-01-30 21:05:59+00:00,,1.222989e+18,el,7.0,0.0,1.0,0.0,Twitter for Android,@athaneziak Είχε τροχαίο το παιδί πριν 20 μέρε...
3,721750076838789121,2020-01-30 20:30:48+00:00,,1.222981e+18,el,0.0,0.0,0.0,0.0,WordPress.com,Έχασα τον άντρα μου σε τροχαίο και παντρεύτηκα...
4,312531855,2020-01-30 20:01:36+00:00,,1.222973e+18,el,0.0,0.0,0.0,0.0,Twitter Web Client,Καστοριά – Τροχαίο στο Δισπηλιό https://t.co/o...
...,...,...,...,...,...,...,...,...,...,...,...
35813,1273585048109944834,2020-12-01 07:25:36+00:00,,1.333674e+18,el,0.0,0.0,0.0,0.0,Twitter Web App,Μάνη: Αγριογούρουνο σκότωσε οδηγό μηχανής σε φ...
35814,604942517,2020-12-01 07:24:42+00:00,,1.333673e+18,el,0.0,0.0,0.0,0.0,WordPress.com,Μάνη: Αγριογούρουνο σκότωσε οδηγό μηχανής σε φ...
35815,194941135,2020-12-01 07:15:27+00:00,,1.333671e+18,el,0.0,0.0,0.0,0.0,dete autopost,Πάτρα: Τροχαίο με τραυματία στην Έλληνος Στρατ...
35816,1913769302,2020-12-01 06:42:04+00:00,,1.333663e+18,el,0.0,0.0,0.0,0.0,leenk.me,Τουλάχιστον 3 εμπλέκονται στο θανατηφόρο τροχ...
