# Demo for collecting a set of tweets with twarc

You can of course create a virtual environment and install the following libraries using pip. 

However, this demo is seeking to find the easiest possible approach for collecting tweets using Python and twarc. 

To run this notebook, install Anaconda, download and place the notebook to a folder and open it in Anaconda. 

In [None]:
!pip install twarc
!pip install pandas

In [37]:
import pandas as pd

Run twarc2 configure on the command line - you only need to do this once per computer:

    twarc2 configure  

Here, we are running twarc on the command line. You might want to consider using the [Python client](https://twarc-project.readthedocs.io/en/latest/api/client2/) for creating more elaborate functionalities using twarc.

However, once configured, this is kind of a handy way to keep running the script and the related functionality for exporting the data.

In [43]:
!twarc2 search --archive --start-time '2021-01-01T00:00:00' --end-time '2021-05-01T00:00:00' --flatten 'rajapinta lang:fi' rajapinta.jsonl

The data is now available in [rajapinta.jsonl](rajapinta.jsonl) following the [JSON Lines](https://jsonlines.org/) format.

Reading the data and creating a Pandas dataframe:

In [44]:
import json
with open('rajapinta.jsonl') as f:
    tweets_jsonl = f.read()
    
tweets = [json.loads(jline) for jline in tweets_jsonl.splitlines()]

In [45]:
df = pd.DataFrame(tweets)

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 433 entries, 0 to 432
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   lang                 433 non-null    object
 1   text                 433 non-null    object
 2   conversation_id      433 non-null    object
 3   author_id            433 non-null    object
 4   possibly_sensitive   433 non-null    bool  
 5   referenced_tweets    361 non-null    object
 6   reply_settings       433 non-null    object
 7   created_at           433 non-null    object
 8   source               433 non-null    object
 9   public_metrics       433 non-null    object
 10  in_reply_to_user_id  163 non-null    object
 11  entities             415 non-null    object
 12  id                   433 non-null    object
 13  author               433 non-null    object
 14  in_reply_to_user     163 non-null    object
 15  __twarc              433 non-null    object
 16  context_

Do note that the columns include more data!

In [33]:
df_full.loc[0].author.keys()

dict_keys(['id', 'public_metrics', 'created_at', 'description', 'url', 'username', 'verified', 'name', 'profile_image_url', 'protected'])

Shorthand for picking up individual values 

In [21]:
df.loc[0].author.get('created_at')

'2012-08-15T09:36:41.000Z'

Serializing the data (i.e., writing the data to a file) in CSV and Excel.

In [47]:
df.to_csv('rajapinta.csv', encoding='utf-8')

In [49]:
!pip install openpyxl

Collecting openpyxl
  Using cached openpyxl-3.0.7-py2.py3-none-any.whl (243 kB)
Collecting et-xmlfile
  Using cached et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.7


In [50]:
df.to_excel('rajapinta.xlsx')

## To be continued: running twarc in native Python

In [40]:
from getpass import getpass
bearer_token = getpass('Enter the secret value: ')

Enter the secret value: ········


In [41]:
from twarc import Twarc2 

client = Twarc2(bearer_token=bearer_token)

Do note that twarc does not accept Z at the end of the timestamp. It seems that it does not support timezones at all. Let us know if you think otherwise.

In [None]:
tweets = client.search_all(
    query = 'rajapinta lang:fi',
    start_time = '2021-01-01T00:00:00'

)

