# Code to Filter Tweets by Language

This notebook has code to filter the archive.org twitter dumps by language and stores in two tsv files. One of them has all of the Tweets and the other one excludes Retweets. To use it, extract the tar file for a given day and a folder with a number `XX` will appear.

This file must be in the same directory as that one. Within that folder there will be one for each hour of the day and within those there will be a json.bz2 file for each minute of the hour. Those must be extracted so the json files for each minute are in the corresponding hour directory.

This piece of code imports the necessary packages.

In [1]:
import json
import pandas as pd

Change these according to the name of the day and to the laguage that you want to filter.

In [2]:
day_dir  = "02"
language = "es"

The following code imports the Tweets and filters them by language.

If `Not Found` is displayed, the json file was most likely not extracted correctly. Some of the json files of the beginning of the 00 hour might be missing, which is expected. Any other of these mistakes is not.

The messages `Backslash Character Found` and `Tab Character Found` appear to tell you that you should be careful with the output so that no rogue character breaks the final tsv file.

In [2]:
#path = "./" + day_dir + "/"
language = "en"
path = "../../train_data/twitter_en/2018/10/01/"

tweets = []

for a in range(0,3):
    for b in range(0,10):
        if a*10 + b > 23:
            continue
        hour = str(a) + str(b)
        print("\nImporting data from hour", hour)
        for d in range(0,6):
            for u in range(0,10):
                minute = str(d) + str(u)
                file = path + hour + "/" + minute + ".json"
                try:
                    for line in open(file, 'r'):
                        tweet = json.loads(line)
                        if ("lang" in tweet.keys()) and (tweet["lang"]==language):
                            tweets.append(tweet)
#                             if ("\t" in tweet["text"]):
#                                 print("Tab character found!")
#                             if ("\\" in tweet["text"]):
#                                 print("Backslash found")
                except FileNotFoundError:
                    print("Not Found: Hour: {} Minute: {}".format(hour, minute))

print("\nTotal Tweets found:")
print(len(tweets))


Importing data from hour 00
Not Found: Hour: 00 Minute: 00
Not Found: Hour: 00 Minute: 01
Not Found: Hour: 00 Minute: 02
Not Found: Hour: 00 Minute: 03
Not Found: Hour: 00 Minute: 04
Not Found: Hour: 00 Minute: 05
Not Found: Hour: 00 Minute: 06
Not Found: Hour: 00 Minute: 07
Not Found: Hour: 00 Minute: 08
Not Found: Hour: 00 Minute: 09
Not Found: Hour: 00 Minute: 10
Not Found: Hour: 00 Minute: 11
Not Found: Hour: 00 Minute: 12
Not Found: Hour: 00 Minute: 13
Not Found: Hour: 00 Minute: 14
Not Found: Hour: 00 Minute: 15
Not Found: Hour: 00 Minute: 16
Not Found: Hour: 00 Minute: 17
Not Found: Hour: 00 Minute: 18
Not Found: Hour: 00 Minute: 19
Not Found: Hour: 00 Minute: 20
Not Found: Hour: 00 Minute: 21
Not Found: Hour: 00 Minute: 22
Not Found: Hour: 00 Minute: 23
Not Found: Hour: 00 Minute: 24
Not Found: Hour: 00 Minute: 25
Not Found: Hour: 00 Minute: 26
Not Found: Hour: 00 Minute: 27
Not Found: Hour: 00 Minute: 28

Importing data from hour 01

Importing data from hour 02

Importing dat

This part checks whether all metadata is accounted for. The `keep` list is the one I would consider useful and the ones in `special` are the ones that could be useful to either further prune the data or to actually keept the proper fields. What you do with these is up to you. An output other than `[]` means that at least one of your entries has metadata that hadn't appeared on any of my runs of the notebook.

In [3]:
nokeep = ["user", "geo", "coordinates", "quote_count", "contributors","reply_count","retweet_count", "favorited",
          "retweeted", "in_reply_to_status_id", "in_reply_to_status_id_str", "id_str", "created_at", "favorite_count",
          "in_reply_to_user_id", "in_reply_to_user_id_str", "in_reply_to_screen_name", "display_text_range", "source",
          "timestamp_ms", "retweeted_status", "entities", "extended_entities", "delete", "truncated", "is_quote_status",
          "extended_tweet", "filter_level", "possibly_sensitive", "quoted_status_id", "quoted_status_id_str",
          "quoted_status", "quoted_status_permalink", "TR", "DE", "withheld_in_countries"]

special = ["truncated", "is_quote_status", "extended_tweet"]

keep = ["id", "text", "lang", "place"]

other = []

for tweet in tweets:
    for key in tweet.keys():
        if key not in nokeep+keep:
            check = "withheld_in_countries"
            if key==check and tweet[check]!=False:
                print(tweet[check])
            if key not in other:
                other.append(key)
            
print(other)

[]


Here we save the Tweets in a `XX.tsv` file.

In [4]:
df = pd.DataFrame(tweets)
df = df.set_index("id")    
df = df.fillna("")
df.to_csv(path+"01.tsv", sep="\t")

This other part filters Retweets and then saves them into a `XX_clean.tsv` file.

In [6]:
#df_clean = df.loc[pd.notnull(df["retweeted_status"])]
#df_clean = df_clean.loc[df["retweeted_status"]==""]
df_clean.to_csv(path+"01_clean.tsv", sep="\t")

If both numbers are different, that means that there were no Retweets in that minute. Depending on the language that you are dealing with, that might be highly unlikely.

In [7]:
print(df.shape[0])
print(df_clean.shape[0])

1026894
418106
