# Filtering

This note book will be dedicated to filtering the Twitter data.

This is part of the filtering stage described in the Design Document.

## input:
    - raw JSON twitter data
## output:
    - a csv file with user name, screen name
    
    - (later on maybe include other meta data like verified or not and creation date)
    
    - names that are obviously wrong, such as names with just emojis will be filtered out


In [1]:
# the language we want to capture
language = 'es'

In [2]:
# the list that is going to contain all the dataframe information
df_list = []

# the path to the gzip twitter data
data_path = "../data/"

In [3]:
%%time
import gzip
import json
import os

# finds all files in the data path and combines them together
for file in os.listdir(data_path):
    if file.endswith(".gz"):
        print(data_path+file)
        with gzip.open(data_path+file) as f:
            for line in f:
                json_line = json.loads(line)
                filtered_dict = {
                    "screen_name": json_line["user"]["screen_name"],
                    "username": json_line["user"]["name"],
                    "language": json_line["lang"]
                }
                df_list.append(filtered_dict)

../data/stream-2021-03-11T14_18_25.811596.gz
../data/stream-2021-03-11T13_43_30.827953.gz
../data/stream-2021-03-11T13_08_08.830007.gz
CPU times: user 47.5 s, sys: 545 ms, total: 48 s
Wall time: 48.5 s


In [4]:
# create the twitter data data frame
import pandas as pd
twitter_data = pd.DataFrame(df_list)

In [5]:
language_dataframe = twitter_data[twitter_data["language"] == language]

In [6]:
language_dataframe.head(10)

Unnamed: 0,language,screen_name,username
1,es,iREKINISTA,ᷥᤣ🍒ᬼૢ ཻུ۪۪˚⁺TAREASTAREASTAREAS♿- REKIBESTB0Y .
14,es,moonslightz,yas | cr: o triste fim de policarpo quaresma
29,es,nataliaaaaaaamm,Natssss
31,es,vxntegogh,❥•MAFERᴮᴱ🐯🎨
47,es,indanitem_,INDANI
54,es,AlexaNioDelSol1,Vane Alexa 👑💔🚪🏃🏃mi papá no me quiere 😭
86,es,mauriciopera96,Mauricio Peralta
92,es,giawtteoyl,︎ ︎︎tini
94,es,Virgini05227696,Queen
120,es,cattrina23,catrinaa🥀


In [7]:
# remove emjojis
import re
def deEmojify(text):
    regrex_pattern = re.compile(
    u"(\ud83d[\ude00-\ude4f])|"  # emoticons
    u"(\ud83c[\udf00-\uffff])|"  # symbols & pictographs (1 of 2)
    u"(\ud83d[\u0000-\uddff])|"  # symbols & pictographs (2 of 2)
    u"(\ud83d[\ude80-\udeff])|"  # transport & map symbols
    u"(\ud83c[\udde0-\uddff])"  # flags (iOS)
    "+", flags=re.UNICODE)
    return regrex_pattern.sub(r'',text)

language_dataframe["username"] = language_dataframe["username"].apply(deEmojify)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


In [8]:
# remove all rows where the username is empty, this removes names which are just emojis
language_dataframe = language_dataframe[language_dataframe.username != '']

In [9]:
language_dataframe["username"].head(1)

1    ᷥᤣᬼૢ ཻུ۪۪˚⁺TAREASTAREASTAREAS♿- REKIBESTB0Y .
Name: username, dtype: object

In [10]:
language_dataframe.dropna(inplace=True)
language_dataframe.drop_duplicates()
language_dataframe.reset_index(drop=True, inplace=True)

In [11]:
# saves to disk
try:
    os.mkdir('filtered') 
except OSError as error:
    print('filtered folder already created')
language_dataframe.to_json('filtered/'+language+'_language_filtered.json',orient="records",lines=True)

filtered folder already created
