# Pasar datos en formato json a txt

Comenzamos importando las librerías y estableciendo las opciones de pandas que queremos.

In [92]:
import glob
import json
import pandas as pd
from datetime import datetime as dt
import re
from IPython.display import Image

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None

El json obtenido con twarc2 tiene múltiples items, uno por cada 100 tweets. En este caso tenemos dos, los cogemos y los metemos en una lista.

In [93]:
tweets_data = []

tweets_file = open(r'felizmiercoles.json1', encoding = 'utf-8')

for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)

tweets_file.close()

Vamos a comprobar que nuestro json tiene 2 items al tener 200 tweets.

In [94]:
len(tweets_data)

2

### Ejemplo de estructura de json

En este ejemplo solo he cogido un tweet del hashtag #estoesunaprueba123 para facilitar la visualización. Si tuvieramos más de 100 tweets el diccionario que marca el primer corchete contendría una lista de items.

In [95]:
Image(url="Imagenes/json_tree.png", width=600,height=600)

### Extracción de los datos

Loopeamos en la lista de items del json para coger cada pieza de información de todos los tweets e introducirla en una variable. Por último el loop va a unir toda la información en `list_tweets_flattered`, que es una lista vacía que vamos a llenar con otras listas, una por tweet.

In [96]:
list_tweets_flattered = []

for item in tweets_data:
    list_tweets = item['data']
    for tweet in list_tweets:
        if 'referenced_tweets' in tweet:
            type = tweet['referenced_tweets'][0]['type']
            if type == 'retweeted':
                user_retweeted = tweet['entities']['mentions'][0]['username']
        else:
            type = 'tweet'
            user_retweeted = None
        id_user = tweet['author_id']
        date = tweet['created_at']
        source = tweet['source']
        text = tweet['text']
        reply_settings = tweet['reply_settings']
        id_tweet = tweet['id']
        retweet_count = tweet['public_metrics']['retweet_count']
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']
        lang = tweet['lang']

        list_tweets_flattered.append([type,user_retweeted,id_user,date,source,text,reply_settings,id_tweet,retweet_count,
                                        reply_count,like_count,quote_count,lang])

Esa lista de listas la convertimos en dataframe.

In [97]:
tweets_df = pd.DataFrame(list_tweets_flattered,columns=['relation','user retweeted','id user','date','app','text','reply settings',
'id tweet','retweet count','reply count','like count','quote count','lang'])
tweets_df.head(2)

Unnamed: 0,relation,user retweeted,id user,date,app,text,reply settings,id tweet,retweet count,reply count,like count,quote count,lang
0,retweeted,Elangel_ven0,1280509194534346752,2022-10-19T15:15:26.000Z,Twitter Web App,RT @Elangel_ven0: YO♥️CUMANA\n🚩 Tras las lluvias acaecidas en #Cumana el alcalde 👉 .@lossifontes realizo el acercamiento a la comunidad de E…,everyone,1582752287495458817,43,0,0,0,es
1,retweeted,sergiofimbres,795089089,2022-10-19T15:15:25.000Z,Twitter for Android,RT @sergiofimbres: 📰 Segob vs ego (@osvaldomonos) \n#FelizMiercoles https://t.co/mzLhRFzLew,everyone,1582752285595029505,7,0,0,0,es


Hacemos lo mismo con la información de los usuarios.

In [98]:
list_users_flattered = []

for item in tweets_data:
    list_users = item['includes']['users']
    for user in list_users:
        if 'location' in user:
            location = user['location']
        else:
            location = None
        username = user['username']
        verified = user['verified']
        name = user['name']
        id_user = user['id']
        profile_image_url = user['profile_image_url']
        followers_count = user['public_metrics']['followers_count']
        following_count = user['public_metrics']['following_count']
        tweet_count = user['public_metrics']['tweet_count']
        listed_count = user['public_metrics']['listed_count']
        bio = user['description']
        created_at = user['created_at']
        if 'url' in user:
            url_list = user['entities']['url']['urls']
            url_end = []
            for url_unique in url_list:
                expanded_url = url_unique['expanded_url']
                url_end.append(expanded_url)
        else:
            url_end = None

        list_users_flattered.append([location,username,verified,name,id_user,profile_image_url,
        followers_count,following_count,tweet_count,listed_count,bio,created_at,url_end])

In [99]:
users_df = pd.DataFrame(list_users_flattered,columns=['location','author','verified','name',
'id user','profile image url','followers','following','tweets','listed',
'description','created_at','urls'])
users_df.head(1)

Unnamed: 0,location,author,verified,name,id user,profile image url,followers,following,tweets,listed,description,created_at,urls
0,Apure,ShavelaHermosa,False,♥️⋆ 𝓒𝓱𝓪𝓿𝓮𝓵𝓪 𝓐𝓷𝓭𝓻𝓮𝓲𝓷𝓪 ⋆ ♥️,1280509194534346752,https://pbs.twimg.com/profile_images/1539401404896378880/X5TJ6zo2_normal.jpg,4402,3183,73680,3,"Llanera, me gusta viajar y conocer mi país 🇻🇪\n♫ ..En el fogón de tus brazos cocíname a fuego \nlento pàque mi alma se sancoche\n con el calor de tu cuerpo.♫",2020-07-07T14:29:32.000Z,


Hacemos lo mismo con la fecha en la que hemos cogido los tweets.

In [100]:
retrieved_at = tweets_data[0]['__twarc']['retrieved_at']

Añadimos la información de los usuarios y la fecha de obtención al dataframe de tweets.

In [101]:
merged = tweets_df.merge(users_df,how='left',on='id user')
merged['retrieved at'] = retrieved_at
merged.head(1)

Unnamed: 0,relation,user retweeted,id user,date,app,text,reply settings,id tweet,retweet count,reply count,like count,quote count,lang,location,author,verified,name,profile image url,followers,following,tweets,listed,description,created_at,urls,retrieved at
0,retweeted,Elangel_ven0,1280509194534346752,2022-10-19T15:15:26.000Z,Twitter Web App,RT @Elangel_ven0: YO♥️CUMANA\n🚩 Tras las lluvias acaecidas en #Cumana el alcalde 👉 .@lossifontes realizo el acercamiento a la comunidad de E…,everyone,1582752287495458817,43,0,0,0,es,Apure,ShavelaHermosa,False,♥️⋆ 𝓒𝓱𝓪𝓿𝓮𝓵𝓪 𝓐𝓷𝓭𝓻𝓮𝓲𝓷𝓪 ⋆ ♥️,https://pbs.twimg.com/profile_images/1539401404896378880/X5TJ6zo2_normal.jpg,4402,3183,73680,3,"Llanera, me gusta viajar y conocer mi país 🇻🇪\n♫ ..En el fogón de tus brazos cocíname a fuego \nlento pàque mi alma se sancoche\n con el calor de tu cuerpo.♫",2020-07-07T14:29:32.000Z,,2022-10-19T15:15:40+00:00


Limpiamos las columnas de fechas.

In [102]:
merged['date'] = merged['date'].apply(lambda x:str(x))
merged['date'] = merged['date'].str.replace('T',' ')
merged['date'] = merged['date'].str.replace('Z','')
merged['date'] = merged['date'].str.replace('000','')
merged['date'] = merged['date'].str.replace('.','')

  merged['date'] = merged['date'].str.replace('.','')


In [103]:
merged['created_at'] = merged['created_at'].apply(lambda x:str(x))
merged['created_at'] = merged['created_at'].str.replace('T',' ')
merged['created_at'] = merged['created_at'].str.replace('Z','')
merged['created_at'] = merged['created_at'].str.replace('000','')
merged['created_at'] = merged['created_at'].str.replace('.','')

  merged['created_at'] = merged['created_at'].str.replace('.','')


In [104]:
merged['retrieved at'] = merged['retrieved at'].str.replace('T',' ').apply(lambda x:str(x)).apply(lambda x:x[:-6])

Transformamos las 3 columnas de fechas de nuevo a objeto datetime (fecha y tiempo).

In [105]:
merged['date'] = merged['date'].apply(lambda x:dt.strptime(x, '%Y-%m-%d %H:%M:%S'))
merged['created_at'] = merged['created_at'].apply(lambda x:dt.strptime(x, '%Y-%m-%d %H:%M:%S'))
merged['retrieved at'] = merged['retrieved at'].apply(lambda x:dt.strptime(x, '%Y-%m-%d %H:%M:%S'))

Quitamos los saltos de línea de las columnas de textos largos.

In [106]:
merged['text'] = merged['text'].str.replace('\n',' ')
merged['description'] = merged['description'].str.replace('\n',' ')

Ordenamos el dataframe de tweets más antiguos a más recientes.

In [107]:
merged = merged.sort_values(by='date',ascending=True)
merged.head(2)

Unnamed: 0,relation,user retweeted,id user,date,app,text,reply settings,id tweet,retweet count,reply count,like count,quote count,lang,location,author,verified,name,profile image url,followers,following,tweets,listed,description,created_at,urls,retrieved at
215,tweet,,296592711,2022-10-19 15:06:43,Twitter Web App,12 #FelizMiercoles https://t.co/2k3ETNfzng,everyone,1582750095019499520,0,0,0,0,und,CDMX,jramiroMX,False,CHILANGO,https://pbs.twimg.com/profile_images/1311646474984476672/LjlH6TKT_normal.jpg,188356,209256,525227,345,"Que nadie manipule tu criterio, ¿como puedes evitarlo?, solo adquiriendo el buen hábito de estar bien informado, no existe otra forma",2011-05-11 02:05:17,,2022-10-19 15:15:40
214,retweeted,INAMEH,1582023931250282497,2022-10-19 15:06:50,Twitter for Android,"RT @INAMEH: así mismo, nubosidad fragmentada en áreas del Táchira, Barinas, Lara, Yaracuy, este de Miranda, Anzoátegui y Monagas; el resto…",everyone,1582750125528875009,127,0,0,0,es,,carmen21463660,False,carmen,https://pbs.twimg.com/profile_images/1582024161647550466/FOKc72l5_normal.jpg,0,40,533,0,,2022-10-17 15:01:30,,2022-10-19 15:15:40


Ordenamos las columnas a nuestro gusto.

In [108]:
merged = merged[['id tweet','date','author','text','app','id user','followers','following','tweets','location',
'urls','name','description','relation','user retweeted','lang','created_at']]
merged.head(1)

Unnamed: 0,id tweet,date,author,text,app,id user,followers,following,tweets,location,urls,name,description,relation,user retweeted,lang,created_at
215,1582750095019499520,2022-10-19 15:06:43,jramiroMX,12 #FelizMiercoles https://t.co/2k3ETNfzng,Twitter Web App,296592711,188356,209256,525227,CDMX,,CHILANGO,"Que nadie manipule tu criterio, ¿como puedes evitarlo?, solo adquiriendo el buen hábito de estar bien informado, no existe otra forma",tweet,,und,2011-05-11 02:05:17


Limpiamos los duplicados que se hayan podido generar.

In [109]:
merged = merged.drop_duplicates(subset='id tweet',keep='first')

Lo guardamos como .txt con separación de tabulador, que es el que usa t-hoarder.

In [110]:
merged.to_csv('felizmiercoles.txt',index=False,sep='\t')