In [1]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import os
from IPython.display import clear_output

------
<a id="indice"></a>

# Índice

1. **[Coleta](#coleta)**
    * [Twitter Archive](#coleta:twitter-archive)
    * [Image Prediction](#coleta:image-prediction)
    * [Twitter API](#coleta:twitter-api)
    
    
2. **[Avaliação](#avaliacao)**
    * [Twitter Archive](#avaliacao:twitter-archive)
    * [Image Prediction](#avaliacao:image-prediction)
    * [Twitter API](#avaliacao:twitter-api)
    * [Notas da avaliação](#avaliacao:anotacoes)
    
    
3. **[Limpeza](#limpeza)**
    * [Definição](#limpeza:definicao)
    * [Twitter Archive](#limpeza:twitter-archive)
    * [Image Prediction](#limpeza:image-prediction)
    * [Twitter API](#limpeza:twitter-api)


4. **[Armazenamento](#armazenamento)**

------
<a id="coleta"></a>

# Coleta

<a id="coleta:twitter-archive"></a>

## Coleta: Twitter archive

In [2]:
df_twitter_arc = pd.read_csv('data/twitter-archive-enhanced.csv')
df_twitter_arc.sample(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2043,671536543010570240,,,2015-12-01 03:49:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Reginald. He's pondering what life wou...,,,,https://twitter.com/dog_rates/status/671536543...,9,10,Reginald,,,,
745,780092040432480260,,,2016-09-25 17:10:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Hank. He's mischievous ...,7.533757e+17,4196984000.0,2016-07-13 23:48:51 +0000,https://twitter.com/dog_rates/status/753375668...,8,10,Hank,,,,


<a id="coleta:image-prediction"></a>

## Coleta: Image prediction

In [3]:
df_prediction = None

r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

if r.status_code is 200:
    df_prediction = pd.read_csv(pd.compat.StringIO(r.text), sep='\t')    
else:    
    print('ERROR: Image prediction request returned {status_code} status code.'.format(status_code = r.status_code))

In [4]:
df_prediction.sample(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2044,886258384151887873,https://pbs.twimg.com/media/DEyfTG4UMAE4aE9.jpg,1,pug,0.943575,True,shower_cap,0.025286,False,Siamese_cat,0.002849,False
1058,714957620017307648,https://pbs.twimg.com/media/CewKKiOWwAIe3pR.jpg,1,Great_Pyrenees,0.251516,True,Samoyed,0.139346,True,kuvasz,0.129005,True


<a id="coleta:twitter-api"></a>

## Coleta: Twitter API

In [5]:
# Carregar configuração do Twitter App
with open('twitter_config.json', 'r', encoding='utf-8') as file:
    app_config = json.load(file)

In [6]:
# Atribuição da configurações a variaveis locais
api_key = app_config['api_key']
api_secret = app_config['api_secret']
access_token = app_config['access_token']
access_secret = app_config['access_secret']

In [7]:
# Autentificação ao Twitter App
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Buscar dados do histórico de Tweets na API

In [8]:
json_path = 'data/tweet_json.txt'
error_log = 'data/tweet_error.log'

# Salvar Tweets localmente
if(not os.path.isfile(json_path)):
    print('Aguarde. Dado o volume de requisições, a coleta de tweets pode demorar alguns minutos.')
    tweet_count = 0
    for tweet_id in df_twitter_arc.tweet_id:
        print('{percent}%'.format(percent=int((++tweet_count/df_twitter_arc.tweet_id.size)*100)))
        try:
            status = api.get_status(tweet_id)
            with open(json_path, 'a', newline='\n') as file:
                file.write(f'{json.dumps(status._json)}\n')

        except Exception as err:
            with open(error_log, 'a', newline='\n') as log:
                log.write(f'{str(tweet_id)}: {str(err.args[0])}\n')
            print(f'{str(tweet_id)}: {str(err.args[0])}')
        
        clear_output(wait=True)
else:
    print('Dados já salvos em disco, não serão executadas novas requisições à API do Twitter.')
        

Dados já salvos em disco, não serão executadas novas requisições à API do Twitter.


In [9]:
# Criar uma lista de Dictionaries com os Tweets carregados
tweets = []

with open(json_path, 'r') as file:
    for line in file:
        try:
            tweet = json.loads(line)
            
            if(tweet.get('entities', False)):
                if(tweet['entities'].get('media', False) and tweet['entities']['media'][0].get('media_url', False)):
                    tweets.append({
                        'id': tweet['id'],
                        'created_at': tweet['created_at'],
                        'in_reply_to_status_id': tweet['in_reply_to_status_id'],
                        'in_reply_to_user_id': tweet['in_reply_to_user_id'],
                        'is_quote_status': tweet['is_quote_status'],
                        'retweet_count': tweet['retweet_count'],
                        'favorite_count': tweet['favorite_count'],
                        'media_url': tweet['entities']['media'][0]['media_url'],
                        'retweeted': tweet['retweeted'],
                        'favorited': tweet['favorited']
                    })
            
            
        except Exception as e:
            print(e) 
            
        

In [10]:
# Criar dataframe de Tweets consultados na API
columns = tweets[0].keys()
df_tweets_api = pd.DataFrame(tweets, columns = columns)

------
<a id="avaliacao"></a>

# Avaliação

<a id="avaliacao:twitter-archive"></a>

## Avaliação: Twitter data archive

In [11]:
 df_twitter_arc.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2164,669371483794317312,,,2015-11-25 04:26:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oliviér. He's a Baptist Hindquarter. A...,,,,https://twitter.com/dog_rates/status/669371483...,10,10,Oliviér,,,,
2122,670403879788544000,,,2015-11-28 00:48:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Nigel. He accidentally popped his ball...,,,,https://twitter.com/dog_rates/status/670403879...,10,10,Nigel,,,,
377,828011680017821696,,,2017-02-04 22:45:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Brutus and Jersey. They think the...,,,,https://twitter.com/dog_rates/status/828011680...,11,10,Brutus,,,,
2131,670086499208155136,,,2015-11-27 03:47:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""Hi yes this is dog. I can't help with that s-...",,,,https://twitter.com/dog_rates/status/670086499...,10,10,,,,,
1062,741099773336379392,,,2016-06-10 02:48:49 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This is Ted. He's given up. 11/10 relatable af...,,,,https://vine.co/v/ixHYvdxUx1L,11,10,Ted,,,,


In [12]:
df_twitter_arc.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [13]:
df_twitter_arc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

<a id="twitter-arc:info"></a>
* Dados incompletos, contém apenas 2356 registros dos 5000 anunciados
* `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id` e `retweeted_status_user_id` em `float`
* `timestamp` e `retweeted_status_timestamp` em `string`

In [14]:
# Verificar se há IDs duplicados
df_twitter_arc[df_twitter_arc.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


<a id="twitter-arc:source"></a>

In [15]:
# Visualizar exemplos de dados da coluna `source`
df_twitter_arc.loc[:, 'source'].sample(10)

2034    <a href="http://twitter.com/download/iphone" r...
1476    <a href="http://twitter.com/download/iphone" r...
1292    <a href="http://twitter.com/download/iphone" r...
910     <a href="http://twitter.com/download/iphone" r...
356     <a href="http://twitter.com/download/iphone" r...
1750    <a href="http://vine.co" rel="nofollow">Vine -...
1689    <a href="http://twitter.com/download/iphone" r...
552     <a href="http://twitter.com/download/iphone" r...
1451    <a href="http://twitter.com/download/iphone" r...
2158    <a href="http://twitter.com/download/iphone" r...
Name: source, dtype: object

Variável `source` é uma URL para a Aplicação utilizada no tweet, sem valor para a unidade de observação

<a id="twitter-arc:name"></a>

In [16]:
# Visualizar exemplos da coluna `name`
df_twitter_arc.name.sample(10)

1665      Taco
517     Hunter
1413      None
1371      None
1673      Todo
1044     Stark
152       Dave
498       None
1065      None
2154     Penny
Name: name, dtype: object

In [17]:
df_twitter_arc[df_twitter_arc.name == 'None'].name.count()

745

* Nomes (`name`) preenchidos com valor literal igual a `'None'`, supostamente um valor não preenchido, vazio

<a id="twitter-arc:category"></id>

In [18]:
# Visualizar exemplo de dados das colunas 'doggo' a 'puppo'
df_twitter_arc.loc[:,'doggo':].sample(10)

Unnamed: 0,doggo,floofer,pupper,puppo
2142,,,,
1267,,,,
914,doggo,,,
1685,,,pupper,
952,,,,
1588,,,,
2092,,,,
420,,,,
1065,,,,
1432,,,,


* Valores não preenchidos como 'None' literal

Uma variável em várias colunas

In [19]:
# Verificar padrão no preenchimento das classificações
df_twitter_arc.loc[:,'doggo':].nunique()

doggo      2
floofer    2
pupper     2
puppo      2
dtype: int64

In [20]:
df_twitter_arc.loc[:, 'doggo':'puppo'].isna().sum()

doggo      0
floofer    0
pupper     0
puppo      0
dtype: int64

<a id="twitter-arc:expanded_urls"></a>

Exemplos da coluna `expanded_urls`

In [21]:
df_twitter_arc.loc[:, 'expanded_urls'].sample(20)

1117    https://twitter.com/dog_rates/status/732375214...
409                                                   NaN
929     https://twitter.com/dog_rates/status/754482103...
372     https://twitter.com/dog_rates/status/828381636...
1669    https://twitter.com/dog_rates/status/682429480...
2306    https://twitter.com/dog_rates/status/666835007...
1316    https://twitter.com/dog_rates/status/706644897...
2349    https://twitter.com/dog_rates/status/666051853...
601     https://twitter.com/dog_rates/status/671896809...
1066    https://twitter.com/dog_rates/status/740699697...
2205    https://twitter.com/dog_rates/status/668633411...
1402    https://twitter.com/dog_rates/status/699423671...
1582    https://twitter.com/dog_rates/status/687109925...
1080                                                  NaN
212     https://twitter.com/eddie_coe98/status/8482893...
1252    https://twitter.com/dog_rates/status/710844581...
2284    https://twitter.com/dog_rates/status/667192066...
1438    https:

In [22]:
df_twitter_arc.loc[:, 'expanded_urls'].isna().value_counts()

False    2297
True       59
Name: expanded_urls, dtype: int64

* Registros de `expanded_urls` com URLs para os twittes originais. Não apresenta valor a unidade em observacão

------
<a id="avaliacao:image-prediction"></a>

## Avaliação: Image prediction

In [23]:
df_prediction.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


<a id="image-prediction:info"></info>

In [24]:
df_prediction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [25]:
df_prediction.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1140,729823566028484608,https://pbs.twimg.com/media/CiDap8fWEAAC4iW.jpg,1,kelpie,0.218408,True,Arabian_camel,0.114368,False,coyote,0.096409,False
1269,749774190421639168,https://pbs.twimg.com/media/Cme7pg2XEAATMnP.jpg,1,Pekinese,0.879012,True,Chihuahua,0.054855,True,Blenheim_spaniel,0.021041,True
1932,859196978902773760,https://pbs.twimg.com/ext_tw_video_thumb/85919...,1,Angora,0.224218,False,malamute,0.216163,True,Persian_cat,0.128383,False
1851,840370681858686976,https://pbs.twimg.com/media/C6mYrK0UwAANhep.jpg,1,teapot,0.981819,False,cup,0.014026,False,coffeepot,0.002421,False
2038,884876753390489601,https://pbs.twimg.com/media/DEe2tZXXkAAwyX3.jpg,1,chow,0.822103,True,Norwich_terrier,0.106075,True,Norfolk_terrier,0.037348,True
826,693280720173801472,https://pbs.twimg.com/media/CZ8HIsGWIAA9eXX.jpg,1,Labrador_retriever,0.340008,True,bull_mastiff,0.175316,True,box_turtle,0.164337,False
1875,845677943972139009,https://pbs.twimg.com/media/C7xzmngWkAAAp9C.jpg,1,chow,0.808681,True,groenendael,0.123141,True,Newfoundland,0.022143,True
2066,890609185150312448,https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg,1,Irish_terrier,0.487574,True,Irish_setter,0.193054,True,Chesapeake_Bay_retriever,0.118184,True
1375,763103485927849985,https://pbs.twimg.com/media/CpcWknPXYAAeLP9.jpg,2,seat_belt,0.685821,False,ice_bear,0.081597,False,chow,0.039085,True
1244,747461612269887489,https://pbs.twimg.com/media/Cl-EXHSWkAE2IN2.jpg,1,binoculars,0.192717,False,barbershop,0.085838,False,ballplayer,0.084672,False


<a id="image-prediction:p-values"></a>

Valores não padronizados para as variáveis `p1`, `p2` e `p3`

In [26]:
# Visualização de exemplos de valores nas colunas `p1`, `p2` e `p3`
df_prediction.loc[:, ['p1', 'p2', 'p3']].sample(5)

Unnamed: 0,p1,p2,p3
1687,dalmatian,boxer,American_Staffordshire_terrier
1453,Chihuahua,dalmatian,toy_terrier
1765,French_bulldog,Siamese_cat,cougar
1630,shield,barrel,sundial
70,Rottweiler,miniature_pinscher,black-and-tan_coonhound


<a id="image-prediction:duplicated-urls"></a>

Verificar a duplicidade de Tweet ID (`tweet_id`) e Imagens (`jpg_url`)

In [27]:
# Verificar a duplicidade de IDs
df_prediction.tweet_id.nunique()

2075

Verificar a duplicidade na coluna `jpg_url`

In [28]:
df_prediction.jpg_url.duplicated().value_counts()

False    2009
True       66
Name: jpg_url, dtype: int64

Há 66 imagens duplicadas

In [29]:
# Validar indicadores de confiabiliade da predição
df_prediction.loc[:, ['p1_conf', 'p2_conf', 'p3_conf']].max()

p1_conf    1.000000
p2_conf    0.488014
p3_conf    0.273419
dtype: float64

------
<a id="avaliacao:twitter-api"></a>

## Avaliação: Twitter API Requests

In [30]:
df_tweets_api.describe()

Unnamed: 0,id,in_reply_to_status_id,in_reply_to_user_id,retweet_count,favorite_count
count,1820.0,22.0,22.0,1820.0,1820.0
mean,7.23727e+17,6.992047e+17,4196984000.0,2512.144505,6820.872527
std,5.777841e+16,4.409222e+16,0.0,4891.076169,11900.984651
min,6.660209e+17,6.671522e+17,4196984000.0,11.0,0.0
25%,6.747671e+17,6.747625e+17,4196984000.0,536.0,1363.5
50%,7.008223e+17,6.799651e+17,4196984000.0,1118.0,3153.0
75%,7.617466e+17,7.032024e+17,4196984000.0,2526.25,7463.25
max,8.924206e+17,8.558181e+17,4196984000.0,82905.0,163034.0


<a id="tweets-api:info"></a>

In [31]:
df_tweets_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1820 entries, 0 to 1819
Data columns (total 10 columns):
id                       1820 non-null int64
created_at               1820 non-null object
in_reply_to_status_id    22 non-null float64
in_reply_to_user_id      22 non-null float64
is_quote_status          1820 non-null bool
retweet_count            1820 non-null int64
favorite_count           1820 non-null int64
media_url                1820 non-null object
retweeted                1820 non-null bool
favorited                1820 non-null bool
dtypes: bool(3), float64(2), int64(3), object(2)
memory usage: 104.9+ KB


In [32]:
df_tweets_api.sample(5)

Unnamed: 0,id,created_at,in_reply_to_status_id,in_reply_to_user_id,is_quote_status,retweet_count,favorite_count,media_url,retweeted,favorited
1443,673148804208660480,Sat Dec 05 14:35:56 +0000 2015,,,False,645,1706,http://pbs.twimg.com/media/CVeBQwiUsAAqhLw.jpg,False,False
505,755206590534418437,Tue Jul 19 01:04:16 +0000 2016,,,False,5762,17264,http://pbs.twimg.com/media/CnsIT0WWcAAul8V.jpg,False,False
1046,689623661272240129,Wed Jan 20 01:41:08 +0000 2016,,,False,678,2318,http://pbs.twimg.com/media/CZIJD2SWIAMJgNI.jpg,False,False
1036,690021994562220032,Thu Jan 21 04:03:58 +0000 2016,,,False,1095,2902,http://pbs.twimg.com/media/CZNzV6cW0AAsX7p.jpg,False,False
628,740214038584557568,Tue Jun 07 16:09:13 +0000 2016,,,False,2076,6945,http://pbs.twimg.com/media/CkXEu2OUoAAs8yU.jpg,False,False


* `created_at` como `string`
* `in_reply_to_status_id` e `in_reply_to_userid` como `Float`

<a id="tweets-api:duplicated"></a>
Verificar a duplicidade de valores

In [33]:
# Verificar se há IDs duplicados
df_tweets_api.id.duplicated().value_counts()

False    1820
Name: id, dtype: int64

In [34]:
# Verificar se há imagens duplicadas
df_tweets_api.media_url.duplicated().value_counts()

False    1759
True       61
Name: media_url, dtype: int64

Há tweets referenciando a mesma imagem

------

<a id="avaliacao:anotacoes"></a>

## Notas da avaliação
### Qualidade

#### `df_twitter_arc` - Twitter archive
1. [Base incompleta, contém apenas **2356** registros dos **5000** anunciados](#twitter-arc:info)
* [Coluna `source` sem valor à unidade de obsevação](#twitter-arc:source)
* [Nomes (`name`) preenchidos com valor literal igual a `'None'`, vazio](#twitter-arc:name)
* [Coluna `expanded_urls` sem valor à unidade de observação](#twitter-arc:expanded_urls)
* [Categorias \[`doggo`, `floofer`, `pupper`, `puppo`\] com literal `'None'`](#twitter-arc:category)


#### `df_prediction` - Image prediction
1. [Previsões \[`p1`, `p2`, `p3`\] com nomes não padronizados](#image-prediction:p-values)
* [Duplicidade na coluna `jgp_url`](#image-prediction:duplicated-urls)


#### `df_tweets_api` - Twitter API
* [Diferentes Tweets referenciando a mesma imagem, duplicidade](#tweets-api:duplicated)


### Organização

#### `df_twitter_arc` - Twitter archive
1. [As colunas `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` estão em `float64`](#twitter-arc:info)
* [As colunas `timestamp` e `retweet_status_timestamp` estão em `string`](#twitter-arc:info)
* [Uma variável em várias colunas, `doggo`, `floofer`, `pupper` e `puppo`](#twitter-arc:category)


#### `df_prediction` - Image prediction
1. [Colunas como variáveis na identificação das predições](#image-prediction:info)


#### `df_tweets_api` - Twitter API
1. [Data de criação (`created_at`) como `string`](#tweets-api:info)
* [Valor do id na coluna `in_reply_to_status_id` como `float`](#tweets-api:info)
* [As colunas `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` estão em `float64`](#tweets-api:info)

------

<a id="limpeza"></a>

# Limpeza

<a id="limpeza:definicao"></a>

## Limpeza: Definição 

#### `df_twitter_arc` - Twitter archive
1. Base incompleta, de **5000**, apenas **2356** estão disponíveis
    * OK - Nada a ser feito
* [Coluna `source` sem valor à unidade de observação](#limpeza:twitter-arc:source)
    * O conteúdo apresenta um link para download do App Twitter, portanto não agrega valor à unidade de observação, a coluna pode ser removida
* [Nomes (`name`) não preenchidos com valor literal igual a `'None'`](#limpeza:twitter-arc:name)
    * Alterar valores como 'None' para `np.nan`
* [Coluna `expanded_urls` sem valor à unidade de observação](#limpeza:twitter-arc:expanded_urls)(#twitter-arc:expanded_urls)
    * Não apresentam valor à unidade de observação, são URLs para os tweets em questão. Remover a coluna `expanded_urls`.
* [Categorias \[doggo, floofer, pupper, puppo\] com literal `'None'`](#limpeza:twitter-arc:category)
    * Aplicar valor `np.nan` para nomes iguais a 'None'   
* [As colunas `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` estão em `float64`](#limpeza:twitter-arc:info)
    * A coluna, `retweeted_status_id` ou `retweeted_status_user_id`, podem ser convertidas para `Boolean` como auxílio a identificação de tweets originais
    * Após a criação de uma coluna indicando se o tweet é um retweet, as colunas de retweet, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`, podem ser descartadas
    * As colunas `in_reply_to_status_id` e `in_reply_to_user_id` serão retratadas no dataframe `df_tweets_api`
* [As colunas `timestamp` e `retweet_status_timestamp` estão em `string` e deveriam ser do tipo `datetime`](#limpeza:twitter-arc:datetime)
    * Converter valores para `datetime`
* [Uma variável em várias colunas, `doggo`, `floofer`, `pupper` e `puppo`](#limpeza:twitter-arc:category-merge)
    * Reduzir a variável para uma coluna, `category`


#### `df_prediction` - Image prediction
1. [Previsões [`p1`, `p2`, `p3`] com nomes não padronizados
    * Converter os valores para 'lower case' e substituir '_' por espaços
* [Previsões repetidas sobre a mesma imagem](#image-prediction:duplicated-urls)
    * Descartar predições sobre imagens repetidas
* [Colunas como variáveis na identificação das predições](#image-prediction:info)
    * Transpor as variáveis nas colunas para linhas


#### `df_tweets_api` - Twitter API
1. [Data de criação (`created_at`) como `string`](#tweets-api:info)
    * Converter datas em string para o formato `datetime`
* [Valor do id na coluna `in_reply_to_status_id` e `in_reply_to_user_id` como `float`](#tweets-api:info)
    * Converter para Inteiro as colunas `in_reply_to_status_id` e `in_reply_to_user_id` e atribuir 0 as que estiverem sem valor


<a id="limpeza:twitter-archive"></a>

## Limpeza: Twitter archive

In [120]:
# Criar cópia do dataframe para a limpeza
df_arch_clean = df_twitter_arc.copy()

<a id="limpeza:twitter-arc:source"></a>

#### 2. [Coluna `source` sem valor à unidade de observação](#limpeza:twitter-arc:source)
* O conteúdo apresenta um link para download do App Twitter, portanto não agrega valor à unidade de observação, a coluna pode ser removida

In [121]:
# Remoção da coluna `source`
df_arch_clean.drop('source', axis='columns', inplace=True)

#### Teste: Remoção coluna `source`

In [122]:
# Conferir que a coluna `source` foi removida
df_arch_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

<a id="limpeza:twitter-arc:name"></a>

#### 3. [Nomes (`name`) não preenchidos com valor literal igual a `'None'`](#limpeza:twitter-arc:name)
    * Alterar valores como 'None' para `np.nan`

In [123]:
# Contagem de nomes iguais a 'None'
df_arch_clean[df_arch_clean.name.str.lower() == 'none'].name.count()

745

In [124]:
# Função para substituição de valores literais 'None' por np.nan
def set_nan_at_none(value):
    if str(value).lower() == 'none':
        return np.nan
    else:
        return value

In [125]:
# Aplicar `np.nan` aos nomes iguais a 'None'
df_arch_clean.name = df_arch_clean.name.apply(set_nan_at_none)

#### Teste: Aplicação de `np.nan` aos nomes como 'None'

In [126]:
df_arch_clean.name.isnull().sum()

745

In [127]:
df_arch_clean[df_arch_clean.name.str.lower() == 'none'].name.count()

0

In [128]:
df_arch_clean.name.sample(10)

1576      Kramer
45         Bella
2145         NaN
987     Dietrich
2269         NaN
1001         NaN
144        Albus
1701       Alice
1687      Apollo
2247         NaN
Name: name, dtype: object

<a id="limpeza:twitter-arc:expanded_urls"></a>

#### 4. [Coluna `expanded_urls` sem valor à unidade de observação](#limpeza:twitter-arc:expanded_urls)
* Não apresentam valor à unidade de observação, são URLs para os tweets em questão. Remover a coluna `expanded_urls`

In [129]:
df_arch_clean.drop('expanded_urls', axis='columns', inplace=True)

#### Teste: Remoção da coluna `expanded_urls`

In [130]:
df_arch_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'rating_numerator', 'rating_denominator',
       'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

<a id="limpeza:twitter-arc:category"></a>

#### 5. [Categorias \[`doggo`, `floofer`, `pupper`, `puppo`\] com literal `'None'`](#limpeza:twitter-arc:category)
* Aplicar valor `np.nan` para nomes iguais a 'None'   

In [131]:
# Trocar valores 'None' para np.nan nas colunas `doggo`, `floofer`, `pupper`, `puppo`
df_arch_clean.doggo = df_arch_clean.doggo.apply(set_nan_at_none)
df_arch_clean.doggo.isna().value_counts()

True     2259
False      97
Name: doggo, dtype: int64

In [132]:
df_arch_clean.floofer = df_arch_clean.floofer.apply(set_nan_at_none)
df_arch_clean.floofer.isna().value_counts()

True     2346
False      10
Name: floofer, dtype: int64

In [133]:
df_arch_clean.pupper = df_arch_clean.pupper.apply(set_nan_at_none)
df_arch_clean.pupper.isna().value_counts()

True     2099
False     257
Name: pupper, dtype: int64

In [134]:
df_arch_clean.puppo = df_arch_clean.puppo.apply(set_nan_at_none)
df_arch_clean.puppo.isna().value_counts()

True     2326
False      30
Name: puppo, dtype: int64

In [135]:
# Soma de valores preenchidos nas classificações
sum_categories = 0
sum_categories += df_arch_clean.doggo.notna().sum()
sum_categories += df_arch_clean.floofer.notna().sum()
sum_categories += df_arch_clean.pupper.notna().sum()
sum_categories += df_arch_clean.puppo.notna().sum()

sum_categories

394

Apenas 394 dos 2346 registros foram classificados

<a id="limpeza:twitter-arc:info"></a>

#### 6. [As colunas `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` estão em `float64`](#limpeza:twitter-arc:info)
* Linhas com algum valor para as colunas `retweeted_status_id` ou `retweeted_status_user_id` podem ser descartadas por indicarem retweet, e o objetivo é analisar somente tweets originais
* Após a remoção retweets, as colunas `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`, podem ser descartadas
* As colunas `in_reply_to_status_id` e `in_reply_to_user_id` serão retratadas no dataframe `df_tweets_api`

#### Identificação dos tweets originais
df_arch_clean.loc[:,'retweeted_status_id'].isna().value_counts()

In [136]:
df_arch_clean.retweeted_status_id.size

2356

In [137]:
df_arch_clean.loc[:,'retweeted_status_id'].isna().value_counts()

True     2175
False     181
Name: retweeted_status_id, dtype: int64

Dos 2356 registros, 181 são retweets

In [138]:
# Remoção dos retweets
df_arch_clean.drop(df_arch_clean[df_arch_clean.retweeted_status_id.notna()].index, axis=0, inplace=True)

#### Teste: Remoção de Retweets

In [139]:
# Consultar a existência de colunas com retweet_status_preenchido
df_arch_clean.loc[:,'retweeted_status_id'].isna().value_counts()

True    2175
Name: retweeted_status_id, dtype: int64

Remoção das colunas referentes a retweets

In [140]:
df_arch_clean.drop(['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis='columns', inplace=True)

#### Teste: Verificação da remoção das colunas referentes a retweets

In [141]:
df_arch_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'text', 'rating_numerator', 'rating_denominator', 'name', 'doggo',
       'floofer', 'pupper', 'puppo'],
      dtype='object')

<a id="limpeza:twitter-arc:datetime"></a>

#### 7. [As colunas `timestamp` e `retweet_status_timestamp` estão em `string` e deveriam ser do tipo `datetime`](#limpeza:twitter-arc:datetime)
* Converter valores para `datetime`

Haja vista que a coluna `retweet_status_timestamp` foi removida no passo anterior, não será necessária a sua conversão

In [142]:
df_arch_clean.loc[:, ['timestamp']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 1 columns):
timestamp    2175 non-null object
dtypes: object(1)
memory usage: 34.0+ KB


In [143]:
df_arch_clean.timestamp = pd.to_datetime(df_arch_clean.timestamp)

#### Teste: Veriificação da conversão de `string` para `datetime`

In [144]:
df_arch_clean.loc[:, ['timestamp']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 1 columns):
timestamp    2175 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 34.0 KB


<a id="limpeza:twitter-arc:category-merge"></a>

#### 8. [Uma variável em várias colunas, `doggo`, `floofer`, `pupper` e `puppo`](#limpeza:twitter-arc:category-merge)
* Reduzir a variável para uma coluna, `category`

In [145]:
df_arch_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id                 2175 non-null int64
in_reply_to_status_id    78 non-null float64
in_reply_to_user_id      78 non-null float64
timestamp                2175 non-null datetime64[ns, UTC]
text                     2175 non-null object
rating_numerator         2175 non-null int64
rating_denominator       2175 non-null int64
name                     1495 non-null object
doggo                    87 non-null object
floofer                  10 non-null object
pupper                   234 non-null object
puppo                    25 non-null object
dtypes: datetime64[ns, UTC](1), float64(2), int64(3), object(6)
memory usage: 220.9+ KB


In [146]:
df_arch_clean.loc[:, 'doggo':].notna().sum()

doggo       87
floofer     10
pupper     234
puppo       25
dtype: int64

In [147]:
df_arch_clean.loc[:, 'doggo':].notna().sum().sum()

356

De **2175** registros, apenas **356** estão classificados

Criar uma lista das categorias para mesclar as colunas, considerando a coluna a esquerda como prioritaria sobre valores nulos a direita.

In [148]:
category = df_arch_clean.doggo
category = category.combine_first(df_arch_clean.floofer)
category = category.combine_first(df_arch_clean.pupper)
category = category.combine_first(df_arch_clean.puppo)

df_arch_clean['category'] = category

In [149]:
df_arch_clean.category.notna().sum()

344

Após a unificação das colunas de classificação, pode-se ver que resultou em apenas **344** registros, diferente dos **356** iniciais. O que indica que haviam registros com mais de uma classificação.

In [150]:
df_arch_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis='columns', inplace=True)

#### Teste: Verificar a criação da coluna `category` a remoção das colunas `doggo`, `floofer`, `pupper` e `puppo`. 

In [151]:
df_arch_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 9 columns):
tweet_id                 2175 non-null int64
in_reply_to_status_id    78 non-null float64
in_reply_to_user_id      78 non-null float64
timestamp                2175 non-null datetime64[ns, UTC]
text                     2175 non-null object
rating_numerator         2175 non-null int64
rating_denominator       2175 non-null int64
name                     1495 non-null object
category                 344 non-null object
dtypes: datetime64[ns, UTC](1), float64(2), int64(3), object(3)
memory usage: 169.9+ KB


------
<a id="limpeza:image-prediction"></a>

## Limpeza: Image prediction 

------
<a id="limpeza:twitter-api"></a>

## Limpeza: Twitter API

------
<a id="armazenamento"></a>

# Armazenamento

# Relatórios

* Data wrangling efforts
* Analyses and visualizations