# Data Wrangling Template

## Gather

First, we import the libraries we'll be using, and loading some environment varables that contain the authentication information for using the Twitter API

In [1]:
import pandas as pd
import requests
import tweepy
import json
import time
import shutil

consumer_key = %env TWITTER_CON_KEY
consumer_secret = %env TWITTER_CON_SECRET
access_token = %env TWITTER_ACC_TOKEN
access_secret = %env TWITTER_ACC_SECRET

Load up the enhanced Twitter archive, which we have on hand

In [2]:
tw_arch = pd.read_csv('data/twitter-archive-enhanced.csv')

Download the image predictions TSV file

In [4]:
tsv_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

with requests.get(tsv_url, stream=True) as r:
    with open('data/image_predictions.tsv', 'wb') as f:
        shutil.copyfileobj(r.raw, f)

Check the headers of the Twitter archive, and confirm that the relevant content contains only unique data

In [5]:
tw_arch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [8]:
tw_arch.tweet_id.nunique() == len(tw_arch)

True

Set up our client, then use tweet IDs from the archive to query the Twitter API

In [6]:
tweet_ids = tw_arch.tweet_id.tolist()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [8]:
failed = []
passed = []
with open('data/tweet-json.txt', mode='w') as f:
    for tid in tweet_ids:
        try:
            data = api.get_status(tid, tweet_mode='extended')
        except:
            print("Failure to retrieve tweet with tid: {0}".format(tid))
            failed.append(tid)
            continue
        passed.append(tid)
        print(json.dumps(data._json), file=f)
        time.sleep(0.3)

Failure to retrieve tweet with tid: 888202515573088257
Failure to retrieve tweet with tid: 873697596434513921
Failure to retrieve tweet with tid: 872668790621863937
Failure to retrieve tweet with tid: 872261713294495745
Failure to retrieve tweet with tid: 869988702071779329
Failure to retrieve tweet with tid: 866816280283807744
Failure to retrieve tweet with tid: 861769973181624320
Failure to retrieve tweet with tid: 856602993587888130
Failure to retrieve tweet with tid: 851953902622658560
Failure to retrieve tweet with tid: 845459076796616705
Failure to retrieve tweet with tid: 844704788403113984
Failure to retrieve tweet with tid: 842892208864923648
Failure to retrieve tweet with tid: 837366284874571778
Failure to retrieve tweet with tid: 837012587749474308
Failure to retrieve tweet with tid: 829374341691346946
Failure to retrieve tweet with tid: 827228250799742977
Failure to retrieve tweet with tid: 812747805718642688
Failure to retrieve tweet with tid: 802247111496568832
Failure to

Rate limit reached. Sleeping for: 346


Failure to retrieve tweet with tid: 754011816964026368
Failure to retrieve tweet with tid: 680055455951884288


Rate limit reached. Sleeping for: 447



   ## Assess

We'll start by looking again at the columns and at the head of the twitter archive file

In [3]:
tw_arch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [4]:
tw_arch.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [5]:
tw_arch[ tw_arch.in_reply_to_status_id.notnull() ].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,


In [6]:
type(tw_arch.iloc[1].timestamp)

str

In [7]:
tw_arch.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [8]:
tw_arch.sample(10).text.values.tolist()

["Meet Vinnie. He's having fun while being safe. Well not a lot of fun, but definitely safe, and that's important 8/10 https://t.co/vZYtynZZlH",
 "This is Bloo. He's a Westminster Cîroc. Doesn't think Bart deserves legs. Nice flowers. 8/10 https://t.co/IAc1QCczMc",
 "Say hello to Griffin. He's upset because his costume for Halloween didn't arrive until today. 9/10 cheer up pup https://t.co/eoBCjSFajX",
 'This is Samsom. He is sexually confused. Really wants to be a triceratops. 9/10 just a great guy https://t.co/HPoce45SI3',
 'Heartwarming scene here. Son reuniting w father after coming home from deployment. Very moving. 10/10 for both pups https://t.co/95JJevQOWW',
 '12/10 gimme now https://t.co/QZAnwgnOMB',
 "Meet Vincent. He's a wild Adderall Cayenne. Shipped for free. Always fresh. Never frozen. 10/10 great purchase https://t.co/ZfS7chSsi7",
 "Meet Clarkus. He's a Skinny Eastern Worcestershire. Can tie own shoes (impressive af) 10/10 would put on track team https://t.co/XP5o7zGn0E"

In [9]:
type(tw_arch[ tw_arch.retweeted_status_timestamp.notnull() ].iloc[0].retweeted_status_timestamp)

str

In [10]:
tw_arch[ tw_arch.retweeted_status_id.notnull() ].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4196984000.0,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4196984000.0,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,


In [13]:
tw_arch.sample(10).expanded_urls.values.tolist()

['https://twitter.com/dog_rates/status/667012601033924608/photo/1',
 'https://vine.co/v/eEZXZI1rqxX,https://vine.co/v/eEZXZI1rqxX',
 'https://twitter.com/dog_rates/status/676470639084101634/photo/1',
 'https://vine.co/v/5QWd3LZqXxd',
 'https://twitter.com/dog_rates/status/752682090207055872/photo/1,https://twitter.com/dog_rates/status/752682090207055872/photo/1',
 'https://www.gofundme.com/meeko-needs-heart-surgery,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1',
 'https://twitter.com/dog_rates/status/701981390485725185/photo/1',
 'https://twitter.com/dog_rates/status/813910438903693312/photo/1',
 'https://twitter.com/dog_rates/status/889665388333682689/photo/1',
 'https://twitter.com/dog_rates/status/738166403467907072/photo/1,https://twitter.com/dog_rates/status/7381664034679070

In [15]:
len(tw_arch[ tw_arch.expanded_urls.str.contains(',', na=False) ])

639

606

### Quality
* in_reply_to_status_id/user_id stored as floats, mostly NaN
* timestamp stored as a string instead of datetime
* source data is buried in HTML
* retweeted_status_id/user_id stored as floats
* retweeted_status_timestamp stored as string
* expanded_urls contains comma-separated values

In [5]:
img_prd = pd.read_csv('data/image_predictions.tsv', sep='\t')

In [7]:
img_prd.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [9]:
twitter_json = []
with open('data/udacity.tweet-json.txt', 'r') as f:
    for line in f:
        twitter_json.append(json.loads(line))

In [10]:
twitter_json

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'large': {'w': 540, 'h': 528, 'resize': 'fit'},
     'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'medium': {'w': 

## Clean

#### Define

#### Code

#### Test