# Wrangle and Analyze Data
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Gathering data</a></li>
<li><a href="#assessing">Assessing data</a></li>
<li><a href="#cleaning">Cleaning data</a></li>
<li><a href="#storing">Storing, analyzing and visualizing wrangled data</a></li>
</ul>

<a id='intro'></a>
## Introduction

In the Wrangle and Analyze Data project we will walk through each of the steps to properly wrangle the WeRateDog Twitter data and create interesting and trustworthy analyses and visualizations. 

In the **Gathering data** section, we will collect 3 pieces of information:
1. The WeRateDog Twitter archive, which was made available for manual download from Udacity's Project Resources
2. The image-predictions.tsv file, which we will download programmatically directly from Udacity's servers
3. Retweet Count and Favorite Count data are not available in the data sources listed previously, thus we create a Twitter API and we will use the tweet IDs in the WeRateDogs Twitter archive to query the API and collect the missing data

The **Assessing data** section will be reserved to assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

Throughout the **Cleaning data** section we will clean each of the documented while assessing. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

Finally, in the last section **Storing, analyzing and visualizing wrangled data**|, we will store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do). We will also analyze and visualize the wrangled data in this Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

Let's start the work.

<a id='gathering'></a>
## Gathering data

In [1]:
import pandas as pd
import requests
import os
import tweepy
import json
import time

In [2]:
# Read the WeRateDogs Twitter archive into a DataFrame
df_twitter_arch = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# Download image-predictions.tsv file programmatically
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

response = requests.get(url)
with open(url.split('/')[-1], mode='wb') as file:
    file.write(response.content)

In [4]:
# Reat image-predictions.tsv file into a DataFrame
df_image_pred = pd.read_csv('image-predictions.tsv', sep='\t')

In [5]:
# Create Tweepy API object
consumer_key = 'sghNXOl7D3URoUlpsOVoynzDg'
consumer_secret = 'ydjnx4Bvf9GzEyvxhBXxnEsQdQNP7VbS1oj4tv9u9Gf2MKrPuj'
access_token = '22273698-Ziz00jwPUaER1j6rQVlIQGEWQZmtEQEQWVkSkBZTi'
access_secret = 'vwH8LFHDLwzauhlhShVJcnN4PKTZDfB7wmhCrE5ENug9t'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
# Read tweet_ids from twitter-archive-enhanced.csv file into a list
tweet_id_list = df_image_pred['tweet_id']

# Query the Twitter API for each tweet's JSON data using Tweepy library 
#   and store it in the tweet_json.txt file
data = {}
id_errors = []

start = time.time()

with open('tweet_json.txt', 'w') as file:
    for tweet_id in tweet_id_list:
        try:
            data = api.get_status(tweet_id, text_mode='extended')
            file.write(json.dumps(data._json) + '\n')
        except Exception as e:
            id_errors.append(print('No status found with ID {}'.format(tweet_id)))

end = time.time()
print(end - start)

No status found with ID 680055455951884288


Rate limit reached. Sleeping for: 613


No status found with ID 754011816964026368
No status found with ID 759566828574212096
No status found with ID 771004394259247104
No status found with ID 779123168116150273
No status found with ID 802247111496568832
No status found with ID 829374341691346946


Rate limit reached. Sleeping for: 629


No status found with ID 831911600680497154
No status found with ID 837012587749474308
No status found with ID 837366284874571778
No status found with ID 842892208864923648
No status found with ID 844704788403113984
No status found with ID 851953902622658560
No status found with ID 861769973181624320
No status found with ID 872261713294495745
No status found with ID 873697596434513921
No status found with ID 888202515573088257
2800.0925800800323


In [7]:
# Read the tweet_json.txt file into a pandas DataFrame 
json_list = []

with open('tweet_json.txt') as json_file:
    for line in json_file:
        data = json.loads(line)
        tweet_id = data['id']
        favorite_counts = data['favorite_count']
        retweet_counts = data['retweet_count']
        
        json_list.append({'tweet_id': tweet_id,
                          'favorite_counts': favorite_counts,
                          'retweet_counts': retweet_counts})
        
df_retweet_fav = pd.DataFrame(json_list, columns=['tweet_id', 'favorite_counts', 'retweet_counts'])

<a id='assessing'></a>
## Assessing data

In [8]:
# Display the WeRateDogs twitter archive table
df_twitter_arch

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [9]:
# Display the image predictions table
df_image_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [10]:
# Display the Retweet and Favorites count table
df_retweet_fav

Unnamed: 0,tweet_id,favorite_counts,retweet_counts
0,666020888022790149,2432,465
1,666029285002620928,121,42
2,666033412701032449,113,41
3,666044226329800704,274,133
4,666049248165822465,99,41
...,...,...,...
2053,891327558926688256,37991,8548
2054,891689557279858688,39794,7919
2055,891815181378084864,23666,3801
2056,892177421306343426,31424,5747


<a id='cleaning'></a>
## Cleaning data

<a id='storing'></a>
## Storing, analyzing and visualizing wrangled data