# We rate dogs data wrangling project

1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: [```twitter_archive_enhanced.csv```](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)

1. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (```image_predictions.tsv```) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

1. Each tweet's *retweet count* and *favorite ("like") count* at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's [Tweepy](http://www.tweepy.org/) library and store each tweet's entire set of JSON data in a file called ```tweet_json.txt``` file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

Note: do not include your Twitter API keys, secrets, and tokens in your project submission.


## Assessing Data for this Project

After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least **eight (8) quality issues** and **two (2) tidiness issues** in your ```wrangle_act.ipynb``` Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

## Cleaning Data for this Project

Clean each of the issues you documented while assessing. Perform this cleaning in ```wrangle_act.ipynb``` as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

## Storing, Analyzing, and Visualizing Data for this Project

Store the clean DataFrame(s) in a CSV file with the main one named ```twitter_archive_master.csv```. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your ```wrangle_act.ipynb``` Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

## Reporting for this Project

Create a 300-600 word written report called ```mwrangle_report.pdf``` or ```wrangle_report.html``` that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called ```act_report.pdf``` or ```act_report.html``` that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the [Markdown functionality](http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.


In [68]:
import os
import tweepy
#import json
import pandas as pd
import numpy as np

In [43]:
df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_image_predictions = pd.read_csv('image-predictions.tsv', sep='\t') # was downloaded with request.get(..., stream=True)

In [30]:
consumer_key = os.environ.get("TWITTER_API")
consumer_secret = os.environ.get("TWITTER_API_SECRET")
access_token = os.environ.get("TWITTER_ACCESS_TOKEN")
access_secret = os.environ.get("TWITTER_ACCESS_TOKEN_SECRET")

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [12]:
df_twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [70]:
retweet_count[606]


645

In [95]:
def get_tweet_coverate():
    if not os.path.exists('tweet_coverage.csv'):
        retweet_count = []
        favorite_count = []
        favorited = []
        retweeted = []

        for index, row in df_twitter_archive.iterrows():
            try:
                tweet_status = api.get_status(row.tweet_id, tweet_mode='extended')._json
                retweet_count.append(tweet_status['retweet_count'])
                favorite_count.append(tweet_status['favorite_count'])
                favorited.append(tweet_status['favorited'])
                retweeted.append(tweet_status['retweeted'])
            except:
                retweet_count.append(np.nan)
                favorite_count.append(np.nan)
                favorited.append(np.nan)
                retweeted.append(np.nan)

            if index % 100 == 0:
                print('Index is %d' % (index))
        print('ready')
        df_tweet_coverage = pd.DataFrame(data = list(zip(df_image_predictions['tweet_id'].values, retweet_count,
                                                         favorite_count, favorited, retweeted)),
                                         columns=['tweet_id', 'retweet_count', 
                                                  'favorite_count', 'favorited', 'retweeted'])
        df_tweet_coverage.to_csv('tweet_coverage.csv', index=False)
    else:
        df_tweet_coverage = pd.read_csv('tweet_coverage.csv')
        
    return df_tweet_coverage



In [96]:
df_tweet_coverage = get_tweet_coverate()
df_tweet_coverage.sample(10)

Index is 0
Index is 100
Index is 200
Index is 300
Index is 400
Index is 500
Index is 600
Index is 700
Index is 800
Index is 900
Index is 1000
Index is 1100
Index is 1200
Index is 1300
Index is 1400
Index is 1500
Index is 1600
Index is 1700
Index is 1800
Index is 1900
Index is 2000
Index is 2100
Index is 2200
Index is 2300
ready


Unnamed: 0,tweet_id,retweet_count,favorite_count,favorited,retweeted
750,687826841265172480,,,,
1392,767122157629476866,,,,
1807,832397543355072512,,,,
250,670691627984359425,7584.0,0.0,False,False
294,671355857343524864,443.0,3858.0,False,False
1833,836753516572119041,,,,
84,667502640335572993,2180.0,17700.0,False,False
1060,715009755312439296,,,,
1849,839990271299457024,,,,
176,669203728096960512,10178.0,33185.0,False,False
