# Table of Contents
---

* [Introduction](#introduction)
* [Gathering](#gathering)
* [Assessing](#assessing)

<a id='introduction'></a>

# Introduction
---

> The goal of this project is wrangling WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

<a id='gathering'></a>

# Gathering
---


> **Enhanced Twitter Archive:** 
>
> Provided by Udacity

> **Additional Data via the Twitter API:**
>
> Used the tweet IDs in the WeRateDogs Twitter archive, queried the Twitter API for each tweet's JSON data using Python's Tweepy library and stored each tweet's entire set of JSON data in a file called tweet_json.txt file


```python
with open('twitter-credential.json') as f:
    credentials = json.load(f)
auth = tweepy.OAuthHandler(credentials['consumer_key'], credentials['consumer_secret'])
auth.set_access_token(credentials['access_token'], credentials['access_token_secret'])
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
tweet_ids = df_wrd_twitter_old['tweet_id']
statuses = []
for tweet_id in tqdm(tweet_ids):
    try:
        status = api.get_status(tweet_id, tweet_mode='extended')
        statuses.append(status._json)
    except:
        continue
with open(json_filepath, 'w') as outfile:
    for status in statuses:
        json.dump(status, outfile)
        outfile.write("\n")
```


> **Image Predictions File:**
>
> Downloaded programmatically using the Requests library and the following URL: [https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)

```python
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url, allow_redirects=True)
with open(image_predictions_filepath, 'wb') as f:
    f.write(response.content)
```

<a id='assessing'></a>

# Assessing
---

> **Key Points:**
> * Only original ratings (no retweets) that have images is wanted
> * The tweets beyond August 1st, 2017 do not need to be gathered.
> * Cleaning includes merging individual pieces of data according to the rules of tidy data
> * The rating numerators are greater than the denominators does not need to be cleaned

> **Quality Issues:**
> * IDs are found to have wrong types
> * Data contains retweets
> * Variable **retweeted** indicates whether this Tweet has been Retweeted by the authenticating user. Variable **retweet_count** indicates number of times this Tweet has been retweeted. The maxium of retweet_count reaches 77154 but the only possible value for retweeted is False
> * The above issue is also found between **favourited** and **favorite_count**
> * Misleading column names in df_image_predictions_old: p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog
> * Column name **text** is not descriptive
> * Issues with columns p1, p2, p3
>  * Not all words start with capitalized characters
>  * Some uses "_", and some uses "-"
> * **timestamp** is not of type DateTime. Same as **retweeted_status_timestamp**
> * The **name** column in **df_wrd_twitter_old** contains words other than dog names
> * The **rating_denominator** contains 0, but its corresponding **rating_numerator** is 960