# Table of Contents
---

* [Introduction](#introduction)
* [Gathering](#gathering)
* [Assessing](#assessing)
* [Cleaning](#cleaning)

<a id='introduction'></a>

# Introduction
---

> The goal of this project is wrangling WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

<a id='gathering'></a>

# Gathering
---


> **Enhanced Twitter Archive** 
>
> Provided by Udacity


> **Image Predictions File**
>
> Downloaded programmatically using the Requests library and the following URL: [https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)

```python
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url, allow_redirects=True)
with open(image_predictions_filepath, 'wb') as f:
    f.write(response.content)
```

> **Additional Data via the Twitter API**
>
> Used the tweet IDs in the WeRateDogs Twitter archive, queried the Twitter API for each tweet's JSON data using Python's Tweepy library and stored each tweet's entire set of JSON data in a file called tweet_json.txt file


```python
with open('twitter-credential.json') as f:
    credentials = json.load(f)
auth = tweepy.OAuthHandler(credentials['consumer_key'], credentials['consumer_secret'])
auth.set_access_token(credentials['access_token'], credentials['access_token_secret'])
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
tweet_ids = df_wrd_twitter_old['tweet_id']
statuses = []
for tweet_id in tqdm(tweet_ids):
    try:
        status = api.get_status(tweet_id, tweet_mode='extended')
        statuses.append(status._json)
    except:
        continue
with open(json_filepath, 'w') as outfile:
    for status in statuses:
        json.dump(status, outfile)
        outfile.write("\n")
```

> **Dog Breed Labels**
> * Downloaded from Kaggle
> * Did not programmatically download since the zip folder contains unnecessary files
> * **URL**: https://www.kaggle.com/c/dog-breed-identification/data 
> * Only uses `labels.csv` in this project

<a id='assessing'></a>

# Assessing
---

> **Key Points**
> * Only original ratings (no retweets) that have images is wanted
> * The tweets beyond August 1st, 2017 do not need to be gathered.
> * Cleaning includes merging individual pieces of data according to the rules of tidy data
> * The rating numerators are greater than the denominators does not need to be cleaned

### Summary

> **Enhanced Twitter Archive**
> * name, doggo, floofer, pupper and puppo columns contain "None"
> * name contains "a" as a value. May contain other nonsense name
> * tweet information and dog information in the same table
> * name (refers to dog name) is not very descriptive since tweet information and dog information are in the same table
> * text (refers to status update) is not very descriptive since tweet information and dog information are in the same table
> * All ids(tweet_id, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id)
 have numeric data type instead of string
> * timestamp is of string type
> * Even though some columns have null values, it is reasonable
> * doggo, floofer, pupper and puppo can be combined into a single column
> * Columns may be ignored: 
>   in_reply_to_status_id, in_reply_to_user_id, source, expanded_urls
> * Have invalid names other than 'None', eg: a, an, the, one. All starts with lowercase characters.
> * Name column has duplicated entries
> * There are 7 rows in df_wrd_twitter_old with wrong denominators and numerators
> * There is 2 rows parsing time as ratings: 1) 24/7 2) 11/15/15
> * name, doggo, pupper, puppo and floofer have "None" values as expected

> **Image Predictions File**
>
> * img_num, p1, p1_conf, p1_dog, etc. are not very descriptive
> * Values of p1, p2, p3 have formatting issues. Some starts with uppercase characters, but others start with lowercase characters. In addition, some are separated by "-" while others have "_" as delimiter
> * The data type of tweet_id is integer
> * No missing values in this table
> * There exist rows where either p1_dog, p2_dog or p3_dog is true, but predictions do not belong to dog breeds

> **Additional Data via the Twitter API**
>
> * For each type of id, there are two columns for it for the same piece of information, and their data type are the same 
> * Columns may be ignored: 
>   * created_at, truncated, in_reply_to_user_id, in_reply_to_status_id_str, in_reply_to_user_id
>   * in_reply_to_user_id_str, in_reply_to_screen_name, is_quote_status, possibly_sensitive
>   * possibly_sensitive_appealable, lang, quoted_status_id, quoted_status_id_str, quoted_status_permalink
>   * quoted_status, geo, coordinates, place
> * id is of numeric type

> **Dog Breed Labels**
> * There are duplicates under the breed column of df_dog_breeds
> * Formatting issues: both "-" and "_" are used for separaters

### Quality Issues

> * All Id related fields have wrong data types.
> * timestamp has wrong data type
> * There exist rows that rating_denominator and rating_numerator does not match text
> * Duplicated columns for same pieces of data and same data types
> * Some column names are not descriptive
> * Some column values have formatting issues
> * Some column values have duplicates
> * There exists rows that either p1_dog, p2_dog or p3_dog is true, but predictions do not belong to dog breeds

### Tidiness Issues

> * tweet information and dog information in the same table
> * doggo, floofer, pupper and puppo can be combined into a single column
> * tweet information are separated into three tables

<a id='cleaning'></a>

# Cleaning
---

> **Dog Breed Labels**
> * Find out rows that are duplicated, get their indices and remove by row indices
> * Fix formatting issues in the breed column

> **Enhanced Twitter Archive**
> * Filter out retweets by removing the ones with retweeted_status_id
> * Drop unnessary columns
> * Update and Remove rows with inaccurate ratings
>   * Get all texts where denominator and numerator are wrong
>   * Get all rows in the dataframe by matching texts
>   * Remove these rows in the dataframe by indices
> * Change data types
>   * Change data type of tweet_id to object(string) using `.astype(str)`
> * Change data type of timestamp to timestamp using `pd.to_datetime` and `utc=True`
> * Filter out tweets beyond August 1st, 2017
>    * Create a new dataframe with shape(the number of rows in df_wrd_twitter, 1). 
>    * Fill it with value = '2017-08-01'
>    * Convert both series using code like `pd.to_datetime(time_df['timestamp'], utc=True)`
>    * Compare these two series, get the rows with timestamp higher than 2017-08-01, remove by row indices
> * Rename columns using `df.rename`
>   * name -> dog_name
>   * text -> status_update
> * Remove rows with invalid names
>   * Find rows containing invalid names: "None" and words starting with lowercase characters
>   * Delete rows using row indices
> * Combine four dog "states" into a single column
>   * Find out indices of row with a specific dog state eg. 'doggo'
>   * Set the "dog_state" of these rows to a value which represents this dog state
>   * Same action for all states
> * Remove doggo, floofer, pupper, puppo columns

> **Image Predictions File**
> * Change data type of tweet_id to object
> * Fix formatting issues of p1, p2 and p3. Final format: starts with lowercase characters and seperated by "_"
> * For rows where either p1_dog, p2_dog or p3_dog is true, but predictions do not belong to dog breeds, remove these rows
> * Rename p1, p2, p3, p1_conf, p2_conf, p3_conf, p1_dog, p2_dog, p3_dog

> **Additional Data via the Twitter API**
> * Drop unnecessary columns
> * Change the data type of id to string
> * Filter out retweets by removing rows with non-null retweeted_status