# Wrangle Report

## Sourcing

The WeRateDogs, data is split over three sources.
    1. A twitter archive (CSV) file
    2. Neural network predictions on the tweet images (online TSV file)
    3. Tweet information (Twitter API)
As such it was necessary to read in from all these locations. The local CSV file is easily read in by pandas. Using requests we can download the TSV file from the server and store it as a local TSV file, to be read in by pandas.

The tweet information was collected through the Twitter API, and tweepy for python. Collecting the data for each tweet individually proved to be very slow and exceeded the rate limit for the API, instead using the ` GET statuses/lookup` we can pull the information in batches of 100 tweets. This reduced the time to download tweets from the predicted 20-30 minutes to around 13 seconds (24 group requests vs. 2356 individual requests). The API configuration was also set to use a JSON parser, so as to more easily store the tweets in the desired format, a JSON .txt file.

## Assessing

In assessing we looked over the data table by table both visually and programatically.

Visual observations include items such as noting that text contains a shortened_url, or a dog with name "O'Malley" had his name incorrectly read as just "O".

Programatic observations include items such as, missing tweet data, finding retweets and replies, and finding incorrect data formats.

A summary of the observations actioned upon are listed below, split into groups for quality and tidiness issues.

#### Quality

##### ``archives`` table

- Retweets and replies are included
- Incorrect data format for `timestamp`
- Incorrect rating data
    - Decimal numerators are split
    - Pulls from first fraction which may not be rating
- Ratings should be normalised to score out of 10
    - Some ratings are groups of dogs
- Incorrect name data
    - O'Malley
    - Incorrect words taken as names (lowercase)
- Missing data should be NaNs
    - `dog type` and `name`
- Tweet source unintuitive

##### `predictions` table

- No apparent quality issues

##### `tweet_json` table

- Missing tweets, 2346 vs 2355, owing to deleted tweets no longer available
    - As seen these tweets are retweets/replies so will be dropped
- Retweets and replies are included
- Empty columns


#### Tidiness

- id_str is duplicate of id in `tweet_json`
- Two variables in one column in `archive` `text`, shortened_url and text
- One variable in four columns in `archive` table, dog type
- Retweet count and favourite count from `tweet_json` should be attached to `archive`
- Predictions from `predictions` should be attached to `archive`
Cleaning
In cleaning we first actioned removing incorrect data, such as retweets and replies, and removing missing data (namely empty columns), so as to not run into issues in the later cleaning steps.
We then worked on the tidiness issues, such as splitting the shortened url from the text column and merging the four observations of dog type into one column. The other aspect worked on was to merge the desired columns from the tweet_json and predictions tables into the archive table.
Moving on from this the remaining quality issues were addressed, generally in the order listed above. Where efficient the corrections were performed programatically, such as fixing the missing data, in other cases it was faster to manually correct issues that were one-off such as "O'Malley."

## Cleaning

In cleaning we first actioned removing incorrect data, such as retweets and replies, and removing missing data (namely empty columns), so as to not run into issues in the later cleaning steps.

We then worked on the tidiness issues, such as splitting the shortened url from the text column and merging the four observations of dog type into one column. The other aspect worked on was to merge the desired columns from the `tweet_json` and `predictions` tables into the `archive` table.

Moving on from this the remaining quality issues were addressed, generally in the order listed above. Where efficient the corrections were performed programatically, such as fixing the missing data, in other cases it was faster to manually correct issues that were one-off such as "O'Malley."