## Twitter data wrangling report

### Background

Goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

Specific Notes: 
- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

### Dataset Introduction

1. Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

2. Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least.

3. Image Predictions File

Ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

### Data Wrangling Steps

### Step 1: Gathering

- `Gathering data source 1 - image prediction`:
    Get Tweet image prediction by downloading from url programmatically

- `Gathering data source 2 - twitter_archive_enhanced`:
    This file is given and just read from the directory

- `Gathering data source 3 - tweet_json`: 
    Did not chose API connection, instead directly get the tweet_json.txt file and copying the API extraction code. However, these data can be access via API.

### Step 2: Access

Access the dataset including two steps:
1. visulize the dataset(give the data an visual inspect)
2. programmatic assessment(including look at dataset info, look at basic statistical describe of each columns)
3. Data issue found: Describe and document the issue found in the dataset

#### For these three dataset, I found below issues:

### Data Issue Summary

#### Quality Issue:
`twitter_archive_enhanced`:
- Has extra rows that tweet_json doesn't have, which is not useful for further analysis
- Delete unnecessary columns with lots of missing values and the retweet rows to be removed, since retweets are essentially duplicates of the actual tweets. 
- timestamp should be convert to datetime format
- None values in dog names change to unknown
- Filter tweets after 2017-08-01
- Remove incorrect ratings (ratings that are not 10 as denumerator)
- Ratings with decimal values incorrectly extracted
- Merge multiple dog stages and delete the rows with multiple stages

`image_predictions`:
- img_num change to category data type
- Deduplicate the dataset, there are duplicated rows
- Adding final prediction column, based on the logic of p1, p2, p3 prediction accuracy and confident interval


#### Tidyness Issue:
- Dog stage columns are not useable and need to be transformed into other formart(use pd.melt)
- Merge three datasets
- Delete unnecessary columns

### Step 3: Data Clean

Cleaning the data including three steps:
    1. Define the issue
    2. Code to clean data
    3. Test to make sure code works well

For this study, all the data quality issue and tidyness issue are being addressed and tested successfully.