# WeRateDogs Project
## Data Wrangling Internal Report

Author: Rahma Ali

12-Aug-2020

This report summarizes data wrangling performed on WeRateDogs twitter account data. It is produced as part of my second project submission in Udacity professional track in data analysis.



## Introduction
This project explores data from tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs and provides a funny comment about the dog. These ratings consist of a numerator, which is almost always greater than 10 in all the ratings, and a denominator, that is mostly 10. This unusual way of rating is biased towards dog, who are indeed man's best friend and also make lovely pets! WeRateDogs currently has over 8 million followers and has received international media attention.

## Sources of data
Three datasets for this project from three different resources have been used:

1- `twitter-archive-enhanced.csv`: contains basic tweet data for all 5000+ of WeRateDogs twitter account tweets offered under the course resources.

2- `image_predictions.tsv`: contains WeRateDogs account tweet images predictions of breed of dog (or other object, animal, etc.) according to a neural network. The file is hosted on Udacity's server

3- `tweet-json.txt` : contains WeRateDogs account tweets retweet and favorite count data from twitter_api.py

## Data wrangling
This section goes through the 3 steps of data wrangling that were performed on the data.

### 1- Gather
In this step, the datasets are gathered from 3 sources and imported. I will go over how I gathered each dataset in narrative first followed by code.

1- `twitter-archive-enhanced.csv` was directly downloaded from course resources and read into python using pandas `read_csv` function.

2- `image_predictions.tsv` was downloaded programmatically from a [url](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv) to Udacity server.

3- `tweet-json.txt` was downloaded directly from course resources as well and read programmatically. I was not able to get confirmation on my twitter developer account by the time of submission to be able to get it from twitter API myself.

### 2- Assess
After importing all the needed dataframes, I started exploring all three both visually and programmatically. The following issues emerged from the data, which I categorized to quality and tidiness issues:

#### 2.1 Quality issues
- Missing dog classification observations are stored as `None`
- Dog classification is not mutually exclusive
- `tweet_id` variable is stored as integer instead of string
- `source` variable values of urls is surrounded by tags
- `timestamp` variable is stored as  string instead of datetime
- `name` variable contains a number of wrong entries (`all`, `some`, ... etc)
- Data contains retweets and replies and not just original tweets
- Data contains tweets with no images
- `name` variable contains the value `none` to represent missing values
- The 4 dummy dog classification variables (`doggo`, `floofer`, `pupper`, `puppo`) are not mutually exclusive
- Fix `rating_numerator` column values

#### 2.2 Tidiness issues:
- Dog classification values are present in column headers 
- Image prediction algorithm values are present in the column headers 
- Tweets observations are stored in multiple tables

### 3- Clean
After having assessed all the data issues, I started acting on them. The following list describes all the actions taken in order to fix all quality and data issues of the data:
- Store values of columns headers in `image_predictions_df` in variables 
- Replace `None` values with `NaN` or empty strings in dog classification variables
- Combine the 4 dummy dog classification variables (doggo, floofer, pupper, puppo) to one `dog_class` variable and fix rows with multiple dog classification
- Convert `tweet_id` variable type to string in all dataframes to prepare for data merge
- Remove extra tags in `source` variable
- Fix non-capitalized and erroneous entries in `name` variable
- Merge all three datasets on `tweet_id`and make a clean copy of the dataframe
- Identify and remove retweets from the master dataframe and tweets without photos
- Convert `timestamp` variable type to `datetime

## Data storing and analysis
I saved the resulting master dataframe from the data cleaning phase as `twitter_archive_master.csv`, which I have attached to my project submission. I started exploring the data. You will find my insights in the `wrangle_act.html` report, also attached to my project submission.