# Wrangle & Analyze WeRateDogs Data

<hr>

Real-world data rarely comes clean. Using Python and its libraries, I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. I will document my wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python.

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

<p align="center">
  <img src="img/dog-rates-social.jpg" width="600">
</p>

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

<p align="center">
  <img src="img/data.png" width="1300">
</p>

Retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered from Twitter's API, which I will do.

In [1]:
# import libs
import numpy as np
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt
from stop_words import get_stop_words

# pandas settings
pd.set_option('display.max_colwidth', -1)

ImportError: cannot import name '_distributor_init'

## Gathering Data

Gather data from various sources and a variety of file formats.

<hr>

* [Enhanced Twitter Archive](#Enhanced-Twitter-Archive)  
* [Image Predictions File](#Image-Predictions-File)  
* [Twitter API File](#Twitter-API-File)

### Enhanced Twitter Archive

This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

In [None]:
# load twitter archive
twitter_arch = pd.read_csv("data/twitter-archive-enhanced.csv")
# use tweet id column as index
twitter_arch.set_index("tweet_id", inplace = True)
# display few lines
twitter_arch.head(3)

### Image Predictions File

This file contains top three predictions of dog breed for each dog image from the WeRateDogs archive. Table contains the top three predictions, tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

In [None]:
# get file with the image predictions
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
with open('data/image-predictions.tsv' , 'wb') as file:
    predictions = requests.get(url)
    file.write(predictions.content)

# load image predictions
image_pred = pd.read_csv('data/image-predictions.tsv', sep = '\t')
# use tweet id column as index
image_pred.set_index("tweet_id", inplace = True)
# display few lines
image_pred.head(3)

### Twitter API File

Retweet count and favorite count are two of the notable column omissions of Twitter data archive. Fortunately, this additional data can be gathered from Twitter's API. Twitter API file contains tweet id, favorite count and retweet count. 

In [None]:
# load twitter API data
with open('data/tweet-json.txt') as f:
    twitter_api = pd.DataFrame((json.loads(line) for line in f), columns = ['id', 'favorite_count', 'retweet_count'])

# change column names
twitter_api.columns = ['tweet_id', 'favorites', 'retweets']
# use tweet id column as index
twitter_api.set_index('tweet_id', inplace = True)
# display few lines
twitter_api.head(3)

## Assessing Data

Assess data visually and programmatically for quality and tidiness issues using pandas.

<hr>

* [Twitter Archive Data](#Twitter-Archive-Data)  
* [Image Predictions](#Image-Predictions)  
* [Twitter API Data](#Twitter-API-Data)

### Twitter Archive Data

In [None]:
# display sample of data
twitter_arch.sample(3)

In [None]:
# pint a summary of a DataFrame
twitter_arch.info()

In [None]:
# check if ids are unique
twitter_arch.index.is_unique

In [None]:
# check number of replies
np.isfinite(twitter_arch.in_reply_to_status_id).sum()

In [None]:
# check values in sources
twitter_arch.source.value_counts()

In [None]:
# check quality of text
twitter_arch.text.sample(3)

In [None]:
# check number of retweets
np.isfinite(twitter_arch.retweeted_status_id).sum()

In [None]:
# check expanded urls
twitter_arch[~twitter_arch.expanded_urls.str.startswith(('https://twitter.com','http://twitter.com', 'https://vine.co'), na=False)].sample(3)

In [None]:
# check for two or more urls in the expanded urls
twitter_arch[twitter_arch.expanded_urls.str.contains(',', na=False)].expanded_urls.count()

In [None]:
# check rating denominator
twitter_arch.rating_denominator.value_counts()

In [None]:
# check rating numerator
twitter_arch.rating_numerator.value_counts()

In [None]:
# check for any float ratings in the text column
twitter_arch[twitter_arch.text.str.contains(r'\d+\.\d+\/\d+')][['text','rating_denominator', 'rating_numerator']]

In [None]:
# check name of dog
twitter_arch.name.value_counts()

In [None]:
# check for stop words in dog name
# https://stackoverflow.com/a/5486535/7382214

stop_words = set(get_stop_words('en'))

count=0
for word in twitter_arch.name:
    if word.lower() in stop_words:
        count += 1
print('Rows with stop words:', count)

In [None]:
# check if gods have more than one category assigned
categories = ['doggo', 'floofer', 'pupper', 'puppo']

for category in categories:
    twitter_arch[category] = twitter_arch[category].apply(lambda x: 0 if x=='None' else 1)

twitter_arch['number_categories'] = twitter_arch.iloc[:,[12,13,14,15]].sum(axis=1)

In [None]:
# dogs categories
twitter_arch['number_categories'].value_counts()

#### Quality & Tidiness Issues

- in_reply_to_status_id, in_reply_to_user_id have an incorrect datatype - is a float, should be int;
- some of the gathered tweets are replies;
- the timestamp has an incorrect datatype - is an object, should be DateTime;
- source is an HTML element - it's text should be extracted;
- some rows in the text column begin from 'RT @dog_rates:';
- some rows in the text column have leading and/or trailing whitespace;
- some of the gathered tweets are retweets;
- retweeted_status_id, retweeted_status_user_id have an incorrect datatype - is a float, should be int;
- the retweeted timestamp has an incorrect datatype - is an object, should be DateTime;
- we have 59 missing expanded urls;
- we have 639 expanded urls which contain more than one url address;
- denominator of some ratings is not 10;
- numerator of some ratings is almost always greater than 10;
- float ratings have been incorrectly read from the text of tweet;
- 'None' in the name should be convert to NaN;
- we have stop words in the name column;
- dog 'stage' classification (doggo, floofer, pupper or puppo) should be one column;
- some dogs have more than one category assigned;

### Image Predictions

In [None]:
# display sample of data
image_pred.sample(3)

In [None]:
# pint a summary of a DataFrame
image_pred.info()

In [None]:
# check if ids are unique
image_pred.index.is_unique

In [None]:
# check jpg_url
image_pred[~image_pred.jpg_url.str.endswith(('.jpg', '.png'), na=False)].jpg_url.count()

In [None]:
# check image number
image_pred.img_num.value_counts()

In [None]:
# check 1st prediction
image_pred.p1.sample(3)

In [None]:
# check dog predictions
image_pred.p1_dog.count()

#### Quality & Tidiness Issues

- the dataset has 2075 entries, while twitter archive dataset has 2356 entries;
- column names are confusing and do not give much information about the content;
- dog breeds contain underscores, and have different case formatting;
- only 2075 images have been classified as dog images for top prediction;
- dataset should be merged with the twitter archive dataset;

### Twitter API Data

In [None]:
# display sample of data
twitter_api.sample(3)

In [None]:
# pint a summary of a DataFrame
twitter_api.info()

In [None]:
# check if ids are unique
twitter_arch.index.is_unique

#### Quality & Tidiness Issues

- twitter archive dataset has 2356 entries, while twitter API data has 2354;
- dataset should be merged with the twitter archive dataset;

## Cleaning Data

Using pandas, clean the quality and tidiness issues identified in the [Assessing Data](#Assessing-Data) section.

<hr>

* [Twitter Archive Data](#Twitter-Archive-Data)  
* [Image Predictions](#Image-Predictions)  
* [Twitter API Data](#Twitter-API-Data)

## Analyze Data

Analyze and visualize data using matplotlib.

<hr>

* [Twitter Archive Data](#Twitter-Archive-Data)  
* [Image Predictions](#Image-Predictions)  
* [Twitter API Data](#Twitter-API-Data)