## Data Wrangling and Analysis via Twitter

**Objective:** Wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

We want to accomplish the following: 
- Gather data from 3 different sources and formats.
- Identify and clean 8 quality issues.
- Identify and clean 2 tidiness issues.
- Analyze data by producing 3 insights and 1 visualization.

## Table of Contents 
- [Introduction](#intro)
- [Gather](#gather)
- [Assess](#assess)
- [Clean](#clean)

<a id='intro'></a>
### Introduction

**Dogtionary:** There are various stages of dog, which will be part of the observed data: doggo, pupper, puppo, and floof(er).

**Data Information:** 

_Twitter Archive_

The WeRateDogs Twitter archive is a file called `twitter-archive-enhanced.csv` that contains basic tweet data for all 5000+ of their tweets. The Twitter archive data has been filtered for tweets with ratings only, bringing the total record count to 2356.
    
_Twitter API_

Additional data can be gathered by anyone using the Twitter API. This queried data is stored in a file called `tweet_json.txt`. This shows data on retweet count and favorite count.

_Image Predictions_ 

Using a neural network, images classified dogs by breed. This is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv


In [52]:
# import all libraries

# Requests
import requests
import os

# Twitter API
import tweepy

# Read files
from bs4 import BeautifulSoup
import glob

# Data analysis
import pandas as pd
import numpy as np

# Secrets
import getpass

<a id='gather'></a>
### Gather

First, we will read the Twitter archive file and identify issues that need to be cleaned.

In [15]:
# load the dataset Twitter Archive from a flat file
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

twitter_archive.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


In [16]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [17]:
twitter_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [32]:
twitter_archive.name.unique()

array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
       'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver',
       'Jim', 'Zeke', 'Ralphus', 'Canela', 'Gerald', 'Jeffrey', 'such',
       'Maya', 'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey',
       'Lilly', 'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella',
       'Grizzwald', 'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey',
       'Gary', 'a', 'Elliot', 'Louis', 'Jesse', 'Romeo', 'Bailey',
       'Duddles', 'Jack', 'Emmy', 'Steven', 'Beau', 'Snoopy', 'Shadow',
       'Terrance', 'Aja', 'Penny', 'Dante', 'Nelly', 'Ginger', 'Benedict',
       'Venti', 'Goose', 'Nugget', 'Cash', 'Coco', 'Jed', 'Sebastian',
       'Walter', 'Sierra', 'Monkey', 'Harry', 'Kody', 'Lassie', 'Rover',
       'Napolean', 'Dawn', 'Boomer', 'Cody', 'Rumble', 'Clifford',
       'quite', 'Dewey', 'Scout', 'Gizmo', 'Cooper', 'Harold', 'Shikha',
       'Jamesy', 'Lili', 'Sammy', 'Meatball', 'Paisley', 'Albus',
       'Nept

The twitter archive file has 2356 records and 17 columns.

**Quality issues:**
* `timestamp` should be converted from type `object` to `datetime`.
* `rating_denominator` should all be 10 but there are some incorrect records (incorrect MAX value is 170).
* There are 59 missing `expanded_urls`.
* `name` is listed incorrectly as "None" (should be null) or "a"/"the" (first letter should be upper case or value should be null).

**Tidiness issues:**
* `doggo`, `floofer`, `pupper`, and `puppo` are 4 separate columns but can be combined into 1 `dog_stage` column.

Next, we will use Twitter's API to get each tweet's JSON data and store in `tweet_json.txt`. Each tweet's JSON data should be written to its own line. 

In [None]:
consumer_key = input()
# consumer_key = getpass.getpass()

In [None]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [None]:
consumer_key

Finally, we will programatically download and read the image predictions hosted on Udacity's servers. They will be stored in `image-predictions.tsv`.

In [34]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)

In [51]:
# Write the url content in .tsv file
with open("image-preds.tsv", mode='wb') as file:
    file.write(response.content)
    
# Read .tsv file
image_preds = pd.read_csv('image-preds.tsv', sep="\t")

image_preds.head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


<a id='assess'></a>
### Assess

We will investigate each dataset to identify what needs to be cleaned.