# Wrangling and Analyzing WeRateDogs Twitter Archive
Real world data rarely comes clean. Using Python and its libraries,I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. I will document my wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python. The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

I will follow the Gather, Assess, and Clean model for wrangling this data. With the clean data, we will see what insights can be made of the data from WeRateDogs Twitter Archive.

## Gather

First lets import the majority of libraries we will need. 

In [2]:
import pandas as pd
import numpy as np 
import os 
import glob
import json
import requests

Read in to a dataframe Twitter Archive Enhanced

In [38]:
df = pd.read_csv('C:/Users/sethb/OneDrive/Documents/Udacity_Real/Data_Wrangling/wrangling_project/Wrangle-and-Analyze-Data/twitter-archive-enhanced.csv')
df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,8.92421e+17,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,8.92177e+17,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,8.91815e+17,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,8.9169e+17,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,8.91328e+17,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Install Twitters API "tweepy"

In [3]:
pip install tweepy

Collecting tweepyNote: you may need to restart the kernel to use updated packages.
  Downloading tweepy-3.9.0-py2.py3-none-any.whl (30 kB)
Collecting requests-oauthlib>=0.7.0
  Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
Installing collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.3.0 tweepy-3.9.0



In [16]:
pip install twython




Create an API object that we can use to gather twitter data. 

In [51]:
import tweepy
from tweepy import OAuthHandler
from timeit import default_timer as timer

consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, parser = tweepy.parsers.JSONParser(), wait_on_rate_limit=True, wait_on_rate_limit_notify = True)

In [54]:
tweet_ids = df.tweet_id.values
tweet_ids

array([8.92421e+17, 8.92177e+17, 8.91815e+17, ..., 6.66033e+17,
       6.66029e+17, 6.66021e+17])

Read in all the tweets from WeRateDogs

In [56]:
count = 0
fail = {}
start = timer()

with open('tweet_json.txt', 'w') as outfile:
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            outfile.write(json.dumps(tweet) + '\n')
        except tweepy.TweepError as e:
            print("Fail")
            fail[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fail)

1: 8.92421e+17
Fail
2: 8.92177e+17
Fail
3: 8.91815e+17
Fail
4: 8.9169e+17
Fail
5: 8.91328e+17
Fail
6: 8.91088e+17
Fail
7: 8.90972e+17
Fail
8: 8.90729e+17
Fail
9: 8.90609e+17
Fail
10: 8.9024e+17
Fail
11: 8.90007e+17
Fail
12: 8.89881e+17
Fail
13: 8.89665e+17
Fail
14: 8.89639e+17
Fail
15: 8.89531e+17
Fail
16: 8.89279e+17
Fail
17: 8.88917e+17
Fail
18: 8.88805e+17
Fail
19: 8.88555e+17


KeyboardInterrupt: 