# Wrangling and Analyzing WeRateDogs Twitter Archive
Real world data rarely comes clean. Using Python and its libraries,I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. I will document my wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python. The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

I will follow the Gather, Assess, and Clean model for wrangling this data. With the clean data, we will see what insights can be made of the data from WeRateDogs Twitter Archive.

## Gather

First lets import the majority of libraries we will need. 

In [11]:
import pandas as pd
import numpy as np 
import os 
import glob
import json
import requests

Read in to a dataframe Twitter Archive Enhanced

In [70]:
df = pd.read_csv('C:/Users/sethb/OneDrive/Documents/Udacity_Real/Data_Wrangling/wrangling_project/Wrangle-and-Analyze-Data/twitter-archive-enhanced.csv')
df

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,8.924210e+17,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,8.921770e+17,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,8.918150e+17,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,8.916900e+17,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,8.913280e+17,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,6.660490e+17,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,6.660440e+17,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,6.660330e+17,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,6.660290e+17,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


Download the image_predictions URL from Udacity's servers.

In [23]:
image_predictions_url = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv']

Make a directory to store the predictions

In [55]:
folder_name = 'predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

Use requests library and os to access the image_predictions_url and write all the data to the folder predictions as the file 'image_predictions.tsv'

In [56]:
for url in image_predictions_url:
    response = requests.get(url)
    with open(os.path.join(folder_name, url.split('/')[-1]), mode ='wb') as file:
        file.write(response.content)

read in text file given by Udacity full of tweet data. I tried to do the other way for five hours. It kept saying every tweet communication failed and was showing no tweets. 

In [49]:
df_tweet_json = pd.read_fwf('C:/Users/sethb/OneDrive/Documents/Udacity_Real/Data_Wrangling/wrangling_project/Wrangle-and-Analyze-Data/tweet_json.txt')

In [57]:
df_tweet_json.head()

Unnamed: 0,"{""created_at"":","""Tue",Aug,01,16:23:56,+0000,"2017"",","""id"":","892420643555336193,","""id_str"":",...,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68
0,"{""created_at"":","""Tue",Aug,1,00:17:27,0,"2017"",","""id"":",892177421306343426,"""id_str"":",...,,,,,,,,,,
1,"{""created_at"":","""Mon",Jul,31,00:18:03,0,"2017"",","""id"":",891815181378084864,"""id_str"":",...,,,,,,,,,,
2,"{""created_at"":","""Sun",Jul,30,15:58:51,0,"2017"",","""id"":",891689557279858688,"""id_str"":",...,,,,,,,,,,
3,"{""created_at"":","""Sat",Jul,29,16:00:24,0,"2017"",","""id"":",891327558926688256,"""id_str"":",...,,,,,,,,,,
4,"{""created_at"":","""Sat",Jul,29,00:08:17,0,"2017"",","""id"":",891087950875897856,"""id_str"":",...,,,,,,,,,,


Convert read in df_tweet_json to new df with only the tweet_id, the retweet_count, and the favorite_count

In [52]:
df_2 = pd.DataFrame(columns=['tweet_id', 'retweet_count', 'favorite_count'])
with open('tweet_json.txt') as f:
   for line in f:
       status  = json.loads(line)
       tweet_id = status['id_str']
       retweet_count = status['retweet_count']
       favorite_count = status['favorite_count'] 
       df_2 = df_2.append(pd.DataFrame([[tweet_id, retweet_count, favorite_count]],columns=['tweet_id', 'retweet_count', 'favorite_count']))

In [53]:
df_2

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
0,892177421306343426,6514,33819
0,891815181378084864,4328,25461
0,891689557279858688,8964,42908
0,891327558926688256,9774,41048
...,...,...,...
0,666049248165822465,41,111
0,666044226329800704,147,311
0,666033412701032449,47,128
0,666029285002620928,48,132


## Assess

#### Quality 
- timestamp is object instead of datetime
- tweet_id is float64 instead of str
- Columns that are not needed for analysis should be removed 
- Denominator is 10 on most but have some variability
- some of the names in name column are none
- 1399 names duplicated
- 




#### Tidiness 
- df_2 (twee_json.txt) should be added to df(twitter archive enhanced)

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   float64
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [77]:
df.name.duplicated().value_counts()

True     1399
False     957
Name: name, dtype: int64