# **Udacity WeRateDog Project**

## **1. Introduction**
WeRateDogs is a twitter handle that rates dogs of people. In this project Twitter API calls will be used to gather tweet information. There is additional tsv file provided in the Udacity's server, which will be downloaded programatically.

Data assesing and cleaning will be done after it to produce tidy and high quality data. Analysis will be done on the collected data to find trends.

## **2. Data Gathering**
WeRateDogs's twitter handle has provided us with the basic information of 2500 tweets (like tweet content, tweet id etc.) This data is in a csv format and will be downloaded from Udacity's website. There is an additional tsv document in Udacity's server which will be used. <br>
The tweet-id will be used to get retweet counts and favorited counts on those tweets using twitter's API calls.

In [1]:
# importing the necessary libraries and packages
import pandas as pd
import numpy as np
import requests
import os
import matplotlib.pyplot as plt
import seaborn as sns
import tweepy
import json
import sys
import time

In [4]:
# reading the csv file containing the WeRateDog archive from local hard drive

dog= pd.read_csv(r'C:\Users\srini\Downloads\twitter-archive-enhanced.csv')
dog.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


The tsv document in Udacity's server will be downloaded programatically using the requests package

In [5]:
# downloading the dog_breed data containing tsv file from the Udacity server

response = requests.get(r'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
response

<Response [200]>

In [6]:
# storing the tsv file into the local hard drive

folder_path = r'C:\Users\srini\OneDrive\Documents\Udacity'
file_name = 'dog_breeds.txt'
with open (os.path.join(folder_path, file_name), 'wb') as file:
    file.write(response.content)

In [7]:
# inputting the tsv file as a dataframe

dog_breed = pd.read_csv(os.path.join(folder_path, file_name), sep= '\t')
dog_breed.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


The tweet-id will be used to gather the tweet related information like favourite counts using tweepy API calls.

In [8]:
# Scraping the data from twitter for past tweet ids

consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [28]:
# Using the twitter API to get the json file for each tweet made by WeRateDogs.

# Using the tweet-ids present in the dog dataframe
start_time = time.time()
tweet_id_list = dog.tweet_id

# List to  the Json output and the errors (if any) that might be generated
temp_list= []
my_errors = []

# Using the twitter API to get the tweet information
for i in tweet_id_list:
    try:
        tweet= api.get_status(i, tweet_mode= 'extended', wait_on_rate_limit= True, wait_on_rate_limit_notify = True)
        temp_list.append(tweet._json)
    except:
        my_errors.append(sys.exc_info()[0])
        
end_time = time.time()
print('Total Time ', end_time- start_time, ' in s')

Rate limit reached. Sleeping for: 588
Rate limit reached. Sleeping for: 587


Total Time  1998.942358970642  in s


In [38]:
#Finding the errors that were thrown performing the API operation
my_errors
len(my_errors)

25

So there were total 25 tweets that were not scrapped from our operation. These might be the tweets that have been deleted.

We will next dump these JSON objects into a text file and use this text file to input data into a Pandas DataFrame

In [30]:
# Declaring the file name for storing the json output and dumping the list of json tweets in it
json_file = 'json_file2.txt'

with open(os.path.join(folder_path, json_file), 'w') as outfile:
    json.dump(temp_list, outfile)

In [31]:
# Loading the json text file for putting it into a dataframe format
with open (os.path.join(folder_path, json_file)) as file:
    data = json.load(file)

In [32]:
# making a dataframe from the list of json characters
twitter_id =[]
my_retweet_count =[]
my_fav_count = []

for i in range(len(data)):
    twitter_id.append(data[i]['id'])
    my_retweet_count.append(data[i]['retweet_count']),
    my_fav_count.append(data[i]['favorite_count'])

json_df = pd.DataFrame({'tweet_id': twitter_id, 'retweet_count': my_retweet_count, 'favorite_count': my_fav_count})
json_df.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7755,36465
1,892177421306343426,5747,31427
2,891815181378084864,3802,23670
3,891689557279858688,7921,39803
4,891327558926688256,8548,37993


In [39]:
json_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
tweet_id          2331 non-null int64
retweet_count     2331 non-null int64
favorite_count    2331 non-null int64
dtypes: int64(3)
memory usage: 54.8 KB


## **3. Data Assesing and Cleaning**
We will now explore the data to see if there are any tidiness and quality issues in it. <br>

**1. Tidiness Issue: Seperate tables (dog and json_df) having similar information about the tweet** <br>
Two seperate dataframes dog and json_df have the information about the tweet. These two should be merged together to have a tidy dataset.

In [40]:
dog_merged= pd.merge(dog, json_df, on= 'tweet_id', how= 'left',   )
dog_merged.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,retweet_count,favorite_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,7755.0,36465.0
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,5747.0,31427.0


In [43]:
dog_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 19 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
retweet_count                 23

**2. Tidiness Issue: Dog category as seperate columns** <br>
The dog category such as pupper, puppo, floofer and doggo should be in a single column as they are the category of the dog.

In [63]:
# making sure that the each row of the dog doesnt fall into 2 categories of dog

temp_series= (dog_merged.doggo!='None').astype('int') + (dog_merged.pupper!='None').astype('int') + \
(dog_merged.puppo!='None').astype('int') + (dog_merged.floofer !='None').astype('int')

temp_series.unique()

array([0, 1, 2], dtype=int64)

In [68]:
# Finding which dogs have 2 categories associated with them
dog_merged[temp_series==2]
(temp_series==2).sum()

14

So there are 14 rows that have 2 categories of dogs associated with them. 12/14 dogs are both **doggo and pupper**. Whereas 1 each are dogg and floofer and doggo and puppo. :) <br>
We will make a seperate category for these 14 dogs called **'doggo and other'**.

In [73]:
# combining the dog categories into another column
temp =dog_merged.melt(id_vars=['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator','rating_denominator', 'name', 'retweet_count', 'favorite_count'], \
       value_vars= [ 'doggo', 'floofer', 'pupper', 'puppo'], var_name='category')
temp.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,retweet_count,favorite_count,category,value
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,7755.0,36465.0,doggo,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,5747.0,31427.0,doggo,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,3802.0,23670.0,doggo,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,7921.0,39803.0,doggo,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,8548.0,37993.0,doggo,


In [75]:
temp.category.unique()

array(['doggo', 'floofer', 'pupper', 'puppo'], dtype=object)

In [77]:
temp.drop('category'=='None', inplace= True,)

KeyError: '[False] not found in axis'

In [71]:
dog_merged['category']= dog_merged.doggo[dog_merged.doggo!='None']
dog_merged['category']= dog_merged.pupper[dog_merged.pupper!='None']
dog_merged['category']= dog_merged.puppo[dog_merged.puppo!='None']

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'retweet_count', 'favorite_count'],
      dtype='object')