# Data Wrangling

## Dataset - WeRateDogs&trade; Twitter Archive

***By: Kartik Nanduri***<br>
**Date: 21st Nov, 2018.**

## Let's Gather

In [1]:
# importing all the necessary libraries
import os
import pandas as pd
import requests as req

***<span style="color: red">Important! uncomment the following files to run the book with out errors.</span>***

In [2]:
# resetting the folder structure.
#os.rename('dataset/twitter-archive-enhanced.csv', 'twitter-archive-enhanced.csv')
#import shutil
#shutil.rmtree('dataset')

***<span style="color: green">Important! once done, please recomment.</span>***

1. [x] **The file given at hand `twitter-archive-enhanced.csv`**

In [3]:
# all the requried files for this project are in the list files_list
files_list = ['twitter-archive-enhanced.csv', 'image-predictions.tsv', 'tweet_json.txt']

In [4]:
# reading the twitter archive file
archive = pd.read_csv(files_list[0])

# taking at random entries for the archive file
archive.shape

(2356, 17)

2. [x] **Fetching the data from url and saving it to local drive - `image-predictions.tsv`**

In [5]:
# reading the file from internet using the requests library
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
res = req.get(url)

with open(url.split('/')[-1], mode = "wb") as op_file:
    op_file.write(res.content)

In [6]:
# checking if fetched the data right way
img_pre_test = pd.read_csv(files_list[1], delimiter = "\t", encoding = 'utf-8')
img_pre_test.sample(2)

# we did it the right way, Yay! it worked.

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
282,671134062904504320,https://pbs.twimg.com/media/CVBY3e7XIAAAE4Y.jpg,1,Shih-Tzu,0.18038,True,golden_retriever,0.180194,True,Labrador_retriever,0.173656,True
78,667443425659232256,https://pbs.twimg.com/media/CUM8QZwW4AAVsBl.jpg,1,goose,0.980815,False,drake,0.006918,False,hen,0.005255,False


3. [x] **Getting data from Twitter&trade;**

In [7]:
# importing all the necessary libraries for accessing Twitter via API
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

In [46]:
# setting up all the necessary placeholders for API
consumer_key = 'xxx.xxx.xxx.xxx'
consumer_secret = 'xxx.xxx.xxx.xxx'
access_token = 'xxx.xxx.xxx.xxx'
access_secret = 'xxx.xxx.xxx.xxx'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth_handler = auth,
                 parser = tweepy.parsers.JSONParser(),
                 wait_on_rate_limit = True, 
                 wait_on_rate_limit_notify = True)

In [9]:
def fetch_and_save(ids, api_ins, one_id = None):
    '''
    This function will fetch data with associated id in ids list
    ids (List Object): a list all tweets
    api_ins (Tweepy Object): api object instance, will be used to query twitter for data
    one_id (int): use when you want to query only for one tweet
    failed_ids (List Object): a list will be retured so that, this fuction can be called once again on those ids
    '''
    new_file_name = ''; failed_ids = []; tweet_df = []
    
    # checking if file exists
    if os.path.exists(files_list[2]):
        temp = [s for s in os.listdir() if "tweet_json" in s]
        new_file_name = files_list[2].split('.')[0] + "_" + str(len(temp)) + ".txt"
    else:
        new_file_name = files_list[2]
    
    # querying a list of ids
    if one_id == None:
        with open(new_file_name, mode = 'w') as outfile:
            for one_id in ids:
                try:
                    content = api_ins.get_status(one_id, tweet_mode='extended')
                    json.dump(content, outfile)
                    outfile.write('\n')
                
                except Exception as e:
                    print("Error for: " + str(one_id) + " - " + str(e))
                    failed_ids.append(one_id)
    
    # querying a single id
    else:
        try:
            content = api_ins.get_status(one_id, include_entities = True)
            favorites = content['favorite_count']
            retweets = content['retweet_count']
            
            tweet_df.append({'tweet_id': int(one_id),
                        'favorites': int(favorites),
                        'retweets': int(retweets)})
            
            return tweet_df
                           
        except Exception as e:
            print("Error for: " + str(one_id) + " - " + str(e))
            failed_ids.append(one_id)

    return failed_ids

In [10]:
# passing the list of ids to the fuction fetch_and_save(), but in batches
# given that we can request 900 request/15min - window, let's break our ids into
tweet_ids = archive['tweet_id'].tolist()

# set_one, two and three
set_one = tweet_ids[0:900]; set_two = tweet_ids[900:1800]; set_three = tweet_ids[1800:]

# checking the lengths so that we send 900 ids/requests.
print(len(set_one), len(set_two), len(set_three), len(set_one)+len(set_two)+len(set_three))
print(len(set_one)+len(set_two)+len(set_three) == len(tweet_ids))

900 900 556 2356
True


In [11]:
# fetching data 1st iteration
# starting the timer
start = timer()

# querying
test_one = fetch_and_save(set_one, api)

# ending the timer
end = timer()

# calculating the runtime for fetch_and_save
print("That took about {} mins.".format(round((end - start)/60, 1)))

Error for: 888202515573088257 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 873697596434513921 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 872668790621863937 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 869988702071779329 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 866816280283807744 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 861769973181624320 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 845459076796616705 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 842892208864923648 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 837012587749474308 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 827228250799742977 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 802247111496568832 - [{'code': 144, 'message': 'No status found with 

In [12]:
# no of erroneous ids
print("we have about {} failed requests.".format(len(test_one)))

we have about 13 failed requests.


In [13]:
# fetching data 2nd iteration
# starting the timer
start = timer()

# querying
test_two = fetch_and_save(set_two, api)

# ending the timer
end = timer()

# calculating the runtime for fetch_and_save
print("That took about {} mins.".format(round((end - start)/60, 1)))

Rate limit reached. Sleeping for: 253


Error for: 758740312047005698 - Failed to send request: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')"))
Error for: 754011816964026368 - [{'code': 144, 'message': 'No status found with that ID.'}]
That took about 11.2 mins.


In [14]:
# no of erroneous ids
print("we have about {} failed requests.".format(len(test_two)))

we have about 2 failed requests.


In [15]:
# fetching data 3rd iteration
# starting the timer
start = timer()

# querying
test_three = fetch_and_save(set_three, api)

# ending the timer
end = timer()

# calculating the runtime for fetch_and_save
print("That took about {} mins.".format(round((end - start)/60, 1)))

Rate limit reached. Sleeping for: 249


Error for: 676957860086095872 - Failed to send request: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')"))
That took about 8.7 mins.


In [16]:
# no of erroneous ids
print("we have about {} failed requests.".format(len(test_three)))

we have about 1 failed requests.


In [17]:
# lets save the failed ids into one master list
failed_ids = test_one + test_two + test_three
print("Total failed request are: {}. \n".format(len(failed_ids)))

# ids that failed and the ones that passed
indi_fail = []; success = []

#for each failed id, lets try to fetch status individually.
for failed_id in failed_ids:
    temp = fetch_and_save(ids = None, api_ins = api, one_id = failed_id)
    indi_fail.append(temp[0])

# removing empty elements from list
success = [x for x in indi_fail if not isinstance(x, (int))]
indi_fail = [x for x in indi_fail if isinstance(x, (int))]

# checking if there is change
print("\nWe were able to retrieve {} records, others failed.".format(len(failed_ids) - len(indi_fail)))

Total failed request are: 16. 

Error for: 888202515573088257 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 873697596434513921 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 872668790621863937 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 869988702071779329 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 866816280283807744 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 861769973181624320 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 845459076796616705 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 842892208864923648 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 837012587749474308 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 827228250799742977 - [{'code': 144, 'message': 'No status found with that ID.'}]
Error for: 802247111496568832 - [{'code': 144, '

In [18]:
success

[{'tweet_id': 758740312047005698, 'favorites': 6081, 'retweets': 1742},
 {'tweet_id': 676957860086095872, 'favorites': 2304, 'retweets': 857}]

4. [x] **Okay, let's combine the successful jsons into one file, called the `tweet_master.txt`**

In [20]:
# combing all successful jsons into one master file
json_1 = pd.read_json('tweet_json.txt', lines = True, encoding = 'utf-8')
json_2 = pd.read_json('tweet_json_1.txt', lines = True, encoding = 'utf-8')
json_3 = pd.read_json('tweet_json_2.txt', lines = True, encoding = 'utf-8')

# total rows that we need to have in our resulting dataframe
print(json_1.shape[0] + json_2.shape[0] + json_3.shape[0])

2340


In [22]:
json_master = pd.concat([json_1, json_2, json_3], ignore_index = True, join = 'outer', sort = True)
json_master.to_json('tweet_master.txt', orient = 'records', lines = True)
json_master.shape

(2340, 32)

In [23]:
# removing objects that are not required.
del archive, img_pre_test
del json_1, json_2, json_3, json_master
del indi_fail, end, start, test_one, test_two, test_three, set_one, set_two, set_three
del consumer_key, consumer_secret, access_token, access_secret, auth, api

# we are not removing success and files_list, making sure we stick to good programming practices - reusablity.

5. [x] **Last thing to do is to tidy up our folder, let's get going.**

In [24]:
# moving all data files under one folder - dataset
# removing the temporary files, that acted as placeholders

# creating the folder
folder = 'dataset'
if not os.path.exists(folder):
    os.mkdir(folder)

# we know that our master datasets for this project are
# 1. twitter-archive-enhanced.csv
# 2. image-predictions.tsv
# 3. tweet_json_master.txt
# let us move these files

# updating our files_list
files_list[-1] = 'tweet_master.txt'

# moving only required files
for file in files_list:
    if os.path.exists(file):
        os.rename(file, folder+'/'+file)

# removing the tweet_json and tweet_json_1 files as they are not required anymore
for file in [s for s in os.listdir() if "tweet_json" in s]:
    if os.path.exists(file):
        os.remove(file)
    
# lisitng the current directory
os.listdir()

# clean and neat, lets get with assessing and cleaning

['.git',
 '.ipynb_checkpoints',
 'dataset',
 'error.png',
 'New Text Document.txt',
 'README.md',
 'wrangle_act.ipynb']

In [25]:
# renaming files_list
for i in range(3):
    files_list[i] = folder + '/'+ files_list[i]

5. [x] **Last thing to do is to extract `retweet_count` and `favourite_count` from `tweet_master.txt`, saving the result as .csv**

In [33]:
# appending success to the master dataset.
tweet_json = pd.read_json(files_list[2], lines = True, encoding = 'utf-8')
tweet_json = tweet_json[['id', 'retweet_count','favorite_count']]
tweet_json.rename(index = str,
                  columns={'id' : 'tweet_id', 'retweet_count': 'retweets','favorite_count': 'favorites'},
                  inplace = True)
tweet_json = pd.concat([tweet_json, pd.DataFrame.from_dict(success)],
                       ignore_index = True, sort = True)
tweet_json.shape

(2342, 3)

In [34]:
# checking if the did it right
print(tweet_json.loc[tweet_json['tweet_id'].isin([758740312047005698, 676957860086095872])])

      favorites  retweets            tweet_id
2340       6081      1742  758740312047005698
2341       2304       857  676957860086095872


In [35]:
# changing the column ordering
tweet_json = tweet_json[['tweet_id', 'retweets', 'favorites']]
tweet_json.sample(3)

Unnamed: 0,tweet_id,retweets,favorites
2009,671882082306625538,1419,3536
1934,673688752737402881,516,1294
847,763167063695355904,3283,0


In [36]:
# saving the dataframe into master file
tweet_json.to_json(files_list[2], orient = 'records', lines = True)

## Summary - Gathering

- We know, that gathering is a the first step in wrangling.
- We were successful in gathering from three different sources with different techniques:
    - Data given at hand.
    - Fetch from flat file stored on a server.
    - From API.

- There a total of 14 missing data points, tried a different ways for retrieving them, using the API as well as `twurl` of the `Ruby` package, but they were not to be found, as stated below in the highlighted section.

***<span style="color: ##6c6cff">So let's start with assessing the data.</span>***

![error](error.png)

## Assessing

In [37]:
# let's load up dataset, and starting assessing them.

archive =  pd.read_csv(files_list[0], encoding = 'utf-8')
img_pre = pd.read_csv(files_list[1], sep = '\t', encoding = 'utf-8')
retweets_fav = pd.read_json(files_list[2], lines = True, encoding =  'utf-8')

### Issues to sort!

In [38]:
# printing out archive - visual assessment
archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [39]:
# Programmatic Assessment 1 - Information
archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [40]:
# Programmtic Assessment 2 - Describe
archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [41]:
# checking for duplicates - tweet_ids
sum(archive.tweet_id.duplicated())

0

1. **`twitter-archive-enhanced.csv`** table

***1 Content Issues:***

**1.1 Visual Assessment:**
- `rating_numerator` : has values such as 1, 3.. e.t.c - **Data Quality Dimension - `Consistency`**.
- `rating_denominator` : have values, less than 10, for example, the tweet_id - 666287406224695296 has the number 2 as its value - **Data Quality Dimension - `Consistency`**. 
- We see that, Articles - `a`, `an`, `the` have been used to name dogs, as well as words such as `such`, `quite` - **Data Quality Dimension - `Validity`**.
- There are instances where the names of dogs are in lowercases - **Data Quality Dimension - `Consistency`**.

**1.2 Programmatic Assessment:**
- `rating_numerator` : has a maximum value of 1766 - **Data Quality Dimension - `Consistency`**. 
- `rating_denominator` : has a maximum value of 170 - **Data Quality Dimension - `Consistency`**.
- All in all, this dataset appears to be clean, except for `expanded_url` - we have about 59 instances missing - **Data Quality Dimension - `Completeness`**.

***2 Structural Issues:***

**2.1 Visual Assessment:**
- we can see that, there are four classes of dogs `doggo`, `floofer`, `puppo`, `pupper`; these should a part of one unit - `dog_class` - **Data Quality Dimension - `Consistency`**.

**2.2 Programmatic Assessment:**
- `in_reply_to_status_id`, `retweeted_status_id`, `retweeted_status_user_id`, `in_reply_to_user_id` of type float64 must be converted into int - **Data Quality Dimension - `Validity`**.
- `timestamp`, `retweeted_status_timestamp` of type object must be converted into datatime - **Data Quality Dimension - `Validity`**.

In [42]:
# assessing img_predictions dataset
img_pre

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [43]:
# Programmatic Assessment - Information
img_pre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [49]:
# checking for duplicates
img_pre[img_pre['jpg_url'].duplicated(keep = False)].sort_values(by = 'jpg_url')[['tweet_id', 'jpg_url']]

Unnamed: 0,tweet_id,jpg_url
480,675354435921575936,https://pbs.twimg.com/ext_tw_video_thumb/67535...
1297,752309394570878976,https://pbs.twimg.com/ext_tw_video_thumb/67535...
1864,842892208864923648,https://pbs.twimg.com/ext_tw_video_thumb/80710...
1641,807106840509214720,https://pbs.twimg.com/ext_tw_video_thumb/80710...
1703,817181837579653120,https://pbs.twimg.com/ext_tw_video_thumb/81596...
1691,815966073409433600,https://pbs.twimg.com/ext_tw_video_thumb/81596...
1705,817423860136083457,https://pbs.twimg.com/ext_tw_video_thumb/81742...
1858,841833993020538882,https://pbs.twimg.com/ext_tw_video_thumb/81742...
1715,819004803107983360,https://pbs.twimg.com/media/C12whDoVEAALRxa.jpg
1718,819015337530290176,https://pbs.twimg.com/media/C12whDoVEAALRxa.jpg


2. **`image-predictions.tsv`** table

***1 Content Issues:***

**1.1 Visual Assessment:**
- We have few dog breeds that are represented in lowercase.

**1.2 Programmatic Assessment:**
- We have about 281 images on a whole, that are missing with respect to our `twitter-archive-enhanced.csv` file - **Data Quality Dimension - `Completeness`**.
- We can see that, we have about `66` duplicates **OR** a pair of tweets are pointing to same *`jpg_url`* - **Data Quality Dimension - `Accuracy`**.

***2 Structural Issues:***

**2.1 Visual Assessment:**
- None.

**2.2 Programmatic Assessment:**
- None.

In [45]:
# assessing tweet_master dataset
retweets_fav

Unnamed: 0,favorites,retweets,tweet_id
0,38186,8364,892420643555336193
1,32755,6176,892177421306343426
2,24660,4088,891815181378084864
3,41515,8495,891689557279858688
4,39708,9201,891327558926688256
5,19933,3060,891087950875897856
6,11661,2028,890971913173991426
7,64437,18557,890729181411237888
8,27387,4204,890609185150312448
9,31434,7264,890240255349198849


In [47]:
retweets_fav.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2342 entries, 0 to 2341
Data columns (total 3 columns):
favorites    2342 non-null int64
retweets     2342 non-null int64
tweet_id     2342 non-null int64
dtypes: int64(3)
memory usage: 55.0 KB


3. **`tweet_master.txt`** table

***1 Content Issues:***

**1.1 Visual Assessment:**
- None.

**1.2 Programmatic Assessment:**
- We have about 14 missing records - **Data Quality Dimension - `Completeness`**.

***2 Structural Issues:***

**2.1 Visual Assessment:**
- None.

**2.2 Programmatic Assessment:**
- None.

## Summary - Assessing

- Completed the second step.
- The following are the insights:
    - from `twitter-archive-enhanced.csv` datset, the rating_numerator and denominator need to be fixed.
    - the dataset also represents row values as columns, which needs to be fixed.
    - the dataset also has structural issues such as wrong datatype assigned to a column.
    - from `images-preductions.tsv` dataset, there is consistency issue with naming dog breeds.
    - the dataset isn't complete when compared to `twitter-archive-enhanced.csv`, we have about 281 missing tweets.
    - Also we have `jpg_urls'` that are pointing to a pair of tweets.
    - `tweet_master.txt` dataset has about 14 missing records.
    - the dataset alone hold the information about retweets and favourites - bad form of schema normalization.