# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [135]:
import pandas as pd
import numpy as np
import tweepy
# import matplotlib.pyplot

df = pd.read_csv(r"C:\Users\User\Documents\Udacity\Project II\twitter-archive-enhanced.csv")
# Displaying the first 5 rows
df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [136]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

col_names = ("tweet_id", "tjpg_url", "timg_num", "tp1", "tp1_conf", "tp1_dog", "tp2", "tp2_conf", "tp2_dog", "tp3", "tp3_conf", "tp3_dog")

In [137]:
import json
import requests
r = requests.get(url, auth=('user', 'pass'))
r.status_code

200

In [138]:
r.headers['content-type']

'text/tab-separated-values; charset=utf-8'

In [139]:
r.encoding

'utf-8'

In [140]:
# Viewing the whole data
r.text

"tweet_id\tjpg_url\timg_num\tp1\tp1_conf\tp1_dog\tp2\tp2_conf\tp2_dog\tp3\tp3_conf\tp3_dog\n666020888022790149\thttps://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg\t1\tWelsh_springer_spaniel\t0.465074\tTrue\tcollie\t0.156665\tTrue\tShetland_sheepdog\t0.0614285\tTrue\n666029285002620928\thttps://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg\t1\tredbone\t0.506826\tTrue\tminiature_pinscher\t0.07419169999999999\tTrue\tRhodesian_ridgeback\t0.07201\tTrue\n666033412701032449\thttps://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg\t1\tGerman_shepherd\t0.596461\tTrue\tmalinois\t0.13858399999999998\tTrue\tbloodhound\t0.11619700000000001\tTrue\n666044226329800704\thttps://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg\t1\tRhodesian_ridgeback\t0.408143\tTrue\tredbone\t0.360687\tTrue\tminiature_pinscher\t0.222752\tTrue\n666049248165822465\thttps://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg\t1\tminiature_pinscher\t0.560311\tTrue\tRottweiler\t0.243682\tTrue\tDoberman\t0.154629\tTrue\n666050758794694657\thttps://pbs.twimg.com/m

As shown above, we can rearrange the data to look desirable for analysis by using the write() method and using the tab separater, i.e., sep="\t

In [141]:
with open('image_predictions.tsv', 'wb') as f:
    f.write(r.content)

In [142]:
df_image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

# Viewing part of the data
df_image_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [143]:
# Using the available twitter information we have
# First, importing tweepy
import tweepy

CONSUMER_KEY = '0mS0LYZxlawCGGBEY1abVmQ4o'
CONSUMER_SECRET = 'oyUFYOq3X3Zt3g0bnWBbwRQdGkRyAjuBZmRrgoB8RJwiYSjNtx'
ACCESS_TOKEN = '1412882069877936132-tPjcUYAvsKXd14P9Ib9k2lMREJ5V8e'
ACCESS_SECRET = 'Sf76vI6s4iNxLARbFQxvFGPXDnK2qBSOPIA8ROHbURiIt'

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

auth = tweepy.API(auth, wait_on_rate_limit = True)

In [144]:
for tweet_id in df['tweet_id']:
    try:
        tweet = api.get_status(tweet_id, tweet_mode = 'extended')
        with open('tweet_json.txt', 'a+') as file:
            json.dump(tweet._json, file)
            file.write('\n')
        print ('for id = ' , tweet_id, 'Successfull')
    except:
        print ('for id = ' , tweet_id, 'Failure')

for id =  892420643555336193 Failure
for id =  892177421306343426 Failure
for id =  891815181378084864 Failure
for id =  891689557279858688 Failure
for id =  891327558926688256 Failure
for id =  891087950875897856 Failure
for id =  890971913173991426 Failure
for id =  890729181411237888 Failure
for id =  890609185150312448 Failure
for id =  890240255349198849 Failure
for id =  890006608113172480 Failure
for id =  889880896479866881 Failure
for id =  889665388333682689 Failure
for id =  889638837579907072 Failure
for id =  889531135344209921 Failure
for id =  889278841981685760 Failure
for id =  888917238123831296 Failure
for id =  888804989199671297 Failure
for id =  888554962724278272 Failure
for id =  888202515573088257 Failure
for id =  888078434458587136 Failure
for id =  887705289381826560 Failure
for id =  887517139158093824 Failure
for id =  887473957103951883 Failure
for id =  887343217045368832 Failure
for id =  887101392804085760 Failure
for id =  886983233522544640 Failure
f

for id =  801538201127157760 Failure
for id =  801285448605831168 Failure
for id =  801167903437357056 Failure
for id =  801127390143516673 Failure
for id =  801115127852503040 Failure
for id =  800859414831898624 Failure
for id =  800855607700029440 Failure
for id =  800751577355128832 Failure
for id =  800513324630806528 Failure
for id =  800459316964663297 Failure
for id =  800443802682937345 Failure
for id =  800388270626521089 Failure
for id =  800188575492947969 Failure
for id =  800141422401830912 Failure
for id =  800018252395122689 Failure
for id =  799774291445383169 Failure
for id =  799757965289017345 Failure
for id =  799422933579902976 Failure
for id =  799308762079035393 Failure
for id =  799297110730567681 Failure
for id =  799063482566066176 Failure
for id =  798933969379225600 Failure
for id =  798925684722855936 Failure
for id =  798705661114773508 Failure
for id =  798701998996647937 Failure
for id =  798697898615730177 Failure
for id =  798694562394996736 Failure
f

for id =  709158332880297985 Failure
for id =  709042156699303936 Failure
for id =  708853462201716736 Failure
for id =  708845821941387268 Failure
for id =  708834316713893888 Failure
for id =  708810915978854401 Failure
for id =  708738143638450176 Failure
for id =  708711088997666817 Failure
for id =  708479650088034305 Failure
for id =  708469915515297792 Failure
for id =  708400866336894977 Failure
for id =  708356463048204288 Failure
for id =  708349470027751425 Failure
for id =  708149363256774660 Failure
for id =  708130923141795840 Failure
for id =  708119489313951744 Failure
for id =  708109389455101952 Failure
for id =  708026248782585858 Failure
for id =  707995814724026368 Failure
for id =  707983188426153984 Failure
for id =  707969809498152960 Failure
for id =  707776935007539200 Failure
for id =  707741517457260545 Failure
for id =  707738799544082433 Failure
for id =  707693576495472641 Failure
for id =  707629649552134146 Failure
for id =  707610948723478529 Failure
f

for id =  674008982932058114 Failure
for id =  673956914389192708 Failure
for id =  673919437611909120 Failure
for id =  673906403526995968 Failure
for id =  673887867907739649 Failure
for id =  673716320723169284 Failure
for id =  673715861853720576 Failure
for id =  673711475735838725 Failure
for id =  673709992831262724 Failure
for id =  673708611235921920 Failure
for id =  673707060090052608 Failure
for id =  673705679337693185 Failure
for id =  673700254269775872 Failure
for id =  673697980713705472 Failure
for id =  673689733134946305 Failure
for id =  673688752737402881 Failure
for id =  673686845050527744 Failure
for id =  673680198160809984 Failure
for id =  673662677122719744 Failure
for id =  673656262056419329 Failure
for id =  673636718965334016 Failure
for id =  673612854080196609 Failure
for id =  673583129559498752 Failure
for id =  673580926094458881 Failure
for id =  673576835670777856 Failure
for id =  673363615379013632 Failure
for id =  673359818736984064 Failure
f

In [145]:
for tweet_id in df['tweet_id']:
    try:
        tweet = api.get_status(tweet_id, tweet_mode = 'extended')
        with open('tweet_json.txt', 'a+') as file:
            json.dump(tweet._json, file)
            file.write('\n')
        print (tweet_id, 'success')
    except:
        print (tweet_id, 'Failed')

892420643555336193 Failed
892177421306343426 Failed
891815181378084864 Failed
891689557279858688 Failed
891327558926688256 Failed
891087950875897856 Failed
890971913173991426 Failed
890729181411237888 Failed
890609185150312448 Failed
890240255349198849 Failed
890006608113172480 Failed
889880896479866881 Failed
889665388333682689 Failed
889638837579907072 Failed
889531135344209921 Failed
889278841981685760 Failed
888917238123831296 Failed
888804989199671297 Failed
888554962724278272 Failed
888202515573088257 Failed
888078434458587136 Failed
887705289381826560 Failed
887517139158093824 Failed
887473957103951883 Failed
887343217045368832 Failed
887101392804085760 Failed
886983233522544640 Failed
886736880519319552 Failed
886680336477933568 Failed
886366144734445568 Failed
886267009285017600 Failed
886258384151887873 Failed
886054160059072513 Failed
885984800019947520 Failed
885528943205470208 Failed
885518971528720385 Failed
885311592912609280 Failed
885167619883638784 Failed
884925521741

828046555563323392 Failed
828011680017821696 Failed
827933404142436356 Failed
827653905312006145 Failed
827600520311402496 Failed
827324948884643840 Failed
827228250799742977 Failed
827199976799354881 Failed
826958653328592898 Failed
826848821049180160 Failed
826615380357632002 Failed
826598799820865537 Failed
826598365270007810 Failed
826476773533745153 Failed
826240494070030336 Failed
826204788643753985 Failed
826115272272650244 Failed
825876512159186944 Failed
825829644528148480 Failed
825535076884762624 Failed
825147591692263424 Failed
825120256414846976 Failed
825026590719483904 Failed
824796380199809024 Failed
824775126675836928 Failed
824663926340194305 Failed
824325613288833024 Failed
824297048279236611 Failed
824025158776213504 Failed
823939628516474880 Failed
823719002937630720 Failed
823699002998870016 Failed
823581115634085888 Failed
823333489516937216 Failed
823322678127919110 Failed
823269594223824897 Failed
822975315408461824 Failed
822872901745569793 Failed
822859134160

777953400541634568 Failed
777885040357281792 Failed
777684233540206592 Failed
777641927919427584 Failed
777621514455814149 Failed
777189768882946048 Failed
776819012571455488 Failed
776813020089548800 Failed
776477788987613185 Failed
776249906839351296 Failed
776218204058357768 Failed
776201521193218049 Failed
776113305656188928 Failed
776088319444877312 Failed
775898661951791106 Failed
775842724423557120 Failed
775733305207554048 Failed
775729183532220416 Failed
775364825476165632 Failed
775350846108426240 Failed
775096608509886464 Failed
775085132600442880 Failed
774757898236878852 Failed
774639387460112384 Failed
774314403806253056 Failed
773985732834758656 Failed
773922284943896577 Failed
773704687002451968 Failed
773670353721753600 Failed
773547596996571136 Failed
773336787167145985 Failed
773308824254029826 Failed
773247561583001600 Failed
773191612633579521 Failed
772877495989305348 Failed
772826264096874500 Failed
772615324260794368 Failed
772581559778025472 Failed
772193107915

701805642395348998 Failed
701601587219795968 Failed
701570477911896070 Failed
701545186879471618 Failed
701214700881756160 Failed
700890391244103680 Failed
700864154249383937 Failed
700847567345688576 Failed
700796979434098688 Failed
700747788515020802 Failed
700518061187723268 Failed
700505138482569216 Failed
700462010979500032 Failed
700167517596164096 Failed
700151421916807169 Failed
700143752053182464 Failed
700062718104104960 Failed
700029284593901568 Failed
700002074055016451 Failed
699801817392291840 Failed
699788877217865730 Failed
699779630832685056 Failed
699775878809702401 Failed
699691744225525762 Failed
699446877801091073 Failed
699434518667751424 Failed
699423671849451520 Failed
699413908797464576 Failed
699370870310113280 Failed
699323444782047232 Failed
699088579889332224 Failed
699079609774645248 Failed
699072405256409088 Failed
699060279947165696 Failed
699036661657767936 Failed
698989035503689728 Failed
698953797952008193 Failed
698907974262222848 Failed
698710712454

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [146]:
# Using describe to get an overview of the dataframe and its variable's descriptrive statistics
df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [147]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [148]:
# Using sample() method to get a better overview of the dataframe
df.sample(30)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
809,771500966810099713,,,2016-09-02 00:12:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Dakota. He's just saying hi. That's al...,,,,https://twitter.com/dog_rates/status/771500966...,12,10,Dakota,,,,
2011,672245253877968896,,,2015-12-03 02:45:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Snickers. He's adorable. Also comes in t-...,,,,https://twitter.com/dog_rates/status/672245253...,12,10,Snickers,,,,
42,884247878851493888,,,2017-07-10 03:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",OMG HE DIDN'T MEAN TO HE WAS JUST TRYING A LIT...,,,,https://twitter.com/kaijohnson_19/status/88396...,13,10,,,,,
489,813910438903693312,,,2016-12-28 00:52:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Chubbs. He dug a hole and now he's stu...,,,,https://twitter.com/dog_rates/status/813910438...,11,10,Chubbs,doggo,,,
1748,679062614270468097,,,2015-12-21 22:15:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Chompsky. He lives up to his name. 11/...,,,,https://twitter.com/dog_rates/status/679062614...,11,10,Chompsky,,,,
623,796031486298386433,,,2016-11-08 16:47:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Daisy. She's here to make your day bet...,,,,https://twitter.com/dog_rates/status/796031486...,13,10,Daisy,,,,
1364,702899151802126337,,,2016-02-25 16:53:11 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Say hello to Luna. Her tongue is malfunctionin...,,,,https://vine.co/v/i6iIrBwnTFI,12,10,Luna,,,,
1702,680934982542561280,,,2015-12-27 02:15:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Sadie. She's a Tortellini Sidewin...,,,,https://twitter.com/dog_rates/status/680934982...,10,10,Sadie,,,,
281,839290600511926273,,,2017-03-08 01:44:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @alexmartindawg: THE DRINK IS DR. PUPPER 10...,8.392899e+17,41198420.0,2017-03-08 01:41:24 +0000,https://twitter.com/alexmartindawg/status/8392...,10,10,,,,pupper,
82,876838120628539392,,,2017-06-19 16:24:33 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Ginger. She's having a ruff Monday. To...,,,,https://twitter.com/dog_rates/status/876838120...,12,10,Ginger,,,pupper,


In [149]:
# checking for posible duplicates
df.duplicated().sum()

0

## Checking the df_image_predictions dataframe

In [150]:
df_image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [151]:
# checking for posible duplicates
df_image_predictions.duplicated().sum()

0

In [152]:
# Using describe to examine descriptive statistics for the dataframe
df_image_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [153]:
# Checking if there are any duplicate ids in the tweets recorded
df_image_predictions['tweet_id'].duplicated().sum()

0

### Examining the tweet_json dataframe

In [154]:
with open('tweet_json.txt') as f:
   tweet_data = f.readlines()
tweet_data

['{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": 892420643555336193, "id_str": "892420643555336193", "full_text": "This is Phineas. He\'s a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", "truncated": false, "display_text_range": [0, 85], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 892420639486877696, "id_str": "892420639486877696", "indices": [86, 109], "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "url": "https://t.co/MgUWQ76dJU", "display_url": "pic.twitter.com/MgUWQ76dJU", "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1", "type": "photo", "sizes": {"large": {"w": 540, "h": 528, "resize": "fit"}, "thumb": {"w": 150, "h": 150, "resize": "crop"}, "small": {"w": 540, "h": 528, "resize": "fit"}, "medium": {"w": 540, "h": 528, "resize": "fit"}}}]}, "extended_entities": {"me

### Quality issues

1. The source for the Tweets are too long, which can be grouped into three main areas for clarity and analysis

<!-- In image descriptions, some names have underscores likely confusing, e.g., "Rhodesian_ridgeback" for "Rhodesian ridgeback" -->

2. Not all tweets have images, i,e., expanded_url's in them, and for consistency, this can be removed for respective dogs.

3. Mix ups in representing NaN values, with some rows indicated as "None" and others using the word "NaN"

4. Varying rating denominators and numerators. For consistency, this can be changed to be uniform for denominator and numerator

5. The "doggo", "floofer", "pupper" and "puppo" depict the same dog aspect, which can be combined into a single column

6. Time stamp challenges, in that it is represented both as date and clock times on same column

7. Different irrelevant or non-clear names for the doggs, e.g., "a", "an" and "the" for a dog names is not very clear

8. None relevant information or columns in the dataframe, which can be removed in the analysis, e.g.,
    - 'in_reply_to_status_id'
    - 'in_reply_to_user_id'
    - 'retweeted_status_id', and
    - 'retweeted_status_user_id'

### Tidiness issues

1. Dog phases/levels are set on their own column than being combined into one for all, i.e., doggo,floofer,pupper,puppo

2. Inconsistencies in image_predictions with uppper and lowercases, e.g., beagle => Beagle, malamute => Malamute, chow ==> Chow

3. There are variation in some links on the source of the tweets. To ensure tidyness, consolidating this differences is needed

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [155]:
# Make copies of original pieces of data
# Copy of twitter_archive_df
twitter_archive_copy = df.copy()

# Copy of tweet_json_df
# tweet_json_copy = tweet_json.copy()

# Copy of image_predictions
df_image_predictions_copy = df_image_predictions.copy()


### Issue #1: Adjusting Tweet's Sources

#### Define: Readjusting Tweet Sources into Three (3) Groupings, i.e., "Twitter for iPhone", "Vine - Make a Scene" and "TweetDeck"

#### Code

In [156]:
# As shown below, there are four (4) core sources
twitter_archive_copy['source'].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [157]:
# Replacing the source naming with simpler information,
# That is, Twitter for iPhone, TweetDeck, Twitter Wen Client and Vine - Make a Scenario

twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'Twitter for iPhone')
twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'Twitter Web Client')
twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>', 'Vine - Make a Scene')
twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'TweetDeck')

  twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'Twitter for iPhone')
  twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'Twitter Web Client')
  twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>', 'Vine - Make a Scene')
  twitter_archive_copy['source'] = twitter_archive_copy['source'].str.replace('<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'TweetDeck')


#### Test

In [158]:
# Checking the new dataframe with replacmenets
twitter_archive_copy.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,Twitter for iPhone,This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,Twitter for iPhone,This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,Twitter for iPhone,This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [159]:
# Checking 'source' using .value_counts() method
twitter_archive_copy['source'].value_counts()

Twitter for iPhone     2221
Vine - Make a Scene      91
Twitter Web Client       33
TweetDeck                11
Name: source, dtype: int64

### Issue #2: Dropping Tweets without images, i.e., expanded URLS

#### Define: In this section, the goal is using Python to drop Tweets without expanded URLs for uniformity in the dataframe

#### Code

In [160]:
# Checking their scope using value_counts() method
twitter_archive_copy['expanded_urls'].value_counts().sum()

2297

In [161]:
# Checking the total number of rows without expanded_urls
twitter_archive_copy['expanded_urls'].isnull().sum()

59

- As shown from above, only 59 rows have no expanded_urls
- This can be dropped for consistency

In [162]:
twitter_archive_copy['expanded_urls'].dropna(inplace=True)
twitter_archive_copy.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,Twitter for iPhone,This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,Twitter for iPhone,This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,Twitter for iPhone,This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [163]:
twitter_archive_copy = twitter_archive_copy[~twitter_archive_copy[
'expanded_urls'].isnull()]
twitter_archive_copy.tail()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,Twitter for iPhone,Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,Twitter for iPhone,This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,Twitter for iPhone,Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,Twitter for iPhone,This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,Twitter for iPhone,Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


#### Test

In [164]:
# Checking the new dataframe, upon dropping the null values in 'expanded_urls'
# Checking the total number of rows without expanded_urls

In [165]:
# Checking if the expanded_urls, i.e., the 59 are in the new cleaned dataframe
# As shown in the table below, nothing is returned based on .isnull() method
twitter_archive_copy[twitter_archive_copy['expanded_urls'].isnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [166]:
# If we check counts, it is now zero(0)
twitter_archive_copy['expanded_urls'].isnull().sum()

0

### Issue 3: Mix up in representing NaN values for some rows and others "None" "NaN"

### Define

- In the dataset, there are sections where NaN is used while others have None
- For consistency, all the NaN items in the dataframe can be replaced with "None"

#### Code

In [219]:
twitter_archive_copy.replace(np.NaN, "None", inplace=True)

- As shown from above, for any NaN, there is replacing them with "None"

#### Test

In [220]:
# Getting a sample of the data to see the changes
twitter_archive_copy.sample(50)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1138,728035342121635841,,,2016-05-05 01:35:26 +0000,Twitter for iPhone,This is all I want in my life. 12/10 for super...,,,,https://twitter.com/dog_rates/status/728035342...,10,10,all,,,pupper,
37,885167619883638784,,,2017-07-12 16:03:00 +0000,Twitter for iPhone,Here we have a corgi undercover as a malamute....,,,,https://twitter.com/dog_rates/status/885167619...,10,10,Unspecified Dog,,,,
2342,666082916733198337,,,2015-11-16 02:38:37 +0000,Twitter for iPhone,Here we have a well-established sunblockerspan...,,,,https://twitter.com/dog_rates/status/666082916...,10,10,Unspecified Dog,,,,
645,793165685325201412,,,2016-10-31 19:00:10 +0000,Twitter for iPhone,This is Benji. He's Air Bud. It's a low effort...,,,,https://twitter.com/dog_rates/status/793165685...,10,10,Benji,,,,
412,822975315408461824,,,2017-01-22 01:12:59 +0000,Twitter for iPhone,This is Albus. He's soaked as h*ck. Seems to h...,,,,https://twitter.com/dog_rates/status/822975315...,10,10,Albus,,,,
1907,674447403907457024,,,2015-12-09 04:36:06 +0000,Twitter for iPhone,This pupper just wants a belly rub. This puppe...,,,,https://twitter.com/dog_rates/status/674447403...,10,10,Unspecified Dog,,,pupper,
687,788039637453406209,,,2016-10-17 15:31:05 +0000,Twitter for iPhone,Did... did they pick out that license plate? 1...,,,,https://twitter.com/dog_rates/status/788039637...,10,10,Unspecified Dog,,,,
428,821149554670182400,,,2017-01-17 00:18:04 +0000,Twitter for iPhone,This is Luca. He got caught howling. H*ckin em...,,,,https://twitter.com/dog_rates/status/821149554...,10,10,Luca,,,,
646,793150605191548928,,,2016-10-31 18:00:14 +0000,Twitter for iPhone,This is Nida. She's a free elf. Waited so long...,,,,https://twitter.com/dog_rates/status/793150605...,10,10,Nida,,,,
942,752917284578922496,,,2016-07-12 17:27:23 +0000,Twitter for iPhone,This is Grizzie. She's a semi-submerged Bahrai...,,,,https://twitter.com/dog_rates/status/752917284...,10,10,Grizzie,,,,


### Issue #4: Adjusting Rating in Numerator and Denominator for Consistency

#### Define

The goal is realizing consistency in the dataset, which can be attained by changing the ratings to ten (10), which is the most common rating value

#### Code

In [167]:
# Finding the number of ratings for all the data
twitter_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2297 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2297 non-null   int64  
 1   in_reply_to_status_id       23 non-null     float64
 2   in_reply_to_user_id         23 non-null     float64
 3   timestamp                   2297 non-null   object 
 4   source                      2297 non-null   object 
 5   text                        2297 non-null   object 
 6   retweeted_status_id         180 non-null    float64
 7   retweeted_status_user_id    180 non-null    float64
 8   retweeted_status_timestamp  180 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2297 non-null   int64  
 11  rating_denominator          2297 non-null   int64  
 12  name                        2297 non-null   object 
 13  doggo                       2297 

### Part 1: Ratings for Numerator

In [168]:
# Finding the number of ratings for all the data
twitter_archive_copy['rating_numerator']

0       13
1       13
2       12
3       13
4       12
        ..
2351     5
2352     6
2353     9
2354     7
2355     8
Name: rating_numerator, Length: 2297, dtype: int64

- As shown above, some ratings are 13, 12, 9, 8, 7, 6, and 5

In [169]:
# Finding the number of ratings for numerator that are not 10 (!10)
twitter_archive_copy[twitter_archive_copy['rating_numerator'] != 10]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,Twitter for iPhone,This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,Twitter for iPhone,This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,Twitter for iPhone,This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,Twitter for iPhone,Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,Twitter for iPhone,This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,Twitter for iPhone,Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,Twitter for iPhone,This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


- The code above shows that 1842 rows have a rating that is not 10 (!10)

In [170]:
# Identifying and recording the indexes of all rating numerator not equal to 10 using the query method
none_10_rating_numerator = twitter_archive_copy.query("rating_numerator != 10")['rating_numerator'].index

In [171]:
# Changing the rating numerators from other values to 10
for index in none_10_rating_numerator:
    twitter_archive_copy.at[index,'rating_numerator'] = 10

#### Test

In [172]:
# Using the previous method, to now check the proportion of rows with none-10 as the numerator
twitter_archive_copy[twitter_archive_copy['rating_numerator'] != 10]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


- As shown above, the result brings an empty dataframe where columns now have no values
- We can recheck the result by using the value_counts() method to know how many of the rows have 10 as the rating
- The response should show all the rows in the dataframe, i.e., 2297, which means now all rows have a 10 as the rating

In [173]:
# Finding the number of counts for each of the ratings in the dataframe
twitter_archive_copy['rating_numerator'].value_counts()

10    2297
Name: rating_numerator, dtype: int64

### Part 2: Ratings for Denominator

In [174]:
# Finding the number of denominator ratings for all the data
twitter_archive_copy['rating_denominator'].value_counts()

10     2278
11        3
50        3
80        2
70        1
7         1
150       1
170       1
20        1
90        1
40        1
130       1
110       1
120       1
2         1
Name: rating_denominator, dtype: int64

- As shown above, for the denominator ratings, the most common value is ten (10)
- For consistency, we well change the other ratings to 10 for use in the data frame
- This is as shown below

In [175]:
# Finding and showing some of the rows where the number of rating denominator are not 10 (!10)
twitter_archive_copy[twitter_archive_copy['rating_denominator'] != 10]


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
433,820690176645140481,,,2017-01-15 17:52:40 +0000,Twitter for iPhone,The floofs have been released I repeat the flo...,,,,https://twitter.com/dog_rates/status/820690176...,10,70,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,Twitter for iPhone,Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,,"https://www.gofundme.com/sams-smile,https://tw...",10,7,Sam,,,,
784,775096608509886464,,,2016-09-11 22:20:06 +0000,Twitter for iPhone,"RT @dog_rates: After so many requests, this is...",7.403732e+17,4196984000.0,2016-06-08 02:41:38 +0000,https://twitter.com/dog_rates/status/740373189...,10,11,,,,,
902,758467244762497024,,,2016-07-28 01:00:57 +0000,Twitter for iPhone,Why does this never happen at my front door......,,,,https://twitter.com/dog_rates/status/758467244...,10,150,,,,,
1068,740373189193256964,,,2016-06-08 02:41:38 +0000,Twitter for iPhone,"After so many requests, this is Bretagne. She ...",,,,https://twitter.com/dog_rates/status/740373189...,10,11,,,,,
1120,731156023742988288,,,2016-05-13 16:15:54 +0000,Twitter for iPhone,Say hello to this unbelievably well behaved sq...,,,,https://twitter.com/dog_rates/status/731156023...,10,170,this,,,,
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,Twitter for iPhone,Happy 4/20 from the squad! 13/10 for all https...,,,,https://twitter.com/dog_rates/status/722974582...,10,20,,,,,
1202,716439118184652801,,,2016-04-03 01:36:11 +0000,Twitter for iPhone,This is Bluebert. He just saw that both #Final...,,,,https://twitter.com/dog_rates/status/716439118...,10,50,Bluebert,,,,
1228,713900603437621249,,,2016-03-27 01:29:02 +0000,Twitter for iPhone,Happy Saturday here's 9 puppers on a bench. 99...,,,,https://twitter.com/dog_rates/status/713900603...,10,90,,,,,
1254,710658690886586372,,,2016-03-18 02:46:49 +0000,Twitter for iPhone,Here's a brigade of puppers. All look very pre...,,,,https://twitter.com/dog_rates/status/710658690...,10,80,,,,,


In [176]:
# Identifying and recording the indexes of all rating denominator not equal to 10 using the query method
none_10_rating_numerator = twitter_archive_copy.query("rating_denominator != 10")['rating_denominator'].index

In [177]:
# Changing the rating denominators from other values to 10
for index in none_10_rating_numerator:
    twitter_archive_copy.at[index,'rating_denominator'] = 10

#### Test

In [178]:
# Using the previous method, to now check the proportion of rows with none-10 as the numerator
twitter_archive_copy[twitter_archive_copy['rating_denominator'] != 10]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [179]:
# Finding the number of counts of rating denominator for each of the ratings in the dataframe
twitter_archive_copy['rating_denominator'].value_counts()

10    2297
Name: rating_denominator, dtype: int64

- As shown from above, on checking, the result brings an empty dataframe where no values are shown
- All the denominator ratings not equal to 10 have been replaced

### Issue # 5: Agglomerating dog phases of "doggo", "floofer", "pupper" & "puppo" into a single column

#### Define:

- As part of the identified issues, the last three columns can be combined into forming a single column
- This will represent the dog's life phase, divided into "doggo", "floofer", "pupper" & "puppo"

#### Code

In [180]:
# Checking unique values in the dataset for doggo
twitter_archive_copy['doggo'].unique()


array(['None', 'doggo'], dtype=object)

In [181]:
# Checking unique values in the dataset for floofer
twitter_archive_copy['floofer'].unique()

array(['None', 'floofer'], dtype=object)

In [182]:
# Checking unique values in the dataset for pupper
twitter_archive_copy['pupper'].unique()

array(['None', 'pupper'], dtype=object)

In [183]:
# Checking unique values in the dataset for puppo
twitter_archive_copy['puppo'].unique()

array(['None', 'puppo'], dtype=object)

In [184]:
# Taking all the "doggo", "floofer", "pupper" & "puppo" and storing them into one column called "dog_phase" aspect
# To combine the columns into a single column, we adopt the melt() method as below


### Issue 6: Time stamp challenges, in that it is represented both as date and clock times on same column

#### Define

- This section focuses on aligning the timestamp column, so that there is consistency in the recorded timeline

#### Code

### Test

In [None]:
# Getting an overview of the whole dataframe, to see the new timestamp column

### Issue 7: Name inconsistencies across the column

### Define

- In the name column, there are some irrelevant or non-clear names for the dogs, e.g., "a", "an" and "the", etc.

- As a means in improving the dataframe's quality, adjusting this information is useful for subsequent analysis

#### Code

In [185]:
# Checking the name column and listing some of its contents for better assessment
# The below list shows some names, and others which are hard to understand
twitter_archive_copy['name']

0        Phineas
1          Tilly
2         Archie
3          Darla
4       Franklin
          ...   
2351        None
2352           a
2353           a
2354           a
2355        None
Name: name, Length: 2297, dtype: object

In [186]:
# Using the sample() method, we can get a randomized better overview of the dataset as below
twitter_archive_copy['name'].sample(30)

713     Reginald
192         None
641         Juno
2248      Shaggy
2181        None
2034           a
497         None
1546     Chipson
1156        None
1045      Harold
869         None
670        Happy
152         Dave
1563      Lizzie
1162    Reginald
1998       Darby
1853           a
1094    Chadrick
1486        None
258         Hank
227         Odin
1035        Abby
947         None
1247        Gary
1795       Tassy
1393       Coops
2128           a
142     Meatball
678       Stella
390         None
Name: name, dtype: object

In [187]:
# Using for loop, we can list the dog names, and check those that are suspicious as shown below
for x in twitter_archive_copy['name']:
    print(x)

Phineas
Tilly
Archie
Darla
Franklin
None
Jax
None
Zoey
Cassie
Koda
Bruno
None
Ted
Stuart
Oliver
Jim
Zeke
Ralphus
Canela
Gerald
Jeffrey
such
Canela
None
None
Maya
Mingus
Derek
Roscoe
Waffles
None
Jimbo
Maisey
None
Lilly
None
Earl
Lola
Kevin
None
None
Yogi
Noah
Bella
Grizzwald
None
Rusty
Gus
Stanley
Alfy
Koko
Rey
Gary
a
Elliot
Louis
None
Bella
Jesse
None
Romeo
Bailey
Duddles
Jack
Emmy
Steven
Beau
Snoopy
None
Shadow
Terrance
Shadow
Emmy
Aja
None
Penny
Dante
Nelly
Ginger
None
Benedict
Venti
Goose
Nugget
None
None
Cash
Coco
Jed
None
Sebastian
Walter
None
Sierra
Sierra
None
None
None
Monkey
None
Harry
Kody
Lassie
Rover
Napolean
Dawn
None
Boomer
None
Cody
Zoey
Rumble
Clifford
quite
Dewey
Stanley
Scout
Gizmo
Walter
Cooper
None
Cooper
None
Harold
Shikha
None
None
Jamesy
None
Lili
Jamesy
Coco
None
Boomer
Sammy
Nelly
None
Meatball
Paisley
Albus
Neptune
Quinn
Belle
None
Quinn
Zooey
Dave
Jersey
None
None
Hobbes
None
Burt
Lorenzo
None
Lorenzo
Carl
Jordy
None
None
Milky
Trooper
None
quite
None
Winsto

- From above, we can document a list of names which look suspecious or not informative
- Based on personal judgment as a data analyst by looking at the sample, the below names are not clear

In [189]:
unclear_name_list = ['None', 'a', 'one', 'mad', 'the', 'not', 'quite', 'just', 'an', 'space', 'my', 'life', 'this', 'very']

# From the dataset, it was noticeable that most of the suspicious names were in lower-case, except "None"

In [207]:
# Next, for any of the listed names above, the best option can be replacing the names with a common understandable name
# In this case, all unclear names will be replaced with "Unspecified Dog"
# The code below can be used to replace the 14 unclear names with "Unspecified Dog" using replace
# Using for loop, and the list of unclear names, we can replace in the name column as below with "Unspecified Dog"
for x in unclear_name_list:
    twitter_archive_copy['name'].replace(x, 'Unspecified Dog', inplace = True)

- The Python code above iterates through the name column, replacing all names in the unclear list

#### Test

In [214]:
# Using value counts to get an overview of each dog's name
twitter_archive_copy['name'].value_counts()

Unspecified Dog    781
Charlie             12
Oliver              11
Cooper              11
Lucy                11
                  ... 
Gert                 1
Dex                  1
Ace                  1
Tayzie               1
Christoper           1
Name: name, Length: 944, dtype: int64

In [216]:
# Running code to check the list of names in the dataframe
twitter_archive_copy['name'].sample(50)

1059               Aqua
714               Balto
284                Arlo
1744              Rubio
85                Venti
1698              Aspen
1565               Kilo
2304    Unspecified Dog
1047               Beya
1827               Duke
765              Wesley
1072             Nollie
1406              Charl
842               Rufus
687     Unspecified Dog
2244               Erik
428                Luca
1655               Jack
383             Charlie
25      Unspecified Dog
27               Mingus
1767    Unspecified Dog
1084    Unspecified Dog
1225    Unspecified Dog
791                Loki
333              Willow
353             Scooter
560              Marley
1384             Daniel
757               Penny
540             Waffles
1381               Chet
1045             Harold
506               Sammy
1633                Obi
468               Chloe
182                Luna
876                  Bo
2026               Brad
2151              Louie
699               Arnie
1875            

- As shown from above, none of the listed suspicious names can be seen in the new dataframe

### Issue 8: Removing Potentially Irrelevant Columns from the Data Frame

### Define

- In this section, there exists some columns in the dataframe which can be eliminated to have a robust dataframe
- For proper representation, the below columns are removed, as they do not offer any usable analysis information
- These are:
    1. - 'in_reply_to_status_id'
    2. 'in_reply_to_user_id'
    3. 'retweeted_status_id', and
    4. 'retweeted_status_user_id'

#### Code

In [62]:
# Checking the column names in the dataset
twitter_archive_copy.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [63]:
# To remove the columns, we adopt the drop method to the dataframe
# First, creating a variable list of the pre-chosen columns to remove
to_remove_columns = ['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id']

In [64]:
# Using drop() with replace to remove the columns
twitter_archive_copy.drop(to_remove_columns, axis=1, inplace=True)

#### Test

In [65]:
# Checking the dataframe after dropping the columns
# The below gives the list of columns, without the above removed/drooped
twitter_archive_copy.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'retweeted_status_timestamp',
       'expanded_urls', 'rating_numerator', 'rating_denominator', 'name',
       'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization

### References
1. Python Community (2022). [requests 2.28.1](https://pypi.org/project/requests/)