# Wrangle and Analyze Data
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Gathering data</a></li>
<li><a href="#assessing">Assessing data</a></li>
<li><a href="#cleaning">Cleaning data</a></li>
<li><a href="#storing">Storing, analyzing and visualizing wrangled data</a></li>
</ul>

<a id='intro'></a>
## Introduction

In the Wrangle and Analyze Data project we will walk through each of the steps to properly wrangle the WeRateDog Twitter data and create interesting and trustworthy analyses and visualizations. 

In the **Gathering data** section, we will collect 3 pieces of information:
1. The WeRateDog Twitter archive, which was made available for manual download from Udacity's Project Resources
2. The image-predictions.tsv file, which we will download programmatically directly from Udacity's servers
3. Retweet Count and Favorite Count data are not available in the data sources listed previously, thus we create a Twitter API and we will use the tweet IDs in the WeRateDogs Twitter archive to query the API and collect the missing data

The **Assessing data** section will be reserved to assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

Throughout the **Cleaning data** section we will clean each of the documented while assessing. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

Finally, in the last section **Storing, analyzing and visualizing wrangled data**|, we will store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do). We will also analyze and visualize the wrangled data in this Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

Let's start the work.

<a id='gathering'></a>
## Gathering data

In [94]:
import pandas as pd
import requests
import os
import tweepy
import json
import time
from datetime import datetime
import numpy as np

In [2]:
# Read the WeRateDogs Twitter archive into a DataFrame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# Download image-predictions.tsv file programmatically
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

response = requests.get(url)
with open(url.split('/')[-1], mode='wb') as file:
    file.write(response.content)

In [4]:
# Reat image-predictions.tsv file into a DataFrame
image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')

In [5]:
# Create Tweepy API object
consumer_key = 'sghNXOl7D3URoUlpsOVoynzDg'
consumer_secret = 'ydjnx4Bvf9GzEyvxhBXxnEsQdQNP7VbS1oj4tv9u9Gf2MKrPuj'
access_token = '22273698-Ziz00jwPUaER1j6rQVlIQGEWQZmtEQEQWVkSkBZTi'
access_secret = 'vwH8LFHDLwzauhlhShVJcnN4PKTZDfB7wmhCrE5ENug9t'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
# Read tweet_ids from twitter-archive-enhanced.csv file into a list
tweet_id_list = image_predictions['tweet_id']

# Query the Twitter API for each tweet's JSON data using Tweepy library 
#   and store it in the tweet_json.txt file
data = {}
id_errors = []

start = time.time()

with open('tweet_json.txt', 'w') as file:
    for tweet_id in tweet_id_list:
        try:
            data = api.get_status(tweet_id, text_mode='extended')
            file.write(json.dumps(data._json) + '\n')
        except Exception as e:
            id_errors.append(print('No status found with ID {}'.format(tweet_id)))

end = time.time()
print(end - start)

No status found with ID 677700003327029250
No status found with ID 677716515794329600
No status found with ID 680055455951884288
No status found with ID 680801747103793152


Rate limit reached. Sleeping for: 231


No status found with ID 754011816964026368
No status found with ID 759566828574212096
No status found with ID 771004394259247104
No status found with ID 779123168116150273
No status found with ID 802247111496568832
No status found with ID 829374341691346946


Rate limit reached. Sleeping for: 605


No status found with ID 831939777352105988
No status found with ID 837012587749474308
No status found with ID 837366284874571778
No status found with ID 842892208864923648
No status found with ID 844704788403113984
No status found with ID 851953902622658560
No status found with ID 861769973181624320
No status found with ID 872261713294495745
No status found with ID 873697596434513921
No status found with ID 888202515573088257
1951.4253249168396


In [7]:
# Read the tweet_json.txt file into a pandas DataFrame 
json_list = []

with open('tweet_json.txt') as json_file:
    for line in json_file:
        data = json.loads(line)
        tweet_id = data['id']
        favorite_counts = data['favorite_count']
        retweet_counts = data['retweet_count']
        
        json_list.append({'tweet_id': tweet_id,
                          'favorite_counts': favorite_counts,
                          'retweet_counts': retweet_counts})
        
rt_fav_counts = pd.DataFrame(json_list, columns=['tweet_id', 'favorite_counts', 'retweet_counts'])

<a id='assessing'></a>
## Assessing data

In [8]:
# Display the WeRateDogs twitter archive table
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [9]:
# Display the image predictions table
image_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [10]:
# Display the Retweet and Favorites count table
rt_fav_counts

Unnamed: 0,tweet_id,favorite_counts,retweet_counts
0,666020888022790149,2430,465
1,666029285002620928,121,42
2,666033412701032449,113,41
3,666044226329800704,274,133
4,666049248165822465,98,41
...,...,...,...
2050,891327558926688256,37968,8540
2051,891689557279858688,39758,7913
2052,891815181378084864,23653,3799
2053,892177421306343426,31395,5740


In [11]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [12]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [13]:
rt_fav_counts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2055 entries, 0 to 2054
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   tweet_id         2055 non-null   int64
 1   favorite_counts  2055 non-null   int64
 2   retweet_counts   2055 non-null   int64
dtypes: int64(3)
memory usage: 48.3 KB


In [16]:
twitter_archive['name']

0        Phineas
1          Tilly
2         Archie
3          Darla
4       Franklin
          ...   
2351        None
2352           a
2353           a
2354           a
2355        None
Name: name, Length: 2356, dtype: object

In [17]:
twitter_archive.sample(20)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
590,799297110730567681,,,2016-11-17 17:04:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jeffrey. He's quite the jokester. Take...,,,,https://twitter.com/dog_rates/status/799297110...,11,10,Jeffrey,,,,
593,798925684722855936,,,2016-11-16 16:28:21 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Shadow. He's a firm believer that they...,,,,https://twitter.com/dog_rates/status/798925684...,11,10,Shadow,,,,
1437,697242256848379904,,,2016-02-10 02:14:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oakley. He has a massive tumor growing...,,,,https://twitter.com/dog_rates/status/697242256...,10,10,Oakley,,,,
1415,698703483621523456,,,2016-02-14 03:01:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Rusty. He has no respect for POULTRY p...,,,,https://twitter.com/dog_rates/status/698703483...,7,10,Rusty,,,,
2089,670789397210615808,,,2015-11-29 02:20:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Two obedient dogs here. Left one has extra leg...,,,,https://twitter.com/dog_rates/status/670789397...,9,10,,,,,
1661,683030066213818368,,,2016-01-01 21:00:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lulu. She's contemplating all her unre...,,,,https://twitter.com/dog_rates/status/683030066...,10,10,Lulu,,,,
1431,697516214579523584,,,2016-02-10 20:23:19 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Say hello to Pherb. He does parkour. 9/10 http...,,,,https://vine.co/v/i1LriMBmX6W,9,10,Pherb,,,,
2101,670691627984359425,,,2015-11-28 19:51:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Ester. He has a cocaine problem. This ...,,,,https://twitter.com/dog_rates/status/670691627...,8,10,Ester,,,,
717,783695101801398276,,,2016-10-05 15:47:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Loomis. He's the leader of the Kenneth...,,,,https://twitter.com/dog_rates/status/783695101...,12,10,Loomis,,,,
93,874057562936811520,,,2017-06-12 00:15:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...","I can't believe this keeps happening. This, is...",,,,https://twitter.com/dog_rates/status/874057562...,12,10,,,,,


In [18]:
twitter_archive.duplicated().sum()

0

In [19]:
image_predictions.duplicated().sum()

0

In [20]:
rt_fav_counts.duplicated().sum()

0

### Quality

`twitter_archive` table:
- Trailing +0000 in timestamp column
- Timestamp data type as object, not datetime
- Ids as float not string (columns in_reply_to_status_id, and in_reply_to_user_id)
- tweet_id column as integer, not string
- Nulls represented as "None" in `doggo`, `floofer`, `pupper`, `puppo`
- In reply to users often doesn't contain image
- 'this' as name in `name` column for tweet_id 731156023742988288
- 'a' as name in `name` column for multiple tweet_ids

`image_predictions` table:
- tweet_id column as integer, not string
- Names in different patterns (separated by _ or - and sometimes in capitals, sometimes in lower case in columns p1, p2 and p3)
- Different number of digits after decimal point (columns p1_conf, p2_conf and p3 conf)

`rt_fav_counts` table:
- tweet_id column as integer, not string


### Tidiness

`twitter_archive` table:
- Four columns as variables (`doggo`, `floofer`, `pupper`, `puppo`)
- Retweets included in the table which may cause duplicates in tweets
- `text` column should be split in two columns, `text` and `URL`
- Columns `source`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`, and `expanded_urls` not needed

`image_predictions` table:
- Columns `p1`, `p1_conf`, `p1_dog`, etc should be converted to rows (i.e. unpivoted) and new column names should be renamed


`rt_fav_counts` table:
- Columns `favorite_counts` and `retweet_counts` should be part of `WeRateDogs twitter archive`
- Table should be disregarded from the analysis

<a id='cleaning'></a>
## Cleaning data

In [319]:
twitter_archive_clean = twitter_archive.copy()
image_predictions_clean = image_predictions.copy()
rt_fav_counts_clean = rt_fav_counts.copy()

**Trailing +0000 in timestamp column**

In [320]:
twitter_archive_clean['timestamp'] = twitter_archive_clean['timestamp'].map(lambda x: x.rstrip(r'^+0000$'))
twitter_archive_clean.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


**Timestamp data type as object, not datetime**


In [321]:
twitter_archive_clean['timestamp'] = pd.to_datetime(twitter_archive_clean['timestamp'])
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   tweet_id                    2356 non-null   int64         
 1   in_reply_to_status_id       78 non-null     float64       
 2   in_reply_to_user_id         78 non-null     float64       
 3   timestamp                   2356 non-null   datetime64[ns]
 4   source                      2356 non-null   object        
 5   text                        2356 non-null   object        
 6   retweeted_status_id         181 non-null    float64       
 7   retweeted_status_user_id    181 non-null    float64       
 8   retweeted_status_timestamp  181 non-null    object        
 9   expanded_urls               2297 non-null   object        
 10  rating_numerator            2356 non-null   int64         
 11  rating_denominator          2356 non-null   int64       

**Ids as numeric type not string:**
- column tweet_id as `int` not `string`
- columns in_reply_to_status_id and in_reply_to_user_id as `float` not `string`

In [322]:
twitter_archive_clean['tweet_id'] = twitter_archive_clean['tweet_id'].astype(str)
twitter_archive_clean['in_reply_to_status_id'] = twitter_archive_clean['in_reply_to_status_id'].astype(str)
twitter_archive_clean['in_reply_to_user_id'] = twitter_archive_clean['in_reply_to_user_id'].astype(str)
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   tweet_id                    2356 non-null   object        
 1   in_reply_to_status_id       2356 non-null   object        
 2   in_reply_to_user_id         2356 non-null   object        
 3   timestamp                   2356 non-null   datetime64[ns]
 4   source                      2356 non-null   object        
 5   text                        2356 non-null   object        
 6   retweeted_status_id         181 non-null    float64       
 7   retweeted_status_user_id    181 non-null    float64       
 8   retweeted_status_timestamp  181 non-null    object        
 9   expanded_urls               2297 non-null   object        
 10  rating_numerator            2356 non-null   int64         
 11  rating_denominator          2356 non-null   int64       

**Nulls represented as "None" in `doggo`, `floofer`, `pupper`, `puppo`**


In [323]:
twitter_archive_clean['doggo'].replace('None', np.nan, inplace=True)
twitter_archive_clean['floofer'].replace('None', np.nan, inplace=True)
twitter_archive_clean['pupper'].replace('None', np.nan, inplace=True)
twitter_archive_clean['puppo'].replace('None', np.nan, inplace=True)
twitter_archive_clean['name'].replace('None', np.nan, inplace=True)

In [324]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   tweet_id                    2356 non-null   object        
 1   in_reply_to_status_id       2356 non-null   object        
 2   in_reply_to_user_id         2356 non-null   object        
 3   timestamp                   2356 non-null   datetime64[ns]
 4   source                      2356 non-null   object        
 5   text                        2356 non-null   object        
 6   retweeted_status_id         181 non-null    float64       
 7   retweeted_status_user_id    181 non-null    float64       
 8   retweeted_status_timestamp  181 non-null    object        
 9   expanded_urls               2297 non-null   object        
 10  rating_numerator            2356 non-null   int64         
 11  rating_denominator          2356 non-null   int64       

**In reply to users often doesn't contain image**


In [325]:
twitter_archive_clean.dropna(subset=['expanded_urls'], inplace=True)

In [326]:
twitter_archive_clean.expanded_urls.isnull().sum()

0

**'this' as name in `name` column for tweet_id 731156023742988288**

In [327]:
twitter_archive_clean.loc[twitter_archive_clean['tweet_id'] == '731156023742988288', ['name']] = np.nan

In [328]:
twitter_archive_clean.query('tweet_id == "731156023742988288"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1120,731156023742988288,,,2016-05-13 16:15:54,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to this unbelievably well behaved sq...,,,,https://twitter.com/dog_rates/status/731156023...,204,170,,,,,


**'a' as name in `name` column for multiple tweet_ids**


In [329]:
twitter_archive_clean.loc[twitter_archive_clean['name'] == 'a', ['name']] = np.nan

In [330]:
twitter_archive_clean[twitter_archive_clean['name'] == 'a']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


`image_predictions` table

**tweet_id column as integer, not string**

In [331]:
image_predictions_clean['tweet_id'] = image_predictions_clean['tweet_id'].astype(str)

In [332]:
image_predictions_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   object 
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


**Names in different patterns (separated by _ or - and sometimes in capitals, sometimes in lower case in columns p1, p2 and p3)**

In [339]:
replace_underscore = lambda x: x.str.replace('_', ' ')
replace_dash = lambda x: x.str.replace('-', ' ')
capitalize_first = lambda x: x.str.title()

In [340]:
cols = ['p1', 'p2', 'p3']

image_predictions_clean[cols] = image_predictions_clean[cols].apply(replace_underscore)
image_predictions_clean[cols] = image_predictions_clean[cols].apply(replace_dash)
image_predictions_clean[cols] = image_predictions_clean[cols].apply(capitalize_first)

In [341]:
image_predictions_clean

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh Springer Spaniel,0.465074,True,Collie,0.156665,True,Shetland Sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,Redbone,0.506826,True,Miniature Pinscher,0.074192,True,Rhodesian Ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German Shepherd,0.596461,True,Malinois,0.138584,True,Bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian Ridgeback,0.408143,True,Redbone,0.360687,True,Miniature Pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,Miniature Pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,Basset,0.555712,True,English Springer,0.225770,True,German Short Haired Pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,Paper Towel,0.170278,False,Labrador Retriever,0.168086,True,Spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,Malamute,0.078253,True,Kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,Papillon,0.068957,True


**Different number of digits after decimal point (columns p1_conf, p2_conf and p3 conf)**

In [289]:
image_predictions_clean = image_predictions_clean.round(2)

In [290]:
image_predictions_clean

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.47,True,collie,0.16,True,Shetland_sheepdog,0.06,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.51,True,miniature_pinscher,0.07,True,Rhodesian_ridgeback,0.07,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.60,True,malinois,0.14,True,bloodhound,0.12,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.41,True,redbone,0.36,True,miniature_pinscher,0.22,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.56,True,Rottweiler,0.24,True,Doberman,0.15,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.56,True,English_springer,0.23,True,German_short-haired_pointer,0.18,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.17,False,Labrador_retriever,0.17,True,spatula,0.04,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.72,True,malamute,0.08,True,kelpie,0.03,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.32,True,Pekinese,0.09,True,papillon,0.07,True


`rt_fav_counts` table:

**tweet_id column as integer, not string**

In [291]:
rt_fav_counts_clean['tweet_id'] = rt_fav_counts_clean['tweet_id'].astype(str)

In [292]:
rt_fav_counts_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2055 entries, 0 to 2054
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   tweet_id         2055 non-null   object
 1   favorite_counts  2055 non-null   int64 
 2   retweet_counts   2055 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 48.3+ KB


<a id='storing'></a>
## Storing, analyzing and visualizing wrangled data