## Project overview

The dataset that I will be wrangling, analyzing and visualizing below is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. WeRateDogs has over 4 million followers and has received international media coverage. WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project.

### Project goals

#### Assessing Data for this Project
After gathering all data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least **eight (8)** quality issues and **two (2)** tidiness issues in a jupyter notebook. To meet specifications, the issues that satisfy the Project Motivation must be assessed. 

#### Cleaning Data for this Project
Clean each of the issues you documented while assessing. Perform this cleaning in the jupyter notebook as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

#### Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named **twitter_archive_master.csv**. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, **you may store the cleaned data in a SQLite database** (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least **three (3) insights and one (1) visualization** must be produced.

#### Reporting for this Project
Create a 300-600 word written report called **wrangle_report.pdf or wrangle_report.html** that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called **act_report.pdf or act_report.html** that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files. 

### Dirty and messy data (reminder)

dirty data = low quality data = content issues

---

untidy data = messy data = structural issues
- each variable forms a column
- each observation forms a row
- each observational unit forms a table

In [1]:
# import all necessary packages
import os
import io
import re
import json
import tweepy
import requests

import pandas as pd
import numpy as np

from tweepy import OAuthHandler
from bs4 import BeautifulSoup

### Load csv file provided by Udacity

In [2]:
# slurp in csv-file that was provided by Udacity
df1 = pd.read_csv('twitter-archive-enhanced.csv')

### Download tsv file provided by Udacity

In [3]:
# download tsv-file containing image predictions
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [4]:
# retrieve encoding that was used
response.encoding

'utf-8'

In [5]:
df2 = pd.read_csv(io.StringIO(response.content.decode('utf-8')), sep='\t')

### Retrieve tweets through twitter's API

In [6]:
# consumer_key = 'HIDDEN'
# consumer_secret = 'HIDDEN'
# access_token = 'HIDDEN'
# access_secret = 'HIDDEN'

# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)

# api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [7]:
# fails = {}
# tweet_ids = df1.tweet_id

# with open('tweet_json.txt', 'w') as file:
#     for tweet_id in tweet_ids:
#         try:
#             tweet = api.get_status(tweet_id, tweet_mode='extended')
#             json.dump(tweet._json, file)
#             file.write('\n')
#         except tweepy.TweepError as exception:
#             fails[tweet_id] = exception
#             pass

In [8]:
df3 = []
with open('tweet_json.txt', 'r') as file:
    for l in file:
        js = json.loads(l)
        df3.append({'id': str(js['id']),
                    'retweet_count': js['retweet_count'],
                    'favorite_count': js['favorite_count']})
df3 = pd.DataFrame(df3, columns = ['id','retweet_count','favorite_count'])

### df4

In [9]:
url = 'https://en.wikipedia.org/wiki/List_of_dog_breeds'
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [10]:
df4 = []

# drill down to bulleted list <ul>
uls = soup.find_all('ul')[3:7]

# loop through each bullet list
for ul in uls:
    # retrieve all instances of <a>
    tags = ul.find_all('a')
    for tag in tags:
        if tag.has_attr('title') and tag.has_attr('href'):
            df4.append({'dog_breed': tag['title'].lower().replace(' (dog)', ''),
                        'wiki_link': tag['href'].replace('/wiki','https://en.wikipedia.org/wiki')})
df4 = pd.DataFrame(df4, columns = ['dog_breed', 'wiki_link'])

## Assess

### df1

Let's first get an overview of the quality of this dataframe; I will also have a look at the first few and the last few rows. Noticed that in rare cases the rating is not correct. 

In [11]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [12]:
df1.head(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [13]:
df1.tail(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


#### names

In [14]:
df1.name.value_counts()

None        745
a            55
Charlie      12
Lucy         11
Oliver       11
           ... 
Flurpson      1
Stewie        1
Skittle       1
Marq          1
Jarod         1
Name: name, Length: 957, dtype: int64

#### None

In [15]:
df1.doggo.value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

In [16]:
df1.floofer.value_counts()

None       2346
floofer      10
Name: floofer, dtype: int64

In [17]:
df1.pupper.value_counts()

None      2099
pupper     257
Name: pupper, dtype: int64

In [18]:
df1.puppo.value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

#### ratings

In [19]:
rating_check = []
for i in range(len(df1)):
    rating_check.append(' '+str(df1.rating_numerator[i])+'/'+str(df1.rating_denominator[i]) not in df1.text[i])

In [20]:
df1[pd.Series(rating_check)].text

45      This is Bella. She hopes her smile made you sm...
113     @ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...
274     @0_kelvin_0 &gt;10/10 is reserved for puppos s...
340     RT @dog_rates: This is Logan, the Chow who liv...
387     I was going to do 007/10, but the joke wasn't ...
                              ...                        
2260    RT @dogratingrating: Unoriginal idea. Blatant ...
2264    This is a southwest Coriander named Klint. Hat...
2301              12/10 gimme now https://t.co/QZAnwgnOMB
2307    12/10 simply brilliant pup https://t.co/V6ZzG4...
2321    "Can you behave? You're ruining my wedding day...
Name: text, Length: 67, dtype: object

In [21]:
df1.iloc[45].text

'This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948'

In [22]:
df1.iloc[45].rating_numerator

5

In [23]:
df1.iloc[340].text

"RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…"

In [24]:
df1.iloc[340].rating_numerator

75

### df2

In [25]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [26]:
df2.head(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [27]:
df2.tail(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


In [50]:
df2.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


### df3

In [53]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              2331 non-null   object
 1   retweet_count   2331 non-null   int64 
 2   favorite_count  2331 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 54.8+ KB


In [54]:
df3.head(5)

Unnamed: 0,id,retweet_count,favorite_count
0,892420643555336193,7729,36314
1,892177421306343426,5721,31300
2,891815181378084864,3786,23576
3,891689557279858688,7883,39608
4,891327558926688256,8509,37803


In [55]:
df3.tail(5)

Unnamed: 0,id,retweet_count,favorite_count
2326,666049248165822465,40,96
2327,666044226329800704,132,272
2328,666033412701032449,41,112
2329,666029285002620928,42,121
2330,666020888022790149,463,2423


### df4

In [56]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   dog_breed  482 non-null    object
 1   wiki_link  482 non-null    object
dtypes: object(2)
memory usage: 7.7+ KB


In [57]:
df4.head(5)

Unnamed: 0,dog_breed,wiki_link
0,affenpinscher,https://en.wikipedia.org/wiki/Affenpinscher
1,afghan hound,https://en.wikipedia.org/wiki/Afghan_Hound
2,aidi,https://en.wikipedia.org/wiki/Aidi
3,airedale terrier,https://en.wikipedia.org/wiki/Airedale_Terrier
4,akbash,https://en.wikipedia.org/wiki/Akbash


In [58]:
df4.tail(5)

Unnamed: 0,dog_breed,wiki_link
477,wirehaired vizsla,https://en.wikipedia.org/wiki/Wirehaired_Vizsla
478,xiasi dog,https://en.wikipedia.org/wiki/Xiasi_Dog
479,mexican hairless dog,https://en.wikipedia.org/wiki/Mexican_Hairless...
480,yakutian laika,https://en.wikipedia.org/wiki/Yakutian_Laika
481,yorkshire terrier,https://en.wikipedia.org/wiki/Yorkshire_Terrier


#### Quality

##### df1 table
- tweet_id is an integer not a string
- timestamp is a string not a datetime object
- names contains incorrect strings (e.g. 'a')
- 'None' in columns name, doggo, floofer, pupper, and puppo are strings not None
- sometimes rating_numerator does not match the one mentioned in text (e.g. 5 instead of 13.5)

##### df2 table
- tweet_id is an integer not a string
- predicted dog breed names consist of small and large caps letters
- predicted dog breed names contain underscores
- predictions include items that are not dogs

#### Tidiness
- columns with stages of dogs (i.e. doggo, pupper, puppo, floofer) should be combined into one column
- in df1 drop columns that are data-sparse and that will not be used in further analysis
- in df2 drop columns containing p2 and p3 prediction information
- all 4 dataframes should be combined into one master dataframe

## Clean

In [None]:
df1_clean = df1.copy()
df2_clean = df2.copy()
df3_clean = df3.copy()
df4_clean = df4.copy()

In [50]:
df1[['name','doggo','floofer','pupper','puppo']] = df1[['name','doggo','floofer','pupper','puppo']].replace({'None': None})

In [52]:
df1['stage'] = df1[['doggo','floofer','pupper','puppo']].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

In [64]:
len(df1[df1.stage.str.find(',') > 0])

14

In [13]:
name = df3.full_text.str.extract('(?:[Tt]his is |[Mm]eet | named |[Ss]ay hello to |[Hh]ere is )((?:[A-Z]\w+)(?: (?:&amp;|and) [A-Z]\w+)?)', expand=False)
stage = df3.full_text.str.extract('[^\w](doggo|pupper|puppo|floofer)[^\w]', flags=re.I, expand=False)
rating = df3.full_text.str.extract('((?:\d+\.?\d+)|(?:\d+))/(\d{2,})', expand=False)

In [14]:
df3['name'] = name
df3['stage'] = stage
df3[['rating_num','rating_denom']] = rating

df3.id = df3.id.astype(int)

In [18]:
df2['conf'] = [i if i in dog_breeds else None for i in df2.p1.str.lower().str.replace('_', ' ')]

In [19]:
df = df1.merge(df2, on='tweet_id')
df = df.merge(df3, left_on='tweet_id', right_on='id')

In [20]:
df.to_clipboard()

### Quality

#### `treatments`: Missing records (280 instead of 350)

##### Define
Import the cut treatments into a DataFrame and concatenate it with the original treatments DataFrame.

##### Code

##### Test

### Tidiness

#### Contact column in `patients` table contains two variables: phone number and email

##### Define
Extract the *phone number* and *email* variables from the *contact* column using regular expressions and pandas' `str.extract` method. Drop the *contact* column when done.

##### Code

##### Test