## Gather

Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

#### Assessing Data for this Project
After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least **eight (8)** quality issues and **two (2)** tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

#### Cleaning Data for this Project
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

#### Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named **twitter_archive_master.csv**. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, **you may store the cleaned data in a SQLite database** (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least **three (3) insights and one (1) visualization** must be produced.

#### Reporting for this Project
Create a 300-600 word written report called **wrangle_report.pdf or wrangle_report.html** that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called **act_report.pdf or act_report.html** that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.

## Dirty and messy data

dirty data = low quality data = content issues

untidy data = messy data = structural issues
- each variable forms a column
- each observation forms a row
- each observational unit forms a table

In [1]:
# import all necessary packages
import os
import io
import re
import json
import tweepy
import requests

import pandas as pd
import numpy as np

from tweepy import OAuthHandler
from bs4 import BeautifulSoup

### Load csv file provided by Udacity

In [2]:
# slurp in csv-file that was provided by Udacity
df1 = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

### Download tsv file provided by Udacity

In [4]:
# download tsv-file containing image predictions
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
# retrieve encoding that was used
response.encoding

'utf-8'

In [5]:
df2 = pd.read_csv(io.StringIO(response.content.decode('utf-8')), sep='\t')

In [6]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


### Retrieve tweets through twitter's API

In [7]:
# consumer_key = 'HIDDEN'
# consumer_secret = 'HIDDEN'
# access_token = 'HIDDEN'
# access_secret = 'HIDDEN'

# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)

# api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [8]:
# fails = {}
# tweet_ids = df1.tweet_id

# with open('tweet_json.txt', 'w') as file:
#     for tweet_id in tweet_ids:
#         try:
#             tweet = api.get_status(tweet_id, tweet_mode='extended')
#             json.dump(tweet._json, file)
#             file.write('\n')
#         except tweepy.TweepError as exception:
#             fails[tweet_id] = exception
#             pass

In [9]:
with open('tweet_json.txt', 'r') as file:
    for l in file:
        js = json.loads(l)
        break

In [10]:
js.keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'extended_entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'possibly_sensitive_appealable', 'lang'])

In [11]:
js['extended_entities']['media'][0]['media_url']

'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg'

In [12]:
df3 = []
with open('tweet_json.txt', 'r') as file:
    for l in file:
        js = json.loads(l)
        df3.append({'id': str(js['id']),
                    'timestamp': pd.to_datetime(js['created_at']),
                    'source': js['source'],
                    'full_text': js['full_text'],
                    'retweet_count': js['retweet_count'],
                    'favorite_count': js['favorite_count']})
df3 = pd.DataFrame(df3, columns = ['id', 'timestamp','source','in_reply_to_status_id','retweet_count', 'favorite_count','full_text'])

In [13]:
name = df3.full_text.str.extract('(?:[Tt]his is |[Mm]eet | named |[Ss]ay hello to |[Hh]ere is )((?:[A-Z]\w+)(?: (?:&amp;|and) [A-Z]\w+)?)', expand=False)
stage = df3.full_text.str.extract('[^\w](doggo|pupper|puppo|floofer)[^\w]', flags=re.I, expand=False)
rating = df3.full_text.str.extract('((?:\d+\.?\d+)|(?:\d+))/(\d{2,})', expand=False)

In [14]:
df3['name'] = name
df3['stage'] = stage
df3[['rating_num','rating_denom']] = rating

df3.id = df3.id.astype(int)

### Download a list of dog breeds from Wikipedia

In [15]:
url = 'https://en.wikipedia.org/wiki/List_of_dog_breeds#A–C'
response = requests.get(url)

In [16]:
soup = BeautifulSoup(response.content)

In [17]:
dog_breeds = []

# drill down to bulleted list <ul>
uls = soup.find_all('ul')[3:7]

for ul in uls:
    # retrieve all instances of <a>
    all_a = ul.find_all('a')
    for a in all_a:
        if a.has_attr('title'):
            dog_breeds.append(a.contents[0].lower())

In [18]:
df2['conf'] = [i if i in dog_breeds else None for i in df2.p1.str.lower().str.replace('_', ' ')]

In [19]:
df = df1.merge(df2, on='tweet_id')
df = df.merge(df3, left_on='tweet_id', right_on='id')

In [20]:
df.to_clipboard()

In [21]:
dog_breeds

['affenpinscher',
 'afghan hound',
 'aidi',
 'airedale terrier',
 'akbash',
 'akita',
 'alano español',
 'alaskan husky',
 'alaskan klee kai',
 'alaskan malamute',
 'alpine dachsbracke',
 'american bulldog',
 'american bully',
 'american cocker spaniel',
 'american english coonhound',
 'american eskimo dog',
 'american foxhound',
 'american hairless terrier',
 'american pit bull terrier',
 'american staffordshire terrier',
 'american water spaniel',
 'anatolian shepherd dog',
 'andalusian hound',
 'anglo-français de petite vénerie',
 'appenzeller sennenhund',
 'ariegeois',
 'armant',
 'armenian gampr dog',
 'artois hound',
 'australian cattle dog',
 'australian kelpie',
 'australian shepherd',
 'australian stumpy tail cattle dog',
 'australian terrier',
 'austrian black and tan hound',
 'austrian pinscher',
 'azawakh',
 'bakharwal dog',
 'banjara hound',
 'barbado da terceira',
 'barbet',
 'basenji',
 'basque shepherd dog',
 'basset artésien normand',
 'basset bleu de gascogne',
 'bass

## Assess

#### Quality
##### df1 table
- timestamp 

- Zip code is a float not a string
- Zip code has four digits sometimes
- Tim Neudorf height is 27 in instead of 72 in
- Full state names sometimes, abbreviations other times
- Dsvid Gustafsson
- Missing demographic information (address - contact columns) ***(can't clean yet)***
- Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns)
- Multiple phone number formats
- Default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- kgs instead of lbs for Zaitseva weight

##### `treatments` table
- Missing HbA1c changes
- The letter 'u' in starting and ending doses for Auralin and Novodra
- Lowercase given names and surnames
- Missing records (280 instead of 350)
- Erroneous datatypes (auralin and novodra columns)
- Inaccurate HbA1c changes (leading 4s mistaken as 9s)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `adverse_reactions` table
- Lowercase given names and surnames

#### Tidiness
- columns with stages of dogs (i.e. doggo, pupper, puppo, floof(er)) should be combined in one column

- Contact column in `patients` table should be split into phone number and email
- Three variables in two columns in `treatments` table (treatment, start dose and end dose)
- Adverse reaction should be part of the `treatments` table
- Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables

## Clean

In [None]:
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

### Quality

#### `treatments`: Missing records (280 instead of 350)

##### Define
Import the cut treatments into a DataFrame and concatenate it with the original treatments DataFrame.

##### Code

##### Test

### Tidiness

#### Contact column in `patients` table contains two variables: phone number and email

##### Define
Extract the *phone number* and *email* variables from the *contact* column using regular expressions and pandas' `str.extract` method. Drop the *contact* column when done.

##### Code

##### Test