# DATA WRANGLING DOCUMENTATION

## Packages
- Pandas
- NumPy
- Requests
- os
- tweepy
- json

## GATHER
There are three sources used to gather data for this project: WeRateDogs Twitter archive, image predictions, Twitter API.

### WeRateDogs Twitter Archive
This dataset is provided by Udacity as a CSV file and is available for download. The WeRateDogs Twitter archive was loaded in using pandas library.

### Image Predictions
This dataset is hosted on Udacity's server and was downloaded programmatically using the Requests library and loaded using pandas library.

### Twitter API
The retweet and favorite count for each tweet in the WeRateDogs Twitter archive was gathered by querying the Twitter API using Tweepy. Since I wasn’t granted with a Twitter developer account, this data was provided by Udacity and was downloaded programmatically using the Requests library and loaded using pandas library.

## ASSESS
Each  of the three datasets gathered were assessed by its quality and tidiness. Findings are listed below:

### WeRateDogs Twitter Archive
>**QUALITY**
- Incorrect data types for `tweet_id`, `timestamp`, columns
- Missing names, incorrect name tagged
- Replies and retweets included in dataset
- `text` includes link
- Incorrect extracted ratings (416, 878, 1407, 2054)
- Maximum for `rating_denominator` 170
- Missing values for `doggo`, `floofer`, `pupper`, `puppo` represented as string '*None*'

>**TIDINESS**
- Separate columns for dog levels(`doggo`, `pupper`, `puppo`, `floofer`) 

### Image Predictions
>**QUALITY**
- Incorrect data type for `tweet_id`
- Dog breed capitalization

>**TIDINESS**
- Multiple columns for breed predictions
- Can be merged with twitter archive dataset

### Twitter API
>**QUALITY**
- Incorrect data type for `tweet_id`

>**TIDINESS**
- Can be merged with twitter archive dataset

## CLEAN

### Missing Values
**Missing names, incorrect name tagged**

>Repopulate `name` column using for loop, if statements, and combination of string methods to extract name. Strings *'name is', 'named', 'This is', 'Meet', and 'hello to'* are typically used before introducing the name. Texts not meeting the condition will be populated as null. (*Missing names issue addressed here*)

**Missing values for `doggo`, `floofer`, `pupper`, `puppo` represented as string '*None*'**

>Extract dog stage from  `text` column using *doggo, Doggo, puppo, Puppo, pupper, floofer, Floofer*, return corresponding dog stage, otherwise, return null value.

### Tidiness
**Separate columns for dog levels(`doggo`, `pupper`, `puppo`, `floofer`)**

>Create a new column, `dog_stage`, to identify the dog level for each row using the `doggo`, `pupper`, `puppo`, `floofer`. If there are two classifications, return string "*multiple*". Drop the `doggo`, `pupper`, `puppo`, `floofer` columns once new column is created.

**Multiple columns for breed predictions**

>Create one column for the predicted breed. To identify the predicted breed, we'll choose the most confident prediction given the prediction is a breed of dog. 'False' predictions will be tagged as NaN, using for and if loops.

**Merge all three datasets**

>Merge all three tables on unique identifier, tweet_id, using pd.merge() method..

### Quality
**Incorrect data types**
>Convert the following columns to correct data types:
- `tweet_id`, `dog_level`, `predicted_breed` to category
- `timestamp` to datetime
- `retweet_count`, `favorite_count` to int

**Replies and retweets included in dataset**
>Drop rows for retweets and replies by extracting non-null values under the `in_reply_to_status_id`, `in_reply_to_user_id` and `retweeted_status` columns.

**`text` column includes link**
>Find string '*https:*' and extract all strings before it using .find() method and loop functions.

**Visual Assessment: Incorrect extracted ratings (416, 878, 969, 1407, 2054)**
>Ratings for rows 416, 878, 1407, 2054 are incorrect. Check the text for each row and update ratings.

**Maximum observation for`rating_denominator`is 170**
>A `rating_denominator` of more than 10 refer to more than 1 dog which was typically multiplied by the number of dogs. To standardize, we'll divide the rating to the number of dogs.

**Drop irrelevant columns**
>Drop columns `in_reply_to_status_id`, `in_reply_to_user_id`, `source`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`.

## FINAL RESULT
The combined and cleaned dataset now consists of 1971 rows with 12 columns.

In [2]:
import pandas as pd
df = pd.read_csv('twitter_archive_master.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1971 entries, 0 to 1970
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            1971 non-null   int64 
 1   timestamp           1971 non-null   object
 2   text                1971 non-null   object
 3   expanded_urls       1971 non-null   object
 4   rating_numerator    1971 non-null   int64 
 5   rating_denominator  1971 non-null   int64 
 6   name                1374 non-null   object
 7   dog_stage           336 non-null    object
 8   jpg_url             1971 non-null   object
 9   predicted_breed     1666 non-null   object
 10  retweet_count       1971 non-null   int64 
 11  favorite_count      1971 non-null   int64 
dtypes: int64(5), object(7)
memory usage: 184.9+ KB
