# Wrangle and analyze Data

## Gathering Data

To gather the data, I used 3 different sources: 

1. The WeRateDogs Twitter archive. I downloaded this file manually by clicking the following [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)

2. The tweet image predictions, I downloaded this file programmatically through this [link]( https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)

3. Tweeter API. Using Tweepy. I downloaded the tweets data (based on the tweet_id from the WeRateDogs Twitter archive), saved them to a text file called `tweet_json.txt` and then created a dataframe using the following variables:
    - retweet count 
    - favorite count 
    - Created at
    - Full text

## Assessing Data

### Data Quality

#### **1. Missing Data**

**Tweet Archive Dataset**
- `name`: 715 tweets that have *None* as a name

#### **2. Duplicated Data**:

**Tweet API dataset**
- 3 duplicated tweetIDs

#### **3. Data type issues**:

**Image Prediction Dataset**:
- `tweet_id` should be string.

**Tweet API dataset**
- `tweet_id`: The current type is int64, I will convert it to string since there is no calculation required.
- `created_at`: The current type is object (str) rather than datetime. I will change it to datetime.

#### **4. Content Issues**

**Tweet Archive Dataset**:
- `name`: has invalid values, such as : None, a, an, the.
- ratings include erroneous values. For example, there is a tweet where the denominator is 0. Also, there are tweets where the text include more than one rating pattern (##/##).
- I found 1 dog that is both *doggo* and *floofer*.
- I found 12 dogs that are both *doggo* and *pupper*.
- I found 1 dog that is both *doggo* and *puppo*.
- `retweeted_status_id`, `retweeted_status_id`, `retweeted_status_user_id`: I found 181 retweets. The retweets and the respective columns will be removed from the dataset.
- `in_reply_to_status_id, in_reply_to_user_id`: columns to be dropped. 

**Image Prediction Dataset**:
- `p1`, `p2`, `p3`: Some entries are lower case. also, some entries have underscore.


### Data Tidiness:

- Merge the three datasets.
- Delete the `text` column from the merged dataset as I included the `full_text` column in the `tweet_df` dataframe.
- Delete the `timestamp` column from the merged dataset as I included the `created_at` column in the `tweet_df` dataframe. 
- Create seperate columns for date from time.
- Parse the column `source` to show the direct source (for ex: iphone, tweetdeck, etc.) rather than the HTML statement.
- Create a new column called `rating_ratio` which is `rating_numerator` divided by `rating_denominator`.
- `full_text`: Some tweets reference more than one dog and include multiple stages and ratings. I will create a new column called `stage` to include the dog stage and delete the four columns. For tweets that reference multiple stages, the `stage` will include additional stages, such as: doggo|pupper, doggo|floofer, and doggo|puppo.

## Cleaning Data

**1. Merged the 3 datasets using INNER join**.
- The number of rows of the merged dataset is 2074.
- I created a new dataframe called `tweet_full_df_clean` to work on and kept the original dataframe for reference.

**2. Deleted retweets from the dataset**

**3. Deleted the extra columns**: 

- I deleted the following columns: `retweeted_status_id`, `retweeted_status_user_id`,`retweeted_status_timestamp`, `timestam`,
`text`, `in_reply_to_status_id, in_reply_to_user_id`

**4. Created separate columns for date a time based on the created_at column**.

**5. Parsed the column `source` to show the direct source (for ex: Tweeter for iPhone) rather than the full link**.

**6. Converted `tweet_id` to string**.

**7. Dropped duplicates**.
**8. Re-formatted the content of columns `p1`, `p2`, `p3`**.
- converted to upper case and removed underscores.

**9. Isolated the erraneous values in the `rating_denominator` and `rating_numerator` columns and fixed them programmatically**.

**10. Created a new column called `rating_ratio` which is `rating_numerator` divided by `rating_denominator`**.

**11. Corrected erroneous values in the `name` column and converted the (None) values to `Null`**.

**12. Created a new categorical column called `stage` to include the dog stages and deleted the old 4 columns**.

**13. Resorted the dataframe columns**.

**14. Saved the dataframe to a file called `twitter_archive_master.csv`**.