# Data Wrangling
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Gathering Data</a></li>
<li><a href="#assessing">Assessing Data</a></li>
<li><a href="#cleaning">Cleaning Data</a></li>
<li><a href="#next">Next Step</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this data wrangling project, I will first collect 3 different datasets that are related to WeRateDogs Twitter data. Then I will assess the data via both visual and programming assessment, and clean the data based on my assessment.

<a id='gathering'></a>
## Gathering Data

The first dataset is `The WeRateDogs Twitter archive`, which I uploaded a csv file called 'twitter-archive-enhanced.csv'. I'll need its tweet IDs later.
Then I used requests module to get tweet image predictions data using url 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'.
The last dataset I gathered was additional data from Twitter API. To use Twitter API, I created my first Twitter account and applied for a developer account. Luckily, I passed the application! With a developer account, I created a project and got token and secret, which I used for authentication. Then I was able to use tweepy module to query Twitter's API for JSON data for each tweet ID in the `The WeRateDogs Twitter archive`. The data I extracted via API was tweet_id, favorite_count and retweet_count.

<a id='assessing'></a>
## Assessing Data

I first used visual assessment, both here and in Excel, and then programming assessment to find out the following data quality and tidiness issues.

#### Quality
##### `Enhanced Twitter Archive` 
* Tweet_id column dtype is int, should be string.
* Timestamp column dtype should be datetime.
* There are 181 retweets in this dataset that need to be removed.
* There are 78 replies in this dataset that need to be removed.
* Not every tweet has identified dog stages.
* Some of the rating denominators are not 10.
* Wrong ratings like *'Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer.'*, where rating numerator & denominator are 24 & 7, and *'This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv'*, where rating numerator & denominator are 1 & 2.

##### `image` Table
* Tweet_id column dtype is int.
* More than one predictions in for dogs in one single tweet.
* There are non-dog predictions.

##### `Data` Table
* Count columns have object type instead of integer type.

#### Tidiness
##### `Enhanced Twitter Archive` 
* Dog stages are not in one column.
* There are 2 columns related to rating, should be just one.

##### `All three tables` 
* 3 datasets are not merged.

<a id='cleaning'></a>
## Cleaning Data
I defined the steps I would take to clean the datasets, wrote code to clean them, and finally, test out if my cleaning was complete.
In the cleaning process, I made some functions to make cleaning more effient.
After cleaning process, I merged the three datasets into a master dataset and export it as `twitter_archive_master.csv`.

<a id='next'></a>
## Next Step
This wraps up my wrangling process! Next I will analyze the clean (rather) `twitter_archive_master.csv` master dataset and find out some insights from the WeRateDogs Twitter data.

In [1]:
!jupyter nbconvert --to html wrangle_report.ipynb

[NbConvertApp] Converting notebook wrangle_report.ipynb to html
[NbConvertApp] Writing 280866 bytes to wrangle_report.html
