# Wrangle Report - Data Wrangling for Twitter - WeRateDog

> This report documents my wrangling efforts in Udacity DAND Term2 - Twitter WeRateDog project.

Data Wrangling process consists of three steps: gathering data, aseesing data, cleaning data.

### Step1- Gathering Data

There are 3 sources of gathering data for this project:<br>

(1) WeRateDog' Twitter Archive:
I manually download WeRateDogs' Twitter archive data from Udacity and read it through `Pandas`.

(2) Tweet Image Predictions:
- First use `requests` function to get the data from the url. 
- Then I use the file system in Python to read the data I gathered using `requests` and store it in the folder
- Third, I read the `.tsv` using `Pandas`.

(3) WeRateDog data from API:
This part is bit more tricky,
- First, I created a `.txt` and store an empty json data in the file.
- Second, I created a `for` loop that run through all the tweet_id in `Twitter Archive` data. For each tweet_id, I use `Tweepy` to get the status data and store all the values I'm interested. 
- The third step is to use the `read_json` function to read the json file from the `.txt` we gathered using `Tweepy`.

### Step2- Assessing Data

This part can separated into two small parts- visual assessment and programmatic assessment.

- Visual Assessment <br>
In this part, I visually looked at each dataframe, try to see if there's anything unusual.
<br><br>
- Programmatic Assessment <br>
In this part, I leverage `Pandas`'s great functions sush as `.sample()`, `.info()`, `.value_counts()`, `.describe()` try to see if there's any data quality/tidiness issues.

### Step3- Cleaning Data

Each Data Cleaning process can be separated into three parts - `define`, `code`, `test`.

First of all, I look at all the issues I found during the Assessment process. The following are all the issues I found.

#### >> Quality Issues

##### wrd archive
- (1) There are 55 status having incorrect name: "a", "an".
- (2) Some entries has the wrong expanded_urls - "gofundme.com" instead of "twitter.com".
- (3) There are 59 status missing "expanded_urls".
- (4) Change all the null values to be NaN in 'name', 'doggo', 'floofer', 'pupper', 'puppo' columns.
- (5) The rating_numerator contains ratings that should sometimes be in decimals, but are incorrectly parsed. This value should be a float.
- (6) Some ratings are incorrectly retrieved from other fractions, e.g. row 2335 - 1/2
- (7) IDs columns should be strings instead of integers.

##### prediction
- (9) Some of the prediction results starts with capital letters. e.g. Labrador_retriever

#### >> Tidiness Issues

##### wrd archive
- (10) doggo, floofer, pupper, puppo should be in the same column - type <br>

##### wrd_api
- (11) Convert columns into rows 

<hr>

### Wrangling Efforts in Quality Issues:

#### - Issues (1):
- I used `df.query()` function tried to understand what are the texts in the original posts. 
- Then, I realized that these are the pets that doesn't specified their names in the posts. 
- As a results, I used `df.apply()` and change all the incorrect name into 'NaN'.

#### - Issues (2) and (3)
- I used `df.apply()` function to fix incorrect status urls.

#### - Issues (4)
- I used `df.apply()` and change all the 'None' values into 'NaN' in 'doggo', 'floofer', 'pupper', 'puppo' columns.

#### - Issues (5) and (6)
- I used `.str.extract()` along with regular expression to extract correct ratings from `text` column
- Then I used `df.apply()` to get the ratings data for `rating_numerator` and `rating_denominator`.
- I used `.astype()` function to convert `rating_numerator` to correct data type.
- This is the regular expression I used: `r'(\d+\.?\d+?/\d+\.?\d+?)'` which captured decimals with in all the ratings.

#### - Issues (7)
- I used `.astype()` function to convert `tweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id` and `retweeted_status_user_id` columns in wrd_clean dataframe and `tweet_id` column in prediction_clean dataframe into correct data type.

#### - Issues (8)
- I used `df.apply()` function and `.lower()` function in order to change all the letters into lower case.

### Wrangling Efforts in Tidiness Issues:

#### - Issues (1):
- I created many temporary columns to store the dog types data using `df.apply()` function.
- I joined all the temporary dataframes I created.
- Drop all the temporary dataframes.

#### - Issues (2):
- First, I used `df.copy()` function before doing anything.
- I use `.transpose()` funcion to transpose the whole dataframe.
- I use the `.rename()` function to rename columns to proper name.

The above are all the efforts I had during the data wrangling process.