# Wrangle Data from WeRateDogs

## Gather Data

Data has been collected from three different sources:

* Main data
    - Already provided by Udacity as a local csv file.
    - 2356 WeRateDogs tweets have been loaded into one dataframe.
    - we need to go for two other sources to gather more data.
* Data relative to the breed of dog:
    - Image prediction has been programmatically downloaded from [cloudfront](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv).
    - From the tsv file, 2075 image prediction has been loaded into one dataframe.
* Data about retweet and favorite
    - those data are available from Twitter API. That require to create a tweeter developper account in order to generate some credentials that can be use to programatically query the Twitter API.
    - 2337 records has been retrieved then saved as a file 'tweet_json.txt'
  
The **gather** step get concluded by having three files stored locally, each one getting loaded in its own dataframe so that we can start assessing what we got.

## Assess Data
We start loading each data file into its own dataframe by executing three simple line of code:

```
df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
df_tweet_data = pd.read_csv('tweet_json.txt')
```

Then we perform two assessments:

* visual one that give glimpse over quality and tidiness issue
* programmatic one that can confirm and also complete what has been visually assess.

### Visual assessment

Each dataset has been visually assess by just sampling it.

Here are the findings:

* specific to **Twitter Archive**:
    - The format of values for columns `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp` and `timestamp` indicate that there's a problem of type (which would be clarified via programmatic assessemnt)
    - Column `name` has some values to "None" which probalby means that the dog's name is not known => we should get null instead of "None" for those
    - In columns `doggo`, `floofer`, `pupper` and `puppo` 'None' is displayed for null and when not null the value is the same as the column name. This looks like a tidiness issue: Each variable forms a column. Here the dog stage variable is spread over four columns.
    - Column `source` contains markup making the reading less friendly.
    - The presence of the colum `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp` and the fact that they sometimes contain not null value indicate that there are record related to retweeted tweets which are not in the population we want analyze.
* **General tidiness issue**:
    - The three datasets should be combined into one since they are all part of the same observation unit.

### Programmatic assessment

For this we use the common method like `pandas.Dataframe.info()`, `pandas.Dataframe.dtypes`,  `pandas.Dataframe.describes()` and `pandas.Series.value_counts()` plus more specific one when needed.

Assessing the **Twitter Archive**:

* **Type issues**:
    - Following columns should be `int64` or `string` instead `float64`: 
      - in_reply_to_status_id
      - in_reply_to_user_id
      - retweeted_status_id
      - retweeted_status_user_id
    - Following colums should be of type `datetime` instead of `object`:
      - timestamp
      - retweeted_status_timestamp
    - Column `source` should be category
* **Value issues**:
    - `rating_denominator` should always be 10.
    
Assessing the **prediction**:   

* 281 records are missing when compare to the twitter archive.
* some records concern retweet.

Assessing the **json**:

* 19 records are missing (those are the failed calls to the Twitter API while getting the tweet info.

### Assessment summary

#### Quality issues

* Common dataset:
    - there are records related to retweeted tweet
    - missing record in **picture prediction** and **tweet data** datasets
* **twitter achive dataset**:
    - Following columns should be int64 or string instead float64: `in_reply_to_status_id`, `in_reply_to_user_id`
    - Following columns are relative to retweeted and so should be removed: `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp`
    - Column `timestamp` should be of type datetime instead of object
    - Column `name` contain invalid values (ex.: "None" instead of `Nan` when null)
    - Column `source` should be a category.
    - Column `source` should be friendly (remove markup).
    - Column `rating_denominator` contains values other that "10"
    - Column `rating_numerator` have few cases of missing decimals that can be retrieved from the column `text`.
    
    
#### Tidiness issues

* The three datasets should be combined into one since they are all part of the same observation unit.
* In **Twitter archive** the columns `doggo`, `floofer`, `pupper` and `puppo` are in fact the same variable and so should be combined into one column



## Clean Data

We start by making a copy of each dataframe by running:
```
df_twitter_archive_clean = df_twitter_archive.copy()
df_image_predictions_clean = df_image_predictions.copy()
df_tweet_data_clean = df_tweet_data.copy()
``` 

Then we clean two tidiness issues and eight quality issues. For each one we follow the same pattern: define, code then test. The details of those steps are present in the **wrangle_act.ipynb**

### Tidiness

The main tidiness issue is that all the three datasets correspond to the same observation unit. In fact we just get **prediction** and **tweet** info to enrich our dataset with information relative to breed and tweet popularity (retweet count and favorite count). `pandas.merge()` make this easy to solve and the test step is simply done by checking the output of `pandas.Dataframe.info()`.

The other tidiness issue is that the dog stage variable is spread on four columns when it should be one. `pandas.melt()` solves most of the issue, however few cases remain of multiple stages, in that case the value in the newly melted column is updated by concatenating all the stages into one csv string value. By checking `pandas.Dataframe.info()` and `pandas.Series.value_counts()` on the newly created column, we can assert that the issue is succesfully fiexed. 

The reason why I prefere starting to solve tidiness before quality issue is because I know that I would have to remove records relative to retweeted tweet. Better performing that removal once than twice or three times. So now that the obervation unit issue is fixed by having one dataframe, I only have to perform the cleaning of retweeted once.

### Quality

Without a surprise the first quality issue to be fixed is the removal of retweeted. If **retweeted_status_id** is not null this means that record is a retweet, that assertion is use at coding time to perform the removal and at test time for checking the correction.

I won't go into the details of each issue. The "Define" step is mainly a rephrase of the assessment summary. "Code" has been achieve easily with the pandas API and "Test" use common descriptive method such as `pandas.Dataframe.info()` or `pandas.Series.value_counts()`. The only issue that worth to be mentioned is cleaning the dog name. Programmatically this does not present any challenge. The only thing is that there's no clean pattern to fix. Decided to remove name with a length less than 2 characters is arbitraty. There are wrong names of 3 char and there are also valid names of 2 char too (like 'Bo'). This should be mentioned along with any analytical conclusion drawn from the variable **name**.

### Finally

Once every thing is clean, the result data set is save in a file for analysis. As a first step of analysis, we should keep in mind that saving to csv then loading from it will drop the transtyping we may have done (to Datetime for example).
