## Reporting: wragle_report
* Create a **300-600 word written report** called "wrangle_report.pdf" or "wrangle_report.html" that briefly describes your wrangling efforts. This is to be framed as an internal document.

## Step 1: Gathering Data

In this step, you will gather all three pieces of data as described below in the "Data Gathering" section in the wrangle_act.ipynb notebook.

In the cells below, gather **all** three pieces of data for this project and load them in the notebook. 

**Note:** the methods required to gather each data are different.

### 1. The WeRateDogs Twitter archive

We downloaded the `twitter-archive-enhanced.csv` from the Udacity platform and loaded it as a dataframe.

### 2. The tweet image predictions

We downloaded the `image-predictions.tsv` from the Udacity platform with `requests` and loaded it as a dataframe.

### 3. Additional data from the Twitter API

As we couldn't get API from twitter, we used the `tweet-json.txt` file provided by the Udacity platform and loaded it as a dataframe.

### 4. Displaying head and basic info of three dataframes

We displayed the first lines of each of the three dataframes from files `twitter-archive-enhanced.csv`, `image-predictions.tsv` and `tweet-json.txt`.

We also displayed the basic informations of the three dataframes to learn about their columns datatypes and detect inconsistencies and issues, both structural and tidiness.

## Step 2: Assessing Data

There are many quality issues and tidiness issues we've detected.

### Quality issues

`twitter_archive_enhanced` dataframe:
1. `in_reply_to_status_id`, `in_reply_to_user_id` are floats, must be integers

2. `retweeted_status_id`, `retweeted_status_user_id` are floats, must be integers

3. columns with  missing values: `expanded_urls`, `retweeted_status_id`, `retweeted_status_user_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_timestamp`

4. columns that must be datetime format not string : `timestamp`, `retweeted_status_timestamp`

`twitter_archive_from_api` dataframe:

5. `tweet_id` are strings, must be integers

6. columns that must be datetime format not string : `created_at`



### Tidiness issues

`twitter_archive_enhanced` dataframe:

1. drop non existig `expanded_urls`

2. separate tables for existing and non existing `expanded_urls`

3. separate tables for existing and non existing `expanded_urls`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp` 

4. separate tables for existing and non existing `expanded_urls`, `in_reply_to_status_id`, `in_reply_to_user_id`

`twitter_archive_from_api` dataframe:
1. drop non existig `expanded_urls`

2. missing rating colomns(rating numerator and rating denominator) all of that are in the text message

3. missing dog's name. The dog's name is in the text message. Need extraction.

Create a `master_twitter_archive` dataframe from the two dataframes:

## Cleaning Data

We cleaned all the issues mentioned above. and created a master dataset and saved it as `twitter_archive_master.csv` 

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

### Quality Issue #1: `twitter_archive_enhanced` dataframe:
1. `in_reply_to_status_id`, `in_reply_to_user_id` are floats, must be integers

2. `retweeted_status_id`, `retweeted_status_user_id` are floats, must be integers

#### Define:
Convert datatype from float to integer

### Quality Issue #2: `twitter_archive_enhanced` dataframe:

3. columns with  missing values: `expanded_urls`, `retweeted_status_id`, `retweeted_status_user_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_timestamp`

#### Define
We can't deal with missing values. we'll leave missing values as they are. They represent a tidiness issue mentionned in the tidiness group. It will be solved in the tidiness solutions group.

### Quality Issue #3: `twitter_archive_enhanced` dataframe:

4. columns that must be datetime format not string : `timestamp`, `retweeted_status_timestamp`

#### Define
Convert `timestamp`, `retweeted_status_timestamp` datatype from object to datetime

### Quality Issue #4:`twitter_archive_from_api` dataframe:
5. `tweet_id` are strings, must be integers

6. columns that must be datetime format not string : `created_at`

#### Define
Convert `tweet_id` datatype from object to integer

Convert `created_at` datatype from object to datetime

### Tidiness Issue #1:`twitter_archive_enhanced` dataframe:

1. drop non existig `expanded_urls`

2. separate tables for existing and non existing `expanded_urls`

3. separate tables for existing and non existing `expanded_urls`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp` 

4. separate tables for existing and non existing `expanded_urls`, `in_reply_to_status_id`, `in_reply_to_user_id`


### Tidiness Issue #2:`twitter_archive_from_api` dataframe:

1. drop non existig `expanded_urls`

2. missing rating colomns(rating numerator and rating denominator) all of that are in the text message

3. missing dog's name. The dog's name is in the text message. Need extraction.

#### Define
Create a new datframe `api_missing_expanded_urls` by slicing the dataframe

Drop rows with missing `expanded_urls`

Create `rating_numerator` and `rating_denominator` columns by extracting `full_text` column

### Tidiness Issue #3: Create a `master_twitter_archive` dataframe from the two dataframes `ehd_twitter_clean` and `api_twitter_clean`:

#### Define

1. Drop  `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp` in `ehd_twitter_clean` dataframe.

2. Rename `timestamp` to `created_at` in  `ehd_twitter_clean` dataframe.

3. Rename `text` in to `full_text` in `ehd_twitter_clean` dataframe.

4. Keep only `tweet_id`, `retweet_count` and `favorite_count` in the `api_twitter_clean` dataframe drop all other columns. 

5. Operate the merge of the three dataframes `image_predictions`, `ehd_twitter_clean` and `api_twitter_clean` to have one `master_twitter_archive` dataframe

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1. There are more tweets from Iphone than other platforms.

2. The most used rating numerator note is `12`.

3. There is a high correlation between `retweet_count` and `favorite_count`

4. Most of each image predictions are True

### Visualization

We used three different visualisations:
* Barplot
* Pieplot
* Heatmap