# Report for the Wrangle and Analyze Data project at [Udacity](https://www.udacity.com/)

### [Requirements:](https://review.udacity.com/#!/rubrics/1136/view)
- Create a **300-600 word written report** called `wrangle_report.pdf` or `wrangle_report.html` that briefly describes the wrangling efforts. It is framed as an internal document.

## Introductory steps that were done
- The necessary libraries `pandas`, `numpy`, `requests`, `tweepy`, `json` and `matplotlib.pyplot` imported 
- The 'magic' command `%matplotlib inline` is used.

## Data Gathering

### 1. The file `twitter_archive_enhanced.csv` was downloaded from the [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv) that was provided by Udacity.

### 2. The `image_predictions.tsv` file was downloaded programmatically using the [Requests](https://2.python-requests.org//en/master/) library
#### The steps that were done:
- The `image_predictions_url` was provided by Udacity. It was saved.
- The response was obtained and saved using `requests.get()`.
- The response was checked using `response.status_code`. It was equal to 200, as it should.
- The `image-predictions.tsv` was opened using [`with open() as` statement](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement)
- The `OAuthHandler` was imported from `tweepy`.
- The `default_timer` was imported from `timeit` - It measures execution time of small code snippets, see https://docs.python.org/3/library/timeit.html .
- The Twitter API was queried for each tweet in the Twitter archive and saved as JSON in a text file. 
- The actual keys and tokens were replaced by the word 'HIDDEN'.
- The `twitter-archive-enhanced.csv` was read to a dataframe using `read_csv()`.
- The Twitter's API were queried for JSON data for each tweet ID in the Twitter archive using `with`, `try` and `except`. 

### 3. The obtained `tweet_json.txt` file was read line by line 
**The steps that were done:**
- Ann empty list was created.,
- The `tweet_json.txt` file was read using `with`, `readline()` in a `for` loop.
- The **tweet ID**, **retweet count**, and **favorite count** were read.
- The data were stored to the list.
- The list was converted into a pandas DataFrame.

## Assess Data

### [Requirements:](https://review.udacity.com/#!/rubrics/1136/view)
- Detect and document at least **eight (8) quality issues** and **two (2) tidiness issues**.
- To meet specifications, the issues that satisfy the Project Motivation (**see the Key Points** header in `wrangle_act` file) must be assessed.

**The steps that were done:**
- The Data Frames obtained from the files `twitter-archive-enhanced.csv`, `image_predictions.tsv` and  `tweet_json.txt` were inspected using the `.head()`, `.info()`, `.value_counts()`, `.duplicated()`, `.islower()`.

## Quality issues that were detected
**in twitter-archive-enhanced:**
Only original ratings (no retweets) that have images are needed, however we have

1. replies: 78 , 
2. retweets: 181 , 
3. entries without images = entries without urls = 2356 entries - 2297 expanded_urls = 59.  
4. The columns *in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp* that are not needed.
5. The *timestamp* is object, while it should be date and time.
6. There are only four sources of images: *Twitter for iPhone*, *Vine - Make a Scene*, *Twitter Web Client*, and *TweetDeck*, which is not seen, because the full url is given.
7. Lower case names are not names.

**in img_predictions**
8. p1, p2, and p3 column names are unclear
9. The 543 images may be not dogs
- The underscore in the breed names is unnecessary
- The 619 breed names are lower case, but 1532-692 = 840 are upper case 

### Tidiness issues
1. All tables can be merged into the new one on 'tweet_id' index 

**in twitter-archive-enhanced.csv:**
2. The dog "stage" (i.e. doggo, floofer, pupper, and puppo) is a variable. The "stage" should be one column.

## The findings:
1. The manual check of the figures shows that even with all three predictions as not dogs, there might be a dog in the picture, see, for example, the [link](https://pbs.twimg.com/media/DBW35ZsVoAEWZUU.jpg). Therefore, in order to clean this issue, one needs a manual check, or a better prediction algorithm, which is beyond the scope of this project.
2. The `retweet count` and `favorite count` are correlated with each other with the correlation coefficient 0.93. However, they are practically independent from `rating numerator` and day of a week. 
3. A small correlation is observed between the length of the description `text`, `p1_dog`, `p2_dog` and `p1_conf` with the `favorite_count` only. The corresponding correlation coefficient can be rounded to 0.1. 
4. The most tweets are unclassified: `None` - 1963 tweets, `pupper` - 209, `doggo` - 72, `puppo` - 23, `floofer` - 8.
5. The `None` and `pupper` distributions are non-gaussian, with heavy right tale, while `doggo`, `puppo` and `floofer` distributions have too litle entries to define the shape, see figures below.

<img src="retweet_count_distributions.png" width="100%"> <img src="favorite_count_distributions.png" width="100%">

6. The `None` distribution has two peaks. It may indicate that there is a group of unclasified images, which are much more popular than the rest.

7. The average number of `favorite_count` is several times larger than the `retweet_count` for every dog 'stage', see figures below.
<img src="retweet_count_boxplot.png" width="45%"> <img src="favorite_count_boxplot.png" width="45%">

8. The unclassified `None` tweets have similar popularity as `pupper`, while `doggo`, `puppo` and `floofer` are much popular in both the number of retweets and the number of favorite counts.