## Project: Get Insights on Twitter Data "WeRateDogs"

### Introduction - About the Data

In this report I'm going to share some insights on the dataset `Twitter Data WeRateDogs`. This dataset consists of about 2000 tweets posted on Twitter between end of 2015 and mid of 2017. The Twitter channel [WeRateDogs](https://twitter.com/dog_rates?lang=en) provides dog pictures with funny comments and ratings on a regular basis.
For creating this insights I was merging and analyzing data from three different sources:
* a Twitter archive of tweets provides as csv.file from [Udacity](https://www.udacity.com/)
* Twitter tweet data gathered via [Twitter's API](https://developer.twitter.com/en/docs)
* a dataset of image prediction data provided by [Udacity](https://www.udacity.com/)  
  
For the insights presented within this report I was using the following data:

|field name|description|example              |
|:---|:---|:---|
|`rating_numerator` | X as the X/10 rating normally used by WeRateDogs (often > 10) as float for the tweet | 14.0 |
|`retweet_count` | How many retweets the tweet got | 7000 |
|`favorite_count` | How often the tweet was marked as favorite | 7526 |
|`p1` | is the algorithm's `#1` prediction for the image in the tweet | golden retriever |
|`p1_dog` | is whether or not the `#1` prediction is a breed of dog | `TRUE` |
|`p2`|  is the algorithm's second most likely prediction | Labrador retriever |
|`p2_dog` | is whether or not the `#2` prediction is a breed of dog | `TRUE` |
|`p3`|  is the algorithm's third most likely prediction | Samoyed |
|`p3_dog` | is whether or not the `#3` prediction is a breed of dog | `TRUE` |


Besides the original tweet data from Twitter the analyzed dataset also includes image prediction data of a machine learning algorithm. Based on this data I was focusing on the following questions:
  
* Which dog breeds (recognized by the machine learning algorithm) are rated highest by WeRateDogs?
* Which dog breeds have highest favorite counts or retweet rates?
* Is there a correlation between rating and favorite count?

## Dataset insights

First I wanted to know how many and which dog breeds were recognized by the algorithm. Therefore I looked at each image prediction to find out if p1-p3 predicted a dog as image content. Therefore  I added a column `most_probable_breed` telling which breed was predicted most probable.  
In around __85%__ of all tweets the image recognition algorithm predicted at least one dog breed. In total  __113__ different dog breeds were predicted as `most_probable_breed`.  
I asked myself which breeds were most prominent within the dataset and wanted to know, which are the top 10 predictions there.  

Here's the result (in absolute numbers):

<div style="width: 250px; height: 280px;">
    <img src='./plots/Top_10_Dogs_in_dataset.png' width="200%" height="200%">
</div>

Based on this I had a look at ratings, favorite counts and number of retweets for each of these top 10 dog breeds.

<div>
<img src="./plots/average_rating_top10_breeds.png" width="800"/>
</div>

<div>
<img src="./plots/favorite_counts_top10_breeds.png" width="800"/>
</div>

<div>
<img src="./plots/retweets_counts_top10_breeds.png" width="800"/>
</div>

Not surprising there is a clear correlation between retweets and favorites. But what about the rating bases on favorites or retweets? Here we look at a scatter plot.

<div>
<img src="./plots/favorite_counts_based_on_rating.png" width="800"/>
</div>

I ran an OLS regression on to analyze the significance of the correlation. Here is the output.
<div>
<img src="./plots/OLS_regression_output.png" width="1000"/>
</div>


As we see there is a clear correlation between both variables (p = 0.00)

## Conclusion

We were looking at recognized dog breeds and saw which breeds get higher ratings, favorite and retweet counts.
Additionally we looked on the correlation between rating and favorite counts. 
Based on the data there is a significant correlation between the two.

Limitations:
* As there are many different dog breeds, value counts for each breed are relatively small (33-155 individual tweets per dog breed) - all observations are therefore limited
* A central assumption concerning the dog breed was that the image recognition algorithm is reliable. I didn't double check on this, so this may also have significant impact on the result.