# Exploratory Data Analysis Report for the @WeRateDogs Data Wrangling Project
<br><br>





## Introduction:
After having wrangled the Data and produced a clean and tidy master Dataset of the @WeRateDogs Twitter Archive, I came up with following Questions that I thought worthy to investigate:
- How does the Rating's distribution look like?
- How are the Ratings, the Favorite Count and the Retweet Count related to each other?
- Which Factors from the Dataset might be influencing these Outcomes? In Particular: are the Dog Stages and the Dog Breeds influencing the Popularity of the Tweets?



### A Glance at the Dataset and Variables to investigate:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# importing the cleaned Dataset:
df_dogs = pd.read_csv('twitter_archive_master.csv')
df_dogs.head(3)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,name,favorite_count,retweet_count,dog_stage,best_predicted_breed,jpg_url
0,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13.0,Tilly,32382,6075,,Chihuahua,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg
1,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12.0,Archie,24381,4017,,Chihuahua,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg
2,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13.0,Darla,41016,8370,,Labrador_retriever,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg


To help answer my Questions and better see possible Correlations between the Ratings and other Variables i chose to add a categorical Variable "rating_level", grouping the ratings in 5 ordered bins: ("low (<=9)", "medium (10 - 11)", "intermediate (12)", "high(13)" and "highest (14)").

### How are the Ratings distributed?

<img src="images/rating_side.png">

The Tweets are meant to be "generously" rated: The overall average rating is 10.8 and <b>68.4%</b> of the Ratings are between 10 and 12. Even in the "low" rated Proportion of the Tweets (<b>16.5%</b> are <= 9/10), <b>73.1%</b> are between 8 and 9.<br>
A smaller Proportion of <b>15.2%</b> of the Tweets stands out with the highest Ratings between 13 and 14(which is the highest given Rating, that only 25 Tweets received)

### Inspect the Retweet Count and Favorite Count for each Rating Level:

<img src="images/fav_retweet_side.png">

As we would expect, the Popularity Factors of both the Favorite and Retweets Counts seem to have a very similar positive Correlation with the Ratings.
The higher the Rating, the higher amount of times the Tweets were liked and retweeted. Both Correlations show a steep curve when the Ratings are high: In average, a tweet with a Rating of 13 or more is likely to be favorited <b>5</b> times more often and retweeted <b>4,5</b> times more often than a tweet with a Rating between 10 and 11.  


### How do the Favorite and Retweet Counts relate to one another?

<img src="images/scatter.png">

We can see there seem to be a positive Correlation between the 2 Variables. In fact the "Favorite Count" seem to be, in average, about a factor 3 bigger than the "Retweet Count". It can be that the number of times a Tweet is marked as "favorite" (i.e. "liked") affects how often the Tweet will be retweeted. It can also be that both Variables are affected by the same Factors in a similar way. 

### Does the "dog stage" influence the Popularity of the Tweets?

By using the Pandas Function ```groupby('dog_stage')``` on the Dataset and inspecting the median values for the "Popularity" Variables (Ratings, Favorite Count and Retweet Count) I can observe the Following: 

It seems that all three Variables are affected by the Dog stage in a similar way:
- Tweets that do have a dog stage of either "Floofer", "Puppo", "Doggo" or "Multiple" tend to have a higher Rating (12 or 13 in average) than Tweets without a dog stage ('None' value) or marked as "Pupper" (11 in average).
- Tweets with a Dog Stage in the decreasing order of ```['Floofer', 'Puppo', 'Doggo', 'Multiple', 'None', 'Pupper']```show, in average, a decreasing amount of Favorites and Retweets.<br>

This seems to reveal a "Preference Tendency" for Tweets with a dog stage in the following decreasing order of Preference: ```['Floofer', 'Puppo', 'Doggo', 'Multiple', 'None', 'Pupper']```.
There is a strong Limitation to be considered here, as the number of tweets having a dog stage value at all (268), is very small, compared to the total amount of tweets (1671). In particular, there are only 3 Records with a dog stage of "Floofer", making it difficult to establish any solid Tendency Observation.

### Does the "dog breed" influence the Popularity of the Tweets?

To be able to find out if there is a Tendency of Preferences for specific dog breeds, we would need a sufficient number of Records of Dogs for these Breeds, to base our Observation on.<br>As an Example: there are only 3 "Afghan Hounds", 4 "Saluki" and just 1 "Bouvier des Flandres" for the Breeds that have the highest rating of 13/10 in average.<br>Therefore, I chose to inspect the most common type of Breeds in the Dataset("Golden Retriever", "Labrador Retriever", "Pembroke" and "Chihuahua")and see if I could observe any Tendency of Preferences among these Breeds. 

<img src="images/breeds_mosaic.png">

By comparing the Median Values for the Ratings and the respective Proportions of Rating Levels for the four most common Breeds, I can observe the following:<br>

1) There seem to be slightly more higher Ratings for the "Golden Retriever" than for the "Pembroke". Both Breeds having the same value of 12/10 for the Median of the ratings, the "Golden Retriever" has a Proportion of <b>94.9%</b> of Ratings that are equal or higher than 10/10, versus <b>93.6%</b> for the "Pembroke" Breed. Also, the "Golden Retriever" Breed has a Median value of 9/10 in the "low" Rating Level, versus 6.5/10 for the "Pembroke".<br>
2) The "Pembroke" has a Median rating value of 12/10 versus 11/10 for the "Labrador Retriever".<br>
3) The "Labrador Retriever" and the "Chihuahua" both have a Median rating value of 11/10. The "Labrador" Breed has a Proportion of <b>91.4%</b> of ratings that are equal or higher than 10/10, versus <b>77%</b> for the "Chihuahua" and the Median values for the ratings in the "low" level Category compare to 9/10 for the "Labrador", versus 8/10 for the Chihuahua. So there seems to be slightly more higher ratings for the "Labrador Retriever" Breed, than for the "Chihuahua".<br>
I can observe a Tendency of Preference among these four dog breeds in the following Order:
```["Golden Retriever", "Pembroke", "Labrador Retriever", "Chihuahua"]```


## Conclusion and Limitations

### Conclusion:
I could make the Observation that:
- the Ratings are strongly positively correlated with both the Favorite Count and the Retweet Count
- the Favorite Count and the Retweet Count are positively correlated to one another
- the "dog stage" seems to play a role in the Popularity of the Tweets in the following Order of Preference: ```['Floofer', 'Puppo', 'Doggo', 'Multiple', 'None', 'Pupper']```
- the "dog breed" also seems to play a role in the Popularity of the tweets for the four most common breeds in the following order of Preference:```["Golden Retriever", "Pembroke", "Labrador Retriever", "Chihuahua"]```

### Limitation:
The Observations we could make for both categorical independant Variables of "dog_breed" and "dog_stage" are limited by the fact that the numbers of Records concerned are small compared to the total amount of Records. We were just able in both cases, to observe a Tendency, which we could not prove to be statistically significant.<br>
 