# Reporting: act_report 
> This report entails the wrangling efforts done on we rate dogs project. 

![image.png](attachment:image.png)

## Table Of Contents
<ul>
<li><a href="#PROJECT_OVERVIEW">1.0 Project Overview</a></li>
<li><a href="#PROJECT_OBJECTIVES">2.0 Project Objectives</a></li>
<li><a href="#DATA_QUALITY_ISSUES">3.0 Data Quality Issues</a></li>
<li><a href="#DATA_CLEANING">4.0 Data Cleaning</a></li>
<li><a href="#DATA_ANALYSIS_VISUALIZATION">5.0 Data Analysis and Visualization</a></li>
<li><a href="#CONCLUSION">6.0 Conclusion</a></li>
</ul>
 

<a id='PROJECT_OVERVIEW'></a>
## 1.0 Project Overview

This project entails dataset that involves rating of people's dogs via WeRateDogs twitter account. The aim of the project is to wrangle (analyzing and visualizing) the tweet archive of Twitter user @dog_rates, also known as WeRateDogs.  These ratings almost always have a denominator of 10. The numerators, on the other hand, almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent."

The data provided are in three sets that will be loaded in this notebook differently. The first dataset will be gathered using the read_csv, the second one will be gathered using the get request and the third will be scrapped from twitter using teepy library.

The next step will be assessing the data to identify data quality and tidiness issues for cleaning. After Cleaning, we'll save the data in a new csv and the perform visualization and make insights thereof.

<a id='PROJECT_OBJECTIVES'></a>
## 2.0 Project Objectives
### 2.1 Main Objective
The main objective is to analyze, visualize and wrangle data that will produce at least **three (3) insights and one (1) visualization.**

### 2.2 Specific Objectives:
1. To find out if there's any relation between the img_num and prediction confidence.

2. To find out if there's any relation between dog_stages_type, favorite_count, and prediction_confidence.

3. To find out the dog stages with the highest number of prediction per confidence level.

4. To check which day, month or year had the highest tweets.

<a id='DATA_QUALITY_ISSUES'></a>
## 3.0 Data Quality Issues
In this section, the main was to detect and document at least **eight (8) quality issues and two (2) tidiness issue**. 

<a id='QUALITY_ISSUES'></a>
### 3.1.0 Quality Issues
>1. Columns in_reply_to_status_id and in_reply_to_user_id columns in the twitter_archived_enhanced.csv have lots of missing data.

>2. in_reply_to_status_id and in_reply_to_user_id columns in the twitter_archived_enhanced.csv (df) had the values in exponential.

>3. The timestamp column in twitter_archived_enhanced.csv (df) was in object instead of datetime format.

>4. The names of dog breeds in the predictions df are not consistent. Some names have underscores, hypen, upper case

>5. From the visual assessment, the rating denominator should be 10 but there are values i.e 0, 15, 70, 7, 11, 150, 170, 20, 50, 90, 80, 40, 130, 110, 16, 120, 2 .

>6. Some columns in the df dataframe are not necessary such as the name, expanded urls, text, source among others. It represents the name od the dogs and not the breed of the dog.

>7. The naming of the columns in predictions_df are not discrete i.e img_num, p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog. It is difficult for one to understand what the columns represent

>8. Some names listed as breeds of dogs are not correct. Example is wooden spoon, pencil box, envelope, entertainment centre, lifeboat, coackroach, crate, water_bottle among others.

<a id='TIDY_ISSUES'></a>
### 3.2.0 Tidy Issues
>1. In the third dataframe (df_3) the breeds of dog p1, p2 and p3 should be in one column where it captures the breed with the highest confidence level. This column could be referred to type of breed.

>2. The tweet id are in dataframe one and two while in the third dataframe, it is referred to as id. This should be merged to one dataframe so that there's only one tweet id.

>3. The dog stages in the first dataframe (df) are captured in separate columns. This should be in one column called dog_stages with the details of the stages of the dog i.e. doggo, floofer, pupper, and puppo. More infomation is found via the (#WeRateDogs book)(https://www.amazon.com/WeRateDogs-Most-Hilarious-Adorable-Youve/dp/1510717145) on Amazon.

<a id='DATA_CLEANING'></a>
## 4.0 Data Cleaning
In this section, you need to performed cleaning **all** of the issues documented earlier during assessessment. 

**Note:** Before cleaning the dataframes, I make copies of the original data before cleaning. Cleaning should include merging of individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). 

<a id='COPY_DATA'></a>
### 4.1.0 Copy Data

> Before perfoming any cleaning, I make copies of the dataframes and define them as you wish. I defened them as df_1_copy, df_2_copy and df_3_copy.

<a id='CQUALITY_ISSUES'></a>
### 4.2.0 Data Cleaning: Quality Issues

### <font color='blue'>Issue #1: 
><font color='blue'>**Columns in_reply_to_status_id and in_reply_to_user_id columns in the twitter_archived_enhanced.csv have lots of missing data.**

**output**
> Drop these columns using the drop function then previewed the column names to confirm the columns are dropped using <code>df_1_copy.columns

### <font color='blue'>Issue #2:
><font color='blue'>**a. in_reply_to_status_id and in_reply_to_user_id columns in the twitter_archived_enhanced.csv (df) had the values in exponential.**

#### Note
> These columns have been dropped otherwise to remove exponential in python, I would have used the lambda function to supress the scientific notation. I'll make a copy of the origainal dataframe and perform this step.

><font color='blue'>**b. Rename the column names ('doggo', 'floofer', 'pupper', 'puppo') in twitter_archived_enhanced.csv.**
    
> After renaming the columns, preview them to confirm if columns have been renamed using df_1_copy.columns.

### <font color='blue'>Issue #3:
><font color='blue'>**The timestamp column in twitter_archived_enhanced.csv (df) was in object instead of datetime format.**

**Define**
>Change the timestamp column from object to datetime datatype using to_datetime method.

### <font color='blue'>Issue #4:
><font color='blue'>**The names of dog breeds in the predictions df are not consistent. Some names have underscores, hypen, upper case.**

**Define**
>Remove the hyphens, underscores and upper case using str.lower() and .replace() in columns p1, p2 and p3. 

**Note:**
>Considering that python is case sensitive, I'll lower all the dog breeds in columns p1, p2 and p3 and thereafter remove the underscores and hyphens and replace them with white/empty spaces.

**Observation:**
> From the above output, it is evident that the dog breeds are in lower case and that the underscores and hyphens have been removed.

### <font color='blue'>Issue #5:
><font color='blue'>**From the visual assessment, the rating denominator should be 10 but there are values i.e 0,  15,  70,   7,  11, 150, 170,  20,  50,  90,  80,  40,
       130, 110,  16, 120,   2 .**

**Define**
>Drop the rating denominator column using .drop(). This column isn't of much use during the analysis.

### <font color='blue'>Issue #6:
><font color='blue'>**Some columns in the first dataframe (df_1_copy) are not necessary such as the name, expanded urls, text, source among others. It represents the name of the dogs and not the breed of the dog.**

**Define**
>Use the drop function to remove columns that won't be used during analysis

### <font color='blue'>Issue #7:
><font color='blue'> **The naming of the columns in predictions_df are not discrete i.e img_num, p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog. It is difficult for one to understand what the columns represent.**   

**Define**
>Rename the columns p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog to prediction_1, prediction_1_confidence, prediction_1_status,prediction_2, prediction_2_confidence, prediction_2_status, prediction_3, prediction_3_confidence, prediction_3_status 

<a id='CTIDY_ISSUES'></a>
### 5.3.0 Data Cleaning: Tidiness Issues

### <font color='blue'>Issue #8:
><font color='blue'>**Some names listed as breeds of dogs are not correct. Example is wooden spoon, pencil box, envelope, entertainment centre, lifeboat, coackroach, crate, water_bottle among others.**

**Define**
>* Having the three confidence level on the predictions, I'll drop the second and third confidence because the first confidence (prediction confidence) with a level of 95% is more reliable amongst the three cofidence levels. Therefore, I'll drop prediction_2, prediction_2_confidence, prediction_2_status, prediction_3, prediction_3_confidence, prediction_3_status and max_confidence. But before then, I'll create a new column (max_confidence) that captures the highest confidence level in every entry.

>* Bearing in mind that the project focuses on dogs, the next thing will be to drop the false on prediction_1_status as they are not dogs. 

**Note:**
>Check if all the values in the max_confidence column belong to prediction_1_confidence as this is the predictions with 95% confidence amongst the other two confidence level (p2 and p3).

#### Option1: Check if all the values in the max_confidence column belong to prediction_1_confidence.


**Observation**
> From the above output, it's evident that all the values in max_confidence column, are from the prediction_1_confidence. To prove this further, I'll check for the shape of these true values.

#### Option 2: Check if all the values in the max_confidence column belong to prediction_1_confidence.


#### Check if all the values in the max_confidence column belong to prediction_1_confidence.
>**Option 3**

**Observation**
> It was evident that the highest predictions were captured in prediction_1. Therfore, I dropped the details for Prediction_2 and prediction_3 which are prediction_2, prediction_2_confidence, prediction_2_status, prediction_3', prediction_3_confidence, prediction_3_status and max_confidence.

Next, I'll therefore check the false values in prediction_1_status and drop them also because we only need to retain the dogs only.

> Looking at the prediction confidence level, despite the fact that some had high confidence of i.e 0.949591, the result obtained was web site and the prediction status was false. Therefore, dropping this entries is viable because the predictions are not dogs.

### <font color='blue'>Issue #9:
><font color='blue'>**In the second dataframe (prediction_df) the breeds of dog p1, p2 and p3 should be in one column where it captures the breed with the highest confidence level. This column should capture the type of dogs breed.**
This issue was resolved already.

**Notes**
>Considering that the prediction column has all true values, I dropped this column.

### <font color='blue'>Issue #10:
><font color='blue'>**The dog stages in the first dataframe (df) are captured in separate columns. This should be in one column called dog_stages with the details of the stages of the dog i.e. doggo, floofer, pupper, and puppo.**

**Note**

> I used the melt() to put the dog stages in one column then checked for the count of the unique values in the four columns to ensure that the right values are obtained after melting the respective columns. To perform this step, I'll create a function for value count.

>Then to condense the four columns doggo, floofer, pupper and puppo into one column, I begun by replacing the None with 'missing' then use the 
```
melt()
```to have four dog stages in one column called dog_stages. Not replacing the None will make the empty spaces have dog stages values.



> I then replaced the None with the word missing by creating a function for value count that will return the number of unique values in floofer, doggo,pupper and poppo columns. Having no other missing values, I saved the output to a new dataframe.

### <font color = 'Blue'>Issue #11:
><font color = 'Blue'>**The tweet id are in dataframe one and two while in the third dataframe, it is referred to as id. This should be merged to one dataframe so that there's only one tweet id.**

**Define**
> Here, I renamed the id in the third dataframe to tweet_id then merge the three dataframe.

#### Additional Cleaning of Quality Issues
> I dropped the retweet column in the third data set and all rows that had retweet data captured in the first dataset. 

> Having remained with original tweets only, I dropped the retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns as their values are all nan.

### Merging the three DataFrames.
> I then merged the three dataframes to a cleaned_data

Save the gathered, assessed and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

<a id='DATA_ANALYSIS_VISUALIZATION'></a>
## 5.0 Data Analysis and Visualization
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1. Find out if there's any relation between the img_num and prediction confidence.

2. Find out if there's any relation between dog_stages_type, favorite_count, and prediction_confidence.

3. Find out the dog stages with the highest number of prediction per confidence level.

4. Check which day, month or year had the highest tweets.

### Visualization

### 1. Find out if there's any relation between the img_num and prediction confidence.

![image_2.png](attachment:image_2.png)

![image_1.jpg](attachment:image_1.jpg)

**Observation**
> From the above two charts, it's evident that the images with the highest number of images were able to have high prediction confidence level. This prediction confidence reduced consequently with the reduction of the number of images. images with four number of images achieved a prediction confidence of 87% while images of one image had 59% confidence.

> It's also noted that there's no correlation in image number and prediction confidence level although the bar chart inicated that images with high number of images had high prediction confidence. This is a true indication of correlation is not causation and more insights need to be performed before making conclusions.

### 2. Find out if there's any relation between tweet_id, rating_numerator, img_num, favorite_count, and prediction_1_confidence.

![image_4.png](attachment:image_4.png)

**Observation**
> The heatmap above indicates that there's is a positive correlation between favorite count and tweet id with a value of 0.62. Prediction confidence had the weakest positive correlation of 0.015 followed by Rating numerator and image number with a weak positive correlation of 0.043 then followed by Favorite count and prediction confidence level with a weak corelation of 0.058.

### 3. Find out the dog stages with the highest number of prediction per confidence level.

![image_3.png](attachment:image_3.png)

![image_5.png](attachment:image_5.png)

**Observation**
> Puppo dog stage had the highest prediction confidence of 97% with an average of 3 images. The scatter plot also indicates there's no correlation between these variables.

### 4. Which Dog Stage had the highest Tweet per day?

![image_6.png](attachment:image_6.png)

**Observation**
> Pupper had the highest number of tweets per day followed by Doggo, Puppo then Floofer.

### 5. Which Dog Stage had the Highest Favorite Count?

![image_7.png](attachment:image_7.png)

**Observation**
> Puppo dog stage had the highest number of favorite count with a value of 24000 favorite count.

### 6. Find out which dog prediction had the highest prediction percentage?


![image_8.png](attachment:image_8.png)

**Observation**
> The Golden Retiever had the highest prediction percentage as shown in the pie chart above.

<a id='CONCLUSION'></a>
## 6.0 Conclusion

> It was noted that Pupper had the highest number of tweets per day but had the lowest number (8000) of favorite counts while Puppo had the highest number of favorite counts but second lowest number of tweets per day. This is an indication that although had the highest number of tweets, it seemed not to be the favorite Dog stage and there's need to investigate as to why this is so. The investigation could entail considering other variables and building machine learning models that could help discover the insights thereof.