## WeRateDogs - Twitter Data

### I. Gather Data

I looked at the instructions given by the Udacity team on how to gather data for this data warngling analysis.

- I initially downloaded the data which is a given CSV file and named as **twitter-archive-enhanced.csv**.
- Next I created my twitter developer account and created the JSON file named **tweet_json.txt** by using the API. 
- Next I downloaded the file image predictions file which is in the tsv format.

Once I had all the above three files, I created them into 3 different dataframes which are shown below.

- *archive_df* - this is a dataset "twitter-archive-enhanced.csv" which was converted into a dataframe and gives information on basic tweet data.  

- *tweets_info_df* - This dataset will contain information like tweet_id, no of retweets and no of favorites etc.,

- *image_predictions_df* - This dataset will contain information about predictions about the image.


### II. Assesing the data

In the below, each column of each table in this twitter dataset is described. 
To see the table that goes hand in hand with these descriptions, I displayed each table in its entirety by displaying the pandas DataFrame that it was gathered into. 
This task is the mechanical part of visual assessment in pandas.

#### 1.`Enhanced Twitter Archive`

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which  used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, we have filtered for tweets with ratings only (there are 2356).

`archive_df` columns and their description:
    
- **tweet_id**: the unique identifier for each of the tweet
- **in_reply_to_status_id**: the status id for the reply given to the tweet id
- **in_reply_to_user_id**: the status id for the reply given to the tweet id ( w.r.t user id)
- **timestamp**: Date and time the tweet was created, in Excel-friendly format.
- **source**: the web link as source
- **text**: the corresponding tweets text
- **retweeted_status_id**: the status id for the reply given to the tweet id i.e., for the retweeted id
- **retweeted_status_user_id**: the status id for the reply given to the tweet id ( w.r.t user id) i.e., for the retweeted id
- **retweeted_status_timestamp**: Date and time the tweet was created, in Excel-friendly format.
- **expanded_urls**: Expanded version of url1; URL entered by user and displayed in Twitter. Note that the user-entered URL may itself be a shortened URL, e.g. from bit.ly.
- **rating_numerator**: the ranking given by the user
- **rating_denominator**: The reference ranking given by the user 
- **name**: the breed or dog's name
- **doggo**, **floofer**,  **pupper**, **puppo** -- The stage of the dog

the link to the twitter data columns description can be found [here ](https://sfm.readthedocs.io/en/1.4.3/data_dictionary.html)

#### `Quality - archive_df`

1. Missing values in columns from in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id , retweeted_status_user_id,
  retweeted_status_timestamp, expanded_urls
  
2. rating_numerator and rating_denominator have some incosistent values in the numerator and denominator (some of them showing as high as 1776, 170 respectively

3. tweet id 835246439529840640 has a rating of denominator = 0

4.  Crazy names found for dogs - **'infuriating', 'just', 'life', 'light', 'mad', 'my', 'not', 'officially', 'old', 'one', 'quite', 'space', 'such', 'the',  'this', 'unacceptable', 'very'**

5. timestamp and retweeted_status_timestamp must be of datetime instead of the object

6. in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id must be **integers** instead of **float** ( They have id's similar to tweet_id) 

7. The columns which have missing values in doggo, floofer, pupper , puppo - has **None** instead of **NaN**

8. We see that the information of text is truncated to 50 characters. Anything in excess is ellipsized


  

#### 2. `Tweets_info_df`

`Tweets_info_df` columns and their description:

- **tweet_id**: The unique identifier for each of the tweet
- **retweets**: The count of retweets done by user
- **favorites**: The count of favorites done by user
- **followers**: The count of number of followers
- **friends**: The count of number of friends

#### `Quality - tweets_info_df` table

- 14 tweet ids information is Missing 



#### 3. `Quality - image_predictions_df :`

WeRateDogs Twitter archive was ran through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

`image_predictions_df` columns and their description:
    
- **tweet_id**: tweet_id is the last part of the tweet URL after "status/"
- **jpg_url**: Image link or URL
- **img_num**: Image number
- **p1**: p1 is the algorithm's #1 prediction for the image in the tweet 
- **p1_conf**: p1_conf is how confident the algorithm is in its #1 prediction
- **p1_dog**: p1_dog is whether or not the #1 prediction is a breed of dog
- **p2**: is the algorithm's second most likely prediction
- **p2_conf**: is how confident the algorithm is in its #2 prediction
- **p2_dog**:  is whether or not the #2 prediction is a breed of dog 
- **p3**: p3 is the algorithm's #3 prediction for the image in the tweet
- **p3_conf**: p3_conf is how confident the algorithm is in its #3 prediction
- **p3_dog**: p3_dog is whether or not the #3 prediction is a breed of dog


`Quality - image_predictions_df` table:

- only 2075 tweetIds have images


### III. Cleaning

For cleaning all the 3 dataframes, Here are the steps I followed before after joining the dataframes.


* Convert the datatype of "tweet_id" into string

* Create a universe dataset joining all the dataframes based on the tweet_id

* Convert the dog stage or category into one column instead of the multiple columns

By doing the above step, there are duplicated rows (334) becuase the count has increased from 2068 to 2402. 
These duplicated rows might have occured due to multiple tagging done with dog_status.
Now let us clean the rows with only one dog_status column value.

* Removed the kind of ambiguity between the dog_stages

* in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id , retweeted_status_user_id  -- Convert all these into Object or string

* retweeted_status_timestamp - Convert this variable into datetime format

* We see that the information of text is truncated to 50 characters. Anything in excess is ellipsized.
  Let us increase the text format representation
    
* Crazy names found for dogs - 'infuriating', 'just', 'life', 'light', 'mad', 'my',
  'not', 'officially', 'old', 'one', 'quite', 'space', 'such', 'the', 'this', 'unacceptable', 'very'. Let us clean to ideal name by looking at the text.
    
* rating_numerator and rating_denominator have some incosistent values in the numerator and denominator (some of them showing as high as 1776, 170 respectively).
    In one tweet_id , the rating for denominator is shown as 0.
    Hence, I am assuming the ratings are done purely on a humouros basis.Hence we are seeing the wide range of values. 
    I am not going to disturb the ratings provided here.
    
* retweeted_status_timestamp - has the null values , I dropped this variable
    






### IV. Store

I stored the final dataframe into csv file with name **twitter_archive_master.csv** with final data of 2055 rows and 30 columns