# "WeRateDogs" Twitter Data Wrangling Steps
This document briefly describes my wrangling efforts: Gathering, Assessing, and Cleaning.

<h2 style="color:orange">1. Gather</h2>

There are 3 data sources that I used when wrangled and analyzed "WeRateDogs" twitter data.<br/>
1. Twitter Archive Enriched data<br/>I gathered data by read `twitter-archive-enriched.csv` file to a Pandas dataframe `df_tweet`. Originally, I get 2356 observations with 17 columns.
2. Image Prediction data<br/>I read `image_predictions.tsv` file to a Pandas dataframe `df_img`. I get 2075 observations with 12 columns.
3. Twitter Additional data<br/>I call Twitter API to get properties data of tweet_id which I have get in `df_tweet`, store it to `tweet_json.txt`. After that, I read `tweet_json.txt` line by line to get tweet_id, favorite_num, and retweet_num into a Pandas dataframe `df_addition`. I get 2331 observations with 3 columns.

<h2 style="color:orange">2. Assess</h2>
<p>In this step, I assessed data quality (value) and tidy-ness (structure). Here are summary of my findings for those 3 dataset.</p>
<h3 style="color:brown">2.1 Summary of Quality Issues</h3>
<p>Twitter Archive "Enriched" Data (df_tweet) :</p>
<ol>
<li>Dataset still contains retweet and reply. 77 of observations are tweet replies. 181 of observations are retweets.</li>
<li><b>timestamp</b> column has <u>wrong data type</u> (string). It should be in datetime.</li>
<li><b>source</b> column is <u>dirty</u> due to contains html tag in its values.</li>
<li><b>rating_denominator</b> column has <u>invalid data</u> due to greater than 10 or less than 10. It is caused: (1) Incorrectly extract the score from text, (2) Some tweets give score to more than a dog in a tweet, (3) Tweet didn't mention score.</li>
<li><b>rating_numerator</b> column has <u>invalid data</u> due to (1) Incorrectly extract the score from text, (2) Some tweets give score to more than a dog in a tweet, (3) Tweet didn't mention score.</li>
<li><b>name</b> column has <u>invalid data</u> due to (1) contains 'None' string value or (2) incorrect when extracting from text.</li>
<li><b>doggo</b> column has values either 'None' string value or equal to its column name. It should be stored in boolean dtype column.</li>
<li><b>floofer</b> column has values either 'None' string value or equal to its column name. It should be stored in boolean dtype column.</li>
<li><b>pupper</b> column has values either 'None' string value or equal to its column name. It should be stored in boolean dtype column.</li>
<li><b>puppo</b> column has values either 'None' string value or equal to its column name. It should be stored in boolean dtype column.</li>
</ol>
<p>Twitter API Data (df_addition) :</p>
<ol>
<li><b>favorite_count</b> has wrong datatype (float). It should be integer.</li>
<li><b>retweet_count</b> has wrong datatype (float). It should be integer.</li>
<li><b>id</b> (tweet_id) has wrong datatype (float). It should be integer.</li>
</ol>
</div>

<h3 style="color:brown">2.2 Summary of Tidy-ness Issues</h3>
<p>Image Predictions (df_img) :</p>
<ol>
<li>Prediction results are <u>messily stored</u>. p1, p2, and p3 show same variable forms. p1_conf, p2_conf, and p3_conf show same variable forms. p1_dog, p2_dog, and p3_dog show same variable forms. <br/>We should create a separate dataset (observation unit = prediction results) by transforming those columns to be observations and only keep 4 columns: tweet_id, prediction_result, prediction_conf, prediction_dog.</li>
<li>Each type of observational unit forms a table, therefore jpg_url and img_num columns from this dataset should be merged to df_tweet.</li>
</ol>
<p>Tweet API (df_addition) :</p>
<ol>
<li>Show same observation unit with df_tweet. It should be merged it with df_tweet to us get tidy data.</li>
</ol>

<h2 style="color:orange">3. Clean</h2>

In this step, I defined cleaning rules to fix quality issues and tidy-ness issues which was mentioned in Assess section. After defined, I created code to clean data and tested it. Here are 2 final cleaned datasets as output from this step.
<ol>
<li><b>df_tweet_clean</b> contains 1971 original tweets data (not retweets, not replies) where 1 row = 1 tweet. <br/>I stored to csv `twitter_archive_master.csv`</li>
<li><b>df_img_clean</b> contains top 3 image prediction results of 1971 original tweets where 1 row = 1 image prediction. <br/>I stored to csv `image_prediction_master.csv`</li>
</ol>

<h3 style="color:brown">3.1 Tweet Archived Dataset (df_tweet_clean)</h3>

Here are cleaning rules that I defined to get cleaned Twitter Archive Dataset :
1. Get only original ratings (no retweets, no replies) that have images.
2. Convert datatype df_tweet_clean.**timestamp** from string to datetime.<br/>
3. Replace df_tweet_clean.**source** column values by appling function to get source url.<br/>
4. Get correct rating_numerator and rating_denominator by reextracting from text and standardizing denominator scale 10.
5. Get correct dog name by reextracting from text and replacing 'None' string with None.
6. Clean **doggo, floofer, pupper, and puppo** column values by replacing 'None' string with None and convert datatype columns to boolean.
7. Copy df_addition and convert datatype of all columns to integer.<br/>
8. Merge df_tweet_clean and df_addition_clean (left)

<h3 style="color:brown">3.2 Image Prediction Dataset (df_img_clean)</h3>
<p>Here are cleaning rules that I defined to get cleaned Image Prediction Dataset where 1 prediction in 1 row.
<ul>
<li>Concat 1st prediction result from df_img with 2nd prediction result. And concat them with 3rd prediction result.</li>
<li>Filter to get only original tweet.</li>
<li>Copy dataset.</li>
</ul>