# Data Wrangling Project - WeRateDogs 

### Gather Data

In this project, I first gathered data from 3 sources:

1. The WeRateDogs Twitter archive. I downloaded **twitter_archive_enhanced.csv** directly and read it as `twitter`.
2. The tweet image predictions. I downloaded **image_predictions.tsv** programmatically using the *Requests* and read it as `image`.
3. Each tweet's retweet count and favorite count. I used tweet IDs from **twitter_archive_enhanced.csv**, query the Twitter API for each tweet's JSON data using *Tweepy* and stored JSON data in **tweet_json.txt**. Then I read this file line by line into a pandas DataFrame with tweet ID, retweet count, and favorite count as `tweet`. 

### Assess Data

The next step is to assess data.

I first visualized the data to identify any quality and tidiness issues. Then I used methods *info()*, *describe()*, *sort_values*, *unique()*, *value_counts()*, *duplicated()* to further assess the data. I checked for data completeness, validity, accuracy, consistency and tidiness. The problems I identified are listed below:

**Quality**

`twitter` table
 - Dog stage columns pupper, puppo, doggo, floofer have missing data.
 - Data type of columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id should be string.
 - Data type of columns timestamp and retweeted_status_timestamp should be date/time.
 - Data type of dog stage column should be category.
 - Some values in name column are not valid dog names, like "a", "the", "an", etc.
 - Some values in rating_denominator column are not equal to 10.
 - Some values in rating_numerator column are very high compared to others.
 
`image` table
 - Some predictions are not accurate (not a dog breed). 
 - Some dog breed names are capitalized but some are not.

`tweet` table
None

**Tidiness**

 - Dog stage columns (pupper, puppo, doggo, floofer) in twitter can be combined to 1 column with 4 variables.
 - Merge the correct prediction column to twitter for later analysis.
 - Data in tweet can be merged to twitter.

### Clean Data

After making copies of the 3 datasets, I started to clean data. I used：

*drop()* and *query()* to remove irrelevant columns；

*to_datetime()* and *astype()* to change data types； 

*replace()*, *map()* and *for loop* to change values;

*function* and *apply()* to get the values I need;

*str.title()* and *round()* to adjust the format;

*merge()* to combine data from 2 datasets.

### Store Data 

After cleaning, I stored the clean data in file and in database.

### Analyze and visualize data

I imported *matplotlib.pyplot* and *seaborn* for data visualization. The 3 questions I explored are listed in **act_report.html**. I used *query()* to generate subsets of the data and *groupby()* to group the data and compute summary statistics. I used scatter plots to see correlations between two quantitative variables and used bar charts to visualize results of a qualitative variable against a quantitative variable.