# Report on wrangle_report.ipynb

![WeRateDogs](./images/weratedogs.jpg)

This report outlines the processes followed when wrangling data from the twitter account [WeRateDogs](https://twitter.com/dog_rates). WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10.

The data wragling process was conducted in a jupyter notebook and involved 4 steps which are: 
1. Gathering the data

2. Assessing the data

3. Cleaning the data

4. Storing the data

### 1. Gathering the data



The data used was obtained by using three methods. 


<div><img src="./images/download.png" width="50" height="35" align="right"/></div>

- The first method involved directly downloading the WeRateDogs `twitter archive` data directly in the form of a csv file and then reading it directly into a Pandas dataframe as shown below.

```Python 
df_tweet_archive_enhanced = pd.read_csv('./data/twitter-archive-enhanced.csv')
```

<div><img src="./images/requests.jpg" width="50" height="50" align="right"></div>

- The second method involved using the `Requests` library to download the `tweet image` data. This was achieved by crating a function that would take the `url` of the file and then download the file to an already specified data folder as illustrated below. 

```Python
def download_file(file_link):
    response = requests.get(file_link)
    with open(os.path.join(folder_name, file_link.split('/')[-1]), mode = 'wb') as file:
        file.write(response.content)```
        
<div><img src="./images/twitterAPI.jpg" width="150" height="150" align="right"/></div>

- The third method querrying the Twitter API ising the `tweepy`library to fetch additional tweet data such as `retweet_count` and `favourite_count`. Tweet ids were used to extract additional data fro each of the tweet in JSON. The data was stored in a file labelled `tweet_json.txt`



### 2. Assessing the data

The second step in the data wrangling process was assessing the data that had been gathered. The data was assessed both `visually` and `programmatically` in order to identify `quqlity` and `tidyness` issues


- `visual` assessment was done by sampling the data and displaying it in the jupyter notebook. 


- `programmatic` assessment was done by using several Pandas functions such as: `.info()` to display information about the data, `.describe()` to describe the numeric columns in the data, `.isna()` to check for missing values and `.shape()` to get the size of the data.

From the assessment,  10 `quality` and 4 `tidyness` issues were identified:

### 3. Cleaning the data

All the `quality` and `tidyness` issues identified in the assessin the data phase were cleaned in accordance to the rules of `tidy data`. 

The first step was to make copies of the data which was achieved by using the Pandas `.copy()` method.

The data was then cleaned by using the `Define`, `Code` and `Test` approach that involved:
- `defining` a cleaning task. 
-  writing `code` to perform the cleaning task.
- `testing` the data with code to ensure the cleaning task is successful.

### 4. Storing the data

The final stage in the wrangling process was storing the cleaned data into a single csv file. This was done by using the `.to_csv()` function of the Pandas Dataframe as shown below:

```Python
df_final.to_csv('./data/twitter_archive_master.csv', index=False)
```

In [1]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])

0