# Gather

For this project, I will gather data from three different sources.

## The first source
The `twitter-archive-enhanced.csv` which was provided by *Udacity* for this project.

This file contains the twitter archive of the [WeRateDogs twitter account](https://twitter.com/dog_rates?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) which is the "only source for professional dog ratings" as they describe themselves on their twitter account.

I downloaded this file manually and added it to the folder of the project under its original name.

![](./Slides/Manual_Download_Proof.png)

##### Storing downloaded data into a pandas DataFrame

In [None]:
import pandas as pd
tweet_archive_df = pd.read_csv("twitter-archive-enhanced.csv")

In [None]:
tweet_archive_df.head()

## The second source

The `image_predictions.tsv` which is hosted on *Udacity* servers. This file contains the tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network.

##### Downloading the file programmatically using *requests* Python module.

In [None]:
import requests

url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
predictions_file_name = url.rsplit("/")[-1]

response = requests.get(url)

with open(predictions_file_name, 'wb') as saved_file:
    saved_file.write(response.content)

![](./Slides/Programmatic_Download_Proof.png)

##### Storing the collected data in a pandas DataFrame

In [None]:
img_pred_df = pd.read_csv(predictions_file_name, sep="\t")
img_pred_df.head()

## The third source

Using the `tweet_id` column in the WeRateDogs Twitter archive `twitter-archive-enhanced.csv`, I will query the Twitter API for each tweet's JSON data using Python's *Tweepy* module and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data will be written to its own line.

##### Loading secret API keys from environment

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

CONSUMER_KEY = os.getenv('CONSUMER_KEY')
CONSUMER_SECRET = os.getenv('CONSUMER_SECRET')
ACCESS_TOKEN = os.getenv('ACCESS_TOKEN')
ACCESS_SECRET = os.getenv('ACCESS_SECRET')

##### Querying the Twitter API through Tweepy to collect tweet data

In [None]:
import tweepy
import json

# Authentication
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True,
                 wait_on_rate_limit_notify=True)

tweet_ids = pd.read_csv("twitter-archive-enhanced.csv").tweet_id

with open("tweet-json.txt", "w") as txt_file:
    # For every tweet ID
    for tweet_id in tweet_ids:
        # Try to get the tweet information
        try:
            tweet = api.get_status(tweet_id, tweet_mode="extended")
        except tweepy.TweepError:
            continue

        # If successful, write the json string to the tweet-json.txt file and append a new line
        tweet_json_string = json.dumps(tweet)
        txt_file.write(f"{tweet_json_string}\n")

![](./Slides/API_Querying_proof.png)

##### Saving collected tweets data into a pandas DataFrame

In [None]:
tweet_data_df = pd.DataFrame()
with open('tweet-json.txt', 'r') as txt_file:
    tweet_data_df = pd.read_json(txt_file, lines=True)

## Wrap-up

In this section, I gathered data from 3 different sources and saved them to 