# APIs, Big Data, and Databases

**Why learn to work with APIs?**

Data is everywhere. As a data analyst, knowing how to go and get it is quite a useful skill in your toolbelt. Why? Because it allows you to augment your internal data with external data, delivering new insights through techniques like correlation analysis to help you better understand potential trend drivers. 

Say, for example, you are a cafe owner. Predictability is your best friend. You have an idea as to the drivers of demand for your products; weather, public holidays, local events and marketing promotions all play a role. But it'd be great to have greater certaintity around when you're going to sell more or less of each product. 

One potentially significant determinant of cafe product demand is weather. Beneath is a tutorial that shows us how we can access weather data for a specified longitude and latitude (I have elected Melbourne), via the Open Weather API: https://openweathermap.org/api. I show us how to access the API (read) and store the data (write) into a few different databases, as described below.

### Databases we cover

When we access data via APIs it typically returns a JSON object. We then need to make a decision; do we store the data as a JSON object or do we transform the JSON object into a structured form.

#### Storing in JSON form (semi-structured)

- **MongoDB via the MongoDB API:** 

    A NoSQL Database that allows us to store data into JSON form. The flexibility of such a database is pretty powerful, as it means we don't have to architect a rigid SQL schema for our data to fill. For developers, this means they can spend more time building their applications and cool features, and less time worrying about the back-end.

#### Storing in Tabular form (structured)

- **Google Sheets via the Google Sheets API:** 

    This API allows us to read and write data to Google Sheets via its API. The code here looks less Pythonic as the developers built in camel case - the Java/Javascript/Typescript standard.

- **CSV:** 

    Writing data to a CSV file.

## Jumping in

**Writing a function to retrieve data from the Open Weather API**

In [22]:
from api_utils.apikey import API_KEY #
import requests # 
import json #

longitude = 144.9578 # Define the longitude of Melbourne
latitude = -37.8082 # Define the latitude of Melbourne
apiKey = API_KEY # import hidden API key for best-ptractice security

In [20]:
def retrieve_data():
        """_summary_

        Returns:
            _type_: _description_
        """
    
        url = f"https://api.openweathermap.org/data/2.5/onecall?lat={latitude}&lon={longitude}&exclude=hourly,minutely&units=metric&appid={apiKey}"
        data = requests.get(url=url)
        info = json.loads(data.text)

        to_store = info['daily']

        return to_store

In [21]:
data_dict = retrieve_data()
data_dict[0]

{'dt': 1658800800,
 'sunrise': 1658784350,
 'sunset': 1658820435,
 'moonrise': 1658778180,
 'moonset': 1658811300,
 'moon_phase': 0.92,
 'temp': {'day': 10.42,
  'min': 7.99,
  'max': 11.49,
  'night': 8.65,
  'eve': 9.63,
  'morn': 7.99},
 'feels_like': {'day': 9.45, 'night': 6.24, 'eve': 6.96, 'morn': 5.45},
 'pressure': 1015,
 'humidity': 74,
 'dew_point': 5.89,
 'wind_speed': 6.55,
 'wind_deg': 255,
 'wind_gust': 11.88,
 'weather': [{'id': 500,
   'main': 'Rain',
   'description': 'light rain',
   'icon': '10d'}],
 'clouds': 100,
 'pop': 0.68,
 'rain': 0.66,
 'uvi': 2.03}

**Cleaning the datetime values**

You may have noticed the datetime values look weird. This is because they are represented as Unix datetime format. We thus need to write a small function to update these values to AEST and replace them.

In [26]:
from datetime import datetime, timedelta 

In [27]:
def convert_datetime(data_dict):
        """_summary_

        Args:
            data_dict (_type_): _description_
        """
        dt_to_update = ['dt', 'sunrise', 'sunset', 'moonrise', 'moonset']
        for record in data_dict:
                for field in dt_to_update:
                        record[field] = (datetime.utcfromtimestamp(record[field])+timedelta(hours=10)).strftime('%Y-%m-%d %H:%M:%S')

## Storing in JSON format

### JSON form with the MongoDB API

In [23]:
import pymongo
from api_utils.mongo_db_password import MONGO_DB_PASSWORD

In [24]:
# Writes JSON object to local MongoDB database
def write_mongo_local(data_dict):
        """_summary_

        Args:
            data_dict (_type_): _description_

        Returns:
            _type_: _description_
        """

        myclient = pymongo.MongoClient(f"mongodb+srv://woz:{MONGO_DB_PASSWORD}@clusterinitial.mdrttii.mongodb.net/?retryWrites=true&w=majority")
        mydb = myclient["weather"]
        mycol = mydb["melb_hourly"]
        x = mycol.insert_many(data_dict)

        return x.inserted_ids

In [25]:
write_mongo_local(data_dict) # Write test data to MongoDB Atlas cluster

[ObjectId('62dfd1c2d5667b8de376e0b8'),
 ObjectId('62dfd1c2d5667b8de376e0b9'),
 ObjectId('62dfd1c2d5667b8de376e0ba'),
 ObjectId('62dfd1c2d5667b8de376e0bb'),
 ObjectId('62dfd1c2d5667b8de376e0bc'),
 ObjectId('62dfd1c2d5667b8de376e0bd'),
 ObjectId('62dfd1c2d5667b8de376e0be'),
 ObjectId('62dfd1c2d5667b8de376e0bf')]

## Storing in Tabluar format

Tabluar form.. structured form.. panel form.. in other words, like an Excel spreadsheet. 

Pandas as a neat method to convert JSON style data into such a form. Naturally, we access it with **pd.json_normalize**.. because we're normalsing the data.

**Refreshing our memory as to what JSON format looks like**

In [9]:
data_dict[0] # How the data currently is. This was ok for MongoDB (as NoSQL database), but it wont work for the highly structured Google Sheets and CSV

{'dt': '2022-07-26 12:00:00',
 'sunrise': '2022-07-26 07:25:50',
 'sunset': '2022-07-26 17:27:15',
 'moonrise': '2022-07-26 05:43:00',
 'moonset': '2022-07-26 14:55:00',
 'moon_phase': 0.92,
 'temp': {'day': 10.42,
  'min': 7.99,
  'max': 11.46,
  'night': 8.78,
  'eve': 9.88,
  'morn': 7.99},
 'feels_like': {'day': 9.45, 'night': 6.4, 'eve': 7.28, 'morn': 5.45},
 'pressure': 1015,
 'humidity': 74,
 'dew_point': 5.89,
 'wind_speed': 6.55,
 'wind_deg': 255,
 'wind_gust': 11.88,
 'weather': [{'id': 500,
   'main': 'Rain',
   'description': 'light rain',
   'icon': '10d'}],
 'clouds': 100,
 'pop': 0.68,
 'rain': 0.66,
 'uvi': 2.03,
 '_id': ObjectId('62dfb65ed5667b8de376e0af')}

**Converting our data from JSON to structured form**

In [10]:
import pandas as pd #
pd.set_option('max_columns', None) #

df_weather = pd.json_normalize(data_dict) # Normalise our data_dict with the Pandas method pd.json_normalize and store the dataframe in the variable df_weather

df_weather.head(3) # Print the top 3 rows of our newely normalise dataset

Unnamed: 0,dt,sunrise,sunset,moonrise,moonset,moon_phase,pressure,humidity,dew_point,wind_speed,wind_deg,wind_gust,weather,clouds,pop,rain,uvi,_id,temp.day,temp.min,temp.max,temp.night,temp.eve,temp.morn,feels_like.day,feels_like.night,feels_like.eve,feels_like.morn
0,2022-07-26 12:00:00,2022-07-26 07:25:50,2022-07-26 17:27:15,2022-07-26 05:43:00,2022-07-26 14:55:00,0.92,1015,74,5.89,6.55,255,11.88,"[{'id': 500, 'main': 'Rain', 'description': 'l...",100,0.68,0.66,2.03,62dfb65ed5667b8de376e0af,10.42,7.99,11.46,8.78,9.88,7.99,9.45,6.4,7.28,5.45
1,2022-07-27 12:00:00,2022-07-27 07:25:03,2022-07-27 17:28:03,2022-07-27 06:34:00,2022-07-27 15:48:00,0.95,1021,84,9.89,4.42,259,9.78,"[{'id': 500, 'main': 'Rain', 'description': 'l...",91,0.53,1.64,1.97,62dfb65ed5667b8de376e0b0,12.66,8.34,13.38,9.84,11.18,9.77,12.17,8.43,10.52,8.52
2,2022-07-28 12:00:00,2022-07-28 07:24:15,2022-07-28 17:28:51,2022-07-28 07:19:00,2022-07-28 16:45:00,0.98,1025,53,2.78,4.37,239,8.6,"[{'id': 500, 'main': 'Rain', 'description': 'l...",100,0.54,1.0,2.15,62dfb65ed5667b8de376e0b1,12.38,8.87,12.75,8.96,9.73,9.62,11.06,6.87,8.61,7.97


In [11]:
# Drop the 'weather' column as I am not demo'ing how to unbundle a list of dictionaries here (observe weather column)
df_weather.drop(
    columns=['weather', '_id'], 
    inplace=True
    )

# Fill NaNs with 0, as Google Spreadsheets doesn't like us trying to give it a non-integer/string/date value
df_weather = df_weather.fillna(0)

# Print top 3 row to see how we've changed the dataframe
df_weather.head(3)

Unnamed: 0,dt,sunrise,sunset,moonrise,moonset,moon_phase,pressure,humidity,dew_point,wind_speed,wind_deg,wind_gust,clouds,pop,rain,uvi,temp.day,temp.min,temp.max,temp.night,temp.eve,temp.morn,feels_like.day,feels_like.night,feels_like.eve,feels_like.morn
0,2022-07-26 12:00:00,2022-07-26 07:25:50,2022-07-26 17:27:15,2022-07-26 05:43:00,2022-07-26 14:55:00,0.92,1015,74,5.89,6.55,255,11.88,100,0.68,0.66,2.03,10.42,7.99,11.46,8.78,9.88,7.99,9.45,6.4,7.28,5.45
1,2022-07-27 12:00:00,2022-07-27 07:25:03,2022-07-27 17:28:03,2022-07-27 06:34:00,2022-07-27 15:48:00,0.95,1021,84,9.89,4.42,259,9.78,91,0.53,1.64,1.97,12.66,8.34,13.38,9.84,11.18,9.77,12.17,8.43,10.52,8.52
2,2022-07-28 12:00:00,2022-07-28 07:24:15,2022-07-28 17:28:51,2022-07-28 07:19:00,2022-07-28 16:45:00,0.98,1025,53,2.78,4.37,239,8.6,100,0.54,1.0,2.15,12.38,8.87,12.75,8.96,9.73,9.62,11.06,6.87,8.61,7.97


Now we have a dataframe that we are able to store in Google Sheets, a CSV, or a SQL database. 

There are stacks of different ways to do what I just did so if you find another way and it works, go for it (unless computation is an issue, then be critcal). 

This quickly becaomes the domain of Data Engineering, so if you wanted to work with datasets of 5-6 million rows plus, I recommend looking into learning Spark or PySpark. 

### Google Sheets API

To work with the Google Sheets API, we must install a few packages first. More details can be found at the link below.

Google Sheets Developer Documentation: https://developers.google.com/sheets/api/quickstart/python

In [12]:
from api_utils.gsheets_spreadsheet import GSHEETS_ID
from googleapiclient.discovery import build
from google.oauth2 import service_account

**Writing a function to write to a Google Sheets spreadsheet**

Now that we have the requistie packages installed, we need to build our connection. 

Below is a function that, when called, accessed my specified Google Sheets (given by the SAMPLE_SPREADSHEET_ID variable which I have imported from another file - so nobody can stitch me up) and appends my data beneath the last row.  

In [13]:
def appendValues(data):
        """_summary_

        Args:
            data (_type_): _description_

        Returns:
            _type_: _description_
        """
        SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
        SERVICE_ACCOUNT_FILE = 'keys.json'
        creds = None
        creds = service_account.Credentials.from_service_account_file(
                SERVICE_ACCOUNT_FILE, scopes=SCOPES)
        # The ID and range of a sample spreadsheet.
        SAMPLE_SPREADSHEET_ID = GSHEETS_ID
        service = build('sheets', 'v4', credentials=creds)
        # Call the Sheets API
        sheet = service.spreadsheets()
        result = sheet.values().append(
                spreadsheetId=SAMPLE_SPREADSHEET_ID,
                range ="Sheet2!A1", valueInputOption="USER_ENTERED",
                insertDataOption="INSERT_ROWS", body={"values": data}
                ).execute()

        return result

**Writing a function to turn each row into a list and append each list into another list**

The Google Spreadsheets API documentation is clear that we must parse in a 2-dimensional array to our connection. This means we can't simply try move the whole dataframe at once. 

Instead, we must turn each row of data into a list, and then append each list into another list. The result is a list of lists, and in other words, a 2-dimensional array.

https://stackoverflow.com/questions/54610707/invalid-values-error-when-attempting-to-use-append-in-google-sheets-api


In [14]:
for index, row in df_weather.iterrows():
        list_of_lists = []
        to_update = row.to_list()
        list_of_lists.append(to_update)
        # print(to_update)

        appendValues(list_of_lists)

### CSV

In [15]:
df_weather.to_csv('weather_data.csv')