# APIs, Big Data, and Databases

**Purpose**

To demonstrate how to access data via API calls with Python, and to show certain options around transforming and storing data in different databases for later use.

**Why learn to work with APIs?**

Data is everywhere. If you work with data, knowing how to go and get it is quite a useful skill in your toolbelt. Why? Because it allows you to augment your internal data with external data, delivering new insights through techniques like correlation analysis to help you better understand potential trend drivers. 

**What's the difference between a JSON file and Python Dictionary?**

When you access data via an API it almost always returns a JSON file. This is also known as "semi-structured" data - because it's not your typical rows and columns. We then need to convert this into an object that Python can work with - and that object is a Python Dictionary. For our purposes, they are they same.

You can learn more about working with JSON files using Python here: https://www.youtube.com/watch?v=YgO5ff9sp7A (chuck it on 2x)

**Traditional SQL vs Modern NoSQL**

| SQL | NoSQL | 
| --- | --- | 
| Structured data | Semi-structured data |
| Database | Database |
| Tables | Collections |
| Rows | Documents |
| Columns | Fields |
| Index | Index |
| Join | Embedding and Linking |
| Group By | Aggregation |
| Primary Key | _id Field | 

**Deciding how to store our data**

Once we've retrieved our data as a JSON file and decoded it into a Python Dictionary, we need to decide how we want to store it. We have two options:
1. Semi-structured form: involves storing in in it's current JSON/Dictionary format with no transformations - in a NoSQL database (Not Only SQL) 
2. Structured form: involves transforming the data from the semi-structured JSON/Dictionary form into structured rows and and columns 

Traditionally, everybody would transform this data into a structured form and then store it in a SQL database, Google Sheets, CSV, or some other structured database. However, with the advent of NoSQL databases like MongoDB, developers have more options around designing backends with less rigidity. As a result, I will demonstrate storing our data in both forms...

### Storing as semi-strutured data

**MongoDB Atlas via the MongoDB API:** 

MongoDB Atlas is a  NoSQL Database that allows us to store data into JSON form. The flexibility of such a database is pretty powerful, as it means we don't have to architect a rigid SQL schema for our data to fill. For developers, this means they can spend more time building their applications and cool features, and less time worrying about the back-end.

    Example of semi-structured data

In [119]:
data_dict[0]

{'dt': '2022-07-28 12:00:00',
 'sunrise': '2022-07-28 07:24:15',
 'sunset': '2022-07-28 17:28:51',
 'moonrise': '2022-07-28 07:19:00',
 'moonset': '2022-07-28 16:45:00',
 'moon_phase': 0.98,
 'temp': {'day': 12.13,
  'min': 7.98,
  'max': 12.33,
  'night': 7.98,
  'eve': 9.15,
  'morn': 8.6},
 'feels_like': {'day': 10.86, 'night': 6.2, 'eve': 7.15, 'morn': 6.96},
 'pressure': 1025,
 'humidity': 56,
 'dew_point': 3.4,
 'wind_speed': 4.12,
 'wind_deg': 261,
 'wind_gust': 9.78,
 'weather': [{'id': 500,
   'main': 'Rain',
   'description': 'light rain',
   'icon': '10d'}],
 'clouds': 100,
 'pop': 0.99,
 'rain': 2.46,
 'uvi': 2.04,
 '_id': ObjectId('62e26181d5667b8de376e0d6')}

### Storing as structured data

**Google Sheets via the Google Sheets API:** 

This API allows us to read and write data to Google Sheets via its API. 
    
**CSV:** 

Writing data to a CSV file.

    Example of structured data

In [118]:
df_weather.head(2)

Unnamed: 0,dt,sunrise,sunset,moonrise,moonset,moon_phase,pressure,humidity,dew_point,wind_speed,wind_deg,wind_gust,weather,clouds,pop,rain,uvi,_id,temp.day,temp.min,temp.max,temp.night,temp.eve,temp.morn,feels_like.day,feels_like.night,feels_like.eve,feels_like.morn
0,2022-07-28 12:00:00,2022-07-28 07:24:15,2022-07-28 17:28:51,2022-07-28 07:19:00,2022-07-28 16:45:00,0.98,1025,56,3.4,4.12,261,9.78,"[{'id': 500, 'main': 'Rain', 'description': 'l...",100,0.99,2.46,2.04,62e26181d5667b8de376e0d6,12.13,7.98,12.33,7.98,9.15,8.6,10.86,6.2,7.15,6.96
1,2022-07-29 12:00:00,2022-07-29 07:23:25,2022-07-29 17:29:40,2022-07-29 07:58:00,2022-07-29 17:46:00,0.0,1029,59,2.68,2.86,208,5.26,"[{'id': 803, 'main': 'Clouds', 'description': ...",76,0.23,,1.37,62e26181d5667b8de376e0d7,10.4,5.71,11.57,5.84,8.48,5.71,9.04,4.33,7.64,4.29


## Jumping in

**Writing a function to retrieve data from the Open Weather API**

In [19]:
from api_utils.apikey import API_KEY # importing my API Key from another file as per best-practice
import requests # the requests module allows you to send HTTP requests using Python.
import json # JSON is a syntax for storing and exchanging data. This library is used for decoding JSON files

longitude = 144.9578 # define the longitude of Melbourne
latitude = -37.8082 # define the latitude of Melbourne
apiKey = API_KEY # import hidden| API key for best-practice security

In [20]:
def retrieve_data():
        """Makes get request to Open Weather API, decodes JSON file into Python dictionary, and accesses the 'daily' key

        Returns:
            _type_: dict 
        """
    
        url = f"https://api.openweathermap.org/data/2.5/onecall?lat={latitude}&lon={longitude}&exclude=hourly,minutely&units=metric&appid={apiKey}"
        data = requests.get(url=url)
        info = json.loads(data.text) # 'json.loads' deserializes (a str, bytes or bytearray instance containing a JSON document) to a Python object. 
        keys = info.keys()
        
        print(f'{data=}\n'
              'A 200 response means that there has been a successful connection between the client (us) and the server URI.\n'
              'You may be familiar with a 404 response which suggests a bad connection. That is, our Get request couldn\'t map to a server-side URI.')
        print(f'\nThe keys we can access in the JSON file are: {keys}')
        
        to_store = info['daily'] # accesses the 'daily' key, containing daily weather data

        return to_store # returns variables containing info['daily'] when function is called 

In [21]:
data_dict = retrieve_data() # call function and store result in 'data_dict'
data_dict[0] # print first element (record) in the result using indexing

data=<Response [200]>
A 200 response means that there has been a successful connection between the client (us) and the server URI.
You may be familiar with a 404 response which suggests a bad connection. That is, our Get request couldn't map to a server-side URI.

The keys we can access in the JSON file are: dict_keys(['lat', 'lon', 'timezone', 'timezone_offset', 'current', 'daily'])


{'dt': 1659060000,
 'sunrise': 1659043405,
 'sunset': 1659079780,
 'moonrise': 1659045480,
 'moonset': 1659080760,
 'moon_phase': 0,
 'temp': {'day': 11.2,
  'min': 6.21,
  'max': 12.55,
  'night': 6.21,
  'eve': 11.21,
  'morn': 6.8},
 'feels_like': {'day': 9.97, 'night': 4.88, 'eve': 10.08, 'morn': 5.29},
 'pressure': 1029,
 'humidity': 61,
 'dew_point': 3.97,
 'wind_speed': 3.03,
 'wind_deg': 204,
 'wind_gust': 5.91,
 'weather': [{'id': 800,
   'main': 'Clear',
   'description': 'clear sky',
   'icon': '01d'}],
 'clouds': 8,
 'pop': 0.34,
 'uvi': 2.54}

**Cleaning the datetime values**

You may have noticed the datetime values look weird. This is because they are represented as Unix datetime format. We thus need to write a small function to update these values to AEST and replace them.

In [22]:
from datetime import datetime, timedelta # import datetime library used for working with dates and times in Python

In [23]:
def convert_datetime(data_dict):
        """Takes in a Python Dictionary, identifies keys that are present in the Dict and in dt_to_update, and performs a conversion fucntion to AEST if found

        Args:
            data_dict (dictionary): an updated Dictionary
        """
        dt_to_update = ['dt', 'sunrise', 'sunset', 'moonrise', 'moonset']
        for record in data_dict:
                for field in dt_to_update:
                        record[field] = (datetime.utcfromtimestamp(record[field])+timedelta(hours=10)).strftime('%Y-%m-%d %H:%M:%S')

In [24]:
convert_datetime(data_dict) # call function on our data 
data_dict[0] # print first element in the result using indexing

{'dt': '2022-07-29 12:00:00',
 'sunrise': '2022-07-29 07:23:25',
 'sunset': '2022-07-29 17:29:40',
 'moonrise': '2022-07-29 07:58:00',
 'moonset': '2022-07-29 17:46:00',
 'moon_phase': 0,
 'temp': {'day': 11.2,
  'min': 6.21,
  'max': 12.55,
  'night': 6.21,
  'eve': 11.21,
  'morn': 6.8},
 'feels_like': {'day': 9.97, 'night': 4.88, 'eve': 10.08, 'morn': 5.29},
 'pressure': 1029,
 'humidity': 61,
 'dew_point': 3.97,
 'wind_speed': 3.03,
 'wind_deg': 204,
 'wind_gust': 5.91,
 'weather': [{'id': 800,
   'main': 'Clear',
   'description': 'clear sky',
   'icon': '01d'}],
 'clouds': 8,
 'pop': 0.34,
 'uvi': 2.54}

### Storing as semi-structured data

**JSON form with the MongoDB API**

Read more about wokring with MongoDB here: https://pymongo.readthedocs.io/en/stable/

In [25]:
import pymongo # PyMongo is a Python distribution containing tools for working with MongoDB. These tools allows us to connect to and work with MongoDB Atlas, their NoSQL database offering
from api_utils.mongo_db_password import MONGO_DB_PASSWORD # import hidden MongoDB cluster instance password for best-practice security

In [26]:
def write_mongo_local(data_dict):
        """Creates a connection to my MongoDB Atlas cluster, creates a database called 'weather', creates a collection in that database called
        'melb_hourly', and inserts my data into that collection

        Args:
            data_dict (_type_): Python Dictionary

        """

        myclient = pymongo.MongoClient(f"mongodb+srv://woz:{MONGO_DB_PASSWORD}@clusterinitial.mdrttii.mongodb.net/?retryWrites=true&w=majority")
        mydb = myclient["weather"]
        mycol = mydb["melb_hourly"]
        x = mycol.insert_many(data_dict)
        
        return x

In [27]:
write_mongo_local(data_dict) # Write test data to MongoDB Atlas cluster

<pymongo.results.InsertManyResult at 0x14f7fa1ddf0>

### Storing as structured data

Tabluar form.. structured form.. panel form.. in other words, like an Excel spreadsheet. 

Pandas as a neat method to convert JSON style data into such a form. Naturally, we access it with **pd.json_normalize**.. because we're normalsing the dataset.

**Refreshing our memory as to what JSON format looks like**

In [28]:
data_dict[0] 

{'dt': '2022-07-29 12:00:00',
 'sunrise': '2022-07-29 07:23:25',
 'sunset': '2022-07-29 17:29:40',
 'moonrise': '2022-07-29 07:58:00',
 'moonset': '2022-07-29 17:46:00',
 'moon_phase': 0,
 'temp': {'day': 11.2,
  'min': 6.21,
  'max': 12.55,
  'night': 6.21,
  'eve': 11.21,
  'morn': 6.8},
 'feels_like': {'day': 9.97, 'night': 4.88, 'eve': 10.08, 'morn': 5.29},
 'pressure': 1029,
 'humidity': 61,
 'dew_point': 3.97,
 'wind_speed': 3.03,
 'wind_deg': 204,
 'wind_gust': 5.91,
 'weather': [{'id': 800,
   'main': 'Clear',
   'description': 'clear sky',
   'icon': '01d'}],
 'clouds': 8,
 'pop': 0.34,
 'uvi': 2.54,
 '_id': ObjectId('62e37339fe35ace267ee1e2d')}

**Converting our data from JSON to structured form**

Before we can store how data in a structured form (rows and columns), we need to convert it from a Python Dictionary 

In [30]:
import pandas as pd #
pd.set_option('max_columns', None) #

In [31]:
df_weather = pd.json_normalize(data_dict) # normalise our data_dict with the Pandas method pd.json_normalize and store the dataframe in the variable df_weather

df_weather.head(3) # print the top 3 rows of our newely normalise dataset

Unnamed: 0,dt,sunrise,sunset,moonrise,moonset,moon_phase,pressure,humidity,dew_point,wind_speed,wind_deg,wind_gust,weather,clouds,pop,uvi,_id,temp.day,temp.min,temp.max,temp.night,temp.eve,temp.morn,feels_like.day,feels_like.night,feels_like.eve,feels_like.morn,rain
0,2022-07-29 12:00:00,2022-07-29 07:23:25,2022-07-29 17:29:40,2022-07-29 07:58:00,2022-07-29 17:46:00,0.0,1029,61,3.97,3.03,204,5.91,"[{'id': 800, 'main': 'Clear', 'description': '...",8,0.34,2.54,62e37339fe35ace267ee1e2d,11.2,6.21,12.55,6.21,11.21,6.8,9.97,4.88,10.08,5.29,
1,2022-07-30 12:00:00,2022-07-30 07:22:33,2022-07-30 17:30:29,2022-07-30 08:31:00,2022-07-30 18:47:00,0.04,1024,46,0.26,6.49,354,13.93,"[{'id': 800, 'main': 'Clear', 'description': '...",0,0.0,2.71,62e37339fe35ace267ee1e2e,11.46,3.84,12.62,9.63,10.23,3.9,9.86,6.74,8.85,1.45,
2,2022-07-31 12:00:00,2022-07-31 07:21:40,2022-07-31 17:31:19,2022-07-31 09:00:00,2022-07-31 19:48:00,0.07,1014,51,2.45,8.41,2,16.96,"[{'id': 804, 'main': 'Clouds', 'description': ...",100,0.57,2.34,62e37339fe35ace267ee1e2f,12.47,9.57,13.74,10.34,11.97,9.57,11.1,9.49,10.89,6.44,


In [32]:
# drop the 'weather' column as I am not demo'ing how to unbundle a list of dictionaries here (observe weather column)
df_weather.drop(
    columns=['weather', '_id'], 
    inplace=True
    )

df_weather = df_weather.fillna(0) # fill NaNs with 0, as Google Spreadsheets doesn't like us trying to give it a non-integer/string/date value

df_weather.head(3) # print top 3 row to see how we've changed the dataframe

Unnamed: 0,dt,sunrise,sunset,moonrise,moonset,moon_phase,pressure,humidity,dew_point,wind_speed,wind_deg,wind_gust,clouds,pop,uvi,temp.day,temp.min,temp.max,temp.night,temp.eve,temp.morn,feels_like.day,feels_like.night,feels_like.eve,feels_like.morn,rain
0,2022-07-29 12:00:00,2022-07-29 07:23:25,2022-07-29 17:29:40,2022-07-29 07:58:00,2022-07-29 17:46:00,0.0,1029,61,3.97,3.03,204,5.91,8,0.34,2.54,11.2,6.21,12.55,6.21,11.21,6.8,9.97,4.88,10.08,5.29,0.0
1,2022-07-30 12:00:00,2022-07-30 07:22:33,2022-07-30 17:30:29,2022-07-30 08:31:00,2022-07-30 18:47:00,0.04,1024,46,0.26,6.49,354,13.93,0,0.0,2.71,11.46,3.84,12.62,9.63,10.23,3.9,9.86,6.74,8.85,1.45,0.0
2,2022-07-31 12:00:00,2022-07-31 07:21:40,2022-07-31 17:31:19,2022-07-31 09:00:00,2022-07-31 19:48:00,0.07,1014,51,2.45,8.41,2,16.96,100,0.57,2.34,12.47,9.57,13.74,10.34,11.97,9.57,11.1,9.49,10.89,6.44,0.0


Now we have a dataframe that we are able to store in Google Sheets, a CSV, or a SQL database. 

There are stacks of different ways to do what I just did so if you find another way and it works, go for it (unless computation is an issue, then be critcal). 

This quickly becaomes the domain of Data Engineering, so if you wanted to work with datasets of 5-6 million rows plus, I recommend looking into learning Spark or PySpark. 

**Google Sheets via the Google Sheets API**

To work with the Google Sheets API, we must install a few packages first. More details can be found at the link below.

Google Sheets Developer Documentation: https://developers.google.com/sheets/api/quickstart/python

In [45]:
from api_utils.gsheets_spreadsheet import GSHEETS_ID
from googleapiclient.discovery import build
from google.oauth2 import service_account

**Writing a function to write to a Google Sheets spreadsheet**

Now that we have the requistie packages installed, we need to build our connection. 

Below is a function that, when called, accessed my specified Google Sheets (given by the SAMPLE_SPREADSHEET_ID variable which I have imported from another file - so nobody can stitch me up) and appends my data beneath the last row.  

In [46]:
def appendValues(data):
        """ Creates connection to my Google Sheets spreadsheet and appends my data beneath the bottom row 

        Args:
            data (_type_): Array

        Returns:
            _type_: _description_
        """
        SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
        SERVICE_ACCOUNT_FILE = 'keys.json'
        creds = None
        creds = service_account.Credentials.from_service_account_file(
                SERVICE_ACCOUNT_FILE, scopes=SCOPES)
        # The ID and range of a sample spreadsheet.
        SAMPLE_SPREADSHEET_ID = GSHEETS_ID
        service = build('sheets', 'v4', credentials=creds)
        # Call the Sheets API
        sheet = service.spreadsheets()
        result = sheet.values().append(
                spreadsheetId=SAMPLE_SPREADSHEET_ID,
                range ="Sheet2!A1", valueInputOption="USER_ENTERED",
                insertDataOption="INSERT_ROWS", body={"values": data}
                ).execute()

        return result

**Writing a function to turn each row into a list and append each list into another list**

The Google Spreadsheets API documentation is clear that we must parse in a 2-dimensional array to our connection. This means we can't simply try move the whole dataframe at once. 

Instead, we must turn each row of data into a list, and then append each list into another list. The result is a list of lists, and in other words, a 2-dimensional array.

https://stackoverflow.com/questions/54610707/invalid-values-error-when-attempting-to-use-append-in-google-sheets-api


In [48]:
for index, row in df_weather.iterrows():
        list_of_lists = []
        to_update = row.to_list()
        list_of_lists.append(to_update)
        # print(to_update)

        appendValues(list_of_lists)

### CSV

In [49]:
df_weather.to_csv('weather_data.csv')