# APIs and Multiprocessing

*This notebook includes adapted content from [Melanie Walsh's chapter on Data Collection](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/00-Data-Collection.html).*

In this lab, we'll introduce a useful way to extract data from online, as well as a canonical tool used to explore large datasets when you don't have access to a Python environment. We'll go over the following topics:

- Accessing an API
- API Wrappers
- Python Scripts
- Multiprocessing

# APIs

It seems only natural that we should be able to extract any data from the internet by programmatically logging information after "going" to each website you're interested in. (In fact, it is [perfectly legal](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/01-User-Ethics-Legal-Concerns.html).) One way to do this is using [web scraping](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/02-Web-Scraping-Part1.html), where you write an algorithm which parses website content, logs data, and loops through several HTML web pages. But, this method is becoming less effective over the years, as websites are becoming far more complex (harder to scrape), and most companies are transitioning to a platform where their data is more easily accessible (and controlled) in an Application Programming Interface (API).

## What is an API?

**An API allows you to programmatically extract and interact with company data which drives their websites.** In this way, social networks, museums, foundations, research labs, applications, and projects can make their data publicly available, allowing for developers to use the data to build applications and tools (e.g., for your phone, computer, or refrigerator) that can be used by the general populous. For example, the reason you can access Google Maps on your phone is because developers used the Google Maps API to build that functionality.

Of course, there are plenty of companies or foundations which will likely never use APIs to store/access their data. In these cases though, you can usually find an API that is *related* to that website, or someone may have built (or, they are building) a third-party API for that purpose. Web scraping should typically be a last resort, so we do not teach it in this class.

<span style="color:red">**Caveat:** People typically design their APIs such that they decide exactly which kinds of data they want to share. So, they often choose not to share their most lucrative and desirable data. In those cases, you are usually asked to pay some fee.</span>

## Using Environment Variables

We will discuss environment variables more in a future lesson, but to use APIs properly, we need to have at least a basic understanding of what environment variables are.

When working on any data science project (e.g., like the web app you'll build later in this course), you will likely track your progress using Git/GitHub. But, keys and secret strings (like the ones we will use to access an API) should never be pushed to GitHub. Instead, it's a best practice to use environment variables when dealing with this kind of data. In short, environment variables are values stored in a special file on your local computer, or on the cloud where your project may be hosted. In this way, those variables are only accessible to agents with access to that file (e.g., your Python interpreter, or the one on the cloud).

In this class, we will use [dotenv](https://github.com/theskumar/python-dotenv#getting-started) to manage environment variables for API. You'll need to `pip` install it, as directed in the instructions, then create a file with the name *.env* (notice the period) in your project directory to hold any keys or secrets. Since we're going to use this package in this notebook, we'll import the library here. *Note: If you're using Git/GitHub, make sure ".env" is added to your [.gitignore file](https://www.atlassian.com/git/tutorials/saving-changes/gitignore)*.

In [18]:
from dotenv import load_dotenv

## Accessing an API

The steps to access *any* API are about the same, no matter the API. So, in this lesson, we're going to use the [Genius](https://genius.com/) API to access data about songs.

### Step 1: Client Access Token

Typically, to use an API, you need a special API key usually called a "Client Access Token", which is kind of like a password. Many APIs require authentication keys to gain access to them. To get your necessary Genius API keys, follow these steps:

1. Navigate to the [api-clients page](https://genius.com/api-clients) (which will prompt you to [sign up for an account](https://genius.com/signup_or_login) if you haven't already). Then, click the button that says **"New API Client"**.
2. Remember, APIs are expecting *developers* to use their APIs to build applications (e.g., for your phone, computers, etc.). But, since we're only doing data analysis for a college course in informatics, we only need to fill in the fields for "App Name" (e.g., *"Song Lyrics Project"*), and "App Website URL" (e.g., *"https://github.com/leontoddjohnson/myrepo"*). Then, click **Save**.
3. When you click "Save," you'll be given a series of API Keys: a "Client ID" and a "Client Secret." **Copy/Paste these values into your *.env* file** without quotations, as instructed in the dotenv documentation. For example, my *.env* file looks something like this:
   
```
CLIENT_ID=asdfghjkl;123456789
CLIENT_SECRET=qwertyuiop098765432   
```
    
4. To generate your "Client Access Token," which is the API key that we'll be using in this notebook, you need to click "Generate Access Token". Place that in your *.env* file as you did the other variables, maybe under the variable name ACCESS_TOKEN.

We can access our `ACCESS_TOKEN` by using *dotenv* to load our environment variables into the current environment, then with the built-in Python *os* library to access it.

In [19]:
load_dotenv()

False

In [20]:
import os

# do not print this variable anywhere if the notebook is going on GitHub
ACCESS_TOKEN = os.environ['ACCESS_TOKEN']

KeyError: 'ACCESS_TOKEN'

### Step 2: Making an API Request

Making an API request is very similar to accessing a URL in your browser. But, instead of getting a rendered HTML web page in return, you get some data in return.

There are a few different ways that we can [query the Genius API](https://docs.genius.com/#songs-h2), but here we'll use [the basic search](https://docs.genius.com/#search-h2), which allows you to get a bunch of Genius data about any artist or songs that you search for:

`http://api.genius.com/search?q={search_term}&access_token={client_access_token}`

First we're going to assign the string "Missy Elliott" to the variable `search_term`. Then we're going to make an f-string URL that contains the variables we'd like to include in our query.

In [None]:
search_term = "Missy Elliott"

In [None]:
genius_search_url = f"http://api.genius.com/search?q={search_term}&access_token={ACCESS_TOKEN}"

You can see the data we'll be requesting from this API by printing the `genius_search_url`, pasting it into your browser.

In [None]:
# print(genius_search_url)

The data you might see when you navigate to your URL is in [JSON](https://www.w3schools.com/whatis/whatis_json.asp) format. JSON is an acronym for JavaScript Object Notation, and it is a data format commonly used by APIs. JSON data can be nested, and contains key-value pairs, much like a Python dictionary.

We can access this JSON directly in Python using the [`requests` library](https://requests.readthedocs.io/en/latest/) to send HTTP requests to a remote client. If you like, you can read more about what a "request" is [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview), but it suffices to say that it represents an online communication between your computer and the server storing the data you want.

In [None]:
import requests
import json

In [None]:
# here, we make a "GET" request to the Genius server
response = requests.get(genius_search_url)
json_data = response.json()

In [None]:
json_data

{'meta': {'status': 200},
 'response': {'hits': [{'highlights': [],
    'index': 'song',
    'type': 'song',
    'result': {'annotation_count': 34,
     'api_path': '/songs/4176',
     'artist_names': 'Missy Elliott',
     'full_title': 'Work It by\xa0Missy\xa0Elliott',
     'header_image_thumbnail_url': 'https://images.genius.com/a54a36ef386a84ff0e689dc6d9cbdd18.300x300x1.jpg',
     'header_image_url': 'https://images.genius.com/a54a36ef386a84ff0e689dc6d9cbdd18.953x953x1.jpg',
     'id': 4176,
     'lyrics_owner_id': 6654,
     'lyrics_state': 'complete',
     'path': '/Missy-elliott-work-it-lyrics',
     'pyongs_count': 42,
     'relationships_index_url': 'https://genius.com/Missy-elliott-work-it-sample',
     'release_date_components': {'year': 2002, 'month': 9, 'day': 1},
     'release_date_for_display': 'September 1, 2002',
     'release_date_with_abbreviated_month_for_display': 'Sep. 1, 2002',
     'song_art_image_thumbnail_url': 'https://images.genius.com/a54a36ef386a84ff0e689dc

In [None]:
json_data.keys()

dict_keys(['meta', 'response'])

In [None]:
json_data['response'].keys()

dict_keys(['hits'])

Genius places all of its search results into the "hits" element. By default, it looks like it returns at most 10 search results for any request.

In [None]:
len(json_data['response']['hits'])

10

According to the documentation, we can use [referents](https://docs.genius.com/#referents-h2) to increase that number to a maximum of 20 results per request using `per_page`. With this slight adjustment added, let's consolidate our request into a single function to use again later.

In [None]:
def genius(search_term, per_page=15):
    '''
    Collect data from the Genius API by searching for `search_term`.
    
    **Assumes ACCESS_TOKEN is loaded in environment.**
    '''
    genius_search_url = f"http://api.genius.com/search?q={search_term}&" + \
                        f"access_token={ACCESS_TOKEN}&per_page={per_page}"
    
    response = requests.get(genius_search_url)
    json_data = response.json()
    
    return json_data['response']['hits']

In [None]:
json_data = genius("The Beatles")
len(json_data)

15

## Loading JSON Data Into a DataFrame

For us to efficiently work with the JSON data, we need to load them into a DataFrame. Conveniently, our "JSON" data is actually now stored in a Python list (of dictionaries), and this is exactly one of the main formats that `pd.DataFrame` [expects](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas-dataframe).

*Note: JSON data is often saved in files, much like CSVs. In these cases, you can use `pd.read_json` to read in those files.*

In [None]:
import pandas as pd

In [None]:
json_data[0]

{'highlights': [],
 'index': 'song',
 'type': 'song',
 'result': {'annotation_count': 8,
  'api_path': '/songs/2236',
  'artist_names': 'The Beatles',
  'full_title': 'Yesterday by\xa0The\xa0Beatles',
  'header_image_thumbnail_url': 'https://images.genius.com/67d46a92276344c6a8684f9c7d27ef80.300x169x1.jpg',
  'header_image_url': 'https://images.genius.com/67d46a92276344c6a8684f9c7d27ef80.1000x563x1.jpg',
  'id': 2236,
  'lyrics_owner_id': 7,
  'lyrics_state': 'complete',
  'path': '/The-beatles-yesterday-lyrics',
  'pyongs_count': 95,
  'relationships_index_url': 'https://genius.com/The-beatles-yesterday-sample',
  'release_date_components': {'year': 1965, 'month': 9, 'day': 13},
  'release_date_for_display': 'September 13, 1965',
  'release_date_with_abbreviated_month_for_display': 'Sep. 13, 1965',
  'song_art_image_thumbnail_url': 'https://images.genius.com/f9bfd62a8c651caab16f631039a9a0b6.300x300x1.jpg',
  'song_art_image_url': 'https://images.genius.com/f9bfd62a8c651caab16f631039a9

When we look at any of the hits, we see the data we're interested in is contained in the `"result"` element. We can consolidate all of the "result" elements for each "hit" using a list comprehension.

**Looking ahead:** Notice that the `"stats"` and the `"primary_artist"` elements contain *dictionaries* of interesting data that we'll need to unpack once we have our data into a DataFrame.

In [None]:
hits = [hit['result'] for hit in json_data]
df = pd.DataFrame(hits)

In [None]:
df.head()

Unnamed: 0,annotation_count,api_path,artist_names,full_title,header_image_thumbnail_url,header_image_url,id,lyrics_owner_id,lyrics_state,path,...,release_date_for_display,release_date_with_abbreviated_month_for_display,song_art_image_thumbnail_url,song_art_image_url,stats,title,title_with_featured,url,featured_artists,primary_artist
0,8,/songs/2236,The Beatles,Yesterday by The Beatles,https://images.genius.com/67d46a92276344c6a868...,https://images.genius.com/67d46a92276344c6a868...,2236,7,complete,/The-beatles-yesterday-lyrics,...,"September 13, 1965","Sep. 13, 1965",https://images.genius.com/f9bfd62a8c651caab16f...,https://images.genius.com/f9bfd62a8c651caab16f...,"{'unreviewed_annotations': 4, 'concurrents': 1...",Yesterday,Yesterday,https://genius.com/The-beatles-yesterday-lyrics,[],"{'api_path': '/artists/586', 'header_image_url..."
1,9,/songs/1575,The Beatles,Let It Be by The Beatles,https://images.genius.com/92f06c735acd852cb7f6...,https://images.genius.com/92f06c735acd852cb7f6...,1575,7,complete,/The-beatles-let-it-be-lyrics,...,"May 8, 1970","May. 8, 1970",https://images.genius.com/38df3b59f231f4babd59...,https://images.genius.com/38df3b59f231f4babd59...,"{'unreviewed_annotations': 1, 'concurrents': 4...",Let It Be,Let It Be,https://genius.com/The-beatles-let-it-be-lyrics,[],"{'api_path': '/artists/586', 'header_image_url..."
2,23,/songs/82381,The Beatles,Hey Jude by The Beatles,https://images.genius.com/d3ed7c6e723c41aa6741...,https://images.genius.com/d3ed7c6e723c41aa6741...,82381,25711,complete,/The-beatles-hey-jude-lyrics,...,"August 26, 1968","Aug. 26, 1968",https://images.genius.com/537342a11e2455300f30...,https://images.genius.com/537342a11e2455300f30...,"{'unreviewed_annotations': 3, 'concurrents': 3...",Hey Jude,Hey Jude,https://genius.com/The-beatles-hey-jude-lyrics,[],"{'api_path': '/artists/586', 'header_image_url..."
3,20,/songs/56218,The Beatles,Come Together by The Beatles,https://images.genius.com/5a6f82f01d02914d41eb...,https://images.genius.com/5a6f82f01d02914d41eb...,56218,29141,complete,/The-beatles-come-together-lyrics,...,"September 26, 1969","Sep. 26, 1969",https://images.genius.com/04df901371547072bab6...,https://images.genius.com/04df901371547072bab6...,"{'unreviewed_annotations': 7, 'concurrents': 2...",Come Together,Come Together,https://genius.com/The-beatles-come-together-l...,[],"{'api_path': '/artists/586', 'header_image_url..."
4,9,/songs/71861,The Beatles,In My Life by The Beatles,https://images.genius.com/1a5e9183169bb70366f9...,https://images.genius.com/1a5e9183169bb70366f9...,71861,11524,complete,/The-beatles-in-my-life-lyrics,...,"December 3, 1965","Dec. 3, 1965",https://images.genius.com/1a5e9183169bb70366f9...,https://images.genius.com/1a5e9183169bb70366f9...,"{'unreviewed_annotations': 4, 'concurrents': 5...",In My Life,In My Life,https://genius.com/The-beatles-in-my-life-lyrics,[],"{'api_path': '/artists/586', 'header_image_url..."


Recall that `"stats"` and `"primary_artist"` contain dictionaries which we want to unpack. After a bit of StackOverflow searching (say), we find that we can [use](https://stackoverflow.com/a/38231651) `pd.apply(pd.Series)` and `pd.concat` to explode these into columns. We'll need to make a slight adjustment to the column names to avoid repeats.

In [None]:
df_stats = df['stats'].apply(pd.Series)
df_stats.rename(columns={c:'stat_' + c for c in df_stats.columns},
                inplace=True)

df_stats.head()

Unnamed: 0,stat_unreviewed_annotations,stat_concurrents,stat_hot,stat_pageviews
0,4,11.0,False,2525012
1,1,4.0,False,1754275
2,3,3.0,False,1316536
3,7,2.0,False,1290747
4,4,5.0,False,1206816


In [None]:
df_primary = df['primary_artist'].apply(pd.Series)
df_primary.rename(columns={c:'primary_artist_' + c for c in df_primary.columns},
                  inplace=True)
df_primary.head()

Unnamed: 0,primary_artist_api_path,primary_artist_header_image_url,primary_artist_id,primary_artist_image_url,primary_artist_is_meme_verified,primary_artist_is_verified,primary_artist_name,primary_artist_url
0,/artists/586,https://images.genius.com/817d7fb288bb1c845614...,586,https://images.genius.com/2a7afa0442a3805371b0...,False,False,The Beatles,https://genius.com/artists/The-beatles
1,/artists/586,https://images.genius.com/817d7fb288bb1c845614...,586,https://images.genius.com/2a7afa0442a3805371b0...,False,False,The Beatles,https://genius.com/artists/The-beatles
2,/artists/586,https://images.genius.com/817d7fb288bb1c845614...,586,https://images.genius.com/2a7afa0442a3805371b0...,False,False,The Beatles,https://genius.com/artists/The-beatles
3,/artists/586,https://images.genius.com/817d7fb288bb1c845614...,586,https://images.genius.com/2a7afa0442a3805371b0...,False,False,The Beatles,https://genius.com/artists/The-beatles
4,/artists/586,https://images.genius.com/817d7fb288bb1c845614...,586,https://images.genius.com/2a7afa0442a3805371b0...,False,False,The Beatles,https://genius.com/artists/The-beatles


In [None]:
# notice, we maintain the original column here
df = pd.concat((df, df_stats, df_primary), axis=1)

Let's consolidate all of this into one function which creates a dataframe of genius data given a search term.

In [None]:
def genius_to_df(search_term, n_results_per_term=10):
    json_data = genius(search_term, per_page=n_results_per_term)
    hits = [hit['result'] for hit in json_data]
    df = pd.DataFrame(hits)

    # expand dictionary elements
    df_stats = df['stats'].apply(pd.Series)
    df_stats.rename(columns={c:'stat_' + c for c in df_stats.columns},
                    inplace=True)
    
    df_primary = df['primary_artist'].apply(pd.Series)
    df_primary.rename(columns={c:'primary_artist_' + c for c in df_primary.columns},
                      inplace=True)
    
    df = pd.concat((df, df_stats, df_primary), axis=1)
    
    return df

### Collecting Multiple API Calls

We are going to want to perform analysis on more than one artist, so let's use what we've written above to collect data from multiple API calls by looping through multiple search terms. When we loop through each search term, we use the [tqdm package](https://pypi.org/project/tqdm/) to help us visualize our progress (you may need to use `pip` to install it). This kind of thing is helpful when we're running multiple API calls, and we don't know how long it will take.

In [None]:
from tqdm import tqdm

In [None]:
search_terms = ['The Beatles', 'Missy Elliot', 'Andy Shauf', 'Slowdive', 'Men I Trust']
n = 10

dfs = []

# loop through search_terms in question
for search_term in tqdm(search_terms):
    df = genius_to_df(search_term, n_results_per_term=n)
    
    # add to list of DataFrames
    dfs.append(df)

100%|████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.33it/s]


In [None]:
df_genius = pd.concat(dfs)

In [None]:
df_genius.shape

(50, 35)

In [None]:
df_genius.columns

Index(['annotation_count', 'api_path', 'artist_names', 'full_title',
       'header_image_thumbnail_url', 'header_image_url', 'id',
       'lyrics_owner_id', 'lyrics_state', 'path', 'pyongs_count',
       'relationships_index_url', 'release_date_components',
       'release_date_for_display',
       'release_date_with_abbreviated_month_for_display',
       'song_art_image_thumbnail_url', 'song_art_image_url', 'stats', 'title',
       'title_with_featured', 'url', 'featured_artists', 'primary_artist',
       'stat_unreviewed_annotations', 'stat_concurrents', 'stat_hot',
       'stat_pageviews', 'primary_artist_api_path',
       'primary_artist_header_image_url', 'primary_artist_id',
       'primary_artist_image_url', 'primary_artist_is_meme_verified',
       'primary_artist_is_verified', 'primary_artist_name',
       'primary_artist_url'],
      dtype='object')

In [None]:
df_genius.sample(3)

Unnamed: 0,annotation_count,api_path,artist_names,full_title,header_image_thumbnail_url,header_image_url,id,lyrics_owner_id,lyrics_state,path,...,stat_hot,stat_pageviews,primary_artist_api_path,primary_artist_header_image_url,primary_artist_id,primary_artist_image_url,primary_artist_is_meme_verified,primary_artist_is_verified,primary_artist_name,primary_artist_url
4,9,/songs/71861,The Beatles,In My Life by The Beatles,https://images.genius.com/1a5e9183169bb70366f9...,https://images.genius.com/1a5e9183169bb70366f9...,71861,11524,complete,/The-beatles-in-my-life-lyrics,...,False,1203075,/artists/586,https://images.genius.com/817d7fb288bb1c845614...,586,https://images.genius.com/2a7afa0442a3805371b0...,False,False,The Beatles,https://genius.com/artists/The-beatles
2,6,/songs/3311767,Men I Trust,I Hope to Be Around by Men I Trust,https://images.genius.com/7014d337d1aa8c8c9d9d...,https://images.genius.com/7014d337d1aa8c8c9d9d...,3311767,3612583,complete,/Men-i-trust-i-hope-to-be-around-lyrics,...,False,54640,/artists/655044,https://images.genius.com/6b4b452133dd16b46e89...,655044,https://images.genius.com/d6e6ba4e7f1d3c182bc7...,False,False,Men I Trust,https://genius.com/artists/Men-i-trust
0,20,/songs/203259,Slowdive,When the Sun Hits by Slowdive,https://images.genius.com/613951a3d80e01e9c621...,https://images.genius.com/613951a3d80e01e9c621...,203259,208752,complete,/Slowdive-when-the-sun-hits-lyrics,...,False,177529,/artists/65120,https://images.genius.com/6132fdffab2b30a0b70e...,65120,https://images.genius.com/af1be5ab88d7e05df203...,False,False,Slowdive,https://genius.com/artists/Slowdive


## Using an API Wrapper

More often than not, someone has built an "API Wrapper" for the API you are working with. An API wrapper makes an API easier to use, and it often extends the API itself. It will typically consist of classes and functions similar to the ones we've built above, but spanning a wide range of functionality and access to the API. For example, John Miller's [LyricsGenius](https://github.com/johnwmillr/LyricsGenius) gives us an almost universal access to the Genius website, and it even uses web scraping to collect song lyrics themselves.

<span style="color: darkblue">**If ever you're working with an API, do some Googling to make sure there isn't a wrapper you can use to make things easier on you!**</span>

# Python Scripts

Suppose we want to extract a good bit of data from the Genius API (or any other API for that matter), but we want to run the script on our computer while we do other processes in Jupyter. Or, consider having several Python-based "jobs" we want to run on a scheduled basis, where we don't want the responsibility of opening Jupyter and typing "Shift-Tab" every time. In these cases, we'll want to run executable code in the form of a [Python Script](https://realpython.com/run-python-scripts/).

In short, a Python script is a Python file that runs a particular routine on a [thread](https://www.liquidweb.com/blog/difference-cpu-cores-thread/) of your CPU, line by line until it's over. You can run these Python files in several different ways, including:

- **from the terminal**
- from an interactive Python shell (e.g., a notebook)
- as a periodically *scheduled* "job"

In this class, we'll talk about running Python files as scripts from the terminal. Running scripts from IPython can cause issues, not the least of which is the script output is relegated to a notebook cell which can take up unnecessary space in the notebook. Whereas, periodically scheduled jobs are best run on virtual instances hosted in a cloud computing environment, which is outside the scope of this class.

## Creating Python Scripts

All we need to do to create a script is to make a .py file, and place our executable code within the file. It's best to keep these files in the "root" directory of your project (e.g., the same directory-level as this notebook), and **use short names with the `snake_case` naming convention.**

For example, notice that the functions we've defined above are all also defined within the *genius_api.py* file in this directory. In that file, we do our best to follow the [PEP-8](https://pep8.org/) guidelines.

### **kwargs

You'll notice the use of `**kwargs` in the Python file. This is in reference to "keyword arguments" which can be passed from a function to "sub"-function. Technically, the `**` at the beginning can be used to unpack sets or dictionaries:

In [None]:
my_dict_1 = {'a': 1, 'b': 2}
my_dict_2 = {'c': 3, 'd': 4}

In [None]:
{**my_dict_1, **my_dict_2}

{'a': 1, 'b': 2, 'c': 3, 'd': 4}

But, it can also be used to pass arguments:

In [None]:
def called(y=3, z=5):
    print(y + z)

In [None]:
def calling(x, **kwargs):
    print("the 'x' value is ", x)
    print("\nand the sum of interest is:")
    called(**kwargs)

In [None]:
calling(4, y=5)

the 'x' value is  4

and the sum of interest is:
10


### Modules

**<span style='color:darkblue'>Note: modules are discussed in-depth during a later week.</span>**

Functions exist as a sort of short-hand: reusable Python structures which help us to avoid writing repetitive code blocks. I.e., instead of writing the same block of code multiple times, we simply call a single function — it saves room and keeps things clean. In the same way that functions (and classes) can be imported from third-party (or built-in) packages, they can also be imported from modules we build ourselves!

In its most basic form, a [Python module](https://realpython.com/python-modules-packages/) is a Python file used to maintain Python objects (e.g., functions, variables, etc.). You can "import" your reusable code using the `import` statement as you would any other Python module.

**It is a good practice to build and test out your functions in Jupyter, and when you feel comfortable with them, *move* them into your Python module.** Use Jupyter notebook *as a notebook*, and keep any reusable code (e.g., functions) in the Python file. Whenever you're working on a project, keep open your Jupyter notebook for building and testing code, and keep your IDE (e.g., VS Code) open for updating your functions.

In [None]:
from genius_api import testing

In [None]:
testing()

Test.


Notice that if we make changes to the file, this function will not update the output. This is because the code that is "imported" into the current Python interpreter is static, from the most recent import call. We can either "re-import" the module, or we can use the `reload` function from `importlib`. You can also use 

In [None]:
from importlib import reload

In [None]:
# we need to import the module itself first
import genius_api

In [None]:
# run this cell after each change
reload(genius_api)
from genius_api import testing

In [None]:
testing()

Testing testing, 1, 2, 3.


Alternatively, you can use the [autoreload](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html) functionality Jupyter to make sure updates to our code reflect in the notebook. *Note: this implementation works most of the time, but not all the time ... using `reload` is less convenient, but more certain.*

In [None]:
%load_ext autoreload
%autoreload 2
from genius_api import testing

In [None]:
testing()

Testing testing, 1, 2, 3.


### \_\_name__

You'll notice at the bottom of the Python file, there is a reference to `__name__` and the string "`__main__`". [In short](https://realpython.com/if-name-main-python/), every time a Python module is accessed, it is given a name. And, according to the [Python documentation](https://docs.python.org/3/library/__main__.html):

*`__main__` is the name of the environment where top-level code is run. "Top-level code" is the first user-specified Python module that starts running. It’s "top-level" because it imports all other modules that the program needs. Sometimes “top-level code” is called an entry point to the application.*

In fact, the `__name__` is stored as a global variable for the environment. For example:

In [None]:
# the 
__name__

'__main__'

This can happen when Python is called from *"the scope of an interactive prompt"* (e.g., in Jupyter notebook), and it can also happen when a Python file is called from your computer's terminal using `python filename.py`.

*Note: this is **not** the same as the `!`-defined terminal in a Jupyter notebook*.

In [None]:
# uncomment the ... print("__name__ is" ... line in the module
# !python genius_api.py

*Note: you may find that there are package install errors, since the `!` is commanding from the environment that Jupyter was called from, which might not be the same as the notebook kernel.*

Otherise, the `__name__` is the name of the module itself.

In [None]:
# the imported module
genius_api.__name__

'genius_api'

In [None]:
from genius_api import NAME_DEMO

In [None]:
NAME_DEMO

'genius_api'

## Multiprocessing

Consider the fact that your computer likely has multiple "cores". I.e., when you buy a "quad-core" computer, this means that your CPU contains 4 *separate* processing units. **Each processing unit (core) can run Python routines independently.** This means that you can parallelize big operations across all your cores if you can harness each one individually. Python does not automatically dole out sub-processes based on our code, so we need to direct the computer to do so explicitly. We do this with multiprocessing.

The [multiprocessing Pool class](https://docs.python.org/3/library/multiprocessing.html) is the most straight-forward way to employ computing parallelization. The simplest example (illustrated in that first code block linked above) instantiates a **pool of workers**, where each worker completes an assigned task, and returns the results. You'll notice that only top-level code can create the pool of workers, so we need the `__name__` to be `__main__`.

As a simple example, take a look at the `SIMPLE MULTIPROCESSING EXAMPLE` at the bottom of the *genius_api.py* file. You can uncomment it, and you'll need to run this from your terminal (within the proper environment):

```bash
python genius_api.py
```

You'll notice that there are several different processes defined (by their Process ID, `pid`), and each one runs the `job_test` function separately. Each process is assigned a **thread** which employs your CPUs (in a way, the CPUs are "multitasking" across threads ...). If you want to know how many threads are available on your machine, you can use `psutil` or `os`.

In [None]:
import psutil
import os

In [None]:
# physical cores
print("core count:\t", psutil.cpu_count(False))

# threads available
print("thread count:\t", psutil.cpu_count(True))

# # threads available (same thing)
print("thread count*:\t", os.cpu_count())

core count:	 10
thread count:	 10
thread count*:	 10


Lastly, notice in the last example (`API MULTIPROCESSING EXAMPLE`) how you can use multiprocessing to speed up pulling data from an API.

## Running background process

In many cases, you'll want a script to run on your local computer "in the background" so that when your computer goes to sleep, or if you close a terminal, the script is still running.

### Run a program

```bash
nohup python <command> &
```

Don't forget the `&` at the end. This directs the process to be run in the background. The `nohup` directs output to a *nohup.out* file in the same (working) directory.

### Check Running Processes

```bash
ps ax
ps ax | grep <script_name> 
```

The former will list the running processes, and the latter will list running processes with `<script_name>` in its name.

### Kill a Process

If you're done with the script, run

```bash
kill <process_id>
```

To kill the process with `<process_id>` in the first column from the `ps` command, above. **Be careful with this!**

# Exercise

Write a Python script (i.e., a Python .py file) to retrieve data for a *list* of search terms from the Genius API.

- Implement error handling using `try`-`except` blocks to handle potential issues such as API rate limits, response errors, or network errors.
- Build your script to save the results to a CSV.

Once you have this working, try to incorporate multiprocessing, and save your data to a CSV file (or separate CSV files).

In [26]:
from genius_api import genius_to_df  # Correct import statement

# Example usage
search_term = "The Beatles"
results = genius_to_df(search_term)  # Correct function call
print(results)


ImportError: cannot import name 'genius_to_df' from 'genius_api' (/Users/yuritziavila-robledo/Desktop/Data Science/i501_informatics/i501-labs/genius_api.py)