<a href="https://colab.research.google.com/github/shstreuber/Data-Mining/blob/master/Module13_WebscrapingAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 13: Natural Language Processing and APIs**

Up to this point, we have mainly learned how the most popular data analytics algorithms work, how to preprocess our data to get them to work, and how to configure these algorithms to get them to work faster and better. For that, we always had "canned" data available, meaning datasets that were already in some sort of .csv format, nicely tabulated, and ready for analysis.

But real life is harsh. Data doesn't usually show up in neat little (or big) .csv packaging. It is messy, crazy, and unstructured. Think, for example, about product ratings on Amazon.com, comments on YouTube or Instagram, or threads of tweets on Twitter, video responses to other videos on TikTok, likes/ dislikes and donations on Discord or Twitch, and so on. There's a whole lot of data there, and it can be incredibly useful. But 1. how do you get to it and 2. how do you analyze it? That's what this module is all about.

At the end of this module, you will be able to:
* Acquire data from a webpage
* Clean data obtained from a webpage
* Acquire data from an API
* Build a dataframe that's ready for analysis

Let's go.

# **0. Preparation and Setup**
Well, we need our libraries again, this time for webscraping and for textual analysis, which we will do in the second half of this file.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# **1. ACQUIRING AND PROCESSING THE DATA (with webscraping)**
Web scraping is the process of using code to extract content and data from a website. It's kind of like building your own web search and using it to scour the internet for appropriate data--for example, sites like LinkedIn or Indeed for the title of the job you want after you graduate, or sites like [catster.com](https://www.catster.com/), Youtube.com or Reddit's [CatAdvice subreddit](https://www.reddit.com/r/CatAdvice/) for, say, information about healthy cat food.

To perform web scraping, we will import the libraries shown below. The [urllib.request](https://docs.python.org/3/library/urllib.request.html) module is used to open URLs. The [Beautiful Soup package](https://pypi.org/project/beautifulsoup4/) is used to extract data from html files. The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.

This is an amended copy of the [Datacamp Tutorial on Web Scraping](https://www.datacamp.com/community/tutorials/web-scraping-using-python).

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

Now we specify the URL containing the dataset and pass it to urlopen() to get the html of the page. The example below references the [results from a 2017 10k race as published on the Huber Timing website](https://www.hubertiming.com/results/2017GPTR10K).

**NOTE** that some pages will produce the following error: `HTTPError: HTTP Error 403: Forbidden`; the developers built in anti-scraping security code.

In [None]:
url = "https://www.hubertiming.com/results/2017GPTR10K"
# url = "http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm"
# url = "https://www.zyte.com/learn/what-is-web-scraping/"
html = urlopen(url)

Getting the html of the page is just the first step. Next step is to create a Beautiful Soup object from the html. This is done by passing the html to the BeautifulSoup() function. The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. The second argument 'lxml' is the html parser which assigns the Python objects to the appropriate tag delimiters.

In [None]:
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

Now we use the soup object to extract interesting information about the website we are scraping such as getting the title of the page as shown below.

In [None]:
# Get the title
title = soup.title
print(title)

<title>Race results for the 2017 Intel Great Place to Run \ Urban Clash Games!</title>


You can also get the text of the webpage and quickly print it out to check if it is what you expect.

In [None]:
# Print out the text
text = soup.get_text()
print(soup.text)

Now, open a new tab on your web browser and go directly to the website you are scraping. Right-click into the website and, from the popup menu, select "Inspect." If you are in Chrome, this will open a developer view with many tabs on the right side of your screen. This will show you the code of the webpage (although you may have to open and close a number of expanders to see any actual HTML tags).

<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/webscraping_HTML_example.png" width="500">
</div>

You can use the find_all() method of soup to extract useful html tags within a webpage. Examples of useful tags include < a > for hyperlinks, < table > for tables, < tr > for table rows, < th > for table headers, and < td > for table cells. The code below shows how to extract all the hyperlinks within the webpage.


In [None]:
soup.find_all('a')

As you can see from the output above, html tags sometimes come with attributes such as class, src, etc. These attributes provide additional information about html elements. You can use a for loop and the get('"href") method to extract and print out only hyperlinks.

In [None]:
all_links = soup.find_all("a")
for link in all_links:
    print(link.get("href"))

mailto:timing@hubertiming.com
https://www.hubertiming.com
/results/2017GPTR
/results/team/2017GPTR
/results/team/2017GPTR10K
/results/summary/2017GPTR10K
None
#tabs-1
https://www.hubertiming.com/
https://facebook.com/hubertiming/
None


To print out table rows only, pass the 'tr' argument in soup.find_all().

In [None]:
# Print the first 7 rows for sanity check
rows = soup.find_all('tr')
print(rows [:7])

## **1.1. Preprocessing--THE ALL-IMPORTANT FIRST STEP!!!**
Our goal here is to convert the data from the webpage into a dataframe so we can do our data magic with it. To get there, we need to get all table rows in list form first and then convert that list into a dataframe. Below is a for loop that iterates through table rows and prints out the cells of the rows.

### **1.1.1 Extracting data from table rows**

In [None]:
for row in rows:
    row_td = row.find_all('td')
print(row_td)
type(row_td)

### **1.1.2. Cleaning Data: Removing HTML Tags**

The output above shows that each row is printed with html tags embedded in each row. This is not what you want. You can use remove the html tags using Beautiful Soup or regular expressions.

The easiest way to remove html tags is to use Beautiful Soup, and it takes just one line of code to do this. Pass the string of interest into BeautifulSoup() and use the get_text() method to extract the text without html tags.

In [None]:
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)

[577, 443, 

                    LIBBY B MITCHELL

                , F, HILLSBORO, OR, 1:41:18, 1:42:10, ]


The code below shows how to build a regular expression that finds all the characters inside the < td > html tags and replace them with an empty string for each table row. First, you compile a regular expression by passing a string to match to re.compile(). The dot, star, and question mark (.*?) will match an opening angle bracket followed by anything and followed by a closing angle bracket. It matches text in a non-greedy fashion, that is, it matches the shortest possible string. If you omit the question mark, it will match all the text between the first opening angle bracket and the last closing angle bracket. After compiling a regular expression, you can use the re.sub() method to find all the substrings where the regular expression matches and replace them with an empty string. The full code below generates an empty list, extract text in between html tags for each row, and append it to the assigned list.

In [None]:
import re

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')  # matches an opening angle bracket followed by anything and followed by a closing angle bracket
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)
print(clean2)
type(clean2)

[577, 443, 

                    LIBBY B MITCHELL

                , F, HILLSBORO, OR, 1:41:18, 1:42:10, ]


str

## **1.2 Converting Data to Dataframe----THE ALL-IMPORTANT SECOND STEP!!!**

The next step is to convert the list into a dataframe and get a quick view of the first 10 rows using Pandas.

In [None]:
df = pd.DataFrame(list_rows)
df.head(10)

Unnamed: 0,0
0,[]
1,"[Finishers:, 577]"
2,"[Male:, 414]"
3,"[Female:, 163]"
4,[]
5,"[1, 814, \r\n\r\n JARED WIL..."
6,"[2, 573, \r\n\r\n NATHAN A ..."
7,"[3, 687, \r\n\r\n FRANCISCO..."
8,"[4, 623, \r\n\r\n PAUL MORR..."
9,"[5, 569, \r\n\r\n DEREK G O..."


### **1.2.1 Cleaning Data: Formatting the Dataframe**
The dataframe is not in the format we want. To clean it up, you should split the "0" column into multiple columns at the comma position. This is accomplished by using the str.split() method.

In [None]:
df1 = df[0].str.split(',', expand=True)
df1.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,[],,,,,,,,
1,[Finishers:,577],,,,,,,
2,[Male:,414],,,,,,,
3,[Female:,163],,,,,,,
4,[],,,,,,,,
5,[1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,OR,36:21,36:24,]
6,[2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,OR,36:42,36:45,\n\r\n INTEL TEAM ...
7,[3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,OR,37:44,37:48,]
8,[4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,OR,38:34,38:37,]
9,[5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,OR,39:21,39:24,\n\r\n INTEL TEAM ...


This looks much better, but there is still work to do. The dataframe has unwanted square brackets surrounding each row. You can use the strip() method to remove the opening square bracket on column "0."

In [None]:
df1[0] = df1[0].str.strip('[')
df1.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,],,,,,,,,
1,Finishers:,577],,,,,,,
2,Male:,414],,,,,,,
3,Female:,163],,,,,,,
4,],,,,,,,,
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,OR,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,OR,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,OR,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,OR,38:34,38:37,]
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,OR,39:21,39:24,\n\r\n INTEL TEAM ...


### **1.2.2 Building Table Headers**

The table is missing table headers. You can use the find_all() method to get the table headers.

In [None]:
col_labels = soup.find_all('th')

Just like what we did with the table rows, you can use Beautiful Soup to extract text in between html tags for table headers.

In [None]:
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
all_header.append(cleantext2)
print(all_header)

['[Place, Bib, Name, Gender, City, State, Time, Gun Time, Team]']


You can then convert the list of headers into a pandas dataframe.

In [None]:
df2 = pd.DataFrame(all_header)
df2.head()

Unnamed: 0,0
0,"[Place, Bib, Name, Gender, City, State, Time, ..."


Similarly, you can split column "0" into multiple columns at the comma position for all rows.

In [None]:
df3 = df2[0].str.split(',', expand=True)
df3.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,[Place,Bib,Name,Gender,City,State,Time,Gun Time,Team]


Now we can concatenate the two dataframes into one using the concat() method as illustrated below.

In [None]:
frames = [df3, df1]

df4 = pd.concat(frames)
df4.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,[Place,Bib,Name,Gender,City,State,Time,Gun Time,Team]
0,],,,,,,,,
1,Finishers:,577],,,,,,,
2,Male:,414],,,,,,,
3,Female:,163],,,,,,,
4,],,,,,,,,
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,OR,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,OR,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,OR,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,OR,38:34,38:37,]


Below shows how to assign the first row to be the table header.

In [None]:
df5 = df4.rename(columns=df4.iloc[0])
df5.head()

Unnamed: 0,[Place,Bib,Name,Gender,City,State,Time,Gun Time,Team]
0,[Place,Bib,Name,Gender,City,State,Time,Gun Time,Team]
0,],,,,,,,,
1,Finishers:,577],,,,,,,
2,Male:,414],,,,,,,
3,Female:,163],,,,,,,


At this point, the table is almost properly formatted. For analysis, you can start by getting an overview of the data as shown below.

In [None]:
df5.info()
df5.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 583 entries, 0 to 581
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   [Place     583 non-null    object
 1    Bib       581 non-null    object
 2    Name      578 non-null    object
 3    Gender    578 non-null    object
 4    City      578 non-null    object
 5    State     578 non-null    object
 6    Time      578 non-null    object
 7    Gun Time  578 non-null    object
 8    Team]     578 non-null    object
dtypes: object(9)
memory usage: 45.5+ KB


(583, 9)

The table has 583 rows and 10 columns. You can drop all rows with any missing values.

In [None]:
df6 = df5.dropna(axis=0, how='any')

Also, notice how the table header is replicated as the first row in df5. It can be dropped using the following line of code.

In [None]:
df7 = df6.drop(df6.index[0])
df7.head()

Unnamed: 0,[Place,Bib,Name,Gender,City,State,Time,Gun Time,Team]
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,OR,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,OR,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,OR,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,OR,38:34,38:37,]
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,OR,39:21,39:24,\n\r\n INTEL TEAM ...


You can perform more data cleaning by renaming the '[Place' and ' Team]' columns. Python is very picky about space. Make sure you include space after the quotation mark in ' Team]'.

In [None]:
df7.rename(columns={'[Place': 'Place'},inplace=True)
df7.rename(columns={' Team]': 'Team'},inplace=True)
df7.head()

Unnamed: 0,Place,Bib,Name,Gender,City,State,Time,Gun Time,Team
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,OR,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,OR,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,OR,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,OR,38:34,38:37,]
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,OR,39:21,39:24,\n\r\n INTEL TEAM ...


## **1.3 Data Cleaning: Removing White Space and Special Characters--THE ALL-IMPORTANT THIRD STEP!!!**

The final data cleaning steps involve removing the closing bracket for cells in the "Team" column, the white space, and the new line characters

In [None]:
df7['Team'] = df7['Team'].str.strip(']')
df7.head()

Unnamed: 0,Place,Bib,Name,Gender,City,State,Time,Gun Time,Team
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,OR,36:21,36:24,
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,OR,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,OR,37:44,37:48,
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,OR,38:34,38:37,
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,OR,39:21,39:24,\n\r\n INTEL TEAM ...


Removing any white space

In [None]:
df7.replace(r'\s', '', regex = True, inplace = True)

And getting rid of the new line characters

In [None]:
df8 = df7.replace(r'\\n',' ', regex=True)
df8.head()

Unnamed: 0,Place,Bib,Name,Gender,City,State,Time,Gun Time,Team
5,1,814,JAREDWILSON,M,TIGARD,OR,36:21,36:24,
6,2,573,NATHANASUSTERSIC,M,PORTLAND,OR,36:42,36:45,INTELTEAMF
7,3,687,FRANCISCOMAYA,M,PORTLAND,OR,37:44,37:48,
8,4,623,PAULMORROW,M,BEAVERTON,OR,38:34,38:37,
9,5,569,DEREKGOSBORNE,M,HILLSBORO,OR,39:21,39:24,INTELTEAMF


It took a while to get here, but at this point, the dataframe is in the desired format.

If you would like to read about another webscraping project, take a look at [this blog post about scraping a job portal](https://realpython.com/beautiful-soup-web-scraper-python/). This is about getting data from the [Fake Python Jobs](https://realpython.github.io/fake-jobs/) site (**NOTE**: These are **FAKE** posts; the jobs **don't exist**; this is a site built exclusively for static HTML-based web scraping). How is that for hunting for your dream job?

# **2. WORKING WITH AN API**
Getting information directly from webpages is one thing--in fact, a lot of online  marketing companies specialize in scraping and cleaning data and then sell these data to other businesses for further analysis. As you may already guess, this works only with static HTML sites, that is, with sites that send your browser complete webpages, not just shells of css and javascript or Ajax with a database call in the middle. And, as you have seen, this can be very painful.

That's why some website providers offer application programming interfaces (APIs) that allow you to access their data in a predefined manner. With APIs, you can avoid parsing HTML. Instead, you can access the data directly using formats like JSON and XML.

When you use an API, the process is generally more stable than gathering the data through web scraping. That’s because developers create APIs to be consumed by programs rather than by human eyes.

## **2.1 Setting up the Data Source**
In order to work with an API, the first step is always to obtain the required login credentials into the source of your data and store these in an **APP** (yes, I said app because that's what this is called--no relation to whatever you have running in your cellphone). Imagine this like getting the key to your house or apartment from your landlord or realtor:


### **2.1.1 API #1: New York Times**


Let's assume we work with the **New York Times** API:
1. Go to [the New York Times API website](https://developer.nytimes.com/get-started) and sign up for an account.
2. Once you have completed the email verification process, log into the website and start setting your API key up under Get Started (see below). This is the access tool that your Python code needs in order to download data from the API.
<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/nytapi.JPG" width="500">
</div>
4. The instructions ask you to create an app. No worries: That is API-speak for a set of dedicated access keys that allow you to use the API. After you log in, go to Get Started and follow the app generation procedure. Enable the Article Search, the Community API, and the Top Stories API and don't forget to hit "save."
5. This will give you an app ID and a set of keys.
<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/nytapi_keys.JPG" width="500">
</div>
6. Once you know that you have these available, go to the APIs and learn about how to connect to each of them. Be sure that the API you wish to use is **AUTHORIZED** in your app.

Now you have set up ONE side of the puzzle--**the API side.**

To learn more about how to connect into the New York Times API, check out this [blog post](https://dlab.berkeley.edu/blog/scraping-new-york-times-articles-python-tutorial) or [this (somewhat older) notebook](https://github.com/nilmolne/Text-Mining-The-New-York-Times-Articles/blob/master/Code/HowToUse.ipynb) or this [notebook about COVID-related articles](https://github.com/brienna/coronavirus-news-analysis/blob/master/2020_05_01_get_data_from_NYT.ipynb).


### **2.1.2 API #2: Reddit**

The principle here is the same as before: Build a user account and configure your app, then write down your key(s) because you'll need it/ them when your code wants to connect to the data source. Here is how this works on Reddit, complete with demonstration:






In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/FdjVoOf9HN4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## **2.2 Setting up Python**
Now that you have set up the API of your choice, the second part of the project is to connect to it and retrieve the data. This means that, now that you have the key(s) available, you write your Colab or Jupyter Notebook code to use these keys to log into the API. This comes in different flavors, as well.



### **2.2.1 New York Times**
So, to connect to the **New York Times**, the simplest code can look like what you are seeing below, with the requests, os, an pprint libraries. Note that  NYTIMES_APIKEY is the key that you received when signing up for the New York Times APIs (see screenshot above). Here you see the Top Story API in action:*italicized text*

In [None]:
import requests
import os
from pprint import pprint

apikey = os.getenv('NYTIMES_APIKEY', '...')

# Top Stories:
# https://developer.nytimes.com/docs/top-stories-product/1/overview
section = "science"
query_url = f"https://api.nytimes.com/svc/topstories/v2/{section}.json?api-key={apikey}"

r = requests.get(query_url)
pprint(r.json())

The snippet above is very straightforward. We run a GET request against topstories/v2 endpoint supplying section name and our API key.

The output comes in JSON format and looks like this:
```
{ 'last_updated': '2020-08-09T08:07:44-04:00',
 'num_results': 25,
 'results': [{'abstract': 'New Zealand marked 100 days with no new reported '
                          'cases of local coronavirus transmission. France '
                          'will require people to wear masks in crowded '
                          'outdoor areas.',
              'byline': '',
              'created_date': '2020-08-09T08:00:12-04:00',
              'item_type': 'Article',
              'multimedia': [{'caption': '',
                              'copyright': 'The New York Times',
                              'format': 'superJumbo',
                              'height': 1080,
                              'subtype': 'photo',
                              'type': 'image',
                              'url': 'https://static01.nyt.com/images/2020/08/03/us/us-briefing-promo-image-print/us-briefing-promo-image-superJumbo.jpg',
                              'width': 1920},
                             ],
              'published_date': '2020-08-09T08:00:12-04:00',
              'section': 'world',
              'short_url': 'https://nyti.ms/3gH9NXP',
              'title': 'Coronavirus Live Updates: DeWine Stresses Tests’ '
                       'Value, Even After His False Positive',
              'uri': 'nyt://article/27dd9f30-ad63-52fe-95ab-1eba3d6a553b',
              'url': 'https://www.nytimes.com/2020/08/09/world/coronavirus-covid-19.html'},
             ]
 }

```
That is the shortest API call. The article API gives you more filtering options. The only mandatory field is q (query), which is the search term. Beyond that you can mix and match filter query, date range ( begin_date, end_date), page number, sort order and facet fields.
```
# Article Search:
# https://api.nytimes.com/svc/search/v2/articlesearch.json?q=<QUERY>&api-key=<APIKEY>
# Use - https://developer.nytimes.com/docs/articlesearch-product/1/routes/articlesearch.json/get to explore API

query = "politics"
begin_date = "20200701"  # YYYYMMDD
filter_query = "\"body:(\"Trump\") AND glocations:(\"WASHINGTON\")\""  # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "0"  # <0-100>
sort = "relevance"  # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
            f"q={query}" \
            f"&api-key={apikey}" \
            f"&begin_date={begin_date}" \
            f"&fq={filter_query}" \
            f"&page={page}" \
            f"&sort={sort}"

r = requests.get(query_url)
pprint(r.json())
```
The final challenge is to transform the JSON content into a file that can serve as a data frame.

To learn more about newspaper APIs, check out [Martin Heinz' blog post](https://martinheinz.dev/blog/31) about how to connect to them.


###**EXAMPLE: Using the New York Times API**

The New York Times API enables you to search the New York Times archives for specific keywords.

**To connect your Colab Notebook to The New York Times archive, you will need to create an application on the NYT website (aka a login/ password profile)**, even if you're building something for personal use. Here is how to do that:

1. Create an NYT account: If you don't have one already, visit the New York Times website (www.nytimes.com) and sign up for a new account. Make sure to verify your email address if prompted.

2. Join the New York Times Developer Network: Go to the New York Times Developer Network website (developer.nytimes.com) and sign in using your NYT account credentials.

3. Create a new application: Once you're signed in, navigate to the "My Apps" section of the Developer Network website. Click on "Create a New App" or a similar button to start the application creation process.

4. Provide application details: Fill in the required details for your application, such as the name, description, and the intended use of the API. Be sure to read and understand the terms and conditions provided by the New York Times.

5. Select API access: Choose the specific APIs you want to use in your application. The New York Times offers various APIs for different purposes, such as article search, top stories, and more. Select the ones that suit your project requirements.

6. Agree to terms and conditions: Review the terms and conditions associated with the New York Times APIs and agree to them if you accept. Ensure that your project complies with the usage guidelines provided by the New York Times.

7. Obtain your API key: Once you've completed the application creation process, the New York Times Developer Network should provide you with an API key. This key will be unique to your application and will allow you to authenticate your requests to the NYT API.

Make sure to store your API key securely and use it responsibly, adhering to the terms and guidelines specified by the New York Times. Additionally, be aware that the steps or requirements for obtaining an NYT API key may change without notice. It's advisable to visit the New York Times Developer Network website or contact their support for the most up-to-date instructions.

###**CODE: Using the New York Times API**

**NOTE**: The code stub below hardcodes keys, logins, and passwords. You should ONLY ever do that if you are the only person ever seeing your code. This is NOT PERMISSIBLE in a professional environment, in which you'll want to set your login information as environment variables instead.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import requests
import os
from pprint import pprint

In [None]:
# 1. VALIDATE AUTHENTICATION AND TEST SEARCH TERM "DOG"
# See instructions for installing Requests module for Python
# https://requests.readthedocs.io/en/master/user/install/#install

def execute():
  requestUrl = "https://api.nytimes.com/svc/search/v2/articlesearch.json?fq=dog&q=dog&api-key=enter here your own API key"
  requestHeaders = {
    "Accept": "application/json"
  }

  response = requests.get(requestUrl, headers=requestHeaders)

  print(response.text)
  print(response.status_code)

if __name__ == "__main__":
  execute()

In [None]:
# 2. DEFINE VARIABLES TO HOLD KEYS AND FILTER QUERY ETC.
# Article Search:
# https://api.nytimes.com/svc/search/v2/articlesearch.json?q=<QUERY>&api-key=<APIKEY>
# Use - https://developer.nytimes.com/docs/articlesearch-product/1/routes/articlesearch.json/get to explore API

query = "dog"
begin_date = "20200701"  # YYYYMMDD
filter_query = "\"body:(\"dog\") OR \"title:(\"dog\") \""  # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "0"  # <0-100>
sort = "relevance"  # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
            f"q=dog" \
            f"&api-key=Jpby1CXmDjp473Lr8zHqckWdJeMpwKmo" \
            f"&fq=dog" \
            f"&page=0" \
            f"&sort={sort}"

r = requests.get(query_url)
pprint(r.json())

In [None]:
# 3. FORMAT THE JSON OUTPUT

import json

def jprint(obj):
    # create a formatted string of the Python JSON object
    r  = json.dumps(obj, sort_keys=True, indent=7)
    print(r)

jprint(r.json())

### **2.2.2 Reddit**
Connecting with **Reddit** is explained really well in [this article](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c), which the code in this section summarizes. Also, you need more than just one token this time around:
1. Your Reddit Login and password
2. The personal use script key you receive when you sign up for your app
3. The secret key that you receive when you sign up for your app

Once you have these, you will need to set up your OAuth configuration. This will assign to you a token that expires every 2 hours:

```
import requests

# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('<CLIENT_ID>', '<SECRET_TOKEN>')

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': '<USERNAME>',
        'password': '<PASSWORD>'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'MyBot/0.0.1'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)
```
That is the first part. Now the API trusts us, and we can retrieve text. First, we'll look at the most popular posts:
```
res = requests.get("https://oauth.reddit.com/r/python/hot",
                   headers=headers)

print(res.json())  # This will pull out all the "hot" posts
```
The result will be a big ugly JSON-formatted set. Think about it as a bag of words.


---


Let's inspect this bag of words in order to find the most interesting posts about--what else?--cats. First, we'll explore the titles of the retrieved posts:
```
for post in res.json()['data'] ['cats']:
   print(post['data'] ['title']
```
There is always enough material on Reddit about cats! So, let's build this search as a filter into our data retrieval query **AND** collect the output in a dataframe (which we will name "fluffy"):

```
# make a request for the trending posts in /r/Python
res = requests.get("https://oauth.reddit.com/r/python/hot",
                   headers=headers)

fluffy = pd.DataFrame()  # initializing our dataframe

# loop through each post retrieved from GET request
for post in res.json()['data']['cats']:
    # here, we append relevant data to our dataframe
    df = df.append({
        'subreddit': post['data']['subreddit'],
        'title': post['data']['title'],
        'selftext': post['data']['selftext'],
        'upvote_ratio': post['data']['upvote_ratio'],
        'ups': post['data']['ups'],
        'downs': post['data']['downs'],
        'score': post['data']['score']
    }, ignore_index=True)

```
---
Afterwards, inspect the dataframe with fluffy.head(), and you'll see all the new posts about cats already in a dataframe. Follow the steps in part 1 of this workbook to clean your data up a little, including removing URLs and any special characters, and  your data is ready for analysis!


###**EXAMPLE: Using the Reddit API**
Reddit is one of the oldest and still very active conversation forums.

**To connect your Colab Notebook to Reddit, you will need to create an application on Reddit (aka a login/ password profile), even if you're building something for personal use.** Here is how to do that:

1. Log in to your Reddit account and visit the Reddit App Preferences page (https://www.reddit.com/prefs/apps).

2. Create a new application: On the App Preferences page, scroll down to the "Developed Applications" section. Click on the "Create App" button.

3. Configure your application: Provide a name for your application (it can be anything) and select the appropriate app type. For most use cases, the "Script" option is suitable. Enter a short description and, optionally, a redirect URI (for web-based applications). The redirect URI is not necessary for most non-web projects.

4. Set up the permissions: In the "Authorized JavaScript Origins" field, you can enter the domain if you're building a web-based application. Otherwise, you can leave it blank. In the "Authorized Redirect URIs" field, enter the redirect URI if you have one; otherwise, leave it blank.

5. Note your client ID and secret: After saving your application configuration, Reddit will provide you with a unique client ID and a client secret. These are essential for making API requests, so keep them secure and do not share them publicly.

6. Understand the usage guidelines: Take a moment to familiarize yourself with Reddit's API usage guidelines (https://www.reddit.com/wiki/api). Ensure that your project adheres to the terms and conditions outlined by Reddit.

That's it! You now have a Reddit API key in the form of a client ID and client secret. With these credentials, you can authenticate your requests and access the Reddit API. Make sure to handle your credentials securely and follow Reddit's guidelines while using their API.

###**CODE: Using the Reddit API**

**NOTE**: The code stub below hardcodes keys, logins, and passwords. You should ONLY ever do that if you are the only person ever seeing your code. This is NOT PERMISSIBLE in a professional environment, in which you'll want to set your login information as environment variables instead.

In [None]:
import requests
import pandas as pd

In [None]:
# 1. CONFIGURING OUR AUTHENTICATION AND CONNECTION PARAMETERS

# Note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('enter here your personal use script', 'enter here your token')

# Here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': 'enter here your username',
        'password': 'enter here your password'}

# Setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'MyBot/0.0.1'}

In [None]:
# 2. AUTHENTICATING OUR CONNECTION VIA A REQUEST FOR AN OAuth TOKEN
res = requests.post('https://www.reddit.com/api/v1/access_token', auth=auth, data=data, headers=headers)

In [None]:
# 3. CONFIGURING THE DESIRED RESPONSE

# Convert response to JSON & add access_token value
TOKEN = res.json()['access_token']

# Add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# While the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me/', headers=headers)

In [None]:
# 4. GETTING THE DATA FROM THE DOGS SUBREDDIT

res = requests.get("https://oauth.reddit.com/r/dogs/", headers=headers) # Here we are getting the data from the dogs subreddit
print(res.json())  # This will pull out all the "hot" posts

In [None]:
# 4b. LOOKING AT JUST THE HEADERS TO GET A BETTER SENSE OF THE DATA WE ACQUIRED

for post in res.json()['data'] ['children']:
   print(post['data'] ['title'])

In [None]:
# 5. BUILDING A DATA FRAME FROM THE ACQUIRED DATA

df = pd.DataFrame()  # initializing our dataframe

# Loop through each post retrieved from GET request
for post in res.json()['data']['children']:
    # here, we append relevant data to our dataframe; note that the "append" command is deprecated and will notify the user to use pandas.concat instead
    df = df.append({
        'subreddit': post['data']['subreddit'],
        'title': post['data']['title'],
        'selftext': post['data']['selftext'],
        'upvote_ratio': post['data']['upvote_ratio'],
        'ups': post['data']['ups'],
        'downs': post['data']['downs'],
        'score': post['data']['score']
    }, ignore_index=True)

# Look at the dataframe
df.head()

### **2.2.3 Twitter**
Connecting to any high-visibility social media is really involved these days due to data privacy concerns (thanks to the Cambridge Analytica Scandal)--even if the data is as public as on Twitter (trying to get the Facebook or Instagram APIs set up and calls working can take days or weeks, mostly for approval turnarounds).

BUT, assuming that you have your logins and your tokens available, the Twitter API is easily managed. Twitter has published [a treasure trove of code stubs](https://github.com/twitterdev/Twitter-API-v2-sample-code) on GitHub for that purpose.

The most interesting code snippets for our purpose are:
* [Full archive search](https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/main/Full-Archive-Search/full-archive-search.py)
* [User Tweet timeline](https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/main/User-Tweet-Timeline/user_tweets.py)

Twitter also allows you to connect with the [Postman package](https://developer.twitter.com/en/docs/tools-and-libraries/using-postman), which uses HTTP to retrieve data through a GUI (of course, any true hacker wouldn't be caught dead using a GUI).

If you already have Twitter API access, I encourage you to explore the options here. If not, the NYT or the Reddit APIs might be easier for you to work with.


## Your Turn (OPTIONAL)
In this workbook, you have encountered 2 major methods of obtaining data: Either through direct webscraping (section 1) or through the use of an API (section 2). One of the big takeaways here is that all APIs behave slightly differently, but the process has several steps in common:
1. Register for a user account on the website
2. Build your app
3. Note your keys, which you will need to connect
4. Pivot to your notebook
5. Install at a minimum the requests and os packages
6. Set up variables to hold your keys
7. Test the authentication method
8. Write your query code--test
9. Edit your query code to pull data into a dataframe--test
10. Clean and format the data

There are several APIs with which you can work and ways in which you can make this exercise **useful for you personally**. I always suggest checking out jobsearch websites like Indeed, LinkedIn, etc. (the Dice API was just recently shut down), to build your own jobsearch agent. Here are some API portals that you might find useful:
* [Indeed](https://developer.indeed.com/?&aceid=&kw=adwords_c_9099621460_15516767951_0_0_pmax&sid=us_googconthajpmax-_c__g_9007272_gclid$_CjwKCAjwhJukBhBPEiwAniIcNYXjVT45sPuXKZDZ1ltBQHRCOc4zBuzur-HxoPUkA04kWWNzFKCUdhoCEX8QAvD_BwE&gclid=CjwKCAjwhJukBhBPEiwAniIcNYXjVT45sPuXKZDZ1ltBQHRCOc4zBuzur-HxoPUkA04kWWNzFKCUdhoCEX8QAvD_BwE&gclsrc=aw.ds)
* [LinkedIn](https://www.linkedin.com/help/linkedin/answer/a526048/accessing-linkedin-apis?lang=en)
* [Monster](https://developer.monster.io/)
* and [several more](https://rapidapi.com/collection/ziprecruiter-api)
NOTE that these APIs and their requirements change quickly and frequently. Pick one of the methods or one of the APIs with which you want to work. Then see what data you can pull. Try formatting the data. If you would like more help converting JSON output to a pandas dataframe than you are seeing above, [this article](https://towardsdatascience.com/how-to-convert-json-into-a-pandas-dataframe-100b2ae1e0d8) will walk you through the individual steps.