*Francisco Pereira [camara@dtu.dk], DTU Management*



# Advanced Business Analytics

## Lecture 1 - Web Data Mining - Part 1: News streams

A lot of valuable data can be collected from online sources. It can be used to get insights, make predictions, etc. In this notebook, we're going to practice data collection from online sources using API. Particularly, we will collect new data from the New York Times and analyse it using a Sentiment Analysis API.

**note: We use the NYT and Sentiment APIs because they are both free and straightforward to use. Many others exist that are free but require privacy sensitive data (e.g. credit card) for the registration. We strongly encourage you to try those, which are often more interesting, but be aware of the risks.** 

### 1. Accessing New York Times (NYT) data 


**! WARNING !**

**In order to make a developer account, you will need to use your email. However, if you do not want to do such registration, you can go through the solutions notebook directly. **

#### 1.1. Creating developer account

In order to use NYT’s API, you have to create a developer account on the NYT developers website.

1. Sign in or create a NYT developer account http://developer.nytimes.com.

2. Create a new app
     1. Go to your profile->Apps->New App
     2. App name (required, e.g., *testappxxx12345*)
     3. For this notebook specifically, you will need to enable the following APIs: Times Wire, Top Stories and Community
     
3. Once your App has been created, copy your API key. You will need it later on


#### 1.2. Install request API
Request is a python library to make access to REST API services, through HTTP calls. You can install request using several options:
- Anaconda-Navigator: add chanel ```conda-forge```, press "update index...", search for ```requests```, install
- pip: pip install requests

#### 1.3. The anatomy of a GET request
Now that we have the API key, we can playing with our API! We will work with the basic GET requests (and later below, the POST requests). If you understand these well, you are ready for pretty much any API out there. The opportunities are imense! :-)

First, let’s import requests

In [3]:
import requests


In practice, a GET request consists of a single http call, where the first part indicates the API server and service (we often call it "url"), and the second part contains the parameters of your call (usually called "params"). The code below exemplifies the idea

#### 1.4. Example 1: Times Newswire API
In the first example, we’ll be pulling the 20 most recent news from NYT. We’ll do this by using the API object’s 


In [1]:
import requests

url = "https://api.nytimes.com/svc/news/v3/content/nyt/world.json"


headers = {
    'api-key': "API_Key" #key
    }

response = requests.request("GET", url, params=headers)



The result should look like a bunch of random news, with the corresponding URL of each news itself. Following the link will often bring you to the news article.

**we recommend you to inspect manually the "response" object. Among other things, it has a json function, that transforms the returned data into a json dictionary.**

In [2]:
result=response.json()

An important field of the response object is the **status_code**. Examples are 200 (request went ok), 400 (you made some mistake), 401 (not authorized, you may have put a wrong API key), 404 (the service was not found at all). For a full list, check https://www.w3.org/Protocols/HTTP/HTRESP.html 

In [3]:
response.status_code

200

**inspect the "result" object. You'll it is full of relevant data!** 

Let's take a look at the whole dictionary at once (well, only the titles and URLs)

In [4]:
print("There are %d different news retrieved"%len(result['results']))

for res in result['results']:
    print("Title: %s, Source: %s"%(res['title'], res['url']))

There are 20 different news retrieved
Title: As diplomatic push continues, Macron will travel to Russia and Ukraine next week., Source: https://www.nytimes.com/live/2022/02/03/world/russia-ukraine-xi-putin/as-diplomatic-push-continues-macron-will-travel-to-russia-and-ukraine-next-week
Title: Pandemic-era tests could help efforts to eliminate hepatitis C., Source: https://www.nytimes.com/live/2022/02/04/world/covid-test-vaccine-cases/pandemic-era-tests-could-help-efforts-to-eliminate-hepatitis-c
Title: Jens Stoltenberg will head Norway’s central bank after his NATO term ends., Source: https://www.nytimes.com/live/2022/02/03/world/russia-ukraine-xi-putin/jens-stoltenberg-will-head-norways-central-bank-after-his-nato-term-ends
Title: Captain Cook’s Ship Caught in Center of a Maritime Rift, Source: https://www.nytimes.com/2022/02/04/world/australia/captain-james-cook-hmb-endeavour.html
Title: Is China’s ‘zero-Covid’ policy sustainable?, Source: https://www.nytimes.com/live/2022/02/04/world

In [None]:
print("There are %d different news retrieved"%len(result['results']))

for res in result['results']:
    print("Section: %s"%(res['title']))

#### 1.5. Exercise: NYT top stories from last 30 days...

It is now time for your to build your own REST API GET call. We want you to get the list of top stories from the last 30 days. 

First, you’ll have to examine the NYT documentation to find where the Top Stories API is (do you have it enabled?):  https://developer.nytimes.com/apis

Then you'll have to understand how to really call it. What is the URL? What are the parameters?



In [4]:
import requests

url = "https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json"


headers = {
    'api-key': "API_Key" #key
    }

response = requests.request("GET", url, params=headers)

In [5]:
result=response.json()
response.status_code

200

In [6]:
print("There are %d top stories in the last 30 days"%len(result['results']))

for res in result['results']:
    print("Section: %s \t Title: %s"%(res['section'],res['title']))

There are 20 top stories in the last 30 days
Section: Health 	 Title: Bob Saget’s Autopsy Report Describes Severe Skull Fractures
Section: Business 	 Title: The New York Times Buys Wordle
Section: Style 	 Title: Anna Sorokin on ‘Inventing Anna’ and Life After Rikers
Section: U.S. 	 Title: G.O.P. Declares Jan. 6 Attack ‘Legitimate Political Discourse’
Section: World 	 Title: Effort to Rescue a 5-Year-Old Transfixes Morocco, Only to End Sadly
Section: U.S. 	 Title: Former Miss USA Cheslie Kryst Dies at 30
Section: Your Money 	 Title: Bernie Madoff’s Sister and Her Husband Are Found Dead in Florida
Section: Technology 	 Title: Wordle Is a Love Story
Section: Health 	 Title: Got a Covid Booster? You Probably Won’t Need Another for a Long Time
Section: Sports 	 Title: Kamila Valieva’s sample included three substances sometimes used to help the heart. Only one is banned.
Section: Technology 	 Title: Who Is Behind QAnon? Linguistic Detectives Find Fingerprints
Section: New York 	 Title: Accou

#### 1.6. Exercise: NYT comments section

For the good or for worse, the comments section in a news website is often the most vivid one... the NYT API allows us to access that too. Do you want to select one of the top news articles above (of the last 30 days) and check the comments?

In [9]:
#Gets an article and the url of that article
import requests

article = result['results'][2]
article_url = article['url']
print(article_url)
#response = requests.request("GET", article_url, params=headers)


url = "https://api.nytimes.com/svc/community/v3/user-content/url.json"


headers = {
    'api-key': "API_Key", #key,
    'offset' : 0,
    'url': article_url
    }

response = requests.request("GET", url, params=headers)


result=response.json()

https://www.nytimes.com/2022/02/14/style/anna-delvey-sorokin-interview.html


In [10]:
count=0
for res in result['results']['comments']:
    print("Comment: %d, user: %s, length: %d, recommentations: %d"%(count, res['userDisplayName'], len(res['commentBody']), res['recommendations']))
    count+=1




Comment: 0, user: Sally, length: 36, recommentations: 53
Comment: 1, user: Maggi, length: 619, recommentations: 603
Comment: 2, user: PM, length: 340, recommentations: 245
Comment: 3, user: H Munro, length: 502, recommentations: 112
Comment: 4, user: Sally, length: 210, recommentations: 184
Comment: 5, user: A., length: 332, recommentations: 203
Comment: 6, user: prairie gal, length: 341, recommentations: 101
Comment: 7, user: Chris Borman, length: 166, recommentations: 129
Comment: 8, user: Cinderella, length: 328, recommentations: 117
Comment: 9, user: Jules, length: 77, recommentations: 151
Comment: 10, user: Susan, length: 149, recommentations: 72
Comment: 11, user: Venus Transit, length: 163, recommentations: 25
Comment: 12, user: Linn Anders, length: 201, recommentations: 135
Comment: 13, user: Amy Williams, length: 1584, recommentations: 45
Comment: 14, user: Jack, length: 208, recommentations: 53
Comment: 15, user: Stefon, length: 181, recommentations: 458
Comment: 16, user: Mi

Print the comment that gets the most number of upvotes ("recommendations")

In [41]:
#your code here

Ok, you should now be ready to explore other services from this API and make a few calls just to see the results. Notice that the mechanism is always the same: find the service URL, get the right parameters, interpret the results... 

**note: for some existing APIs (including the NYT) there are actually Python packages that make your life even easier, with functions that directly get the data, preprocess it, combine it with others and so on. Check for example, the "pynytimes".** 

### 2. Combining services from different APIs. POST requests with the Sentiment Analysis API

As you'll quickly find out, there are thousands of APIs available online, and even aggregators of such APIs. An example is http://rapidapi.com. Got there and create your account

Feel free to browse the "API Hub", you may find services that you really like and can use later in your project! :-)

What we want to do now is to use another API, the Text Sentiment API (https://rapidapi.com/fyhao/api/text-sentiment-analysis-method/), to do a simple analysis on the comments we obtained from the NYT website.

#### 2.1. The anatomy of a POST request

The GET and POST are two different types of HTTP requests. GET is used for reading something without changing it, while POST is used for uploading data, i.e. changing something. For example, the NYT API only allowes for GET calls, since it provides news. But if it allowed us to send comments for example, it would have to be through the POST calls. 

In our case, the Text Sentiment API expects us to upload our text through a POST request called "analyze".

##### 2.2. Example 2: A POST request to Text Sentiment 

Take a look below, at the different components of the call. 

In [19]:
url = "https://text-sentiment.p.rapidapi.com/analyze"

payload = "text=I am so happy today... hmm, no, I'm actually quite sad... wait!"
headers = {
    'content-type': "application/x-www-form-urlencoded",
    'x-rapidapi-host': "text-sentiment.p.rapidapi.com",
    'x-rapidapi-key': "API_Key"
    }
response = requests.request("POST", url, data=payload, headers=headers).json()



In [20]:
response

{'text': "I am so happy today... hmm, no, I'm actually quite sad... wait!",
 'totalLines': 3,
 'pos': 1,
 'neg': 1,
 'mid': 1,
 'pos_percent': '33.33333333333333%',
 'neg_percent': '33.33333333333333%',
 'mid_percent': '33.33333333333333%',
 'lang': 'ENGLISH'}

#### 2.2. Assinging sentiment to the NYT comments

Make a function called "get_sentiment", that receives a comment in text, and returns a sentiment value. If the value is 100, it should be the most positive sentiment. If it is 0, it is the most negative sentiment.

In [24]:
def get_sentiment(comment):
    url = "https://text-sentiment.p.rapidapi.com/analyze"

    payload = "text="+comment
    headers = {
        'content-type': "application/x-www-form-urlencoded",
        'x-rapidapi-host': "text-sentiment.p.rapidapi.com",
        'x-rapidapi-key': "API_Key"
        }
    try:
        response = requests.request("POST", url, data=payload, headers=headers).json()
        sentiment=(response['pos']+response['mid']*.5+response['neg']*0)/(response['pos']+response['mid']+response['neg'])
    except:
        sentiment=.5
    
    return sentiment*100

Can you now go through the same set of NYT comments from above (or something else of your choice...), and print the sentiment associated to each one? Also, for those most polarized (i.e. either 0 or 100), can you print them too?

In [25]:
url = "https://api.nytimes.com/svc/community/v3/user-content/url.json"


headers = {
    'api-key': "API_Key", #key,
    'offset' : 0,
    'url':'https://www.nytimes.com/2022/02/14/style/anna-delvey-sorokin-interview.html'
    }

response = requests.request("GET", url, params=headers)

result=response.json()

In [26]:
#your code here
count=0
for res in result['results']['comments']:
    count+=1
    sentiment=get_sentiment(res['commentBody'])
    print("Comment: %d, user: %s, length: %d, recommentations: %d, sentiment:%d"%(count, res['userDisplayName'], len(res['commentBody']), res['recommendations'], sentiment))
    if sentiment==100 or sentiment==0:
        print("Polarized!")
        print(res['commentBody'])

Comment: 1, user: Sally, length: 36, recommentations: 53, sentiment:50
Comment: 2, user: Maggi, length: 619, recommentations: 603, sentiment:50
Comment: 3, user: PM, length: 340, recommentations: 245, sentiment:50
Comment: 4, user: H Munro, length: 502, recommentations: 112, sentiment:100
Polarized!
In the last two episodes,was there some importance or meaning to the narrative decision to shift sympathy and identification to the woman who committed fraud? Why choose to align the majority of major characters with the sorokin character? What was the point of depicting her friend's misfortune as some kind of deserved comeuppance for allowing sorokin to buy things for her? Was this to distinguish and give moral superiority to the other two friends who did the exact thing? what a waste of my time.
Comment: 5, user: Sally, length: 210, recommentations: 184, sentiment:58
Comment: 6, user: A., length: 332, recommentations: 203, sentiment:25
Comment: 7, user: prairie gal, length: 341, recomment

#### 2.3. So many other things to do!...

The idea with this notebook was to give you access to this magnificent Web Mining tool that are the HTTP GET/POST API services. Now, it's your turn to explore. Some ideas (that may require you to register for further services):
- Further use other text anaysis tools (e.g. Google Translate, GrammarBot, TextToImage...)
- Extract the users from the NYT Community API and create ther influencer's graph
- Inspect other APIs you may like (e.g. financial news, stockmarket, weather, enviroment, transport, ...)
