In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

# Detour: Exception Handling

Sometimes the interpreter will generate an error that will interrupt the execution of your program.  These are called exceptions and can be handled programmatically.

This part of the content if from [Chapter 10](https://automatetheboringstuff.com/chapter10/) of the your ABSP textbook. 

## `try` and `except` statements

In [None]:
10/0

In [None]:
a = int(input("Number: "))

In [None]:
try:
    a = int(input("Number 1: "))
    b = int(input("Number 2: "))
    print(a/b)
except ValueError:
    print("Whoa, that's not an integer")
except ZeroDivisionError:
    print("Whoa, you can't divide by zero")

### `raise` statement

In [None]:
try:
    a = int(input("Number 1: "))
    if a<0:
        raise ValueError("Entered a Negative")
except ValueError:
    print("Whoa, that's not an integer")
    raise

### `assert` statement

An assertion is a sanity check to make sure your code isnâ€™t doing something obviously wrong. These sanity checks are performed by `assert` statements. If the sanity check fails, then an `AssertionError` exception is raised.

In [None]:
instructorName = 'Sriram'

assert instructorName == 'Sriram', "Wow! The instructor has to be Sriram!"

# Web Scrapping

Web scrapping is very large concept and involves a deep understanding of how websites are created and managed. You will also need to know some fundamentals of HTML. In this section we will do a very basic foundations of extracting the data from the websites. 

### `pd.read_html()`

Using the pandas package, you can read the tables that are created on the websites. It reads all the tables that are available on the webpage. 

The following example extracts the NBA 2019 draft data set from the [Sports Reference](https://www.basketball-reference.com/draft/NBA_2019.html) website

In [None]:
nba_data_list = pd.read_html("https://www.basketball-reference.com/draft/NBA_2019.html") 
type(nba_data_list)

You will notice that after `read_html()` returns a list. There can be multiple tables in a given webpage. The `read_html()` method returns list of tables. In this webpage there is only one table. So you can access the table with the 0th indexed element. 

In [None]:
nba_df = nba_data_list[0]
nba_df

Information on the web pages is not always clean. In this case you might have observed the column names are all multilevel indexes. You can change the column names as indicated on the website by renaming the column names. 

In [None]:
nba_df.columns = ['Rk', 'Pk', 'Tm','Player','College', 'Yrs','G', 'MP', 'PTS','TRB','AST','FG%', 
                    '3P%', 'FT%', 'MP', 'PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP']

nba_df.head()

#### Clean the data

Data downloaded from the webpages, most certainly requires to be cleaned. The following is a simple example of deleting unnnecessary data. 

You will notice that the internet data is **messy**. For example, if you actually see the rows from 28:34, you will see that index 30, 31 had data that is not required. Look at the [website](https://www.basketball-reference.com/draft/NBA_2017.html) the table has a break, so the the DataFrame has unnecessary information. 

In [None]:
nba_df.loc[28:34]

In [None]:
# Drop those two rows with those indices and you are saying inplace=True, to make sure you are not creating a copy. 
nba_df.drop([30,31], axis=0, inplace= True)
nba_df.loc[28:34]

### Activity

* Use `pd.read_html()` to download the information on all the states from the [wikipedia](https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals) page. 
    * Do the column names appear appropriately? Make sure you set the column names appropriately. 
    * Do you see any redundant rows appearing? Remove them from the DataFrame. 

### Activity

* Download the top 250 movies from [IMDB](http://www.imdb.com/chart/top?ref_=nv_wl_img_3) list 

* Clean the data and remove unnecessary rows and columns

* Which movie released in 2014 has highest IMDb rating

# Packages for webscrapping 

* urllib
* requests
* **BeautifulSoup**
* mechanize

This will require some fundamentals on HTML, the language used to display the webpages on the browser. 

In [None]:
import urllib
import requests
from bs4 import BeautifulSoup

In [None]:
req = requests.get("https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals")
page = req.text

page_soup = BeautifulSoup(page, 'html.parser')


In [None]:
page_soup

You can print the actual webpage and its contents. 

**Warning**: The contents of a webpage are messy and may not be obvious for the first time. However, if you want to scrape any website, you will have to be patient and look through the contents to extract the information. 

In [None]:
print(page_soup.prettify())

In [None]:
page_soup.title

In [None]:
page_soup.title.string

### Searching in the webpage

You can programmatically search through a webpage to find the tables that are available on the webpage. You can do that by using **`find_all()`** method. 

In [None]:
states_table = page_soup.find_all("table")
states_table

# WebScrapping through Application Programming Interface (API)

There are a lot of APIs available for each of the website. You can use these APIs to scrape websites like Twitter, Google Trends, etc. 

In this section, we will use a simple API provided by NASA, [here](http://open-notify.org/), to retrieve data about the International Space Station (ISS). 

Some of the content presented here is based on [dataquest](https://www.dataquest.io/blog/python-api-tutorial/). 

#### Current ISS position

In [None]:
import requests
response = requests.get("http://api.open-notify.org/iss-now.json")

print(response.status_code)

In [None]:
response.content

There are various status codes that you will get when you request a website. [This](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) describes more detailed description. 

In [None]:
response = requests.get("http://api.open-notify.org/iss-now.json")
pd.read_json(response.content)

#### Current Number of People In Space

In [None]:
response = requests.get("http://api.open-notify.org/astros.json")
pd.read_json(response.content)

# Global Database of Events, Language, and Tone (GDELT) API

[GDELT](https://www.gdeltproject.org/about.html) is the largest, most comprehensive, and highest resolution open database of human society ever created.  If you have never seen this, you should explore their open source database. It is very unique and has a lot of opportunity to analyze data. 

### Install the package

In order to access their database with an API, you need to install `gdelt` package. 

In a cell in your Jupyter notebook use the following command.  

`!pip install --user gdelt`

This should install `gdelt` package that we can use here. 

**Important Notes**

1. **You should be able to install any package this way on your computer**

2. **You might have to restart the Kernel to use the installed package**


In [None]:
!pip install --user gdelt

In [None]:
import gdelt

In [None]:
gd = gdelt.gdelt(version=2)

In [None]:
results = gd.Search(['2021-11-10'],table='events', coverage = True)

In [None]:
results.columns

**NOTE**: If you are more interested in the columns you can look at the [cookbook](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf) for more information. 

In [None]:
results.tail(10)

## Activity:

1. Select only the those from the results which are in US, that is 'ActionGeo_CountryCode' is 'US', 'Actor1Name' is 'UNIVERSITY', and 'ActionGeo_ADM1Code' is 'USIN'. 
2. Find any interesting news articles based on 'SOURCEURL'