In [1]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

# Detour: Exception Handling

Sometimes the interpreter will generate an error that will interrupt the execution of your program.  These are called exceptions and can be handled programmatically.

This part of the content if from [Chapter 10](https://automatetheboringstuff.com/chapter10/) of the your ABSP textbook. 

## `try` and `except` statements

In [2]:
10/0

ZeroDivisionError: division by zero

In [4]:
a = int(input("Number: "))

Number: 6t


ValueError: invalid literal for int() with base 10: '6t'

In [8]:
try:
    a = int(input("Number 1: "))
    b = int(input("Number 2: "))
    print(a/b)
except ValueError:
    print("Whoa, that's not an integer")
except ZeroDivisionError:
    print("Whoa, you can't divide by zero")

Number 1: 6
Number 2: 4r
Whoa, that's not an integer


### `raise` statement

In [10]:
try:
    a = int(input("Number 1: "))
    if a<0:
        raise ValueError("Entered a Negative")
except ValueError:
    print("Whoa, that's not an integer")
    raise

Number 1: 5t
Whoa, that's not an integer


ValueError: invalid literal for int() with base 10: '5t'

### `assert` statement

An assertion is a sanity check to make sure your code isn’t doing something obviously wrong. These sanity checks are performed by `assert` statements. If the sanity check fails, then an `AssertionError` exception is raised.

In [12]:
instructorName = 'Sean'

assert instructorName == 'Sriram', "Wow! The instructor has to be Sriram!"

AssertionError: Wow! The instructor has to be Sriram!

# Web Scrapping

Web scrapping is very large concept and involves a deep understanding of how websites are created and managed. You will also need to know some fundamentals of HTML. In this section we will do a very basic foundations of extracting the data from the websites. 

### `pd.read_html()`

Using the pandas package, you can read the tables that are created on the websites. It reads all the tables that are available on the webpage. 

The following example extracts the NBA 2019 draft data set from the [Sports Reference](https://www.basketball-reference.com/draft/NBA_2019.html) website

In [13]:
nba_data_list = pd.read_html("https://www.basketball-reference.com/draft/NBA_2019.html") 
type(nba_data_list)

list

You will notice that after `read_html()` returns a list. There can be multiple tables in a given webpage. The `read_html()` method returns list of tables. In this webpage there is only one table. So you can access the table with the 0th indexed element. 

In [14]:
nba_df = nba_data_list[0]
nba_df

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Round 1,Round 1,Unnamed: 5_level_0,Totals,Totals,Totals,Totals,...,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Advanced,Advanced,Advanced,Advanced
Unnamed: 0_level_1,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP,PTS,TRB,AST,WS,WS/48,BPM,VORP
0,1,1,NOP,Zion Williamson,Duke,2,85,2694,2187,591,...,.333,.683,31.7,25.7,7.0,3.2,10.6,.189,4.6,4.5
1,2,2,MEM,Ja Morant,Murray State,3,141,4515,2689,574,...,.320,.753,32.0,19.1,4.1,7.3,8.0,.085,0.0,2.3
2,3,3,NYK,RJ Barrett,Duke,3,140,4624,2270,761,...,.367,.685,33.0,16.2,5.4,2.8,3.8,.040,-2.8,-0.9
3,4,4,LAL,De'Andre Hunter,Virginia,3,96,2981,1239,424,...,.354,.782,31.1,12.9,4.4,1.7,1.8,.029,-3.5,-1.1
4,5,5,CLE,Darius Garland,Vanderbilt,3,123,3948,1827,262,...,.378,.857,32.1,14.9,2.1,5.1,0.5,.006,-3.3,-1.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,56,56,LAC,Jaylen Hands,UCLA,,,,,,...,,,,,,,,,,
58,57,57,NOP,Jordan Bone,Tennessee,2,24,249,68,28,...,.286,,10.4,2.8,1.2,1.1,0.1,.010,-5.9,-0.2
59,58,58,GSW,Miye Oni,Yale,3,68,636,140,102,...,.346,.818,9.4,2.1,1.5,0.5,1.0,.078,-2.6,-0.1
60,59,59,TOR,Dewan Hernandez,Miami (FL),1,6,28,14,14,...,.500,.600,4.7,2.3,2.3,0.5,0.0,.043,-9.6,-0.1


Information on the web pages is not always clean. In this case you might have observed the column names are all multilevel indexes. You can change the column names as indicated on the website by renaming the column names. 

In [15]:
nba_df.columns = ['Rk', 'Pk', 'Tm','Player','College', 'Yrs','G', 'MP', 'PTS','TRB','AST','FG%', 
                    '3P%', 'FT%', 'MP', 'PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP']

nba_df.head()

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST,WS,WS/48,BPM,VORP
0,1,1,NOP,Zion Williamson,Duke,2,85,2694,2187,591,...,0.333,0.683,31.7,25.7,7.0,3.2,10.6,0.189,4.6,4.5
1,2,2,MEM,Ja Morant,Murray State,3,141,4515,2689,574,...,0.32,0.753,32.0,19.1,4.1,7.3,8.0,0.085,0.0,2.3
2,3,3,NYK,RJ Barrett,Duke,3,140,4624,2270,761,...,0.367,0.685,33.0,16.2,5.4,2.8,3.8,0.04,-2.8,-0.9
3,4,4,LAL,De'Andre Hunter,Virginia,3,96,2981,1239,424,...,0.354,0.782,31.1,12.9,4.4,1.7,1.8,0.029,-3.5,-1.1
4,5,5,CLE,Darius Garland,Vanderbilt,3,123,3948,1827,262,...,0.378,0.857,32.1,14.9,2.1,5.1,0.5,0.006,-3.3,-1.4


#### Clean the data

Data downloaded from the webpages, most certainly requires to be cleaned. The following is a simple example of deleting unnnecessary data. 

You will notice that the internet data is **messy**. For example, if you actually see the rows from 28:34, you will see that index 30, 31 had data that is not required. Look at the [website](https://www.basketball-reference.com/draft/NBA_2017.html) the table has a break, so the the DataFrame has unnecessary information. 

In [16]:
nba_df.loc[28:34]

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST,WS,WS/48,BPM,VORP
28,29,29,SAS,Keldon Johnson,Kentucky,3,97,2602,1205,539,...,.341,.735,26.8,12.4,5.6,1.6,4.9,.090,-1.7,0.2
29,30,30,MIL,Kevin Porter Jr.,USC,3,87,2346,1070,306,...,.320,.721,27.0,12.3,3.5,3.8,-0.1,-.002,-4.2,-1.3
30,,,,Round 2,Round 2,,Totals,Totals,Totals,Totals,...,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Advanced,Advanced,Advanced,Advanced
31,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP,PTS,TRB,AST,WS,WS/48,BPM,VORP
32,31,31,BRK,Nicolas Claxton,Georgia,3,51,857,304,232,...,.167,.489,16.8,6.0,4.5,0.9,2.5,.141,-0.4,0.4
33,32,32,PHO,KZ Okpala,Stanford,3,47,491,102,76,...,.250,.529,10.4,2.2,1.6,0.4,0.2,.024,-6.6,-0.6
34,33,33,PHI,Carsen Edwards,Purdue,2,68,627,244,73,...,.302,.750,9.2,3.6,1.1,0.6,0.4,.030,-4.7,-0.4


In [17]:
# Drop those two rows with those indices and you are saying inplace=True, to make sure you are not creating a copy. 
nba_df.drop([30,31], axis=0, inplace= True)
nba_df.loc[28:34]

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST,WS,WS/48,BPM,VORP
28,29,29,SAS,Keldon Johnson,Kentucky,3,97,2602,1205,539,...,0.341,0.735,26.8,12.4,5.6,1.6,4.9,0.09,-1.7,0.2
29,30,30,MIL,Kevin Porter Jr.,USC,3,87,2346,1070,306,...,0.32,0.721,27.0,12.3,3.5,3.8,-0.1,-0.002,-4.2,-1.3
32,31,31,BRK,Nicolas Claxton,Georgia,3,51,857,304,232,...,0.167,0.489,16.8,6.0,4.5,0.9,2.5,0.141,-0.4,0.4
33,32,32,PHO,KZ Okpala,Stanford,3,47,491,102,76,...,0.25,0.529,10.4,2.2,1.6,0.4,0.2,0.024,-6.6,-0.6
34,33,33,PHI,Carsen Edwards,Purdue,2,68,627,244,73,...,0.302,0.75,9.2,3.6,1.1,0.6,0.4,0.03,-4.7,-0.4


### Activity

* Use `pd.read_html()` to download the information on all the states from the [wikipedia](https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals) page. 
    * Do the column names appear appropriately? Make sure you set the column names appropriately. 
    * Do you see any redundant rows appearing? Remove them from the DataFrame. 

In [20]:
state_caps_list = pd.read_html("https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals")
state_caps = state_caps_list[0]
state_caps.head()

Unnamed: 0_level_0,State,Abr.,State-hood,Capital,Capital since,Area (mi²),Population (2018),Population (2018),Population (2018),Population (2018),Notes
Unnamed: 0_level_1,State,Abr.,State-hood,Capital,Capital since,Area (mi²),City,Metropolitan,Rank in state,Rank in US,Notes
0,Alabama,AL,1819,Montgomery,1846,159.8,198218,373903.0,2,119.0,
1,Alaska,AK,1959,Juneau,1906,2716.7,31275,,2,,Largest capital by municipal land area.
2,Arizona,AZ,1912,Phoenix,1889,517.6,1660272,4857962.0,1,5.0,Largest capital by population.
3,Arkansas,AR,1836,Little Rock,1821,116.2,193524,699757.0,1,117.0,
4,California,CA,1850,Sacramento,1854,97.9,508529,2345210.0,6,35.0,Largest capital by population to not be the mo...


In [23]:
state_caps.columns = ['State', 'Abr.', 'State-hood', 'Capital', 'Capital since', 'Area', 'City', 'Metropolitan', 
                     'Rank in state', 'Rank in US', 'Notes']
state_caps.head()

Unnamed: 0,State,Abr.,State-hood,Capital,Capital since,Area,City,Metropolitan,Rank in state,Rank in US,Notes
0,Alabama,AL,1819,Montgomery,1846,159.8,198218,373903.0,2,119.0,
1,Alaska,AK,1959,Juneau,1906,2716.7,31275,,2,,Largest capital by municipal land area.
2,Arizona,AZ,1912,Phoenix,1889,517.6,1660272,4857962.0,1,5.0,Largest capital by population.
3,Arkansas,AR,1836,Little Rock,1821,116.2,193524,699757.0,1,117.0,
4,California,CA,1850,Sacramento,1854,97.9,508529,2345210.0,6,35.0,Largest capital by population to not be the mo...


In [24]:
state_caps.shape

(50, 11)

### Activity

* Download the top 250 movies from [IMDB](http://www.imdb.com/chart/top?ref_=nv_wl_img_3) list 

* Clean the data and remove unnecessary rows and columns

* Which movie released in 2014 has highest IMDb rating

In [29]:
movies = pd.read_html("https://www.imdb.com/chart/top?ref_=nv_wl_img_3")[0]
movies.head()

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4
0,,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,
1,,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,
2,,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,
3,,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,
4,,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,


In [30]:
movies.columns = ['Image', 'Rank and Title', 'Rating', 'Your Rating', 'Star']
movies.head()

Unnamed: 0,Image,Rank and Title,Rating,Your Rating,Star
0,,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,
1,,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,
2,,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,
3,,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,
4,,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,


In [31]:
movies.drop(['Image', 'Your Rating', 'Star'], axis = 1, inplace=True)
movies.head()

Unnamed: 0,Rank and Title,Rating
0,1. The Shawshank Redemption (1994),9.2
1,2. The Godfather (1972),9.1
2,3. The Godfather: Part II (1974),9.0
3,4. The Dark Knight (2008),9.0
4,5. 12 Angry Men (1957),8.9


In [33]:
shaw = '1. The Shawshank Redemption (1994)'
shaw[-5:-1]

'1994'

In [35]:
movies['year'] = movies['Rank and Title'].str[-5:-1]

In [36]:
movies.head()

Unnamed: 0,Rank and Title,Rating,year
0,1. The Shawshank Redemption (1994),9.2,1994
1,2. The Godfather (1972),9.1,1972
2,3. The Godfather: Part II (1974),9.0,1974
3,4. The Dark Knight (2008),9.0,2008
4,5. 12 Angry Men (1957),8.9,1957


In [37]:
movies_2014 = movies[(movies['year'] == '2014')]
movies_2014

Unnamed: 0,Rank and Title,Rating,year
26,27. Interstellar (2014),8.5,2014
42,43. Whiplash (2014),8.5,2014
184,185. Wild Tales (2014),8.1,2014
192,193. The Grand Budapest Hotel (2014),8.1,2014
200,201. Gone Girl (2014),8.1,2014


In [38]:
movies_2014.sort_values('Rating', ascending=False)

Unnamed: 0,Rank and Title,Rating,year
26,27. Interstellar (2014),8.5,2014
42,43. Whiplash (2014),8.5,2014
184,185. Wild Tales (2014),8.1,2014
192,193. The Grand Budapest Hotel (2014),8.1,2014
200,201. Gone Girl (2014),8.1,2014


# Packages for webscrapping 

* urllib
* requests
* **BeautifulSoup**
* mechanize

This will require some fundamentals on HTML, the language used to display the webpages on the browser. 

In [39]:
import urllib
import requests
from bs4 import BeautifulSoup

In [40]:
req = requests.get("https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals")
page = req.text

page_soup = BeautifulSoup(page, 'html.parser')


In [41]:
page_soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6916f457-82b0-4c76-92ba-0ed9d3291404","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_U.S._state_capitals","wgTitle":"List of U.S. state capitals","wgCurRevisionId":7837958,"wgRevisionId":7837958,"wgArticleId":18635,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["State capitals in the United States","Lists of cities in the United States"],"wgPageContentLan

You can print the actual webpage and its contents. 

**Warning**: The contents of a webpage are messy and may not be obvious for the first time. However, if you want to scrape any website, you will have to be patient and look through the contents to extract the information. 

In [42]:
print(page_soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6916f457-82b0-4c76-92ba-0ed9d3291404","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_U.S._state_capitals","wgTitle":"List of U.S. state capitals","wgCurRevisionId":7837958,"wgRevisionId":7837958,"wgArticleId":18635,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["State capitals in the United States","Lists of cities in the United States"],

In [43]:
page_soup.title

<title>List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia</title>

In [44]:
page_soup.title.string

'List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia'

### Searching in the webpage

You can programmatically search through a webpage to find the tables that are available on the webpage. You can do that by using **`find_all()`** method. 

In [45]:
states_table = page_soup.find_all("table")
states_table

[<table class="wikitable sortable">
 <caption>State capitals of the United States
 </caption>
 <tbody><tr>
 <th rowspan="2">State</th>
 <th rowspan="2">Abr.</th>
 <th rowspan="2">State-hood</th>
 <th rowspan="2">Capital</th>
 <th rowspan="2">Capital since</th>
 <th rowspan="2">Area (mi²)</th>
 <th colspan="4">Population (2018)</th>
 <th rowspan="2">Notes
 </th></tr>
 <tr>
 <th><a href="/wiki/List_of_United_States_cities_by_population" title="List of United States cities by population">City</a>
 </th>
 <th>Metropolitan
 </th>
 <th>Rank in state
 </th>
 <th>Rank in US
 </th></tr>
 <tr>
 <td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
 <td>AL</td>
 <td align="center">1819</td>
 <td><a href="/wiki/Montgomery,_Alabama" title="Montgomery, Alabama">Montgomery</a></td>
 <td align="center">1846</td>
 <td align="right">159.8</td>
 <td align="right">198,218</td>
 <td align="right">373,903</td>
 <td align="center">2</td>
 <td align="center">119</td>
 <td>
 </td></tr>
 <tr>
 <td><a hre

# WebScrapping through Application Programming Interface (API)

There are a lot of APIs available for each of the website. You can use these APIs to scrape websites like Twitter, Google Trends, etc. 

In this section, we will use a simple API provided by NASA, [here](http://open-notify.org/), to retrieve data about the International Space Station (ISS). 

Some of the content presented here is based on [dataquest](https://www.dataquest.io/blog/python-api-tutorial/). 

#### Current ISS position

In [46]:
import requests
response = requests.get("http://api.open-notify.org/iss-now.json")

print(response.status_code)

200


In [48]:
response.content

b'{"iss_position": {"longitude": "-134.9620", "latitude": "-48.0068"}, "message": "success", "timestamp": 1636660338}'

There are various status codes that you will get when you request a website. [This](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) describes more detailed description. 

In [52]:
response = requests.get("http://api.open-notify.org/iss-now.json")
pd.read_json(response.content)

Unnamed: 0,iss_position,message,timestamp
latitude,-48.5755,success,2021-11-11 19:52:42
longitude,-132.8465,success,2021-11-11 19:52:42


#### Current Number of People In Space

In [53]:
response = requests.get("http://api.open-notify.org/astros.json")
pd.read_json(response.content)

Unnamed: 0,message,people,number
0,success,"{'name': 'Mark Vande Hei', 'craft': 'ISS'}",10
1,success,"{'name': 'Pyotr Dubrov', 'craft': 'ISS'}",10
2,success,"{'name': 'Anton Shkaplerov', 'craft': 'ISS'}",10
3,success,"{'name': 'Zhai Zhigang', 'craft': 'Shenzhou 13'}",10
4,success,"{'name': 'Wang Yaping', 'craft': 'Shenzhou 13'}",10
5,success,"{'name': 'Ye Guangfu', 'craft': 'Shenzhou 13'}",10
6,success,"{'name': 'Raja Chari', 'craft': 'ISS'}",10
7,success,"{'name': 'Tom Marshburn', 'craft': 'ISS'}",10
8,success,"{'name': 'Kayla Barron', 'craft': 'ISS'}",10
9,success,"{'name': 'Matthias Maurer', 'craft': 'ISS'}",10


# Global Database of Events, Language, and Tone (GDELT) API

[GDELT](https://www.gdeltproject.org/about.html) is the largest, most comprehensive, and highest resolution open database of human society ever created.  If you have never seen this, you should explore their open source database. It is very unique and has a lot of opportunity to analyze data. 

### Install the package

In order to access their database with an API, you need to install `gdelt` package. 

In a cell in your Jupyter notebook use the following command.  

`!pip install --user gdelt`

This should install `gdelt` package that we can use here. 

**Important Notes**

1. **You should be able to install any package this way on your computer**

2. **You might have to restart the Kernel to use the installed package**


In [54]:
!pip install --user gdelt



In [1]:
import gdelt

In [2]:
gd = gdelt.gdelt(version=2)

In [3]:
results = gd.Search(['2021-11-10'],table='events', coverage = True)

In [4]:
results.columns

Index(['GLOBALEVENTID', 'SQLDATE', 'MonthYear', 'Year', 'FractionDate',
       'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode',
       'Actor1EthnicCode', 'Actor1Religion1Code', 'Actor1Religion2Code',
       'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Code',
       'Actor2Name', 'Actor2CountryCode', 'Actor2KnownGroupCode',
       'Actor2EthnicCode', 'Actor2Religion1Code', 'Actor2Religion2Code',
       'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code', 'IsRootEvent',
       'EventCode', 'CAMEOCodeDescription', 'EventBaseCode', 'EventRootCode',
       'QuadClass', 'GoldsteinScale', 'NumMentions', 'NumSources',
       'NumArticles', 'AvgTone', 'Actor1Geo_Type', 'Actor1Geo_FullName',
       'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code', 'Actor1Geo_ADM2Code',
       'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_FeatureID',
       'Actor2Geo_Type', 'Actor2Geo_FullName', 'Actor2Geo_CountryCode',
       'Actor2Geo_ADM1Code', 'Actor2Geo_ADM2Code

**NOTE**: If you are more interested in the columns you can look at the [cookbook](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf) for more information. 

In [5]:
results.tail(10)

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
135695,1013712485,20211110,202111,2021,2021.8493,USAMED,REUTERS,USA,,,...,1,"Vietnam, Republic Of",VM,VM,,16.166667,107.833333,VM,20211110234500,https://www.reuters.com/business/energy/us-rej...
135696,1013712486,20211110,202111,2021,2021.8493,USAMED,REUTERS,USA,,,...,1,"Vietnam, Republic Of",VM,VM,,16.166667,107.833333,VM,20211110234500,https://www.reuters.com/business/energy/us-rej...
135697,1013712487,20211110,202111,2021,2021.8493,USAOPP,AUSTIN,USA,,,...,3,"Fort Worth, Texas, United States",US,USTX,,32.7254,-97.3208,1380947,20211110234500,https://www.fwweekly.com/2021/11/10/poor-judgm...
135698,1013712488,20211110,202111,2021,2021.8493,USAOPP,AUSTIN,USA,,,...,2,"Texas, United States",US,USTX,,31.106,-97.6475,TX,20211110234500,https://www.fwweekly.com/2021/11/10/poor-judgm...
135699,1013712489,20211110,202111,2021,2021.8493,USAPRI,PUERTO RICAN,USA,,,...,3,"Washington, District of Columbia, United States",US,USDC,DC001,38.8951,-77.0364,531871,20211110234500,https://cnsnews.com/blog/megan-williams/levin-...
135700,1013712490,20211110,202111,2021,2021.8493,USAREL,UNITED STATES,USA,,,...,3,"Charlotte, North Carolina, United States",US,USNC,NC119,35.2271,-80.8431,1019610,20211110234500,https://www.wwaytv3.com/nc-congressional-map-t...
135701,1013712491,20211110,202111,2021,2021.8493,haz,HAZARA,,,haz,...,4,"Daykundi, Daykondi, Afghanistan",AF,AF41,100027,33.9167,65.9167,-3373590,20211110234500,https://www.wwaytv3.com/taliban-official-at-le...
135702,1013712492,20211110,202111,2021,2021.8493,haz,HAZARA,,,haz,...,4,"Kunduz, Kondoz, Afghanistan",AF,AF24,3642,36.729,68.857,-3381731,20211110234500,https://www.wwaytv3.com/taliban-official-at-le...
135703,1013712493,20211110,202111,2021,2021.8493,yor,YORUBA,,,yor,...,4,"Abuja, Abuja Federal Capital Territory, Nigeria",NI,NI11,191147,9.08333,7.53333,-1997013,20211110234500,https://punchng.com/agf-says-political-solutio...
135704,1013712494,20211110,202111,2021,2021.8493,yor,YORUBA,,,yor,...,4,"Abuja, Abuja Federal Capital Territory, Nigeria",NI,NI11,191147,9.08333,7.53333,-1997013,20211110234500,https://punchng.com/agf-says-political-solutio...


## Activity:

1. Select only the those from the results which are in US, that is 'ActionGeo_CountryCode' is 'US', 'Actor1Name' is 'UNIVERSITY', and 'ActionGeo_ADM1Code' is 'USIN'. 
2. Find any interesting news articles based on 'SOURCEURL'

In [7]:
results_IN = results[ (results['ActionGeo_CountryCode'] == 'US') & 
                     (results['Actor1Name'] == 'UNIVERSITY') & 
                    (results['ActionGeo_ADM1Code'] == 'USIN') ]
results_IN.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
5656,1013513276,20211110,202111,2021,2021.8493,EDU,UNIVERSITY,,,,...,3,"Purdue University, Indiana, United States",US,USIN,,40.4281,-86.9225,441652,20211110001500,http://www.wbiw.com/2021/11/09/lawrence-county...
50997,1013581230,20211110,202111,2021,2021.8493,EDUEDU,UNIVERSITY,,,,...,2,"Indiana, United States",US,USIN,,39.8647,-86.2604,IN,20211110100000,https://www.theledger.com/story/business/manuf...
66458,1013612243,20211110,202111,2021,2021.8493,EDU,UNIVERSITY,,,,...,3,"Purdue University, Indiana, United States",US,USIN,,40.4281,-86.9225,441652,20211110131500,https://www.tmnews.com/story/news/2021/11/10/i...
81329,1013635166,20211110,202111,2021,2021.8493,EDU,UNIVERSITY,,,,...,2,"Indiana, United States",US,USIN,,39.8647,-86.2604,IN,20211110153000,https://www.chronicle-tribune.com/eedition/pag...
87949,1013643702,20211110,202111,2021,2021.8493,EDU,UNIVERSITY,,,,...,2,"Indiana, United States",US,USIN,,39.8647,-86.2604,IN,20211110161500,https://beaver1003.com/mornings/a-cool-idea-fo...


In [9]:
list(results_IN['SOURCEURL'].iloc[:5])

['http://www.wbiw.com/2021/11/09/lawrence-county-commissioners-stress-importance-of-county-employees-to-complete-security-training/',
 'https://www.theledger.com/story/business/manufacturing/2021/11/09/florida-poly-tech-professor-sanna-siddiqui-lands-grant-study-3-d-printing-jet-rocket-parts/6087698001/',
 'https://www.tmnews.com/story/news/2021/11/10/indot-indiana-lawrence-county-commissioners-vote-contract-approved/6354648001/',
 'https://www.chronicle-tribune.com/eedition/page-a1/page_ef80ebd1-0531-5b02-a926-7fb01fa5b94d.html',
 'https://beaver1003.com/mornings/a-cool-idea-for-storing-your-make-up-brushes/']