# Assignment Data Scraping
### Scrape and Analyse

* API [https://beautiful-soup-4.readthedocs.io/en/latest/](https://beautiful-soup-4.readthedocs.io/en/latest/)

In [1]:
pip install requests bs4 scrapy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scrapy
  Downloading Scrapy-2.6.1-py2.py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 5.4 MB/s 
Collecting protego>=0.1.15
  Downloading Protego-0.2.1-py2.py3-none-any.whl (8.2 kB)
Collecting Twisted>=17.9.0
  Downloading Twisted-22.4.0-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 50.8 MB/s 
[?25hCollecting itemadapter>=0.1.0
  Downloading itemadapter-0.6.0-py3-none-any.whl (10 kB)
Collecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.5.zip (47 kB)
[K     |████████████████████████████████| 47 kB 4.2 MB/s 
Collecting cssselect>=0.9.1
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting parsel>=1.5.0
  Downloading parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=16.0.0
  Downloading service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting zope.interface>=4.1.3
  Down

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Tasks
Scrape data from the website [http://www.nationmaster.com](http://www.nationmaster.com/), convert it into Pandas data frames and use pandas queries to answer the following questions: 

#### 1
Get the number of internet users per country, remove all NaN entries and return the top 10 countries with the highest absolute number of internet users. 

In [3]:
import bs4 as bs
import urllib.request
import pandas as pd

url = "https://www.nationmaster.com/nmx/ranking/total-internet-users"
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find_all('table')
df = pd.read_html(str(table))[0]

In [4]:
df.columns

Index(['#', '204 Countries', 'Units Per Hundred Persons', 'Last', 'YoY',
       '5‑years CAGR', 'Unnamed: 6'],
      dtype='object')

In [5]:
df.drop(['#', 'Last', 'YoY', '5‑years CAGR', 'Unnamed: 6'], axis=1, inplace=True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 2 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   204 Countries              204 non-null    object
 1   Units Per Hundred Persons  204 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 3.3+ KB


In [7]:
df.rename(columns={'204 Countries':'Countries'}, inplace=True)

In [8]:
df.columns

Index(['Countries', 'Units Per Hundred Persons'], dtype='object')

In [9]:
df.Countries = df['Countries'].replace('#.* ', '', regex=True)

In [10]:
df.sort_values(by='Units Per Hundred Persons', ascending=False).head(10)

Unnamed: 0,Countries,Units Per Hundred Persons
0,Iceland,98
1,Islands,98
2,Bermuda,97
3,Norway,96
4,Denmark,96
5,Andorra,96
6,Liechtenstein,95
7,Luxembourg,95
8,Islands,95
9,Netherlands,93


#### 2
Get the number of internet users per country, remove all NaN entries and return the top 10 countries with the highest number of internet users relative to the populutation. Hint: you need to scrape the population number from another page)

In [11]:
url = 'https://www.nationmaster.com/nmx/ranking/individuals-using-the-internet'

source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find_all('table')
df = pd.read_html(str(table))[0]

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   #                      205 non-null    int64 
 1   205 Countries          205 non-null    object
 2   Percent of Population  205 non-null    object
 3   Last                   205 non-null    int64 
 4   YoY                    195 non-null    object
 5   5‑years CAGR           201 non-null    object
 6   Unnamed: 6             205 non-null    object
dtypes: int64(2), object(5)
memory usage: 11.3+ KB


In [13]:
df.drop(['#', 'Last', 'YoY', '5‑years CAGR', 'Unnamed: 6'], axis=1, inplace=True)

In [14]:
df.rename(columns={'205 Countries':'Countries'}, inplace=True)
df.columns

Index(['Countries', 'Percent of Population'], dtype='object')

In [15]:
df.Countries = df['Countries'].replace('#.* ', '', regex=True)
df['Percent of Population'] = df['Percent of Population'].replace('(\D|\s)*.%', '', regex=True).astype("float")

[regex expr](https://regex101.com/r/1rqIGu/1)

In [16]:
df.sort_values(by='Percent of Population', ascending=False).head(10)

Unnamed: 0,Countries,Percent of Population
0,Aruba,105.26
1,Liechtenstein,104.53
2,Bermuda,103.68
3,Islands,102.82
4,Monaco,101.91
5,Gibraltar,101.47
6,Iceland,100.79
7,Andorra,100.16
8,Luxembourg,100.03
9,Bahrain,99.7


#### 3
Compute the correlation between the crime rate (murders per 100k) and the education level. Compare this to the correlation of crime rate and poverty (relative BIP). Hint: use pandas build in correlation function: [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)

In [17]:
#url_education = 'https://www.nationmaster.com/nmx/ranking/education-expenditure'
url_pov = 'https://www.nationmaster.com/country-info/stats/Economy/Poverty-and-inequality/Multidimensional-poverty-index#-amount'
url_murder = 'https://www.nationmaster.com/country-info/stats/Crime/Violent-crime/Murder-rate'
urls = [url_pov, url_murder]

dfs = []
for url in urls:
  print(url)
  source = urllib.request.urlopen(url).read()
  soup = bs.BeautifulSoup(source,'lxml')

  table = soup.find_all('table')
  dfs += [pd.read_html(str(table))[0]]

df_pov = dfs[0]
df_mur = dfs[1]

https://www.nationmaster.com/country-info/stats/Economy/Poverty-and-inequality/Multidimensional-poverty-index#-amount
https://www.nationmaster.com/country-info/stats/Crime/Violent-crime/Murder-rate


Preapre Edu Dataset

In [18]:
df_pov.columns

Index(['#', 'COUNTRY', 'AMOUNT', 'DATE', 'GRAPH', 'HISTORY'], dtype='object')

In [19]:
df_pov.drop(columns=['#', 'DATE', 'GRAPH', 'HISTORY'], inplace=True)
df_pov.rename(columns={'AMOUNT':'poverty'}, inplace=True)
df_pov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   COUNTRY  103 non-null    object 
 1   poverty  103 non-null    float64
dtypes: float64(1), object(1)
memory usage: 1.7+ KB


Prepare Murder Data Set

In [20]:
df_mur.head(5)

Unnamed: 0,#,COUNTRY,AMOUNT,DATE,GRAPH,HISTORY
0,1,Brazil,40974.0,2010,,
1,2,India,40752.0,2009,,
2,3,Mexico,25757.0,2010,,
3,4,Ethiopia,20239.0,2008,,
4,5,Indonesia,18963.0,2008,,


In [21]:
df_mur.drop(columns=['#', 'DATE', 'GRAPH', 'HISTORY'], inplace=True)
df_mur.rename({'AMOUNT':'homicides'}, axis=1, inplace=True)
df_mur.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   COUNTRY    204 non-null    object 
 1   homicides  204 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.3+ KB


Combine both datastes

In [22]:
df = pd.merge(df_pov, df_mur, on='COUNTRY')

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   COUNTRY    103 non-null    object 
 1   poverty    103 non-null    float64
 2   homicides  103 non-null    float64
dtypes: float64(2), object(1)
memory usage: 3.2+ KB


In [24]:
df.corr()

Unnamed: 0,poverty,homicides
poverty,1.0,0.094618
homicides,0.094618,1.0


### REST API
#### Using data from [https://www.energidataservice.dk](https://www.energidataservice.dk) 

We look at real time energy production: https://www.energidataservice.dk/tso-electricity/electricityprodex5minrealtime

In [25]:
import pandas as pd
import requests
from pandas import json_normalize

In [26]:
#get data from an open energy data service provider
url = 'https://api.energidataservice.dk/datastore_search?resource_id=electricityprodex5minrealtime&limit=500'

response = requests.get(url)
dictr = response.json() #parse json to dict
recs = dictr['result']['records'] 
df = json_normalize(recs) #flatten json files into data frame
df.head()

Unnamed: 0,_id,Minutes5UTC,Minutes5DK,PriceArea,ProductionLt100MW,ProductionGe100MW,OffshoreWindPower,OnshoreWindPower,SolarPower,ExchangeGreatBelt,ExchangeGermany,ExchangeNetherlands,ExchangeNorway,ExchangeSweden,BornholmSE4
0,1949,2018-05-27T22:00:00+00:00,2018-05-28T00:00:00,DK1,186.91,-5.14,585.6,814.6,0.0,-424.19,571.81,,-246.0,395.45,
1,1950,2018-05-27T22:00:00+00:00,2018-05-28T00:00:00,DK2,112.13,159.23,206.72,223.51,0.0,424.19,2.12,,,-66.93,-2.89
2,1951,2018-05-27T22:05:00+00:00,2018-05-28T00:05:00,DK1,191.5,-5.85,602.48,831.38,0.0,-478.59,534.25,,-253.0,441.05,
3,1952,2018-05-27T22:05:00+00:00,2018-05-28T00:05:00,DK2,111.44,176.53,223.09,229.23,0.0,478.59,2.21,,,-160.48,-1.9
4,1953,2018-05-27T22:10:00+00:00,2018-05-28T00:10:00,DK1,188.68,-5.82,612.03,842.72,0.0,-532.18,547.31,,-266.0,493.95,


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   _id                  500 non-null    int64  
 1   Minutes5UTC          500 non-null    object 
 2   Minutes5DK           500 non-null    object 
 3   PriceArea            500 non-null    object 
 4   ProductionLt100MW    500 non-null    float64
 5   ProductionGe100MW    500 non-null    float64
 6   OffshoreWindPower    500 non-null    float64
 7   OnshoreWindPower     500 non-null    float64
 8   SolarPower           500 non-null    float64
 9   ExchangeGreatBelt    500 non-null    float64
 10  ExchangeGermany      500 non-null    float64
 11  ExchangeNetherlands  0 non-null      object 
 12  ExchangeNorway       253 non-null    float64
 13  ExchangeSweden       500 non-null    float64
 14  BornholmSE4          247 non-null    float64
dtypes: float64(10), int64(1), object(4)
memo

#### 4
Compute overview statistics (mean, variance, quantiles, counts,...) for all variables. Hint: there is a single pandas call to get this ...

In [27]:
df.describe()

Unnamed: 0,_id,ProductionLt100MW,ProductionGe100MW,OffshoreWindPower,OnshoreWindPower,SolarPower,ExchangeGreatBelt,ExchangeGermany,ExchangeNorway,ExchangeSweden,BornholmSE4
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,253.0,500.0,247.0
mean,66433.074,184.32804,193.20972,256.14898,386.65624,28.40066,-3.20298,388.8504,155.811344,161.31754,5.934939
std,227313.713959,87.553423,191.439448,162.159724,300.516856,64.60294,408.286726,649.952804,867.672642,397.920958,7.284122
min,1093.0,83.45,-6.36,14.64,41.52,0.0,-592.28,-1317.44,-1531.0,-740.58,-7.06
25%,1492.75,110.14,63.8325,106.9225,142.72,0.0,-389.2975,1.07,-540.0,-172.3,-1.165
50%,2007.5,197.07,150.6,241.68,301.755,0.08,0.0,2.22,129.84,127.13,6.21
75%,2132.25,219.5325,275.635,333.4475,595.78,21.2,372.9075,846.05,953.95,472.25,11.57
max,907901.0,610.62,1086.11,660.9,2329.14,437.21,592.28,1622.75,1333.51,919.54,22.62


#### 5 
Compute the average ***OffshoreWindPower*** by each day for the last 7 days.
* hint: you need to check the API to query the right data 

In [37]:
import datetime

In [47]:
#get data from an open energy data service provider
url = 'https://api.energidataservice.dk/datastore_search?resource_id=electricityprodex5minrealtime&limit=500'

response = requests.get(url)
dictr = response.json() #parse json to dict
recs = dictr['result']['records'] 
df = json_normalize(recs) #flatten json files into data frame
df.head()

Unnamed: 0,_id,Minutes5UTC,Minutes5DK,PriceArea,ProductionLt100MW,ProductionGe100MW,OffshoreWindPower,OnshoreWindPower,SolarPower,ExchangeGreatBelt,ExchangeGermany,ExchangeNetherlands,ExchangeNorway,ExchangeSweden,BornholmSE4
0,1949,2018-05-27T22:00:00+00:00,2018-05-28T00:00:00,DK1,186.91,-5.14,585.6,814.6,0.0,-424.19,571.81,,-246.0,395.45,
1,1950,2018-05-27T22:00:00+00:00,2018-05-28T00:00:00,DK2,112.13,159.23,206.72,223.51,0.0,424.19,2.12,,,-66.93,-2.89
2,1951,2018-05-27T22:05:00+00:00,2018-05-28T00:05:00,DK1,191.5,-5.85,602.48,831.38,0.0,-478.59,534.25,,-253.0,441.05,
3,1952,2018-05-27T22:05:00+00:00,2018-05-28T00:05:00,DK2,111.44,176.53,223.09,229.23,0.0,478.59,2.21,,,-160.48,-1.9
4,1953,2018-05-27T22:10:00+00:00,2018-05-28T00:10:00,DK1,188.68,-5.82,612.03,842.72,0.0,-532.18,547.31,,-266.0,493.95,


In [51]:
cols = [c for c in df.columns if c not in ['Minutes5UTC', 'OffshoreWindPower']]
df.drop(columns=cols, inplace=True)

In [52]:
df.head(1)

Unnamed: 0,Minutes5UTC,OffshoreWindPower
0,2018-05-27T22:00:00+00:00,585.6


[datetime str](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Minutes5UTC        500 non-null    object 
 1   OffshoreWindPower  500 non-null    float64
dtypes: float64(1), object(1)
memory usage: 7.9+ KB


In [55]:
df[['Date', 'Time']] = df['Minutes5UTC'].str.split('T', 1, expand=True)

In [56]:
df.head(2)

Unnamed: 0,Minutes5UTC,OffshoreWindPower,Date,Time
0,2018-05-27T22:00:00+00:00,585.6,2018-05-27,22:00:00+00:00
1,2018-05-27T22:00:00+00:00,206.72,2018-05-27,22:00:00+00:00


In [57]:
df['Date'] = pd.to_datetime(df['Date'])    # , format='%Y-%b-%dT%H:%M:%S'

In [65]:
averages = []
cache = []
last_date = None
counter = 0
max_ = 7
for i, data in df.sort_values(by='Minutes5UTC', ascending=False).iterrows():
    if last_date == None:
        last_date = data.Date
    elif last_date == data.Date:
        cache += [data.OffshoreWindPower]
    else:
        # save result
        averages += [sum(cache)/len(cache)]
        # new date
        last_date = data.Date
        cache += [data.OffshoreWindPower]
        counter += 1
        if counter == max_:
            break


In [66]:
averages

[261.76888888888885,
 293.4699999999999,
 285.26555555555547,
 300.9104999999999,
 225.66057142857136,
 222.42567567567562,
 214.28949999999995]