In [14]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Lab 2 : Web scraping and  API requests

In this lab exercise you will practice scraping data from a website, as well as doing some priliminary analysis on them.

__Deadline: Friday, Feb 25 11:59__



## Part 1: Scraping Data From Wikipedia

We have completed a similar task during lecture. You have to scrap a specific page of Wikipedia and answer some questions regarding the data you have collected. 
You have to get the data about different countries and their respective populations from the following page:
[https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population](https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)

This page contains multiple tables for past and future population of countries. For the first part of this lab do the following:

1. Fetch the data from wikipedia with "requests" library
2. Parse html data with BeautifulSoup library
3. Use BeautifulSoup to extract specific tables
4. Combine the tables and convert the data into a dictionary 
5. Make a pandas dataframe from the dictionary 
6. Answer some questions and do some basic visualization!



### 1.1 Get the data from wikipedia (5 pts)

Use "requests" library. 

In [15]:
# Your code here 
import requests
url = "https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_future_population"
response =  requests.get(url)

### 1.2 Parse html data with BeautifulSoup (10 pts)

Parse the data using BeautifulSoup. Remember that BeautifulSoup has many useful attributes such as prettify(), find(attribute), and find_all(attribute). Check the documentation for more info: [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)



#### 1.2.a Find the first title object and extract and print the string stored in it (5 pts)

In [16]:
# Your code here  
from bs4 import BeautifulSoup as bsoup

wiki_content = response.text

soup = bsoup(wiki_content,'html5lib')

soup.find_all('title')[0].text


'List of countries by past and projected future population - Wikipedia'

#### 1.2.b Find all the paragrpahs, store them in a list, and print the first 10 (5 pts) 

In [17]:
# Your code here  
paragraph_list = []
for i in soup.find_all('p'):
  paragraph_list.append(i.text)

paragraph_list[:10]

['All the figures shown here have been sourced from the International Data Base (IDB) Division of the United States Census Bureau. Every individual value has been rounded to the nearest thousand, to assure data coherence, particularly when adding up (sub)totals. Although data from specific statistical offices may be more accurate, the information provided here has the advantage of being homogeneous.\n',
 'Population estimates, as long as they are based on recent censuses, can be more easily projected into the near future than many macroeconomic indicators, such as GDP, which are much more sensitive to political and/or economic crises. This means that demographic estimates for the next five (or even ten) years can be more accurate than the projected evolution of GDP over the same time period (which may also be distorted by inflation).\n',
 'However, no projected population figures can be considered exact. As the IDB states, "figures beyond the years 2020-2025 should be taken with cautio

### 1.3 Extract the tables (10 pts)

We only care about the tables that contain historical population data. Extract all of them.

In [18]:
# Your code here  
# You need to  find all objects that include the css class “wikitable” within the soup object.

tables  = soup.find_all("table",{"class":"sortable wikitable"})

In [19]:
# check the tables you extracted

from IPython.display import display, HTML
display(HTML(tables[0].prettify()))

Country (or dependent territory),1950,1955,%,1960,%.1,1965,%.2,1970,%.3,1975,%.4,1980,%.5
Afghanistan,8151,8892,1.76,9830,2.03,10998,2.27,12431,2.48,14133,2.6,15045,1.26
Albania,1228,1393,2.56,1624,3.12,1884,3.02,2157,2.74,2402,2.17,2672,2.16
Algeria,8893,9842,2.05,10910,2.08,11964,1.86,13932,3.09,16141,2.99,18807,3.1
American Samoa,20,20,0.72,21,0.2,25,4.23,28,2.08,30,1.68,33,1.81
Andorra,7,7,0.04,9,6.28,14,10.17,20,7.49,27,6.32,34,4.81
Angola,4118,4424,1.44,4798,1.64,5135,1.37,5606,1.77,6051,1.54,7206,3.56
Anguilla,6,6,0.8,6,0.79,6,0.75,7,0.8,7,0.68,7,0.64
Antigua and Barbuda,46,52,2.19,55,1.32,60,1.7,66,2.05,69,0.73,69,0.15
Argentina,17151,18928,1.99,20617,1.72,22284,1.57,23963,1.46,26082,1.71,28370,1.7
Armenia,1356,1566,2.92,1869,3.61,2206,3.37,2520,2.7,2835,2.38,3134,2.03


### 1.4 Convert the tables into a dictionary  (30 pts)

Looking at the tables, we only care about the population number throughout the history. You want to associate each country with a series of population values to make a proper time series table you can use to analyze the population in a given coutnry.

First, you need to clean the tables cells from any footnote, links, commas or any garbage values. 
Once your data is cleaned, make a dictionary and combine each country with its corresponding year/population values across all three tables. An entry in your final dictionary should look like this: 


'Albania': {'1950': 1228,
            '1955': 1393,
            '1960': 1624,
            '1965': 1884,
            '1970': 2157,
            '1975': 2402,
            '1980': 2672,
            '1985': 2957,
            '1990': 3245,
            '1995': 3159,
            '2000': 3159,
            '2005': 3025,
            '2010': 2987,
            '2015': 3030,
            '2020': 3075,
            '2025': 3105,
            '2030': 3103,
            '2035': 3063,
            '2040': 2994,
            '2045': 2913,
            '2050': 2825},

One way to do it is:

1. First extract the header 
2. From your header only store values that are numeric (you can use isnumeric() function, recall that we only care about year values and we don't want to store columns represented by %
3. Once you have all the relevant column names (column that correspond to a year value), you can go over every row of the table 
    * Create a dictionary key with the country name 
    * Collect and add values corresponding to one of your column names to the dictionary

In [20]:
import pandas as pd
import numpy as np

df = pd.read_html(wiki_content,header=0)[:-1]

for j in df:
  for i in j.columns:
    if i.find('%')==0:
      a = j.pop(i)
      del a
all_dict_list = []
for i in df:

  table_dict_list = i.to_dict(orient='records')
  all_dict_list.append(table_dict_list)




ImportError: lxml not found, please install it

In [None]:
# adding all the countries data in the final dictionary
country_dict = {}
for i in all_dict_list:

  for j in i:

    country_column = 'Country (or dependent territory)'
    country_name = str(j['Country (or dependent territory)'])

    if country_name in country_dict.keys():

      for k in j:

        if not(k==country_column):
          country_dict[country_name].update({k:j.get(k)})

    else:
      country_dict[country_name] = {}

      for k in j:

        if not(k==country_column):
          country_dict[country_name].update({k:j.get(k)})


###  1.5 Create a dataframe from your dictionary (10 pts)

Now that all tables are stored in a dictionary, we can convert the dictionary into a pandas dataframe.

1. Remove the "World" row 
2. Replace 'NaN' values with 0
3. Display the first 8 rows


In [None]:
# Your code here 

# making a data frame from the country dictionary

population_dataframe_final = pd.DataFrame.from_dict(country_dict,orient='index')

In [None]:
# Your code here

# removing world column and replacing NaN values with 0
population_dataframe_final = population_dataframe_final.drop('World',axis=0,errors='ignore')
population_dataframe_final = population_dataframe_final.fillna(0)

population_dataframe_final[0:8]

##### Part 2. Exploring the data 

Now let's look at the data at hand. 

### 2.1 Plotting population  (10 pts)

Pick 6 countries of your choice and plot their population growth.

In [None]:
# Your code here
countries_of_choice = ['India','Japan','United States','New Zealand','South Korea','Iceland']
country_index = []

for j in range(len(countries_of_choice)):
  for i in range(population_dataframe_final.index.size):
    if (population_dataframe_final.index[i]==countries_of_choice[j]):
      country_index.append(i)

population_dataframe_final.iloc[country_index].T.plot()

### 2.2 Find 6 most populous countries ( 15 pts)

Find 6 most popoulous coutntries in 1960. Then find their population in  1980, 2000, 2020, and 2040. 
plot their population changes. Are there countries that consistently remain the most populous throught the years?  

In [None]:
question_2_2 = population_dataframe_final.sort_values('1960',ascending=False)[0:6][['1960','1980','2000','2040']]
question_2_2

In [None]:
question_2_2.T.plot()

### China has remained the most populous throughout the years, however India has been projected to be more populous than China in 2040.

### 2.3 Declining population ( 10 pts)

Check the population estimates between the years of 2020 and 2050 and find 6 countries that are experiencing decline in their population. Plot their population changes from 1960 - 2050. 

In [None]:
# Your code here
question_2_3 = (population_dataframe_final.loc[population_dataframe_final['2020']>population_dataframe_final['2025']][['2020','2050']])[0:6]
question_2_3_country_list =  question_2_3.index.to_list()

question_2_3_index_list = []
for i in question_2_3_country_list:
  question_2_3_index_list.append(population_dataframe_final.index.get_loc(i))

In [None]:
population_dataframe_final.iloc[question_2_3_index_list].T.plot()