# <span style=color:blue>Read tabular data from a PDF report of World Bank for doing some analysis</span>

In [3]:
import numpy as np
import pandas as pd
from tabula import read_pdf
import matplotlib.pyplot as plt

#### Open the accompanying PDF file "WDI-2016" and browse through it quickly. It is an annual report from World Bank on World Development Indicators (poverty, hunger, child mortality, social mobility, education, etc.)

#### Go to pages 68-72 to look at the tables we need toextract in this activity for analysis. They show various statistics for nations around the world.

### Define a list of page numbers to read

In [4]:
pages_to_read = [68,69,70,71,72]

### Create a list of column names. This will not be extracted by the PDF reader correctly, so we need to manually use it later.

#### Look at the pages 68-72 and come up with these variable names. Use your own judgment.

In [6]:
column_names = ['Country','Population','Surface area','Population density','Urban pop %',
                'GNI Atlas Method (Billions)','GNI Atlas Method (Per capita)','Purchasing power (Billions)',
                'Purchasing power (Per capita)','GDP % growth', 'GDP per capita growth']

### Test a PDF table extraction by using the `read_pdf` function from Tabula

* **You can read details on this library here: https://github.com/chezou/tabula-py**
* **You may have to set `multiple_tables=True` in this case**

In [7]:
lst_tbl1=read_pdf("WDI-2016.pdf",pages=70,multiple_tables=True)

### If you have done previous step correctly, you should get a simple list back. Check its length and contents. Do you see the table (as a Pandas DataFrame) in the list?

In [8]:
len(lst_tbl1)

2

In [9]:
lst_tbl1[0]

Unnamed: 0,0,1,2,3,4
0,Population,Surface,Population,Urban,Gross national income Gross domestic
1,,area,density,population,productAtlas method Purchasing power parity
2,,thousand,people,% of total,Per capita Per capita Per capita
3,millions,sq. km,per sq. km,population,$ billions $ $ billions $ % growth % growth
4,2014,2014,2014,2014,2014 2014 2014 2014 2013–14 2013–14


In [10]:
lst_tbl1[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Italy,60.8,301.3,207,69,2102.2,34580,2155.2,35450,–0.4,–1.4
1,Jamaica,2.7,11.0,251,55,14.0,5150,23.5,8640,0.7,0.5
2,Japan,127.1,378.0,349,93,5339.1,42000,4846.7,38120,–0.1,0.1
3,Jordan,6.6,89.3,74,83,34.1,5160,78.7,11910,3.1,0.8
4,Kazakhstan,17.3,2724.9,6,53,204.8,11850,375.3,21710,4.4,2.9
5,Kenya,44.9,580.4,79,25,58.1,1290,131.8,2940,5.3,2.6
6,Kiribati,0.1,0.8,136,44,0.3,2950,0.4a,"3,340a",3.7,1.9
7,"Korea, Dem. People’s Rep.",25.0,120.5,208,61,..,..j,..,..,..,..
8,"Korea, Rep.",50.4,100.3,517,82,1365.8,27090,1697.0,33650,3.3,2.9
9,Kosovo,1.8,10.9,167,..,7.3,3990,17.0a,"9,300a",1.2,0.9


### It looks like that the 2nd element of the list is the table we want to extract. Let's assign it to a DataFrame and check first few rows using `head` method

In [11]:
df = lst_tbl1[1]

In [12]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Italy,60.8,301.3,207,69,2102.2,34580,2155.2,35450,–0.4,–1.4
1,Jamaica,2.7,11.0,251,55,14.0,5150,23.5,8640,0.7,0.5
2,Japan,127.1,378.0,349,93,5339.1,42000,4846.7,38120,–0.1,0.1
3,Jordan,6.6,89.3,74,83,34.1,5160,78.7,11910,3.1,0.8
4,Kazakhstan,17.3,2724.9,6,53,204.8,11850,375.3,21710,4.4,2.9


### You should observe that the column headers are just numbers. Here, we need to use the defined list of variables we created earlier. Assign that list as column names of this DataFrame. 

In [13]:
df.columns = column_names

In [14]:
df.head()

Unnamed: 0,Country,Population,Surface area,Population density,Urban pop %,GNI Atlas Method (Billions),GNI Atlas Method (Per capita),Purchasing power (Billions),Purchasing power (Per capita),GDP % growth,GDP per capita growth
0,Italy,60.8,301.3,207,69,2102.2,34580,2155.2,35450,–0.4,–1.4
1,Jamaica,2.7,11.0,251,55,14.0,5150,23.5,8640,0.7,0.5
2,Japan,127.1,378.0,349,93,5339.1,42000,4846.7,38120,–0.1,0.1
3,Jordan,6.6,89.3,74,83,34.1,5160,78.7,11910,3.1,0.8
4,Kazakhstan,17.3,2724.9,6,53,204.8,11850,375.3,21710,4.4,2.9


### Next, write a loop to create such DataFrames by reading data tables from the pages 68-72 of the PDF file. You can store those DataFrames in a list for concatenating later.

In [16]:
# Empty list to store DataFrames
list_of_df = []
# Loop for reading tables from the PDF file page by page
for pg in pages_to_read:
    lst_tbl=read_pdf("WDI-2016.pdf",pages=pg,multiple_tables=True)
    df = lst_tbl[1]
    df.columns=column_names
    list_of_df.append(df)
    print("Finished processing page: {}".format(pg))

Finished processing page: 68
Finished processing page: 69
Finished processing page: 70
Finished processing page: 71
Finished processing page: 72


### Examine individual DataFrames from the list. Does the last DataFrame look alright?

In [18]:
list_of_df[4]

Unnamed: 0,Country,Population,Surface area,Population density,Urban pop %,GNI Atlas Method (Billions),GNI Atlas Method (Per capita),Purchasing power (Billions),Purchasing power (Per capita),GDP % growth,GDP per capita growth
0,Tanzania,51.8,947.3,59,31,46.4t,920t,126.3t,"2,510t",7.0t,3.6t
1,Thailand,67.7,513.1,133,49,391.7,5780,1006.9,14870,0.9,0.5
2,Timor-Leste,1.2,14.9,82,32,3.2,2680,6.2a,"5,080a",7.0,4.2
3,Togo,7.1,56.8,131,39,4.0,570,9.2,1290,5.7,2.9
4,Tonga,0.1,0.8,147,24,0.4,4260,0.6a,"5,270a",2.1,1.7
5,Trinidad and Tobago,1.4,5.1,264,9,27.2,20070,43.3,31970,0.8,0.4
6,Tunisia,11.0,163.6,71,67,46.5,4230,121.2,11020,2.7,1.7
7,Turkey,75.9,783.6,99,73,822.4,10830,1485.2,19560,2.9,1.7
8,Turkmenistan,5.3,488.1,11,50,42.5,8020,77.1a,"14,520a",10.3,8.9
9,Turks and Caicos Islands,0.0k,1.0,36,92,..,..e,..,..,..,..


### Concetenate all the DataFrames in the list into a single DataFrame so that we can use it for further wrangling and analysis.

* Check the shape of the DataFrame. It should show 226 entries in total with 11 columns.

In [22]:
df = pd.concat(list_of_df,axis=0)

In [23]:
df.shape

(226, 11)

In [24]:
df.head()

Unnamed: 0,Country,Population,Surface area,Population density,Urban pop %,GNI Atlas Method (Billions),GNI Atlas Method (Per capita),Purchasing power (Billions),Purchasing power (Per capita),GDP % growth,GDP per capita growth
0,Afghanistan,31.6,652.9,48,26,21.4,680,63.2a,"2,000a",1.3,–1.7
1,Albania,2.9,28.8,106,56,12.9,4450,31.8,10980,2.2,2.3
2,Algeria,38.9,2381.7,16,70,213.8,5490,540.5,13880,3.8,1.8
3,American Samoa,0.1,0.2,277,87,..,..b,..,..,..,..
4,Andorra,0.1,0.5,155,86,3.3,43270,..,..,–0.1,4.4


### Is the Data set clean and ready to be analyzed? 
* **Are there missing entries? How to handle them?**
* **Are there entries not specific to countries but regions? Do we need them here or can they be copied to another data set?**

#### As with any real-world example, this data set also needs further wrangling and cleaning before it can be used in an analytics pipeline. Those will not be discussed here but you can try on your own how to extract beautiful plots and insights from this dataset by using your data wrangling skills!