This notebook demonstrates simple steps to deal with a large dataset.

In [11]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
print(os.listdir("../input"))

In [2]:
data = pd.read_csv('../input/data.csv')
data.shape

It has close to 1 million rows and 62 columns.

In [3]:
data.head()

In [4]:
data.loc[:,'Country Name':'Indicator Code'].describe()

The dataset contains 254 countries and 3617 indicators!

We can create two tables of countries and indicators for easier reference.

In [73]:
countries = data.loc[:,['Country Name','Country Code']].drop_duplicates()
indicators = data.loc[:,['Indicator Name','Indicator Code']].drop_duplicates()
print (countries.shape, indicators.shape)

In [74]:
countries.head()

Some "Countries" are in fact not countries, but regions.

In [24]:
indicators.head()

Then we look at how much missing data is in the dataset.

In [33]:
present = data.loc[:,'1977':'2016'].notnull().sum()/len(data)*100
future = data.loc[:,'2020':].notnull().sum()/len(data)*100
plt.figure(figsize=(10,7))
plt.subplot(121)
present.plot(kind='barh', color='green')
plt.title('Missing Data (% of Data Rows)')
plt.ylabel('Column')
plt.subplot(122)
future.plot(kind='barh', color='limegreen')
plt.title('Missing Data (% of Data Rows)')
plt.show()

Findings:
- 2010 is the year with the most available data, but only slightly above 25% of all data rows;
- 2016 has exceptionally few data
- Future forecast data is available for about 5% of the data rows;
- The last column 'Unnamed: 61' is likely to be an error. It should be dropped.

In [34]:
data = data.drop(['Unnamed: 61'], axis=1)
data.head()

## If We Want to Look at One Country...

We can use **.str.contains()** to look for certain countries. Here is an examples:

In [36]:
countries[countries['Country Name'].str.contains('Hong')]

Let's look at what happened to education in Hong Kong:

In [52]:
hk = data.loc[data['Country Code']=='HKG']
hk.head()

In [58]:
hk.shape

Then we're going to remove rows if all values of a certain indicator is unavailable:

In [62]:
hkx = hk.dropna('index', thresh = 5) # First 4 columns and at least 1 value should be available
hkx.shape

In [76]:
hkx.head()

As names of indicators are quite long, they are more easy to be viewed in spreadsheets.  
The following code converts the list of indicators into a csv file:

In [75]:
indicators.to_csv('indicators.csv', index=False)

Stay tuned!