# [Reading Tabular Data into DataFrames](http://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html)

In [1]:
import pandas as pd

data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)

       country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Australia     10039.59564     10949.64959     12217.22686   
1  New Zealand     10556.57566     12247.39532     13175.67800   

   gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
0     14526.12465     16788.62948     18334.19751     19477.00928   
1     14463.91893     16046.03728     16233.71770     17632.41040   

   gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
0     21888.88903     23424.76683     26997.93657     30687.75473   
1     19007.19129     18363.32494     21050.41377     23189.80135   

   gdpPercap_2007  
0     34435.36744  
1     25185.00911  


Row headings are numbers (0 and 1 in the above). Want to index by country? Pass the name of the column `country` to `pandas.read_csv()` as its `index_col` parameter to do this.

In [2]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)

             gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
country                                                                       
Australia       10039.59564     10949.64959     12217.22686     14526.12465   
New Zealand     10556.57566     12247.39532     13175.67800     14463.91893   

             gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
country                                                                       
Australia       16788.62948     18334.19751     19477.00928     21888.88903   
New Zealand     16046.03728     16233.71770     17632.41040     19007.19129   

             gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                      
Australia       23424.76683     26997.93657     30687.75473     34435.36744  
New Zealand     18363.32494     21050.41377     23189.80135     25185.00911  


The `DataFrame.columns` member variable stores information about the columns.

In [3]:
print(data.columns)

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')


The `DataFrame.T` member variable is a transposed view of the dataframe. The `.T` doesn't copy the data, just changes the program's view of it.

In [4]:
print(data.T)

country           Australia  New Zealand
gdpPercap_1952  10039.59564  10556.57566
gdpPercap_1957  10949.64959  12247.39532
gdpPercap_1962  12217.22686  13175.67800
gdpPercap_1967  14526.12465  14463.91893
gdpPercap_1972  16788.62948  16046.03728
gdpPercap_1977  18334.19751  16233.71770
gdpPercap_1982  19477.00928  17632.41040
gdpPercap_1987  21888.88903  19007.19129
gdpPercap_1992  23424.76683  18363.32494
gdpPercap_1997  26997.93657  21050.41377
gdpPercap_2002  30687.75473  23189.80135
gdpPercap_2007  34435.36744  25185.00911


Use `DataFrame.describe()` to get summary statics.

In [5]:
print(data.T.describe())

country     Australia   New Zealand
count       12.000000     12.000000
mean     19980.595634  17262.622813
std       7815.405220   4409.009167
min      10039.595640  10556.575660
25%      13948.900203  14141.858697
50%      18905.603395  16933.064050
75%      24318.059265  19517.996910
max      34435.367440  25185.009110


# [Pandas DataFrames](http://swcarpentry.github.io/python-novice-gapminder/08-data-frames/index.html)

A `DataFrame` is a collection of `Series`; The DataFrame is the way Pandas represents a table, and Series is the data structure Pandas use to represent a column.

Pandas is built on top of the `Numpy` library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

In [6]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])

1601.056136


Use `:` on its own to mean all columns or rows.

In [7]:
print(data.loc["Albania",:].head(n=5))

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
Name: Albania, dtype: float64


In [8]:
print(data.loc[:,"gdpPercap_1952"].tail(n=6))

country
Slovenia           4215.041741
Spain              3834.034742
Sweden             8527.844662
Switzerland       14734.232750
Turkey             1969.100980
United Kingdom     9979.508487
Name: gdpPercap_1952, dtype: float64


Comparison is applied element by element. Returns a similar-shaped dataframe of `True` or `False`.

In [9]:
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
mask = subset > 10000
print('type(mask): ', type(mask), '\n')
print(mask)

type(mask):  <class 'pandas.core.frame.DataFrame'> 

             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy                 False            True            True
Montenegro            False           False           False
Netherlands            True            True            True
Norway                 True            True            True
Poland                False           False           False


A dataframe full of Booleans is sometimes called a *mask* because of how it can be used. Get the value where the mask is `True`, and `NaN` (Not a Number) where it is `False`.

In [10]:
print(subset[mask])

             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy                   NaN     10022.40131     12269.27378
Montenegro              NaN             NaN             NaN
Netherlands     12790.84956     15363.25136     18794.74567
Norway          13450.40151     16361.87647     18965.05551
Poland                  NaN             NaN             NaN


## Group By

We count how many ties a country has participated in the group of higher GDP.

In [11]:
mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
wealth_score

country
Albania                   0.000000
Austria                   1.000000
Belgium                   1.000000
Bosnia and Herzegovina    0.000000
Bulgaria                  0.000000
Croatia                   0.000000
Czech Republic            0.500000
Denmark                   1.000000
Finland                   1.000000
France                    1.000000
Germany                   1.000000
Greece                    0.333333
Hungary                   0.000000
Iceland                   1.000000
Ireland                   0.333333
Italy                     0.500000
Montenegro                0.000000
Netherlands               1.000000
Norway                    1.000000
Poland                    0.000000
Portugal                  0.000000
Romania                   0.000000
Serbia                    0.000000
Slovak Republic           0.000000
Slovenia                  0.333333
Spain                     0.333333
Sweden                    1.000000
Switzerland               1.000000
Turkey      

Then for each group in the `wealth_score` table, we sum their financial contribution across the years surveyed.

In [12]:
print(data.groupby(wealth_score).sum())

          gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
0.000000    36916.854200    46110.918793    56850.065437    71324.848786   
0.333333    16790.046878    20942.456800    25744.935321    33567.667670   
0.500000    11807.544405    14505.000150    18380.449470    21421.846200   
1.000000   104317.277560   127332.008735   149989.154201   178000.350040   

          gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
0.000000    88569.346898   104459.358438   113553.768507   119649.599409   
0.333333    45277.839976    53860.456750    59679.634020    64436.912960   
0.500000    25377.727380    29056.145370    31914.712050    35517.678220   
1.000000   215162.343140   241143.412730   263388.781960   296825.131210   

          gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
0.000000    92380.047256   103772.937598   118590.929863   149577.357928  
0.333333    67918.093220    80876.051580   102086.795210   122803.729520  
0.500000    3

## Exercises

GDP per capita for all countries in 1982.

In [13]:
print(data.loc[:,"gdpPercap_1982"])

country
Albania                    3630.880722
Austria                   21597.083620
Belgium                   20979.845890
Bosnia and Herzegovina     4126.613157
Bulgaria                   8224.191647
Croatia                   13221.821840
Czech Republic            15377.228550
Denmark                   21688.040480
Finland                   18533.157610
France                    20293.897460
Germany                   22031.532740
Greece                    15268.420890
Hungary                   12545.990660
Iceland                   23269.607500
Ireland                   12618.321410
Italy                     16537.483500
Montenegro                11222.587620
Netherlands               21399.460460
Norway                    26298.635310
Poland                     8451.531004
Portugal                  11753.842910
Romania                    9605.314053
Serbia                    15181.092700
Slovak Republic           11348.545850
Slovenia                  17866.721750
Spain            

GDP per capita for Denmark for all years.

In [14]:
print(data.loc["Denmark",:])

gdpPercap_1952     9692.385245
gdpPercap_1957    11099.659350
gdpPercap_1962    13583.313510
gdpPercap_1967    15937.211230
gdpPercap_1972    18866.207210
gdpPercap_1977    20422.901500
gdpPercap_1982    21688.040480
gdpPercap_1987    25116.175810
gdpPercap_1992    26406.739850
gdpPercap_1997    29804.345670
gdpPercap_2002    32166.500060
gdpPercap_2007    35278.418740
Name: Denmark, dtype: float64


GDP per capita for all countries for years after 1985.

In [15]:
print(data.loc[:,"gdpPercap_1985":])

                        gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  \
country                                                                  
Albania                    3738.932735     2497.437901     3193.054604   
Austria                   23687.826070    27042.018680    29095.920660   
Belgium                   22525.563080    25575.570690    27561.196630   
Bosnia and Herzegovina     4314.114757     2546.781445     4766.355904   
Bulgaria                   8239.854824     6302.623438     5970.388760   
Croatia                   13822.583940     8447.794873     9875.604515   
Czech Republic            16310.443400    14297.021220    16048.514240   
Denmark                   25116.175810    26406.739850    29804.345670   
Finland                   21141.012230    20647.164990    23723.950200   
France                    22066.442140    24703.796150    25889.784870   
Germany                   24639.185660    26505.303170    27788.884160   
Greece                    16120.528390

GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952. **YET TO BE FINISHED**

In [16]:
print(data.loc[:,"gdpPercap_2007"])

country
Albania                    5937.029526
Austria                   36126.492700
Belgium                   33692.605080
Bosnia and Herzegovina     7446.298803
Bulgaria                  10680.792820
Croatia                   14619.222720
Czech Republic            22833.308510
Denmark                   35278.418740
Finland                   33207.084400
France                    30470.016700
Germany                   32170.374420
Greece                    27538.411880
Hungary                   18008.944440
Iceland                   36180.789190
Ireland                   40675.996350
Italy                     28569.719700
Montenegro                 9253.896111
Netherlands               36797.933320
Norway                    49357.190170
Poland                    15389.924680
Portugal                  20509.647770
Romania                   10808.475610
Serbia                     9786.534714
Slovak Republic           18678.314350
Slovenia                  25768.257590
Spain            