### Importando arquivos .csv

In [2]:
import pandas as pd

Ler os dados na planilha GDP per capita.

- Selecionamos a planilha do arquivo com os dados que precisamos, pulando linhas e colunas que não queremos. 
- Usamos o parâmetro `sheet_name` para especificar a planilha. 
- Usamos o `skiprows=4` e `skipfooter=1` para pularmos as quatro primeiras linhas (a primeira linha está escondida) e a última linha.
- Damos valores em `usecols` para pegar os dados da **coluna A** e **coluna C até T**. 
- Usamos `head` para vermos as primeiras linhas do dataset.

In [3]:
percapitaGDP = pd.read_excel('data/GDPpercapita.xlsx',
    sheet_name='OECD.Stat export',
    skiprows=4,
    skipfooter=1,
    usecols='A, C:T')

In [4]:
percapitaGDP.head()

Unnamed: 0,Year,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Metropolitan areas,,,,,,,,,,,,,,,,,,
1,AUS: Australia,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..
2,AUS01: Greater Sydney,43313,44008,45424,45837,45423,45547,45880,45225,45900,45672,46535,47350,47225,48510,50075,50519,50578,49860
3,AUS02: Greater Melbourne,40125,40894,41602,42188,41484,41589,42316,40975,41384,40943,41165,41264,41157,42114,42928,42671,43025,42674
4,AUS03: Greater Brisbane,37580,37564,39080,40762,42976,44475,44635,46192,43507,42774,44166,43764,43379,43754,44388,45723,46876,46640


Usamos o método `info` para vermos os tipos dos dados e uma contagem de dados non-null

In [5]:
percapitaGDP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Year    702 non-null    object
 1   2001    701 non-null    object
 2   2002    701 non-null    object
 3   2003    701 non-null    object
 4   2004    701 non-null    object
 5   2005    701 non-null    object
 6   2006    701 non-null    object
 7   2007    701 non-null    object
 8   2008    701 non-null    object
 9   2009    701 non-null    object
 10  2010    701 non-null    object
 11  2011    701 non-null    object
 12  2012    701 non-null    object
 13  2013    701 non-null    object
 14  2014    701 non-null    object
 15  2015    701 non-null    object
 16  2016    701 non-null    object
 17  2017    701 non-null    object
 18  2018    701 non-null    object
dtypes: object(19)
memory usage: 104.3+ KB



- Renomeamos a coluna `Year` para `metro` e removemos os espaços do início.
- Existem espaços extras antes e depois dos valores na coluna `metro` em alguns casos. 
- Nós podemos testar se existem tais espaços com o comando `startwith(' ')` e então usar `any` para estabelecer qualquer célula que possua um ou mais ocasiões com o primeiro caracter em branco.
- Da mesma forma, usamos o comando `endswith(' ')` para examinar os espaços depois dos caracteres.
- Usamos `strip` para remover ambos espaços (início e fim dos caracteres)

In [6]:
percapitaGDP.rename(columns={'Year':'metro'}, inplace=True)

In [7]:
percapitaGDP.metro.str.startswith(' ').any()

True

In [8]:
percapitaGDP.metro.str.endswith(' ').any()

True

In [9]:
percapitaGDP.metro = percapitaGDP.metro.str.strip()

Convertendo as colunas de datas para numéricas.

- Iterando sobre todas as colunas de GDP ear (2001-2018) e convertendo o tipo dos dados de `object` para `float`. 

In [11]:
for col in percapitaGDP.columns[1:]:
    percapitaGDP[col] = pd.to_numeric(percapitaGDP[col], errors='coerce')
    percapitaGDP.rename(columns={col:'pcGDP'+col}, inplace=True)

In [12]:
percapitaGDP.head()

Unnamed: 0,metro,pcGDP2001,pcGDP2002,pcGDP2003,pcGDP2004,pcGDP2005,pcGDP2006,pcGDP2007,pcGDP2008,pcGDP2009,pcGDP2010,pcGDP2011,pcGDP2012,pcGDP2013,pcGDP2014,pcGDP2015,pcGDP2016,pcGDP2017,pcGDP2018
0,Metropolitan areas,,,,,,,,,,,,,,,,,,
1,AUS: Australia,,,,,,,,,,,,,,,,,,
2,AUS01: Greater Sydney,43313.0,44008.0,45424.0,45837.0,45423.0,45547.0,45880.0,45225.0,45900.0,45672.0,46535.0,47350.0,47225.0,48510.0,50075.0,50519.0,50578.0,49860.0
3,AUS02: Greater Melbourne,40125.0,40894.0,41602.0,42188.0,41484.0,41589.0,42316.0,40975.0,41384.0,40943.0,41165.0,41264.0,41157.0,42114.0,42928.0,42671.0,43025.0,42674.0
4,AUS03: Greater Brisbane,37580.0,37564.0,39080.0,40762.0,42976.0,44475.0,44635.0,46192.0,43507.0,42774.0,44166.0,43764.0,43379.0,43754.0,44388.0,45723.0,46876.0,46640.0


In [13]:
percapitaGDP.dtypes

metro         object
pcGDP2001    float64
pcGDP2002    float64
pcGDP2003    float64
pcGDP2004    float64
pcGDP2005    float64
pcGDP2006    float64
pcGDP2007    float64
pcGDP2008    float64
pcGDP2009    float64
pcGDP2010    float64
pcGDP2011    float64
pcGDP2012    float64
pcGDP2013    float64
pcGDP2014    float64
pcGDP2015    float64
pcGDP2016    float64
pcGDP2017    float64
pcGDP2018    float64
dtype: object

Usamos o método `describe` para gerar um sumário estatístico para todos os dados numéricos ni dataframe.

In [14]:
percapitaGDP.describe()

Unnamed: 0,pcGDP2001,pcGDP2002,pcGDP2003,pcGDP2004,pcGDP2005,pcGDP2006,pcGDP2007,pcGDP2008,pcGDP2009,pcGDP2010,pcGDP2011,pcGDP2012,pcGDP2013,pcGDP2014,pcGDP2015,pcGDP2016,pcGDP2017,pcGDP2018
count,424.0,440.0,440.0,440.0,447.0,447.0,447.0,455.0,471.0,471.0,480.0,480.0,480.0,480.0,480.0,480.0,445.0,441.0
mean,41263.658019,41015.070455,41553.361364,42473.022727,42881.143177,43987.762864,44786.760626,44533.958242,42724.316348,43433.511677,43946.966667,44075.9375,44302.154167,44942.49375,45802.220833,46243.666667,47489.089888,48032.668934
std,11877.960193,12536.516772,12456.583153,12621.90115,13172.229181,13450.431995,13693.732714,14082.871703,13602.723246,13896.77508,14018.472002,14170.164166,14251.392256,14421.619849,14948.683819,14938.54938,15463.803389,15719.725615
min,10988.0,11435.0,11969.0,12777.0,13062.0,13855.0,13937.0,2236.0,2202.0,2227.0,2363.0,2572.0,2700.0,2683.0,2761.0,2796.0,2745.0,2832.0
25%,33139.25,32636.0,33284.75,33864.5,33735.5,34540.0,35226.0,35094.0,33730.0,34294.5,34582.75,34808.75,35113.75,35766.25,36128.5,36584.75,37316.0,37908.0
50%,39543.5,39683.5,40390.5,41200.5,41609.0,42929.0,43461.0,43287.0,41250.0,41627.0,42345.0,42131.5,42154.5,42777.5,43237.5,43931.5,45385.0,46057.0
75%,47971.75,48611.0,49354.75,50468.25,51025.0,52304.0,53043.5,53132.0,50739.5,51428.5,52568.75,52569.75,53087.5,53737.25,54134.25,54449.75,56023.0,56638.0
max,91488.0,93566.0,98123.0,96242.0,101084.0,121053.0,122897.0,120158.0,114486.0,119658.0,119965.0,117348.0,123709.0,121011.0,121623.0,117879.0,122242.0,127468.0


Removemos linhas onde todos os valores de GDP percapita estão faltando.

- Usamos o parâmetro `subset` do `dropna` para inspecionar todas as colunas, começando pela segunda coluna até a última.
- Usamos `how` para especificar que queremos dropar as linhas se todas as colunas especificadas no `subset` possuem dados faltando.
- Comando `shape` para mostrar o número de linhas e colunas no resultado do dataframe.

In [15]:
percapitaGDP.dropna(subset=percapitaGDP.columns[1:], how='all', inplace=True)

In [16]:
percapitaGDP.describe()

Unnamed: 0,pcGDP2001,pcGDP2002,pcGDP2003,pcGDP2004,pcGDP2005,pcGDP2006,pcGDP2007,pcGDP2008,pcGDP2009,pcGDP2010,pcGDP2011,pcGDP2012,pcGDP2013,pcGDP2014,pcGDP2015,pcGDP2016,pcGDP2017,pcGDP2018
count,424.0,440.0,440.0,440.0,447.0,447.0,447.0,455.0,471.0,471.0,480.0,480.0,480.0,480.0,480.0,480.0,445.0,441.0
mean,41263.658019,41015.070455,41553.361364,42473.022727,42881.143177,43987.762864,44786.760626,44533.958242,42724.316348,43433.511677,43946.966667,44075.9375,44302.154167,44942.49375,45802.220833,46243.666667,47489.089888,48032.668934
std,11877.960193,12536.516772,12456.583153,12621.90115,13172.229181,13450.431995,13693.732714,14082.871703,13602.723246,13896.77508,14018.472002,14170.164166,14251.392256,14421.619849,14948.683819,14938.54938,15463.803389,15719.725615
min,10988.0,11435.0,11969.0,12777.0,13062.0,13855.0,13937.0,2236.0,2202.0,2227.0,2363.0,2572.0,2700.0,2683.0,2761.0,2796.0,2745.0,2832.0
25%,33139.25,32636.0,33284.75,33864.5,33735.5,34540.0,35226.0,35094.0,33730.0,34294.5,34582.75,34808.75,35113.75,35766.25,36128.5,36584.75,37316.0,37908.0
50%,39543.5,39683.5,40390.5,41200.5,41609.0,42929.0,43461.0,43287.0,41250.0,41627.0,42345.0,42131.5,42154.5,42777.5,43237.5,43931.5,45385.0,46057.0
75%,47971.75,48611.0,49354.75,50468.25,51025.0,52304.0,53043.5,53132.0,50739.5,51428.5,52568.75,52569.75,53087.5,53737.25,54134.25,54449.75,56023.0,56638.0
max,91488.0,93566.0,98123.0,96242.0,101084.0,121053.0,122897.0,120158.0,114486.0,119658.0,119965.0,117348.0,123709.0,121011.0,121623.0,117879.0,122242.0,127468.0


In [17]:
percapitaGDP.head()

Unnamed: 0,metro,pcGDP2001,pcGDP2002,pcGDP2003,pcGDP2004,pcGDP2005,pcGDP2006,pcGDP2007,pcGDP2008,pcGDP2009,pcGDP2010,pcGDP2011,pcGDP2012,pcGDP2013,pcGDP2014,pcGDP2015,pcGDP2016,pcGDP2017,pcGDP2018
2,AUS01: Greater Sydney,43313.0,44008.0,45424.0,45837.0,45423.0,45547.0,45880.0,45225.0,45900.0,45672.0,46535.0,47350.0,47225.0,48510.0,50075.0,50519.0,50578.0,49860.0
3,AUS02: Greater Melbourne,40125.0,40894.0,41602.0,42188.0,41484.0,41589.0,42316.0,40975.0,41384.0,40943.0,41165.0,41264.0,41157.0,42114.0,42928.0,42671.0,43025.0,42674.0
4,AUS03: Greater Brisbane,37580.0,37564.0,39080.0,40762.0,42976.0,44475.0,44635.0,46192.0,43507.0,42774.0,44166.0,43764.0,43379.0,43754.0,44388.0,45723.0,46876.0,46640.0
5,AUS04: Greater Perth,45713.0,47371.0,48719.0,51020.0,55278.0,60142.0,62551.0,63899.0,63616.0,70111.0,73715.0,72679.0,76153.0,70395.0,66544.0,66032.0,66424.0,70390.0
6,AUS05: Greater Adelaide,36505.0,37194.0,37634.0,37399.0,37604.0,38151.0,39049.0,38502.0,39538.0,39309.0,39223.0,39812.0,39855.0,40306.0,40295.0,39737.0,40115.0,39924.0


In [19]:
percapitaGDP.shape

(480, 19)

Confirmamos que existem 480 valores válidos para a coluna `metro` e existem 480 únicos valores, antes de setar o index.

In [20]:
percapitaGDP.metro.count()

480

In [21]:
percapitaGDP.metro.nunique()

480

In [22]:
percapitaGDP.set_index('metro', inplace=True)

In [24]:
percapitaGDP.head()

Unnamed: 0_level_0,pcGDP2001,pcGDP2002,pcGDP2003,pcGDP2004,pcGDP2005,pcGDP2006,pcGDP2007,pcGDP2008,pcGDP2009,pcGDP2010,pcGDP2011,pcGDP2012,pcGDP2013,pcGDP2014,pcGDP2015,pcGDP2016,pcGDP2017,pcGDP2018
metro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
AUS01: Greater Sydney,43313.0,44008.0,45424.0,45837.0,45423.0,45547.0,45880.0,45225.0,45900.0,45672.0,46535.0,47350.0,47225.0,48510.0,50075.0,50519.0,50578.0,49860.0
AUS02: Greater Melbourne,40125.0,40894.0,41602.0,42188.0,41484.0,41589.0,42316.0,40975.0,41384.0,40943.0,41165.0,41264.0,41157.0,42114.0,42928.0,42671.0,43025.0,42674.0
AUS03: Greater Brisbane,37580.0,37564.0,39080.0,40762.0,42976.0,44475.0,44635.0,46192.0,43507.0,42774.0,44166.0,43764.0,43379.0,43754.0,44388.0,45723.0,46876.0,46640.0
AUS04: Greater Perth,45713.0,47371.0,48719.0,51020.0,55278.0,60142.0,62551.0,63899.0,63616.0,70111.0,73715.0,72679.0,76153.0,70395.0,66544.0,66032.0,66424.0,70390.0
AUS05: Greater Adelaide,36505.0,37194.0,37634.0,37399.0,37604.0,38151.0,39049.0,38502.0,39538.0,39309.0,39223.0,39812.0,39855.0,40306.0,40295.0,39737.0,40115.0,39924.0


In [25]:
percapitaGDP.loc['AUS02: Greater Melbourne']

pcGDP2001    40125.0
pcGDP2002    40894.0
pcGDP2003    41602.0
pcGDP2004    42188.0
pcGDP2005    41484.0
pcGDP2006    41589.0
pcGDP2007    42316.0
pcGDP2008    40975.0
pcGDP2009    41384.0
pcGDP2010    40943.0
pcGDP2011    41165.0
pcGDP2012    41264.0
pcGDP2013    41157.0
pcGDP2014    42114.0
pcGDP2015    42928.0
pcGDP2016    42671.0
pcGDP2017    43025.0
pcGDP2018    42674.0
Name: AUS02: Greater Melbourne, dtype: float64