***
## <font color=blue size=6>Statistics course - PART 1</font>
***

In [7]:
# checking pandas version
import pandas as pd
pd.__version__

'0.25.1'

# <font color=blue>1 Knowing the data</font>
***

## <font color=#0080ff>1.1 Project dataset</font>


### National Household Sample Survey - 2015

The <b>National Household Sample Survey (*PNAD in portuguese*)</b> annually, permanently investigates general characteristics of the population, education, work, income and housing and others, with variable periodicity, according to the information needs for the country, such as the characteristics about migration, fertility, nuptiality, health, food security, among other topics. 

The survey of these statistics constitutes, throughout the 49 years of the research, an important instrument for the formulation, validation and evaluation of policies oriented to the socioeconomic development and the improvement of the living conditions in Brazil.

### Data Source

https://ww2.ibge.gov.br/home/estatistica/populacao/trabalhoerendimento/pnad2015/microdados.shtm

### Variables considered

> ### Income
> Monthly income from main job for persons 10 years of age and over.

> ### Age
> Age of the resident at the reference date in years.

> ### Height (own elaboration)
> Resident height in meters

> ### UF
> Federative Unit (*Unidade Federativa in portuguese*) - a code that identifies a state in Brazil

|Código|Descrição|
|---|---|
|11|Rondônia|
|12|Acre|
|13|Amazonas|
|14|Roraima|
|15|Pará|
|16|Amapá|
|17|Tocantins|
|21|Maranhão|
|22|Piauí|
|23|Ceará|
|24|Rio Grande do Norte|
|25|Paraíba|
|26|Pernambuco|
|27|Alagoas|
|28|Sergipe|
|29|Bahia|
|31|Minas Gerais|
|32|Espírito Santo|
|33|Rio de Janeiro|
|35|São Paulo|
|41|Paraná|
|42|Santa Catarina|
|43|Rio Grande do Sul|
|50|Mato Grosso do Sul|
|51|Mato Grosso|
|52|Goiás|
|53|Distrito Federal|

> ### Gender	
 
|Código|Descrição|
|---|---|
|0|Masculino|
|1|Feminino|

> ### Years of study

|Código|Descrição|
|---|---|
|1|Sem instrução e menos de 1 ano|
|2|1 ano|
|3|2 anos|
|4|3 anos|
|5|4 anos|
|6|5 anos|
|7|6 anos|
|8|7 anos|
|9|8 anos|
|10|9 anos|
|11|10 anos|
|12|11 anos|
|13|12 anos|
|14|13 anos|
|15|14 anos|
|16|15 anos ou mais|
|17|Não determinados| 
||Não aplicável|

> ### Skin color

|Código|Descrição|
|---|---|
|0|Indígena|
|2|Branca|
|4|Preta|
|6|Amarela|
|8|Parda|
|9|Sem declaração|

#### <font color='red'>Important</font>
***
> The following treatments were performed on the original data:
> 1. Records where <b> Income </b> were invalid (999 999 999 999) were dropped from dataset;
> 2. Missing <b> Income </b> were dropped from dataset
> 3. Only the records of the <b> Reference Persons </b> of each household (responsible for the household) were considered.

### Importing pandas and reading the project dataset

https://pandas.pydata.org/

In [8]:
import pandas as pd

In [9]:
dados = pd.read_csv("dados.csv")

In [10]:
dados.head()

Unnamed: 0,UF,Sexo,Idade,Cor,Anos de Estudo,Renda,Altura
0,11,0,23,8,12,800,1.603808
1,11,1,23,2,12,1150,1.73979
2,11,1,35,8,15,880,1.760444
3,11,0,46,2,6,3500,1.783158
4,11,1,47,8,9,150,1.690631


## <font color=#0080ff>1.1 Data types</font>


1 - **Qualitative data** are attributes, describe a variable

2 - **Quantitative data** are countable, can be aggregated or summed

### Ordinal qualitative variables

► Variables that can be sorted or hierarchized

In [11]:
# Two years of study is better than only one year, and so on
sorted(dados['Anos de Estudo'].unique())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

### Nominal quantitative variables

► Variables that cannot be sorted or hierarchized

*The following variables cannot be larger than another variable, nor can it be hierarchized*

In [13]:
dados.UF.unique()

array([11, 12, 13, 14, 15, 16, 17, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31,
       32, 33, 35, 41, 42, 43, 50, 51, 52, 53])

In [14]:
dados.Sexo.unique()

array([0, 1])

In [15]:
dados.Cor.unique()

array([8, 2, 4, 6, 0])

### Discrete quantitative variables

► Variables that represent a count where the possible values ​​form a finite or enumerable set.

In [16]:
print(f"From {dados.Idade.min()} until {dados.Idade.max()} years old")

From 13 until 99 years old


#### <font color='red'>Important</font>
***
> The `age` variable can be classified by three distint ways:
> 1. <b>DISCRETE QUANTITATIVE</b> - when represents complete years (integers);
> 2. <b>CONTINUOUS QUANTITATIVE</b> - when represents exact age (by fraction age)
> 3. <b>ORDINAL QUALITATIVE</b> - when it represents age ranges

### Continuous quantitative variables

► Variables that represents a count or measurements that assume values ​​on a continuous scale (real numbers).

In [18]:
print(f"From {dados.Altura.min()} until {dados.Altura.max()} meters")

From 1.339244614 until 2.028496765 meters


### Classification of a variable
*Labels in portuguese*
<img src='https://caelum-online-public.s3.amazonaws.com/1177-estatistica-parte1/01/img001.png' width='70%'>