<a href="https://colab.research.google.com/github/strawndri/python-ds-pandas-io/blob/main/Projeto_Python_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python para Data Science

Neste Notebook, serão trabalhadas diferentes maneiras de importar e exportar arquivos utilizando a [biblioteca Pandas](https://pandas.pydata.org/docs/), do Python.

Todo o estudo é baseado no conteúdo apresentado no curso [Pandas I/O: trabalhando com diferentes formatos de arquivos](https://www.alura.com.br/curso-online-pandas-io-trabalhando-diferentes-formatos-arquivos), da Alura.

# 1. Fazendo leitura de arquivos CSV

## 1.1 Lendo arquivos CSV

O método `read_csv` é usado para ler dados de um arquivo CSV e criar um *DataFrame*.

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [1]:
import pandas as pd

In [12]:
url = 'https://raw.githubusercontent.com/strawndri/python-ds-pandas-io/main/dados/superstore_data.csv'

In [13]:
dados_mercado = pd.read_csv(url)

In [14]:
dados_mercado.head()

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain
0,1826,1970,Graduation,Divorced,84835.0,0,0,6/16/2014,0,189,...,111,189,218,1,4,4,6,1,1,0
1,1,1961,Graduation,Single,57091.0,0,0,6/15/2014,0,464,...,7,0,37,1,7,3,7,5,1,0
2,10476,1958,Graduation,Married,67267.0,0,1,5/13/2014,0,134,...,15,2,30,1,3,2,5,2,0,0
3,1386,1967,Graduation,Together,32474.0,1,1,11/5/2014,0,10,...,0,0,0,1,1,0,2,7,0,0
4,5371,1989,Graduation,Single,21474.0,1,0,8/4/2014,0,6,...,11,0,34,2,3,1,2,7,1,0


In [18]:
url2 = 'https://raw.githubusercontent.com/strawndri/python-ds-pandas-io/main/dados/superstore_data_ponto_virgula.csv'

In [19]:
dados_mercado_ponto_virgula = pd.read_csv(url2)

In [20]:
dados_mercado_ponto_virgula.head()

Unnamed: 0,Id;Year_Birth;Education;Marital_Status;Income;Kidhome;Teenhome;Dt_Customer;Recency;MntWines;MntFruits;MntMeatProducts;MntFishProducts;MntSweetProducts;MntGoldProds;NumDealsPurchases;NumWebPurchases;NumCatalogPurchases;NumStorePurchases;NumWebVisitsMonth;Response;Complain
0,1826;1970;Graduation;Divorced;84835;0;0;6/16/2...
1,1;1961;Graduation;Single;57091;0;0;6/15/2014;0...
2,10476;1958;Graduation;Married;67267;0;1;5/13/2...
3,1386;1967;Graduation;Together;32474;1;1;11/5/2...
4,5371;1989;Graduation;Single;21474;1;0;8/4/2014...


## 1.2 Parâmetros da função `read_csv`

### `sep`

O parâmetro `sep` no método `read_csv` é utilizado para especificar o **delimitador que separa os campos no arquivo CSV**. O delimitador é o caractere usado para indicar onde uma coluna termina e a próxima começa.

Por padrão, o `read_csv` assume que o delimitador é a vírgula (`,`), mas podemos usar outros delimitadores, como ponto e vírgula (`;`) e tabulação (`\t`).

In [24]:
dados_sem_virgula = pd.read_csv(url2, sep=';')

In [25]:
dados_sem_virgula.head()

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain
0,1826,1970,Graduation,Divorced,84835.0,0,0,6/16/2014,0,189,...,111,189,218,1,4,4,6,1,1,0
1,1,1961,Graduation,Single,57091.0,0,0,6/15/2014,0,464,...,7,0,37,1,7,3,7,5,1,0
2,10476,1958,Graduation,Married,67267.0,0,1,5/13/2014,0,134,...,15,2,30,1,3,2,5,2,0,0
3,1386,1967,Graduation,Together,32474.0,1,1,11/5/2014,0,10,...,0,0,0,1,1,0,2,7,0,0
4,5371,1989,Graduation,Single,21474.0,1,0,8/4/2014,0,6,...,11,0,34,2,3,1,2,7,1,0


### `nrows`

O parâmetro `nrows` no método `read_csv` serve para especificar o **número de linhas a serem lidas do arquivo CSV**.

Caso não seja especificado, todas as linhas do arquivo são lidas.

In [26]:
dados_primeiras_linhas = pd.read_csv(url, nrows=5)

In [27]:
dados_primeiras_linhas

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain
0,1826,1970,Graduation,Divorced,84835,0,0,6/16/2014,0,189,...,111,189,218,1,4,4,6,1,1,0
1,1,1961,Graduation,Single,57091,0,0,6/15/2014,0,464,...,7,0,37,1,7,3,7,5,1,0
2,10476,1958,Graduation,Married,67267,0,1,5/13/2014,0,134,...,15,2,30,1,3,2,5,2,0,0
3,1386,1967,Graduation,Together,32474,1,1,11/5/2014,0,10,...,0,0,0,1,1,0,2,7,0,0
4,5371,1989,Graduation,Single,21474,1,0,8/4/2014,0,6,...,11,0,34,2,3,1,2,7,1,0


### `usecols`

O parâmetro `usecols`, do método `read_csv`, é usado para **selecionar um conjunto específico de colunas** do arquivo CSV durante a leitura.

In [28]:
dados_selecao = pd.read_csv(url, usecols=['Id', 'Year_Birth', 'Income'])

In [29]:
dados_selecao

Unnamed: 0,Id,Year_Birth,Income
0,1826,1970,84835.0
1,1,1961,57091.0
2,10476,1958,67267.0
3,1386,1967,32474.0
4,5371,1989,21474.0
...,...,...,...
2235,10142,1976,66476.0
2236,5263,1977,31056.0
2237,22,1976,46310.0
2238,528,1978,65819.0


In [30]:
dados_selecao = pd.read_csv(url, usecols=[0, 1, 4])

In [31]:
dados_selecao

Unnamed: 0,Id,Year_Birth,Income
0,1826,1970,84835.0
1,1,1961,57091.0
2,10476,1958,67267.0
3,1386,1967,32474.0
4,5371,1989,21474.0
...,...,...,...
2235,10142,1976,66476.0
2236,5263,1977,31056.0
2237,22,1976,46310.0
2238,528,1978,65819.0


## 1.3 Escrevendo arquivos CSV

O método `to_csv` salva um *DataFrame* em um arquivo CSV.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [40]:
dados_selecao.to_csv('clientes_mercado.csv')

In [41]:
clientes_mercado = pd.read_csv('/content/clientes_mercado.csv')

In [42]:
clientes_mercado.head()

Unnamed: 0.1,Unnamed: 0,Id,Year_Birth,Income
0,0,1826,1970,84835.0
1,1,1,1961,57091.0
2,2,10476,1958,67267.0
3,3,1386,1967,32474.0
4,4,5371,1989,21474.0


### `index`

O parâmetro `index`, do método `to_csv`, é usado para determinar se o índice do *DataFrame* deve ser incluído no arquivo CSV.

* `index=True` (padrão): indica que o índice do *DataFrame* será incluído;
* `index=False`: omite a apresentação do índice.

In [43]:
dados_selecao.to_csv('clientes_mercado.csv', index=False)

In [44]:
clientes_mercado = pd.read_csv('/content/clientes_mercado.csv')

In [45]:
clientes_mercado.head()

Unnamed: 0,Id,Year_Birth,Income
0,1826,1970,84835.0
1,1,1961,57091.0
2,10476,1958,67267.0
3,1386,1967,32474.0
4,5371,1989,21474.0
