# Data Wrangling I

Data Wrangling é um processo manual para transformar dados brutos em um formato adequado para análise.
ETL é um processo automatizado projetado para integrar, limpar e preencher dados em um repositório, normalmente um data warehouse. Embora a organização de dados seja exploratória e iterativa, o ETL é sistemático e definido.

## Coleta de Dados

A partir da lista de ativos da API Alpha Vantage em conjunto com a biblioteca Yahoo Finance, vamos criar uma nova base de dados com as ações, informando os dividendos dos últimos 5 anos. 


In [1]:
import pandas as pd

df = pd.read_csv('listing_status.csv', header=0)
df.head(10)

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
0,A,Agilent Technologies Inc,NYSE,Stock,1999-11-18,,Active
1,AA,Alcoa Corp,NYSE,Stock,2016-10-18,,Active
2,AAA,ALTERNATIVE ACCESS FIRST PRIORITY CLO BOND ETF,NYSE ARCA,ETF,2020-09-09,,Active
3,AAAU,Goldman Sachs Physical Gold ETF,BATS,ETF,2018-08-15,,Active
4,AACG,ATA Creativity Global,NASDAQ,Stock,2008-01-29,,Active
5,AACT,Ares Acquisition Corporation II - Class A,NYSE,Stock,2023-06-12,,Active
6,AACT-U,Ares Acquisition Corporation II - Units (1 Ord...,NYSE,Stock,2023-04-21,,Active
7,AACT-WS,Ares Acquisition Corporation II - Warrants (01...,NYSE,Stock,2023-06-12,,Active
8,AADI,Aadi Bioscience Inc,NASDAQ,Stock,2017-08-08,,Active
9,AADR,ADVISORSHARES DORSEY WRIGHT ADR ETF,NASDAQ,ETF,2010-07-21,,Active


In [2]:
#! pip install yfinance
import yfinance as yf

def get_finance_data_name(symbol):
    try:
        return yf.Ticker(symbol).info['shortName']
    except:
        return f''

## Limpeza e Transformação dos Dados

Faremos um processo mais minucioso de limpeza dos dados, pois sabemos que há muitas inconsistências de tipos:

In [3]:
df.dtypes

symbol            object
name              object
exchange          object
assetType         object
ipoDate           object
delistingDate    float64
status            object
dtype: object

E também de dados ausentes:

In [4]:
df.isna().sum()

symbol               1
name                34
exchange             0
assetType            0
ipoDate              0
delistingDate    11628
status               0
dtype: int64

In [5]:
df.loc[df.name.isna()]

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
456,AMEH,,NASDAQ,Stock,2024-02-26,,Active
487,AMRS,,NASDAQ,Stock,2023-08-18,,Active
896,AVRO,,NASDAQ,Stock,2024-06-21,,Active
2113,CLVS,,NASDAQ,Stock,2023-01-03,,Active
2419,CTEST,,NYSE,Stock,2019-07-25,,Active
2632,DEC,,NYSE,Stock,2023-12-18,,Active
2676,DFFN,,NASDAQ,Stock,2023-08-17,,Active
4135,FWP,,NASDAQ,Stock,2022-12-27,,Active
6994,MTEST,,NYSE,Stock,2019-10-09,,Active
7401,NTEST-G,,NYSE,Stock,2019-07-17,,Active


In [6]:
df.loc[df.symbol.isna()]

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
7073,,Nano Labs Ltd,NASDAQ,Stock,2022-07-12,,Active


Vamos corrigir esses tipos para lidar melhor com essas informações:

In [7]:
df['ipoDate'] = pd.to_datetime(df['ipoDate'], format='%Y-%m-%d')
df['delistingDate'] = pd.to_datetime(df['delistingDate'], format='%Y-%m-%d')
df['assetType'] = df['assetType'].astype('category')
df['exchange'] = df['exchange'].astype('category')
df['name'] = df['name'].astype('string')

In [8]:
df.dtypes

symbol                   object
name             string[python]
exchange               category
assetType              category
ipoDate          datetime64[ns]
delistingDate    datetime64[ns]
status                   object
dtype: object

## Integração dos Dados

Para cada linha com nome nulo, vamos recuperar essa informação com a função do Yahoo Finance. Para as que não forem possíveis de recuperar, vamos excluir essas linhas:

In [9]:
df['name'] = df.apply(lambda x: get_finance_data_name(x.symbol) if pd.isna(x['name']) or x['name'] == '' else x.name, axis=1)

df.dropna(subset=['name', 'symbol'], inplace=True)
df.drop(df[df.name == ''].index, inplace=True)

df.shape


404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/AMEH?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=AMEH&crumb=P%2FStuqaRma7
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/DFFN?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=DFFN&crumb=P%2FStuqaRma7
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/NTEST-G?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=NTEST-G&crumb=P%2FStuqaRma7
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/NTEST-J?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDoma

(11595, 7)

Vamos também excluir as duplicatas de nomes:

In [10]:
df.drop_duplicates(subset=['name'], inplace=True)
df.shape

(11595, 7)

Agora vamos integrar à nossa base de dados informações sobre os dividendos pagos pelas empresas e ETF's nos últimos 5 anos:

In [13]:
def get_dividends_by_period(period, symbol):
    try: 
        return yf.Ticker(symbol).history(period=period).Dividends.sum() # 1d, 1w, 1m, 3m, 6m, 5y, 10y, ytd, max
    except:
        return 0
    
get_dividends_by_period('5y', 'AAPL')

np.float64(4.58)

In [12]:
def get_history_by_period(period, symbol):
    try:
        return yf.Ticker(symbol).history(period=period)
    except:
        return pd.DataFrame()

get_history_by_period('5y', 'AAPL')

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-03-19 00:00:00-04:00,60.036366,61.358966,58.876360,59.402973,271857200,0.0,0.0
2020-03-20 00:00:00-04:00,59.985399,61.113859,55.330817,55.631741,401693200,0.0,0.0
2020-03-23 00:00:00-04:00,55.350235,55.452160,51.595990,54.449894,336752800,0.0,0.0
2020-03-24 00:00:00-04:00,57.359606,60.109159,56.859688,59.912590,287531200,0.0,0.0
2020-03-25 00:00:00-04:00,60.851764,62.671857,59.286485,59.582554,303602000,0.0,0.0
...,...,...,...,...,...,...,...
2025-03-12 00:00:00-04:00,220.139999,221.750000,214.910004,216.979996,62547500,0.0,0.0
2025-03-13 00:00:00-04:00,215.949997,216.839996,208.419998,209.679993,61368300,0.0,0.0
2025-03-14 00:00:00-04:00,211.250000,213.949997,209.580002,213.490005,60107600,0.0,0.0
2025-03-17 00:00:00-04:00,213.309998,215.220001,209.970001,214.000000,48073400,0.0,0.0


## Redução e Validação dos Dados

Vamos extrair uma amostra aleatória para iniciar a comparação do desempenho nominal dos dividendos dos últimos 5 anos entre as bolsas.

In [14]:
sample_df = df.sample(n=25)
sample_df

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
6945,MSCI,6945,NYSE,Stock,2007-11-15,NaT,Active
7918,PBNV,7918,BATS,ETF,2024-05-22,NaT,Active
6198,LEGR,6198,NASDAQ,ETF,2018-01-25,NaT,Active
3823,FLC,3823,NYSE,ETF,2003-10-09,NaT,Active
8328,PRZO,8328,NASDAQ,Stock,2023-07-27,NaT,Active
8915,RNST,8915,NYSE,Stock,1992-04-24,NaT,Active
2224,COHN,2224,NYSE MKT,Stock,2004-05-06,NaT,Active
4125,FUTY,4125,NYSE ARCA,ETF,2013-10-24,NaT,Active
6075,KSTR,6075,NYSE ARCA,ETF,2021-01-27,NaT,Active
6836,MNTL,6836,NASDAQ,ETF,2024-01-23,NaT,Active


In [15]:
sample_df['dividends_last_5_years'] = sample_df.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
sample_df

$PGRU: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$RITM-P-C: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
6945,MSCI,6945,NYSE,Stock,2007-11-15,NaT,Active,24.18
7918,PBNV,7918,BATS,ETF,2024-05-22,NaT,Active,0.0
6198,LEGR,6198,NASDAQ,ETF,2018-01-25,NaT,Active,4.202
3823,FLC,3823,NYSE,ETF,2003-10-09,NaT,Active,6.609
8328,PRZO,8328,NASDAQ,Stock,2023-07-27,NaT,Active,0.0
8915,RNST,8915,NYSE,Stock,1992-04-24,NaT,Active,4.4
2224,COHN,2224,NYSE MKT,Stock,2004-05-06,NaT,Active,4.25
4125,FUTY,4125,NYSE ARCA,ETF,2013-10-24,NaT,Active,6.251
6075,KSTR,6075,NYSE ARCA,ETF,2021-01-27,NaT,Active,0.0
6836,MNTL,6836,NASDAQ,ETF,2024-01-23,NaT,Active,0.257


Como a amostra aleatória trouxe muitas empresas novas, sem histórico de dividendos mínimo consistente para a nossa análise, vamos tentar refinar nossa amostra. Antes, vamos verificar a proporção de ativos por exchange e por tipo:

In [16]:
rate_exchange = df.exchange.value_counts(normalize=True)
rate_exchange

exchange
NASDAQ       0.447176
NYSE         0.263131
NYSE ARCA    0.192842
BATS         0.068650
NYSE MKT     0.028202
Name: proportion, dtype: float64

In [17]:
rate_asset = df.assetType.value_counts(normalize=True)
rate_asset

assetType
Stock    0.644761
ETF      0.355239
Name: proportion, dtype: float64

Para fins de simplificação, vamos deixar somente as exchanges mais conhecidas e verificar a proporção novamente:

In [18]:
df.drop(df[df.exchange == 'NYSE ARCA'].index, inplace=True)
df.drop(df[df.exchange == 'NYSE MKT'].index, inplace=True)
df.drop(df[df.exchange == 'BATS'].index, inplace=True)

In [19]:
rate_exchange = df.exchange.value_counts(normalize=True)
rate_exchange

exchange
NASDAQ       0.629553
NYSE         0.370447
BATS         0.000000
NYSE ARCA    0.000000
NYSE MKT     0.000000
Name: proportion, dtype: float64

Agora vamos extrair uma amostra obedecendo essas proporções:

In [23]:
exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')
exchange_sample

  exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')
  exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
11299,WWD,11299,NASDAQ,Stock,1996-05-30,NaT,Active
4444,GLUE,4444,NASDAQ,Stock,2021-06-24,NaT,Active
9558,SMPL,9558,NASDAQ,Stock,2017-07-10,NaT,Active
8167,PLMIW,8167,NASDAQ,Stock,2021-05-06,NaT,Active
1389,BNDW,1389,NASDAQ,ETF,2018-09-06,NaT,Active
11477,XWEL,11477,NASDAQ,Stock,2018-01-05,NaT,Active
5936,KEN,5936,NYSE,Stock,2015-01-14,NaT,Active
457,AMG,457,NYSE,Stock,1997-11-21,NaT,Active
1407,BNS,1407,NYSE,Stock,1999-09-13,NaT,Active


In [24]:
asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')
asset_sample

  asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')
  asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
10955,VONG,10955,NASDAQ,ETF,2010-09-22,NaT,Active
5580,IUSB,5580,NASDAQ,ETF,2014-06-12,NaT,Active
9176,SCDS,9176,NASDAQ,ETF,2024-08-08,NaT,Active
7865,PANL,7865,NASDAQ,Stock,2013-12-19,NaT,Active
11346,XELAP,11346,NASDAQ,Stock,2022-03-23,NaT,Active
7905,PBH,7905,NYSE,Stock,2005-02-10,NaT,Active
1441,BOTJ,1441,NASDAQ,Stock,2000-08-10,NaT,Active
9357,SHAK,9357,NYSE,Stock,2015-01-30,NaT,Active
11094,WASH,11094,NASDAQ,Stock,1995-08-18,NaT,Active


Para comparar dividendos, vamos adicionar as informações como anteriormente, em cada sample:

In [25]:
exchange_sample['dividends_last_5_years'] = exchange_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
exchange_sample


$PLMIW: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
11299,WWD,11299,NASDAQ,Stock,1996-05-30,NaT,Active,3.815
4444,GLUE,4444,NASDAQ,Stock,2021-06-24,NaT,Active,0.0
9558,SMPL,9558,NASDAQ,Stock,2017-07-10,NaT,Active,0.0
8167,PLMIW,8167,NASDAQ,Stock,2021-05-06,NaT,Active,0.0
1389,BNDW,1389,NASDAQ,ETF,2018-09-06,NaT,Active,10.033
11477,XWEL,11477,NASDAQ,Stock,2018-01-05,NaT,Active,0.0
5936,KEN,5936,NYSE,Stock,2015-01-14,NaT,Active,24.43
457,AMG,457,NYSE,Stock,1997-11-21,NaT,Active,0.2
1407,BNS,1407,NYSE,Stock,1999-09-13,NaT,Active,14.219


In [26]:
asset_sample['dividends_last_5_years'] = asset_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
asset_sample

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
10955,VONG,10955,NASDAQ,ETF,2010-09-22,NaT,Active,2.59425
5580,IUSB,5580,NASDAQ,ETF,2014-06-12,NaT,Active,6.996
9176,SCDS,9176,NASDAQ,ETF,2024-08-08,NaT,Active,0.19
7865,PANL,7865,NASDAQ,Stock,2013-12-19,NaT,Active,1.325
11346,XELAP,11346,NASDAQ,Stock,2022-03-23,NaT,Active,0.835
7905,PBH,7905,NYSE,Stock,2005-02-10,NaT,Active,0.0
1441,BOTJ,1441,NASDAQ,Stock,2000-08-10,NaT,Active,1.56818
9357,SHAK,9357,NYSE,Stock,2015-01-30,NaT,Active,0.0
11094,WASH,11094,NASDAQ,Stock,1995-08-18,NaT,Active,10.81


Numa outra tentativa, vamos montar uma amostra considerando empresas listadas entre 5 e 10 anos:

In [33]:
five_year_df = df[df.ipoDate.dt.year.between(2014, 2024)]
five_year_df

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
1,AA,1,NYSE,Stock,2016-10-18,NaT,Active
5,AACT,5,NYSE,Stock,2023-06-12,NaT,Active
6,AACT-U,6,NYSE,Stock,2023-04-21,NaT,Active
7,AACT-WS,7,NYSE,Stock,2023-06-12,NaT,Active
8,AADI,8,NASDAQ,Stock,2017-08-08,NaT,Active
...,...,...,...,...,...,...,...
11618,ZVZZT,11618,NASDAQ,Stock,2017-09-22,NaT,Active
11620,ZWZZT,11620,NASDAQ,Stock,2017-09-22,NaT,Active
11621,ZXYZ-A,11621,NASDAQ,Stock,2016-01-19,NaT,Active
11623,ZYME,11623,NASDAQ,Stock,2017-04-28,NaT,Active


In [None]:
five_year_rate = five_year_df.ipoDate.dt.year.value_counts(normalize=True)
five_year_rate

ipoDate
2021    0.251346
2022    0.116697
2023    0.107919
2020    0.104329
2024    0.083982
2019    0.065430
2018    0.061839
2017    0.060642
2014    0.058448
2016    0.048873
2015    0.040495
Name: proportion, dtype: float64

In [56]:
five_year = five_year_rate.sample(5)
five_year

ipoDate
2024    0.083982
2018    0.061839
2017    0.060642
2014    0.058448
2021    0.251346
Name: proportion, dtype: float64

In [62]:
five_year_df.drop(five_year_df[five_year_df.ipoDate.dt.year.isin([2015,2016,2019,2020,2022,2023])].index, inplace=True)
five_year_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  five_year_df.drop(five_year_df[five_year_df.ipoDate.dt.year.isin([2015,2016,2019,2020,2022,2023])].index, inplace=True)


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
8,AADI,8,NASDAQ,Stock,2017-08-08,NaT,Active
13,AAM-U,13,NYSE,Stock,2024-08-01,NaT,Active
47,ABOS,47,NASDAQ,Stock,2021-07-01,NaT,Active
49,ABR-P-D,49,NYSE,Stock,2021-05-26,NaT,Active
50,ABR-P-E,50,NYSE,Stock,2021-08-05,NaT,Active
...,...,...,...,...,...,...,...
11613,ZURA,11613,NASDAQ,Stock,2021-09-03,NaT,Active
11614,ZVIA,11614,NYSE,Stock,2021-07-22,NaT,Active
11618,ZVZZT,11618,NASDAQ,Stock,2017-09-22,NaT,Active
11620,ZWZZT,11620,NASDAQ,Stock,2017-09-22,NaT,Active


In [63]:
five_year_rate = five_year_df.ipoDate.dt.year.value_counts(normalize=True)
five_year_rate

ipoDate
2021    0.486862
2024    0.162674
2018    0.119784
2017    0.117465
2014    0.113215
Name: proportion, dtype: float64

In [65]:
five_year_sample = five_year_df.groupby(five_year_df.ipoDate.dt.year).apply(lambda x: x.sample(int(30 * five_year[x.name]))).droplevel('ipoDate')
five_year_sample

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
9465,SKOR,9465,NASDAQ,ETF,2014-11-13,NaT,Active
8914,RNSC,8914,NASDAQ,ETF,2017-06-22,NaT,Active
619,AQST,619,NASDAQ,Stock,2018-07-25,NaT,Active
7153,NDACW,7153,NASDAQ,Stock,2021-04-22,NaT,Active
10677,USCB,10677,NASDAQ,Stock,2021-07-23,NaT,Active
6162,LCAAW,6162,NASDAQ,Stock,2021-05-11,NaT,Active
3283,EOCW,3283,NYSE,Stock,2021-08-19,NaT,Active
6570,MCAGU,6570,NASDAQ,Stock,2021-11-12,NaT,Active
8786,RF-P-E,8786,NYSE,Stock,2021-04-27,NaT,Active
3079,EDR,3079,NYSE,Stock,2021-04-28,NaT,Active


In [66]:
five_year_sample['dividends_last_5_years'] = five_year_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
five_year_sample

$NDACW: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$LCAAW: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$EOCW: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$RF-P-E: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
9465,SKOR,9465,NASDAQ,ETF,2014-11-13,NaT,Active,8.749
8914,RNSC,8914,NASDAQ,ETF,2017-06-22,NaT,Active,2.081576
619,AQST,619,NASDAQ,Stock,2018-07-25,NaT,Active,0.0
7153,NDACW,7153,NASDAQ,Stock,2021-04-22,NaT,Active,0.0
10677,USCB,10677,NASDAQ,Stock,2021-07-23,NaT,Active,0.3
6162,LCAAW,6162,NASDAQ,Stock,2021-05-11,NaT,Active,0.0
3283,EOCW,3283,NYSE,Stock,2021-08-19,NaT,Active,0.0
6570,MCAGU,6570,NASDAQ,Stock,2021-11-12,NaT,Active,0.0
8786,RF-P-E,8786,NYSE,Stock,2021-04-27,NaT,Active,0.0
3079,EDR,3079,NYSE,Stock,2021-04-28,NaT,Active,0.42
