# Data Wrangling I

Data Wrangling é um processo manual para transformar dados brutos em um formato adequado para análise.
ETL é um processo automatizado projetado para integrar, limpar e preencher dados em um repositório, normalmente um data warehouse. Embora a organização de dados seja exploratória e iterativa, o ETL é sistemático e definido.

## Coleta de Dados

A partir da lista de ativos da API Alpha Vantage em conjunto com a biblioteca Yahoo Finance, vamos criar uma nova base de dados com as ações, informando os dividendos dos últimos 5 anos. 


In [119]:
import pandas as pd

df = pd.read_csv('listing_status.csv', header=0)
df.head(10)

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
0,A,Agilent Technologies Inc,NYSE,Stock,1999-11-18,,Active
1,AA,Alcoa Corp,NYSE,Stock,2016-10-18,,Active
2,AAA,ALTERNATIVE ACCESS FIRST PRIORITY CLO BOND ETF,NYSE ARCA,ETF,2020-09-09,,Active
3,AAAU,Goldman Sachs Physical Gold ETF,BATS,ETF,2018-08-15,,Active
4,AACG,ATA Creativity Global,NASDAQ,Stock,2008-01-29,,Active
5,AACT,Ares Acquisition Corporation II - Class A,NYSE,Stock,2023-06-12,,Active
6,AACT-U,Ares Acquisition Corporation II - Units (1 Ord...,NYSE,Stock,2023-04-21,,Active
7,AACT-WS,Ares Acquisition Corporation II - Warrants (01...,NYSE,Stock,2023-06-12,,Active
8,AADI,Aadi Bioscience Inc,NASDAQ,Stock,2017-08-08,,Active
9,AADR,ADVISORSHARES DORSEY WRIGHT ADR ETF,NASDAQ,ETF,2010-07-21,,Active


In [120]:
#! pip install yfinance
import yfinance as yf

def get_finance_data_name(symbol):
    try:
        return yf.Ticker(symbol).info['shortName']
    except:
        return f''

## Limpeza e Transformação dos Dados

Faremos um processo mais minucioso de limpeza dos dados, pois sabemos que há muitas inconsistências de tipos:

In [121]:
df.dtypes

symbol            object
name              object
exchange          object
assetType         object
ipoDate           object
delistingDate    float64
status            object
dtype: object

E também de dados ausentes:

In [122]:
df.isna().sum()

symbol               1
name                34
exchange             0
assetType            0
ipoDate              0
delistingDate    11628
status               0
dtype: int64

In [105]:
df.loc[df.name.isna()]

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
456,AMEH,,NASDAQ,Stock,2024-02-26,,Active
487,AMRS,,NASDAQ,Stock,2023-08-18,,Active
896,AVRO,,NASDAQ,Stock,2024-06-21,,Active
2113,CLVS,,NASDAQ,Stock,2023-01-03,,Active
2419,CTEST,,NYSE,Stock,2019-07-25,,Active
2632,DEC,,NYSE,Stock,2023-12-18,,Active
2676,DFFN,,NASDAQ,Stock,2023-08-17,,Active
4135,FWP,,NASDAQ,Stock,2022-12-27,,Active
6994,MTEST,,NYSE,Stock,2019-10-09,,Active
7401,NTEST-G,,NYSE,Stock,2019-07-17,,Active


In [106]:
df.loc[df.symbol.isna()]

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
7073,,Nano Labs Ltd,NASDAQ,Stock,2022-07-12,,Active


Vamos corrigir esses tipos para lidar melhor com essas informações:

In [123]:
df['ipoDate'] = pd.to_datetime(df['ipoDate'], format='%Y-%m-%d')
df['delistingDate'] = pd.to_datetime(df['delistingDate'], format='%Y-%m-%d')
df['assetType'] = df['assetType'].astype('category')
df['exchange'] = df['exchange'].astype('category')
df['name'] = df['name'].astype('string')

In [108]:
df.dtypes

symbol                   object
name             string[python]
exchange               category
assetType              category
ipoDate          datetime64[ns]
delistingDate    datetime64[ns]
status                   object
dtype: object

## Integração dos Dados

Para cada linha com nome nulo, vamos recuperar essa informação com a função do Yahoo Finance. Para as que não forem possíveis de recuperar, vamos excluir essas linhas:

In [124]:
df['name'] = df.apply(lambda x: get_finance_data_name(x.symbol) if pd.isna(x['name']) or x['name'] == '' else x.name, axis=1)

df.dropna(subset=['name', 'symbol'], inplace=True)
df.drop(df[df.name == ''].index, inplace=True)

df.shape


404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/AMEH?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=AMEH&crumb=H1YgLak2O95
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/DFFN?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=DFFN&crumb=H1YgLak2O95
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/NTEST-G?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=NTEST-G&crumb=H1YgLak2O95
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/NTEST-J?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=fin

(11598, 7)

Vamos também excluir as duplicatas de nomes:

In [110]:
# df.drop_duplicates(subset=['name'], inplace=True)
# df.shape

(11598, 7)

Agora vamos integrar à nossa base de dados informações sobre os dividendos pagos pelas empresas e ETF's nos últimos 5 anos:

In [125]:
def get_dividends_by_period(period, symbol):
    try: 
        return yf.Ticker(symbol).history(period=period).Dividends.sum() # 1d, 1w, 1m, 3m, 6m, 5y, 10y, ytd, max
    except:
        return 0

In [112]:
def get_history_by_period(period, symbol):
    try:
        return yf.Ticker(symbol).history(period=period)
    except:
        return pd.DataFrame()

get_history_by_period('5y', 'AAPL')

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-09-18 00:00:00-04:00,53.476841,53.909864,53.084947,53.890511,101360000,0.0,0.0
2019-09-19 00:00:00-04:00,53.706659,54.130004,53.309925,53.452656,88242400,0.0,0.0
2019-09-20 00:00:00-04:00,53.554269,53.839722,52.608396,52.671291,221652400,0.0,0.0
2019-09-23 00:00:00-04:00,52.966417,53.181717,52.651931,52.910778,76662000,0.0,0.0
2019-09-24 00:00:00-04:00,53.469579,53.822771,52.540642,52.659176,124763200,0.0,0.0
...,...,...,...,...,...,...,...
2024-09-11 00:00:00-04:00,221.460007,223.089996,217.889999,222.660004,44587100,0.0,0.0
2024-09-12 00:00:00-04:00,222.500000,223.550003,219.820007,222.770004,37498200,0.0,0.0
2024-09-13 00:00:00-04:00,223.580002,224.039993,221.910004,222.500000,36766600,0.0,0.0
2024-09-16 00:00:00-04:00,216.539993,217.220001,213.919998,216.320007,59357400,0.0,0.0


## Redução e Validação dos Dados

Vamos extrair uma amostra aleatória para iniciar a comparação do desempenho nominal dos dividendos dos últimos 5 anos entre as bolsas.

In [126]:
sample_df = df.sample(n=25)
sample_df

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
3568,FAS,3568,NYSE ARCA,ETF,2008-11-19,NaT,Active
4083,FTIF,4083,NYSE ARCA,ETF,2023-03-14,NaT,Active
4198,GATEU,4198,NASDAQ,Stock,2021-10-01,NaT,Active
6543,MBI,6543,NYSE,Stock,1987-07-02,NaT,Active
7134,NCL,7134,NYSE MKT,Stock,2023-10-19,NaT,Active
8716,RDIV,8716,NYSE ARCA,ETF,2013-10-01,NaT,Active
7421,NTRA,7421,NASDAQ,Stock,2015-07-02,NaT,Active
2028,CIM,2028,NYSE,Stock,2007-11-16,NaT,Active
3981,FPRO,3981,BATS,ETF,2021-02-04,NaT,Active
5189,IDAT,5189,NYSE ARCA,ETF,2021-06-10,NaT,Active


In [127]:
sample_df['dividends_last_5_years'] = sample_df.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
sample_df

FTIF: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', 'ytd', 'max']
NCL: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', 'ytd', 'max']
BAMG: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', 'ytd', 'max']
$HGTY-WS: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
LRGC: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', 'ytd', 'max']
PMAX: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', 'ytd', 'max']
MBINM: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', 'ytd', 'max']


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
3568,FAS,3568,NYSE ARCA,ETF,2008-11-19,NaT,Active,4.091
4083,FTIF,4083,NYSE ARCA,ETF,2023-03-14,NaT,Active,0.0
4198,GATEU,4198,NASDAQ,Stock,2021-10-01,NaT,Active,0.0
6543,MBI,6543,NYSE,Stock,1987-07-02,NaT,Active,8.0
7134,NCL,7134,NYSE MKT,Stock,2023-10-19,NaT,Active,0.0
8716,RDIV,8716,NYSE ARCA,ETF,2013-10-01,NaT,Active,7.406
7421,NTRA,7421,NASDAQ,Stock,2015-07-02,NaT,Active,0.0
2028,CIM,2028,NYSE,Stock,2007-11-16,NaT,Active,17.21
3981,FPRO,3981,BATS,ETF,2021-02-04,NaT,Active,1.927
5189,IDAT,5189,NYSE ARCA,ETF,2021-06-10,NaT,Active,0.616


Como a amostra aleatória trouxe muitas empresas novas, sem histórico de dividendos mínimo consistente para a nossa análise, vamos tentar refinar nossa amostra. Antes, vamos verificar a proporção de ativos por exchange e por tipo:

In [128]:
rate_exchange = df.exchange.value_counts(normalize=True)
rate_exchange

exchange
NASDAQ       0.447319
NYSE         0.263063
NYSE ARCA    0.192792
BATS         0.068633
NYSE MKT     0.028195
Name: proportion, dtype: float64

In [129]:
rate_asset = df.assetType.value_counts(normalize=True)
rate_asset

assetType
Stock    0.644853
ETF      0.355147
Name: proportion, dtype: float64

Para fins de simplificação, vamos deixar somente as exchanges mais conhecidas e verificar a proporção novamente:

In [130]:
df.drop(df[df.exchange == 'NYSE ARCA'].index, inplace=True)
df.drop(df[df.exchange == 'NYSE MKT'].index, inplace=True)
df.drop(df[df.exchange == 'BATS'].index, inplace=True)

In [131]:
rate_exchange = df.exchange.value_counts(normalize=True)
rate_exchange

exchange
NASDAQ       0.629688
NYSE         0.370312
BATS         0.000000
NYSE ARCA    0.000000
NYSE MKT     0.000000
Name: proportion, dtype: float64

Agora vamos extrair uma amostra obedecendo essas proporções:

In [136]:
exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')
exchange_sample

  exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')
  exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
2957,DVAX,2957,NASDAQ,Stock,2004-02-19,NaT,Active
3488,EWCZ,3488,NASDAQ,Stock,2021-08-05,NaT,Active
7265,NKLA,7265,NASDAQ,Stock,2018-06-11,NaT,Active
2232,COLM,2232,NASDAQ,Stock,1998-03-27,NaT,Active
8488,PYPD,8488,NASDAQ,Stock,2020-06-26,NaT,Active
6505,MAR,6505,NASDAQ,Stock,1993-10-13,NaT,Active
1989,CHN,1989,NYSE,ETF,1992-07-10,NaT,Active
9491,SLGN,9491,NYSE,Stock,1997-02-14,NaT,Active
1355,BMAC,1355,NYSE,Stock,2021-11-12,NaT,Active


In [133]:
asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')
asset_sample

  asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')
  asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
10357,TQQQ,10357,NASDAQ,ETF,2010-02-11,NaT,Active
8348,PSCE,8348,NASDAQ,ETF,2010-04-07,NaT,Active
4795,HFRO,4795,NYSE,ETF,2017-11-06,NaT,Active
830,AUDC,830,NASDAQ,Stock,1999-05-28,NaT,Active
3266,ENSC,3266,NASDAQ,Stock,2018-02-26,NaT,Active
5331,IMKTA,5331,NASDAQ,Stock,1990-03-26,NaT,Active
7823,OXLCN,7823,NASDAQ,Stock,2014-06-02,NaT,Active
4464,GMRE,4464,NYSE,Stock,2016-06-29,NaT,Active
6279,LION,6279,NASDAQ,Stock,2024-05-14,NaT,Active


Para comparar dividendos, vamos adicionar as informações como anteriormente, em cada sample:

In [63]:
exchange_sample['dividends_last_5_years'] = exchange_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
exchange_sample


MKAM: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', 'ytd', 'max']
$VCXA: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$ADAL: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$ATH-P-A: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
6763,MKAM,MKAM ETF,NASDAQ,ETF,2023-04-12,NaT,Active,0.0
5838,JSML,JANUS HENDERSON SMALL CAP GROWTH ALPHA ETF,NASDAQ,ETF,2016-02-25,NaT,Active,1.253
3025,EBC,Eastern Bankshares Inc,NASDAQ,Stock,2020-10-15,NaT,Active,1.44
10803,VCXA,10X Capital Venture Acquisition Corp II - Class A,NASDAQ,Stock,2021-10-05,NaT,Active,0.0
122,ADAL,Anthemis Digital Acquisitions I Corp - Class A,NASDAQ,Stock,2021-12-29,NaT,Active,0.0
10362,TRDA,Entrada Therapeutics Inc,NASDAQ,Stock,2021-10-29,NaT,Active,0.0
11151,WES,Western Midstream Partners LP,NYSE,Stock,2012-12-10,NaT,Active,10.089
781,ATH-P-A,Athene Holding Ltd,NYSE,Stock,2019-06-06,NaT,Active,0.0
9039,RVTY,Revvity Inc,NYSE,Stock,1983-04-06,NaT,Active,1.4


In [35]:
asset_sample['dividends_last_5_years'] = asset_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
asset_sample

VSTEW: Period '5y' is invalid, must be one of ['1d', '5d']
ASTSW: Period '5y' is invalid, must be one of ['1d', '5d']


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
11075,VYMI,VANGUARD INTERNATIONAL HIGH DIVIDEND YIELD IND...,NASDAQ,ETF,2016-03-02,NaT,Active,13.541
5338,IMOM,ALPHA ARCHITECT INTERNATIONAL QUANTITATIVE MOM...,NASDAQ,ETF,2016-01-04,NaT,Active,2.971
5146,IBTF,ISHARES IBONDS DEC 2025 TERM TREASURY ETF,NASDAQ,ETF,2020-02-28,NaT,Active,2.354
11069,VXRT,Vaxart Inc,NASDAQ,Stock,2018-02-12,NaT,Active,0.0
8877,RM,Regional Management Corp,NYSE,Stock,2012-03-28,NaT,Active,4.45
4933,HRI,Herc Holdings Inc,NYSE,Stock,2006-11-16,NaT,Active,7.327
6418,LTRPA,Liberty TripAdvisor Holdings Inc - Series A,NASDAQ,Stock,2014-08-27,NaT,Active,0.0
11019,VSTEW,Vast Renewables Ltd - Warrants (01/07/2028),NASDAQ,Stock,2023-12-19,NaT,Active,0.0
755,ASTSW,AST SpaceMobile Inc - Warrants (06/04/2026),NASDAQ,Stock,2019-11-01,NaT,Active,0.0


Numa outra tentativa, vamos montar uma amostra considerando empresas listadas entre 5 e 10 anos:

In [137]:
five_year_df = df[df.ipoDate.dt.year.between(2014, 2019)]
five_year = five_year_df.ipoDate.dt.year.value_counts(normalize=True)

five_year_sample = five_year_df.groupby(five_year_df.ipoDate.dt.year).apply(lambda x: x.sample(int(25 * five_year[x.name]))).droplevel('ipoDate')
five_year_sample['dividends_last_5_years'] = five_year_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
five_year_sample


$CIO-P-A: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$AMH-P-G: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$NLY-P-I: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
3034,EBR-B,3034,NYSE,Stock,2014-09-22,NaT,Active,1.797644
7639,OEC,7639,NYSE,Stock,2014-07-25,NaT,Active,0.851
9314,SFBS,9314,NYSE,Stock,2014-05-14,NaT,Active,4.59
70,ACB,70,NASDAQ,Stock,2014-07-11,NaT,Active,0.0
2027,CIL,2027,NASDAQ,ETF,2015-08-20,NaT,Active,5.619
11526,YRD,11526,NYSE,Stock,2015-12-18,NaT,Active,0.0
1450,BOX,1450,NYSE,Stock,2015-01-23,NaT,Active,0.0
2040,CIO-P-A,2040,NYSE,Stock,2016-10-06,NaT,Active,0.0
10095,TCMD,10095,NASDAQ,Stock,2016-07-28,NaT,Active,0.0
11190,WINT,11190,NASDAQ,Stock,2016-01-04,NaT,Active,0.0
