# Data Wrangling I

Data Wrangling é um processo manual para transformar dados brutos em um formato adequado para análise.
ETL é um processo automatizado projetado para integrar, limpar e preencher dados em um repositório, normalmente um data warehouse. Embora a organização de dados seja exploratória e iterativa, o ETL é sistemático e definido.

## Coleta de Dados

A partir da lista de ativos da API Alpha Vantage em conjunto com a biblioteca Yahoo Finance, vamos criar uma nova base de dados com as ações, informando os dividendos dos últimos 5 anos. 


In [2]:
import pandas as pd

df = pd.read_csv('listing_status.csv', header=0)
df.head(10)

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
0,A,Agilent Technologies Inc,NYSE,Stock,1999-11-18,,Active
1,AA,Alcoa Corp,NYSE,Stock,2016-10-18,,Active
2,AAA,ALTERNATIVE ACCESS FIRST PRIORITY CLO BOND ETF,NYSE ARCA,ETF,2020-09-09,,Active
3,AAAU,Goldman Sachs Physical Gold ETF,BATS,ETF,2018-08-15,,Active
4,AACG,ATA Creativity Global,NASDAQ,Stock,2008-01-29,,Active
5,AACT,Ares Acquisition Corporation II - Class A,NYSE,Stock,2023-06-12,,Active
6,AACT-U,Ares Acquisition Corporation II - Units (1 Ord...,NYSE,Stock,2023-04-21,,Active
7,AACT-WS,Ares Acquisition Corporation II - Warrants (01...,NYSE,Stock,2023-06-12,,Active
8,AADI,Aadi Bioscience Inc,NASDAQ,Stock,2017-08-08,,Active
9,AADR,ADVISORSHARES DORSEY WRIGHT ADR ETF,NASDAQ,ETF,2010-07-21,,Active


In [3]:
#! pip install yfinance
import yfinance as yf

def get_finance_data_name(symbol):
    try:
        return yf.Ticker(symbol).info['shortName']
    except:
        return f''

## Limpeza e Transformação dos Dados

Faremos um processo mais minucioso de limpeza dos dados, pois sabemos que há muitas inconsistências de tipos:

In [4]:
df.dtypes

symbol            object
name              object
exchange          object
assetType         object
ipoDate           object
delistingDate    float64
status            object
dtype: object

E também de dados ausentes:

In [5]:
df.isna().sum()

symbol               1
name                34
exchange             0
assetType            0
ipoDate              0
delistingDate    11628
status               0
dtype: int64

In [6]:
df.loc[df.name.isna()]

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
456,AMEH,,NASDAQ,Stock,2024-02-26,,Active
487,AMRS,,NASDAQ,Stock,2023-08-18,,Active
896,AVRO,,NASDAQ,Stock,2024-06-21,,Active
2113,CLVS,,NASDAQ,Stock,2023-01-03,,Active
2419,CTEST,,NYSE,Stock,2019-07-25,,Active
2632,DEC,,NYSE,Stock,2023-12-18,,Active
2676,DFFN,,NASDAQ,Stock,2023-08-17,,Active
4135,FWP,,NASDAQ,Stock,2022-12-27,,Active
6994,MTEST,,NYSE,Stock,2019-10-09,,Active
7401,NTEST-G,,NYSE,Stock,2019-07-17,,Active


In [7]:
df.loc[df.symbol.isna()]

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
7073,,Nano Labs Ltd,NASDAQ,Stock,2022-07-12,,Active


Vamos corrigir esses tipos para lidar melhor com essas informações:

In [8]:
df['ipoDate'] = pd.to_datetime(df['ipoDate'], format='%Y-%m-%d')
df['delistingDate'] = pd.to_datetime(df['delistingDate'], format='%Y-%m-%d')
df['assetType'] = df['assetType'].astype('category')
df['exchange'] = df['exchange'].astype('category')

In [9]:
df.dtypes

symbol                   object
name                     object
exchange               category
assetType              category
ipoDate          datetime64[ns]
delistingDate    datetime64[ns]
status                   object
dtype: object

## Integração dos Dados

Para cada linha com nome nulo, vamos recuperar essa informação com a função do Yahoo Finance. Para as que não forem possíveis de recuperar, vamos excluir essas linhas:

In [33]:
df['name'] = df.apply(lambda x: get_finance_data_name(x.symbol) if pd.isna(x['name']) or x['name'] == '' else x.name, axis=1)

df.dropna(subset=['name', 'symbol'], inplace=True)
df.drop(df[df.name == ''].index, inplace=True)

df.shape


404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/AMEH?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=AMEH&crumb=H1YgLak2O95
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/DFFN?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=DFFN&crumb=H1YgLak2O95
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/NTEST-J?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=NTEST-J&crumb=H1YgLak2O95
404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/NTEST-K?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=fin

Vamos também excluir as duplicatas de nomes:

In [10]:
df.drop_duplicates(subset=['name'], inplace=True)
df.shape

(10917, 7)

Agora vamos integrar à nossa base de dados informações sobre os dividendos pagos pelas empresas e ETF's nos últimos 5 anos:

In [11]:
def get_dividends_by_period(period, symbol):
    try: 
        return yf.Ticker(symbol).history(period=period).Dividends.sum()
    except:
        return 0

In [59]:
def get_history_by_period(period, symbol):
    return yf.Ticker(symbol).history(period=period)
    # except:
    #     return pd.DataFrame()

get_history_by_period('5y', 'AAPL')

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-09-17 00:00:00-04:00,53.210747,53.418791,53.007539,53.389759,73274800,0.0,0.0
2019-09-18 00:00:00-04:00,53.476845,53.909868,53.084950,53.890514,101360000,0.0,0.0
2019-09-19 00:00:00-04:00,53.706663,54.130007,53.309929,53.452660,88242400,0.0,0.0
2019-09-20 00:00:00-04:00,53.554265,53.839718,52.608392,52.671288,221652400,0.0,0.0
2019-09-23 00:00:00-04:00,52.966413,53.181713,52.651928,52.910774,76662000,0.0,0.0
...,...,...,...,...,...,...,...
2024-09-11 00:00:00-04:00,221.460007,223.089996,217.889999,222.660004,44587100,0.0,0.0
2024-09-12 00:00:00-04:00,222.500000,223.550003,219.820007,222.770004,37498200,0.0,0.0
2024-09-13 00:00:00-04:00,223.580002,224.039993,221.910004,222.500000,36766600,0.0,0.0
2024-09-16 00:00:00-04:00,216.539993,217.220001,213.919998,216.320007,59288400,0.0,0.0


## Redução e Validação dos Dados

Vamos extrair uma amostra aleatória para iniciar a comparação do desempenho nominal dos dividendos dos últimos 5 anos entre as bolsas.

In [12]:
sample_df = df.sample(n=25)
sample_df

Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
1201,BHAT,Blue Hat Interactive Entertainment Technology,NASDAQ,Stock,2019-07-26,NaT,Active
9454,SJT,San Juan Basin Royalty Trust,NYSE,Stock,1987-10-29,NaT,Active
9945,SVIIR,Spring Valley Acquisition Corp II,NASDAQ,Stock,2022-10-28,NaT,Active
7319,NOC,Northrop Grumman Corp,NYSE,Stock,1981-12-31,NaT,Active
8250,PPI,AXS ASTORIA INFLATION SENSITIVE ETF,NYSE ARCA,ETF,2021-12-30,NaT,Active
543,AOGOW,Arogo Capital Acquisition Corp - Warrants (23/...,NASDAQ,Stock,2022-02-11,NaT,Active
9754,SPWH,Sportsman`s Warehouse Holdings Inc,NASDAQ,Stock,2014-04-17,NaT,Active
409,ALSAU,Alpha Star Acquisition Corp - Units (1 1 Right...,NASDAQ,Stock,2021-12-13,NaT,Active
7457,NUSB,NUVEEN ULTRA SHORT INCOME ETF,NASDAQ,ETF,2024-03-06,NaT,Active
9678,SPDW,SPDR(R) PORTFOLIO DEVELOPED WORLD EX-US ETF,NYSE ARCA,ETF,2007-04-26,NaT,Active


In [13]:
sample_df['dividends_last_5_years'] = sample_df.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
sample_df

SVIIR: Period '5y' is invalid, must be one of ['1d', '5d']
AOGOW: Period '5y' is invalid, must be one of ['1d', '5d']
NUSB: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', 'ytd', 'max']
BAMY: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', 'ytd', 'max']
ETHT: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', 'ytd', 'max']
PBNV: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', 'ytd', 'max']
VHAI: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', 'ytd', 'max']


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
1201,BHAT,Blue Hat Interactive Entertainment Technology,NASDAQ,Stock,2019-07-26,NaT,Active,0.0
9454,SJT,San Juan Basin Royalty Trust,NYSE,Stock,1987-10-29,NaT,Active,3.816
9945,SVIIR,Spring Valley Acquisition Corp II,NASDAQ,Stock,2022-10-28,NaT,Active,0.0
7319,NOC,Northrop Grumman Corp,NYSE,Stock,1981-12-31,NaT,Active,33.24
8250,PPI,AXS ASTORIA INFLATION SENSITIVE ETF,NYSE ARCA,ETF,2021-12-30,NaT,Active,0.733
543,AOGOW,Arogo Capital Acquisition Corp - Warrants (23/...,NASDAQ,Stock,2022-02-11,NaT,Active,0.0
9754,SPWH,Sportsman`s Warehouse Holdings Inc,NASDAQ,Stock,2014-04-17,NaT,Active,0.0
409,ALSAU,Alpha Star Acquisition Corp - Units (1 1 Right...,NASDAQ,Stock,2021-12-13,NaT,Active,0.0
7457,NUSB,NUVEEN ULTRA SHORT INCOME ETF,NASDAQ,ETF,2024-03-06,NaT,Active,0.0
9678,SPDW,SPDR(R) PORTFOLIO DEVELOPED WORLD EX-US ETF,NYSE ARCA,ETF,2007-04-26,NaT,Active,4.619


Como a amostra aleatória trouxe muitas empresas novas, sem histórico de dividendos mínimo consistente para a nossa análise, vamos tentar refinar nossa amostra. Antes, vamos verificar a proporção de ativos por exchange e por tipo:

In [14]:
rate_exchange = df.exchange.value_counts(normalize=True)
rate_exchange

exchange
NASDAQ       0.453788
NYSE         0.245031
NYSE ARCA    0.200513
BATS         0.072731
NYSE MKT     0.027938
Name: proportion, dtype: float64

In [15]:
rate_asset = df.assetType.value_counts(normalize=True)
rate_asset

assetType
Stock    0.625263
ETF      0.374737
Name: proportion, dtype: float64

Para fins de simplificação, vamos deixar somente as exchanges mais conhecidas e verificar a proporção novamente:

In [16]:
df.drop(df[df.exchange == 'NYSE ARCA'].index, inplace=True)
df.drop(df[df.exchange == 'NYSE MKT'].index, inplace=True)
df.drop(df[df.exchange == 'BATS'].index, inplace=True)

In [61]:
rate_exchange = df.exchange.value_counts(normalize=True)
rate_exchange

exchange
NASDAQ       0.649364
NYSE         0.350636
BATS         0.000000
NYSE ARCA    0.000000
NYSE MKT     0.000000
Name: proportion, dtype: float64

Agora vamos extrair uma amostra obedecendo essas proporções:

In [62]:
exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')
exchange_sample

  exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')
  exchange_sample = df.groupby('exchange').apply(lambda x: x.sample(int(10 * rate_exchange[x.name]))).droplevel('exchange')


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
6763,MKAM,MKAM ETF,NASDAQ,ETF,2023-04-12,NaT,Active
5838,JSML,JANUS HENDERSON SMALL CAP GROWTH ALPHA ETF,NASDAQ,ETF,2016-02-25,NaT,Active
3025,EBC,Eastern Bankshares Inc,NASDAQ,Stock,2020-10-15,NaT,Active
10803,VCXA,10X Capital Venture Acquisition Corp II - Class A,NASDAQ,Stock,2021-10-05,NaT,Active
122,ADAL,Anthemis Digital Acquisitions I Corp - Class A,NASDAQ,Stock,2021-12-29,NaT,Active
10362,TRDA,Entrada Therapeutics Inc,NASDAQ,Stock,2021-10-29,NaT,Active
11151,WES,Western Midstream Partners LP,NYSE,Stock,2012-12-10,NaT,Active
781,ATH-P-A,Athene Holding Ltd,NYSE,Stock,2019-06-06,NaT,Active
9039,RVTY,Revvity Inc,NYSE,Stock,1983-04-06,NaT,Active


In [33]:
asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')
asset_sample

  asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')
  asset_sample = df.groupby('assetType').apply(lambda x: x.sample(int(10 * rate_asset[x.name]))).droplevel('assetType')


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status
11075,VYMI,VANGUARD INTERNATIONAL HIGH DIVIDEND YIELD IND...,NASDAQ,ETF,2016-03-02,NaT,Active
5338,IMOM,ALPHA ARCHITECT INTERNATIONAL QUANTITATIVE MOM...,NASDAQ,ETF,2016-01-04,NaT,Active
5146,IBTF,ISHARES IBONDS DEC 2025 TERM TREASURY ETF,NASDAQ,ETF,2020-02-28,NaT,Active
11069,VXRT,Vaxart Inc,NASDAQ,Stock,2018-02-12,NaT,Active
8877,RM,Regional Management Corp,NYSE,Stock,2012-03-28,NaT,Active
4933,HRI,Herc Holdings Inc,NYSE,Stock,2006-11-16,NaT,Active
6418,LTRPA,Liberty TripAdvisor Holdings Inc - Series A,NASDAQ,Stock,2014-08-27,NaT,Active
11019,VSTEW,Vast Renewables Ltd - Warrants (01/07/2028),NASDAQ,Stock,2023-12-19,NaT,Active
755,ASTSW,AST SpaceMobile Inc - Warrants (06/04/2026),NASDAQ,Stock,2019-11-01,NaT,Active


Para comparar dividendos, vamos adicionar as informações como anteriormente, em cada sample:

In [63]:
exchange_sample['dividends_last_5_years'] = exchange_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
exchange_sample


MKAM: Period '5y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', 'ytd', 'max']
$VCXA: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$ADAL: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$ATH-P-A: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
6763,MKAM,MKAM ETF,NASDAQ,ETF,2023-04-12,NaT,Active,0.0
5838,JSML,JANUS HENDERSON SMALL CAP GROWTH ALPHA ETF,NASDAQ,ETF,2016-02-25,NaT,Active,1.253
3025,EBC,Eastern Bankshares Inc,NASDAQ,Stock,2020-10-15,NaT,Active,1.44
10803,VCXA,10X Capital Venture Acquisition Corp II - Class A,NASDAQ,Stock,2021-10-05,NaT,Active,0.0
122,ADAL,Anthemis Digital Acquisitions I Corp - Class A,NASDAQ,Stock,2021-12-29,NaT,Active,0.0
10362,TRDA,Entrada Therapeutics Inc,NASDAQ,Stock,2021-10-29,NaT,Active,0.0
11151,WES,Western Midstream Partners LP,NYSE,Stock,2012-12-10,NaT,Active,10.089
781,ATH-P-A,Athene Holding Ltd,NYSE,Stock,2019-06-06,NaT,Active,0.0
9039,RVTY,Revvity Inc,NYSE,Stock,1983-04-06,NaT,Active,1.4


In [35]:
asset_sample['dividends_last_5_years'] = asset_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
asset_sample

VSTEW: Period '5y' is invalid, must be one of ['1d', '5d']
ASTSW: Period '5y' is invalid, must be one of ['1d', '5d']


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
11075,VYMI,VANGUARD INTERNATIONAL HIGH DIVIDEND YIELD IND...,NASDAQ,ETF,2016-03-02,NaT,Active,13.541
5338,IMOM,ALPHA ARCHITECT INTERNATIONAL QUANTITATIVE MOM...,NASDAQ,ETF,2016-01-04,NaT,Active,2.971
5146,IBTF,ISHARES IBONDS DEC 2025 TERM TREASURY ETF,NASDAQ,ETF,2020-02-28,NaT,Active,2.354
11069,VXRT,Vaxart Inc,NASDAQ,Stock,2018-02-12,NaT,Active,0.0
8877,RM,Regional Management Corp,NYSE,Stock,2012-03-28,NaT,Active,4.45
4933,HRI,Herc Holdings Inc,NYSE,Stock,2006-11-16,NaT,Active,7.327
6418,LTRPA,Liberty TripAdvisor Holdings Inc - Series A,NASDAQ,Stock,2014-08-27,NaT,Active,0.0
11019,VSTEW,Vast Renewables Ltd - Warrants (01/07/2028),NASDAQ,Stock,2023-12-19,NaT,Active,0.0
755,ASTSW,AST SpaceMobile Inc - Warrants (06/04/2026),NASDAQ,Stock,2019-11-01,NaT,Active,0.0


Numa outra tentativa, vamos montar uma amostra considerando empresas listadas entre 5 e 10 anos:

In [79]:
five_year_df = df[df.ipoDate.dt.year.between(2014, 2019)]
five_year = five_year_df.ipoDate.dt.year.value_counts(normalize=True)
five_year

five_year_sample = five_year_df.groupby(five_year_df.ipoDate.dt.year).apply(lambda x: x.sample(int(25 * five_year[x.name]))).droplevel('ipoDate')
five_year_sample['dividends_last_5_years'] = five_year_sample.apply(lambda x: get_dividends_by_period('5y', x.symbol), axis=1)
five_year_sample


$CFMS: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")
$CODI-P-C: possibly delisted; no price data found  (period=5y) (Yahoo error = "No data found, symbol may be delisted")


Unnamed: 0,symbol,name,exchange,assetType,ipoDate,delistingDate,status,dividends_last_5_years
11205,WKHS,Workhorse Group Inc,NASDAQ,Stock,2014-08-07,NaT,Active,0.0
6159,LC,LendingClub Corp,NYSE,Stock,2014-12-11,NaT,Active,0.0
1316,BLBD,Blue Bird Corp,NASDAQ,Stock,2014-03-20,NaT,Active,0.0
2545,CZR,Caesars Entertainment Inc,NASDAQ,Stock,2014-09-22,NaT,Active,0.0
8891,RMNI,Rimini Street Inc,NASDAQ,Stock,2015-08-28,NaT,Active,0.0
1918,CFMS,Conformis Inc,NASDAQ,Stock,2015-07-01,NaT,Active,0.0
2166,CNCR,Range Cancer Therapeutics ETF,NASDAQ,ETF,2015-10-14,NaT,Active,2.141
7790,OTLK,Outlook Therapeutics Inc,NASDAQ,Stock,2016-06-14,NaT,Active,0.0
6410,LSXMK,Liberty Media Corp (New Liberty SiriusXM) Seri...,NASDAQ,Stock,2016-04-18,NaT,Active,0.25
1458,BPRN,Princeton Bancorp Inc,NASDAQ,Stock,2016-07-12,NaT,Active,4.26
