# **One-Hot Encoding**

Nesta etapa iremos aplicar one-hot-encoding para transformar variáveis categóricas em numéricas.


| Variável                       | Descrição                                           | Ação                                                        |
|--------------------------------|-----------------------------------------------------|-------------------------------------------------------------|
| Product                        | Produto financeiro ou serviço                       | Aplicar one-hot encoding.  |
| Sub-product                    | Subproduto financeiro ou serviço                    | Criar variável agrupada grouped_sub_product e então aplicar one-hot encoding. |
| Issue                          | Problema relatado pelo consumidor                   | Criar variável agrupada grouped_issue e então aplicar one-hot encoding. |
| Sub-issue                      | Subproblema relatado pelo consumidor                | manter  |
| State                          | Estado do consumidor                                | one-hot encoding                                            |
| Company public response        | Resposta pública da empresa                         | one-hot encoding                                            |
| Tags                           | Tags relacionadas à reclamação                     | one-hot encoding                                                     |
| Consumer disputed?             | Consumidor contestou a resposta                     | one-hot encoding
| Company size, Company market, Company response time, Company response satisfaction, Zip average education, Zip bank services access, Zip crime rate, Zip unemployment rate | Variáveis categóricas criadas na etapa data-enrichment | one-hot encoding |

---

# # **Importando Pickle da Etapa Anterior**

In [29]:
import pandas as pd

df = pd.read_pickle('./pickle/df_enrichment_final.pkl')

df.shape

(4887, 28)

In [30]:
df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Company response to consumer',
       'Consumer disputed?', 'Complaint ID', 'Date received Day',
       'Date received Month', 'Date received Year', 'Company size',
       'Company market', 'Company response time',
       'Company response satisfaction', 'Zip average education',
       'Zip life expectancy', 'Zip average income', 'Zip average age',
       'Zip bank services access', 'Zip crime rate', 'Zip unemployment rate'],
      dtype='object')

----


# # **One Hot Encoding**

In [31]:
unique_values_count = df[['Product', 'Sub-product', 'Issue', 'Sub-issue', 'State', 'Company public response', 'Tags', 'Consumer disputed?']].nunique()

unique_values_count

Product                     11
Sub-product                 37
Issue                       48
Sub-issue                  158
State                       52
Company public response     10
Tags                         4
Consumer disputed?           3
dtype: int64

## ## **Variáveis Categóricas Enriquecidas**

In [33]:
df = pd.get_dummies(df, columns=[
    'Company size', 'Company market', 'Company response time', 'Company response satisfaction',
    'Zip average education', 'Zip bank services access', 'Zip crime rate', 'Zip unemployment rate'], 
    drop_first=True)

In [34]:
df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Company response to consumer',
       'Consumer disputed?', 'Complaint ID', 'Date received Day',
       'Date received Month', 'Date received Year', 'Zip life expectancy',
       'Zip average income', 'Zip average age', 'Company size_low',
       'Company size_medium', 'Company market_low', 'Company market_medium',
       'Company response time_low', 'Company response time_medium',
       'Company response satisfaction_low',
       'Company response satisfaction_medium', 'Zip average education_low',
       'Zip average education_medium', 'Zip bank services access_low',
       'Zip bank services access_medium', 'Zip crime rate_low',
       'Zip crime rate_medium', 'Zip unemployment rate_low',
       'Zip unemployment rate_medium'],
      dtype='object')

## ## **Product**

In [35]:
# Aplicando One-Hot Encoding na coluna 'Product'
df = pd.get_dummies(df, columns=['Product'])

# Contando a quantidade de colunas que começam com "Product_"
product_columns_count = len([col for col in df.columns if col.startswith('Product_')])

product_columns_count

11

## ## **Sub-product**

Para a variável `Sub-product` iremos agrupar as variáveis similares e então aplicar one-hot encoding.

| **Category**                        | **Sub-products**                                              |
|-------------------------------------|---------------------------------------------------------------|
| **Credit Reporting and Consumer Reports** | Credit reporting, Other personal consumer report, I do not know |
| **Bank Accounts**                   | Checking account, Savings account, CD (Certificate of Deposit) |
| **Student Loans**                   | Federal student loan servicing, Private student loan, Federal student loan debt, Private student loan debt, Non-federal student loan, Federal student loan |
| **Mortgages**                       | Conventional home mortgage, FHA mortgage, VA mortgage, USDA mortgage, Reverse mortgage, Other type of mortgage, Mortgage, Mortgage debt |
| **Credit Cards**                    | General-purpose credit card or charge card, Store credit card, Credit card debt, Credit card |
| **Prepaid and Gift Cards**          | General-purpose prepaid card, Government benefit card, Payroll card, Gift card, Student prepaid card |
| **Personal Loans**                  | Personal line of credit, Installment loan, Title loan, Loan    |
| **Auto Loans and Debts**            | Auto debt, Auto                                                |
| **Other Debts**                     | Telecommunications debt, Other debt, Payday loan debt, Medical debt, Rental debt, Medical, Lease |
| **Specialized Mortgages and Loans** | Home equity loan or line of credit (HELOC), Manufactured home loan, Payday loan, Pawn loan |
| **Other Financial Products**        | Other banking product or service, Other (i.e. phone, health club, etc.) |


In [36]:
# Mapeamento dos sub-produtos para as novas categorias agrupadas
sub_product_mapping = {
    'Credit reporting': 'Credit Reporting and Consumer Reports',
    'Other personal consumer report': 'Credit Reporting and Consumer Reports',
    'I do not know': 'Credit Reporting and Consumer Reports',
    'Checking account': 'Bank Accounts',
    'Savings account': 'Bank Accounts',
    'CD (Certificate of Deposit)': 'Bank Accounts',
    'Federal student loan servicing': 'Student Loans',
    'Private student loan': 'Student Loans',
    'Federal student loan debt': 'Student Loans',
    'Private student loan debt': 'Student Loans',
    'Non-federal student loan': 'Student Loans',
    'Federal student loan': 'Student Loans',
    'Conventional home mortgage': 'Mortgages',
    'FHA mortgage': 'Mortgages',
    'VA mortgage': 'Mortgages',
    'USDA mortgage': 'Mortgages',
    'Reverse mortgage': 'Mortgages',
    'Other type of mortgage': 'Mortgages',
    'Mortgage': 'Mortgages',
    'Mortgage debt': 'Mortgages',
    'General-purpose credit card or charge card': 'Credit Cards',
    'Store credit card': 'Credit Cards',
    'Credit card debt': 'Credit Cards',
    'Credit card': 'Credit Cards',
    'General-purpose prepaid card': 'Prepaid and Gift Cards',
    'Government benefit card': 'Prepaid and Gift Cards',
    'Payroll card': 'Prepaid and Gift Cards',
    'Gift card': 'Prepaid and Gift Cards',
    'Student prepaid card': 'Prepaid and Gift Cards',
    'Personal line of credit': 'Personal Loans',
    'Installment loan': 'Personal Loans',
    'Title loan': 'Personal Loans',
    'Loan': 'Personal Loans',
    'Auto debt': 'Auto Loans and Debts',
    'Auto': 'Auto Loans and Debts',
    'Telecommunications debt': 'Other Debts',
    'Other debt': 'Other Debts',
    'Payday loan debt': 'Other Debts',
    'Medical debt': 'Other Debts',
    'Rental debt': 'Other Debts',
    'Medical': 'Other Debts',
    'Lease': 'Other Debts',
    'Home equity loan or line of credit (HELOC)': 'Specialized Mortgages and Loans',
    'Manufactured home loan': 'Specialized Mortgages and Loans',
    'Payday loan': 'Specialized Mortgages and Loans',
    'Pawn loan': 'Specialized Mortgages and Loans',
    'Other banking product or service': 'Other Financial Products',
    'Other (i.e. phone, health club, etc.)': 'Other Financial Products'
}

# Aplicando o mapeamento para criar a nova coluna 'grouped_sub_product'
df['grouped_sub_product'] = df['Sub-product'].map(sub_product_mapping)

In [37]:
df['grouped_sub_product'].value_counts()

grouped_sub_product
Credit Reporting and Consumer Reports    2078
Credit Cards                             1341
Bank Accounts                            1014
Other Debts                               103
Prepaid and Gift Cards                     99
Other Financial Products                   90
Personal Loans                             76
Student Loans                              39
Mortgages                                  32
Auto Loans and Debts                       14
Specialized Mortgages and Loans             1
Name: count, dtype: int64

In [38]:
# Aplicando One-Hot Encoding na coluna 'Product'
df = pd.get_dummies(df, columns=['grouped_sub_product'])

# Contando a quantidade de colunas que começam com "Product_"
grouped_sub_product_columns_count = len([col for col in df.columns if col.startswith('grouped_sub_product_')])

grouped_sub_product_columns_count

11

# # **Issue**

| **Grouped Issue**                      | **Original Issues**                                                                                                            |
|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| **Credit Report Issues**               | Incorrect information on your report, Improper use of your report, Unable to get your credit report or credit score, Problem with a credit reporting company's investigation into an existing problem |
| **Account Management**                 | Managing an account, Closing an account, Opening an account, Closing your account, Problem with a purchase shown on your statement, Problem getting a card or closing an account |
| **Loan and Mortgage Issues**           | Applying for a mortgage or refinancing an existing mortgage, Struggling to pay mortgage, Closing on a mortgage, Dealing with your lender or servicer, Managing the loan or lease, Problems at the end of the loan or lease, Struggling to repay your loan, Getting a loan or lease, Struggling to pay your loan, Struggling to pay your bill, Repossession, Issue where my lender is my school, Can't repay my loan, Issue with income share agreement, Problem with overdraft, Problem with an overdraft |
| **Debt Collection Issues**             | Attempts to collect debt not owed, Written notification about debt, Communication tactics, False statements or representation, Threatened to contact someone or share information improperly, Taking/threatening an illegal action, Disclosure verification of debt, Improper contact or sharing of info, Cont'd attempts collect debt not owed |
| **Payment Issues**                     | Trouble during payment process, Problem when making payments, Problem caused by your funds being low, Fees or interest |
| **Credit Card Issues**                 | Trouble using the card, Trouble using your card, Problem with a purchase or transfer, Getting a credit card, Problem with a lender or other company charging your account |
| **Identity Theft and Monitoring Services** | Credit monitoring or identity theft protection services, Identity theft protection or other monitoring services |
| **Advertising and Marketing**          | Advertising and marketing, including promotional offers, Advertising |
| **Legal and Threats Issues**           | Took or threatened to take negative or legal action, Problem with a company's investigation into an existing problem, Problem with a company's investigation into an existing issue |
| **Electronic and Communication Issues** | Electronic communications, Problem with overdraft |


In [39]:
unique_issue = df['Issue'].unique().tolist()
unique_issue

['Problem with a purchase shown on your statement',
 'Other features, terms, or problems',
 'Closing an account',
 'Problem with a lender or other company charging your account',
 'Managing an account',
 'Problem when making payments',
 'Problem with a purchase or transfer',
 "Problem with a credit reporting company's investigation into an existing problem",
 'Dealing with my lender or servicer',
 'Attempts to collect debt not owed',
 'Fees or interest',
 'Credit monitoring or identity theft protection services',
 'Dealing with your lender or servicer',
 'Struggling to pay your bill',
 'Advertising and marketing, including promotional offers',
 'Incorrect information on your report',
 'Problem caused by your funds being low',
 'Getting a credit card',
 'Opening an account',
 'Written notification about debt',
 'Trouble using the card',
 'Closing your account',
 'Taking/threatening an illegal action',
 'Trouble using your card',
 'Unable to get your credit report or credit score',
 'Tro

In [40]:
# Mapeamento dos issues para as novas categorias agrupadas
issue_mapping = {
    'Incorrect information on your report': 'Credit Report Issues',
    'Improper use of your report': 'Credit Report Issues',
    "Problem with a credit reporting company's investigation into an existing problem": 'Credit Report Issues',
    'Unable to get your credit report or credit score': 'Credit Report Issues',
    'Managing an account': 'Account Management',
    'Closing an account': 'Account Management',
    'Opening an account': 'Account Management',
    'Closing your account': 'Account Management',
    'Problem with a purchase shown on your statement': 'Account Management',
    'Problem getting a card or closing an account': 'Account Management',
    'Applying for a mortgage or refinancing an existing mortgage': 'Loan and Mortgage Issues',
    'Struggling to pay mortgage': 'Loan and Mortgage Issues',
    'Closing on a mortgage': 'Loan and Mortgage Issues',
    'Dealing with your lender or servicer': 'Loan and Mortgage Issues',
    'Managing the loan or lease': 'Loan and Mortgage Issues',
    'Problems at the end of the loan or lease': 'Loan and Mortgage Issues',
    'Struggling to repay your loan': 'Loan and Mortgage Issues',
    'Getting a loan or lease': 'Loan and Mortgage Issues',
    'Struggling to pay your loan': 'Loan and Mortgage Issues',
    'Struggling to pay your bill': 'Loan and Mortgage Issues',
    'Repossession': 'Loan and Mortgage Issues',
    'Issue where my lender is my school': 'Loan and Mortgage Issues',
    "Can't repay my loan": 'Loan and Mortgage Issues',
    'Issue with income share agreement': 'Loan and Mortgage Issues',
    'Problem with overdraft': 'Loan and Mortgage Issues',
    'Problem with an overdraft': 'Loan and Mortgage Issues',
    'Attempts to collect debt not owed': 'Debt Collection Issues',
    'Written notification about debt': 'Debt Collection Issues',
    'Communication tactics': 'Debt Collection Issues',
    'False statements or representation': 'Debt Collection Issues',
    'Threatened to contact someone or share information improperly': 'Debt Collection Issues',
    'Taking/threatening an illegal action': 'Debt Collection Issues',
    'Disclosure verification of debt': 'Debt Collection Issues',
    'Improper contact or sharing of info': 'Debt Collection Issues',
    "Cont'd attempts collect debt not owed": 'Debt Collection Issues',
    'Trouble during payment process': 'Payment Issues',
    'Problem when making payments': 'Payment Issues',
    'Problem caused by your funds being low': 'Payment Issues',
    'Fees or interest': 'Payment Issues',
    'Trouble using the card': 'Credit Card Issues',
    'Trouble using your card': 'Credit Card Issues',
    'Problem with a purchase or transfer': 'Credit Card Issues',
    'Getting a credit card': 'Credit Card Issues',
    'Problem with a lender or other company charging your account': 'Credit Card Issues',
    'Credit monitoring or identity theft protection services': 'Identity Theft and Monitoring Services',
    'Identity theft protection or other monitoring services': 'Identity Theft and Monitoring Services',
    'Advertising and marketing, including promotional offers': 'Advertising and Marketing',
    'Advertising': 'Advertising and Marketing',
    'Took or threatened to take negative or legal action': 'Legal and Threats Issues',
    "Problem with a company's investigation into an existing problem": 'Legal and Threats Issues',
    "Problem with a company's investigation into an existing issue": 'Legal and Threats Issues',
    'Electronic communications': 'Electronic and Communication Issues',
}

# Aplicando o mapeamento para criar a nova coluna 'grouped_issue'
df['grouped_issue'] = df['Issue'].map(issue_mapping)


In [41]:
df['grouped_issue'].value_counts()

grouped_issue
Credit Report Issues                      1893
Account Management                        1443
Payment Issues                             485
Credit Card Issues                         286
Debt Collection Issues                     259
Legal and Threats Issues                   127
Loan and Mortgage Issues                   116
Identity Theft and Monitoring Services      65
Advertising and Marketing                   60
Name: count, dtype: int64

In [42]:
# Aplicando One-Hot Encoding na coluna 'Product'
df = pd.get_dummies(df, columns=['grouped_issue'])

# Contando a quantidade de colunas que começam com "Product_"
grouped_issue_columns_count = len([col for col in df.columns if col.startswith('grouped_issue')])

grouped_issue_columns_count

9

## ## **Demais Variáveis Categóricas**

Transformar as variáveis: State, Company public response, Tags e Consumer disputed? em variáveis numéricas.

In [43]:
# Aplicando One-Hot Encoding 
df = pd.get_dummies(df, columns=['State', 'Company public response', 'Tags', 'Consumer disputed?'])


In [44]:
state_columns_count = len([col for col in df.columns if col.startswith('State_')])
company_columns_count = len([col for col in df.columns if col.startswith('Company public response_')])
tags_columns_count = len([col for col in df.columns if col.startswith('Tags_')])
consumer_disputed_columns_count = len([col for col in df.columns if col.startswith('Consumer disputed?_')])

print(f'A quantidade de colunas criadas para a variável State é {state_columns_count}.')
print(f'A quantidade de colunas criadas para a variável Company public response é {company_columns_count}.')
print(f'A quantidade de colunas criadas para a variável Tags é {tags_columns_count}.')
print(f'A quantidade de colunas criadas para a variável Consumer disputed? é {consumer_disputed_columns_count}.')



A quantidade de colunas criadas para a variável State é 52.
A quantidade de colunas criadas para a variável Company public response é 10.
A quantidade de colunas criadas para a variável Tags é 4.
A quantidade de colunas criadas para a variável Consumer disputed? é 3.


---

# **Exportando Arquivos Gerados**

In [45]:
df.to_excel('./excel/df_onehot.xlsx', index=False)

In [46]:
df.to_pickle("./pickle/df_onehot.pkl")

In [47]:
print("Notebook One Hot Encoding Concluído")

Notebook One Hot Encoding Concluído
