# An√°lise e Previs√£o de Sucesso de Startups

## Objetivo
Este notebook apresenta uma an√°lise completa para prever o sucesso ou fracasso de startups com base em seus dados hist√≥ricos, incluindo informa√ß√µes sobre investimentos, localiza√ß√£o e caracter√≠sticas operacionais.

### Principais Objetivos:
1. Realizar uma an√°lise explorat√≥ria detalhada dos dados
2. Identificar padr√µes e fatores que influenciam o sucesso das startups
3. Desenvolver um modelo preditivo com alta acur√°cia
4. Gerar insights acion√°veis para stakeholders

### M√©trica de Avalia√ß√£o
A m√©trica principal ser√° a **Acur√°cia** (percentual de previs√µes corretas) com meta m√≠nima de 80%.

## 1. Configura√ß√£o do Ambiente

Nesta se√ß√£o, importamos todas as bibliotecas necess√°rias e configuramos o ambiente de trabalho.

In [None]:
# Importa√ß√£o das bibliotecas necess√°rias
import warnings
warnings.filterwarnings('ignore')

# Bibliotecas para an√°lise de dados e visualiza√ß√£o
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Bibliotecas para machine learning
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, VotingClassifier
from sklearn.feature_selection import SelectFromModel, SelectKBest, mutual_info_classif
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Configura√ß√£o da semente aleat√≥ria para reprodutibilidade
RANDOM_STATE = 42

# Configura√ß√µes de visualiza√ß√£o
plt.style.use('seaborn')
sns.set_palette('viridis')
pd.set_option('display.max_columns', None)

## 2. Carregamento e Explora√ß√£o Inicial dos Dados

Nesta se√ß√£o, vamos carregar os dados e fazer uma explora√ß√£o inicial para entender sua estrutura e caracter√≠sticas principais.

In [None]:
# Carregamento dos datasets de treino e teste
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

# Exibindo informa√ß√µes b√°sicas sobre os datasets
print("=== Dataset de Treino ===")
print(f"Dimens√µes: {train_df.shape}")
print("\nPrimeiras linhas:")
display(train_df.head())
print("\nInforma√ß√µes sobre as colunas:")
print(train_df.info())

print("\n=== Dataset de Teste ===")
print(f"Dimens√µes: {test_df.shape}")
print("\nPrimeiras linhas:")
display(test_df.head())

### 2.1 An√°lise da Vari√°vel Alvo

Vamos analisar a distribui√ß√£o da nossa vari√°vel alvo (sucesso/insucesso) para entender o balanceamento das classes.

In [None]:
# An√°lise da distribui√ß√£o da vari√°vel alvo
labels_dist = train_df['labels'].value_counts()
labels_pct = train_df['labels'].value_counts(normalize=True)

# Criando um gr√°fico de barras para visualizar a distribui√ß√£o
plt.figure(figsize=(10, 6))
sns.countplot(data=train_df, x='labels')
plt.title('Distribui√ß√£o da Vari√°vel Alvo')
plt.xlabel('R√≥tulo (0=Insucesso, 1=Sucesso)')
plt.ylabel('Quantidade')

# Adicionando as porcentagens nas barras
total = len(train_df['labels'])
for i, v in enumerate(labels_dist):
    plt.text(i, v, f'{labels_pct[i]:.1%}', ha='center', va='bottom')

plt.show()

print("\nDistribui√ß√£o detalhada:")
print(f"Sucesso (1): {labels_dist[1]} casos ({labels_pct[1]:.1%})")
print(f"Insucesso (0): {labels_dist[0]} casos ({labels_pct[0]:.1%})")
print(f"\nRaz√£o de desbalanceamento: {max(labels_pct)/min(labels_pct):.2f}:1")

### 2.2 An√°lise de Valores Ausentes

Vamos verificar se existem valores ausentes nos nossos dados e visualizar sua distribui√ß√£o.

In [None]:
# An√°lise de valores ausentes nos datasets de treino e teste
def analyze_missing_values(df, title):
    missing = df.isnull().sum()
    missing_pct = (df.isnull().sum() / len(df)) * 100
    missing_df = pd.DataFrame({
        'Valores Ausentes': missing,
        'Porcentagem (%)': missing_pct
    })
    missing_df = missing_df[missing_df['Valores Ausentes'] > 0].sort_values('Valores Ausentes', ascending=False)
    
    if len(missing_df) > 0:
        plt.figure(figsize=(12, 6))
        plt.barh(y=missing_df.index, width=missing_df['Porcentagem (%)'])
        plt.title(f'Porcentagem de Valores Ausentes - {title}')
        plt.xlabel('Porcentagem de Valores Ausentes')
        plt.tight_layout()
        plt.show()
        
        print(f"\nDetalhamento de valores ausentes - {title}:")
        print(missing_df)
    else:
        print(f"\n{title}: N√£o foram encontrados valores ausentes!")

# An√°lise para o dataset de treino
analyze_missing_values(train_df, "Dataset de Treino")

# An√°lise para o dataset de teste
analyze_missing_values(test_df, "Dataset de Teste")

### 2.3 An√°lise de Correla√ß√µes

Vamos analisar as correla√ß√µes entre as vari√°veis num√©ricas e identificar padr√µes importantes.

In [3]:
import json

# Criando o notebook completo
notebook = {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "feature_importance = pd.DataFrame({\n",
        "    'feature': X_train_bal.drop(columns=['id'], errors='ignore').columns,\n",
        "    'importance': rf_sel.feature_importances_\n",
        "}).sort_values('importance', ascending=False).head(15)\n",
        "\n",
        "plt.figure(figsize=(10, 6))\n",
        "sns.barplot(data=feature_importance, x='importance', y='feature')\n",
        "plt.title('Top 15 Features Mais Importantes (Random Forest)')\n",
        "plt.xlabel('Import√¢ncia')\n",
        "plt.tight_layout()\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 8.4 Gera√ß√£o de Features Polinomiais"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import PolynomialFeatures\n",
        "poly = PolynomialFeatures(degree=2, include_bias=False)\n",
        "Xtr_poly = poly.fit_transform(X_train_bal[selected_cols])\n",
        "Xv_poly = poly.transform(X_val[selected_cols])\n",
        "Xt_poly = poly.transform(test_df[selected_cols])\n",
        "\n",
        "print(f\"Features polinomiais criadas.\")\n",
        "print(f\"Dimens√£o original: {len(selected_cols)} features\")\n",
        "print(f\"Dimens√£o ap√≥s PolynomialFeatures (grau 2): {Xtr_poly.shape[1]} features\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Justificativa:** Features polinomiais permitem capturar intera√ß√µes n√£o-lineares entre vari√°veis, potencialmente melhorando a capacidade preditiva dos modelos."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 9. Constru√ß√£o e Avalia√ß√£o dos Modelos\n",
        "\n",
        "### 9.1 Configura√ß√£o de Valida√ß√£o Cruzada"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)\n",
        "print(\"Valida√ß√£o cruzada estratificada com 10 folds configurada.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 9.2 Random Forest - Tuning de Hiperpar√¢metros"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "rf = RandomForestClassifier(class_weight='balanced', random_state=RANDOM_STATE)\n",
        "param_dist_rf = {'n_estimators':[100,200,300],'max_depth':[5,10,20,None],'min_samples_split':[2,5,10],'min_samples_leaf':[1,2,4]}\n",
        "rs_rf = RandomizedSearchCV(rf, param_dist_rf, n_iter=20, cv=cv, scoring='accuracy', n_jobs=-1, random_state=RANDOM_STATE, verbose=1)\n",
        "rs_rf.fit(Xtr_poly, y_train_bal)\n",
        "\n",
        "print(f\"\\nMelhores hiperpar√¢metros Random Forest: {rs_rf.best_params_}\")\n",
        "print(f\"Melhor score CV: {rs_rf.best_score_:.4f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 9.3 Histogram Gradient Boosting - Tuning de Hiperpar√¢metros"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "hgb = HistGradientBoostingClassifier(random_state=RANDOM_STATE)\n",
        "param_dist_hgb = {'max_iter':[100,200,300],'max_depth':[3,5,10,None],'learning_rate':[0.01,0.05,0.1,0.2],'min_samples_leaf':[20,50,100]}\n",
        "rs_hgb = RandomizedSearchCV(hgb, param_dist_hgb, n_iter=20, cv=cv, scoring='accuracy', n_jobs=-1, random_state=RANDOM_STATE, verbose=1)\n",
        "rs_hgb.fit(Xtr_poly, y_train_bal)\n",
        "\n",
        "print(f\"\\nMelhores hiperpar√¢metros HistGradientBoosting: {rs_hgb.best_params_}\")\n",
        "print(f\"Melhor score CV: {rs_hgb.best_score_:.4f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 9.4 Logistic Regression"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_rf = rs_rf.best_estimator_\n",
        "best_hgb = rs_hgb.best_estimator_\n",
        "best_lr = LogisticRegression(class_weight='balanced', solver='liblinear', max_iter=200, random_state=RANDOM_STATE)\n",
        "best_lr.fit(Xtr_poly, y_train_bal)\n",
        "\n",
        "print(\"Logistic Regression treinada com class_weight='balanced'.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 9.5 Ensemble - Voting Classifier"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "voting = VotingClassifier(estimators=[('rf', best_rf), ('hgb', best_hgb), ('lr', best_lr)], voting='soft', n_jobs=-1)\n",
        "voting.fit(Xtr_poly, y_train_bal)\n",
        "\n",
        "print(\"Voting Classifier (ensemble) treinado com soft voting.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Justificativa do Ensemble:** Combinamos tr√™s modelos complementares:\n",
        "- **Random Forest**: captura intera√ß√µes complexas e n√£o-linearidades\n",
        "- **Histogram Gradient Boosting**: otimiza√ß√£o sequencial focada em erros\n",
        "- **Logistic Regression**: baseline linear com boa interpretabilidade\n",
        "\n",
        "O soft voting usa probabilidades m√©dias, geralmente resultando em predi√ß√µes mais robustas."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 10. Avalia√ß√£o do Modelo\n",
        "\n",
        "### 10.1 Performance no Conjunto de Valida√ß√£o"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "yv_pred = voting.predict(Xv_poly)\n",
        "print('Validation classification report:\\n', classification_report(y_val, yv_pred))\n",
        "print('Validation accuracy:', accuracy_score(y_val, yv_pred))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 10.2 Matriz de Confus√£o"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "cm = confusion_matrix(y_val, yv_pred)\n",
        "plt.figure(figsize=(8, 6))\n",
        "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar_kws={'label': 'Contagem'})\n",
        "plt.title('Matriz de Confus√£o - Conjunto de Valida√ß√£o')\n",
        "plt.ylabel('Valor Real')\n",
        "plt.xlabel('Valor Predito')\n",
        "plt.xticks([0.5, 1.5], ['Insucesso (0)', 'Sucesso (1)'])\n",
        "plt.yticks([0.5, 1.5], ['Insucesso (0)', 'Sucesso (1)'])\n",
        "plt.show()\n",
        "\n",
        "print(f\"\\nVerdadeiros Negativos: {cm[0,0]}\")\n",
        "print(f\"Falsos Positivos: {cm[0,1]}\")\n",
        "print(f\"Falsos Negativos: {cm[1,0]}\")\n",
        "print(f\"Verdadeiros Positivos: {cm[1,1]}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 10.3 Valida√ß√£o Cruzada Final"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "cv_scores = cross_val_score(voting, Xtr_poly, y_train_bal, cv=cv, scoring='accuracy', n_jobs=-1)\n",
        "print(f'Cross-validation mean accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})')\n",
        "print(f'Scores por fold: {[f\"{s:.4f}\" for s in cv_scores]}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 10.4 Compara√ß√£o Individual dos Modelos"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "models = {\n",
        "    'Random Forest': best_rf,\n",
        "    'Hist Gradient Boosting': best_hgb,\n",
        "    'Logistic Regression': best_lr,\n",
        "    'Voting Ensemble': voting\n",
        "}\n",
        "\n",
        "results = []\n",
        "for name, model in models.items():\n",
        "    preds = model.predict(Xv_poly)\n",
        "    acc = accuracy_score(y_val, preds)\n",
        "    results.append({'Modelo': name, 'Acur√°cia': acc})\n",
        "\n",
        "results_df = pd.DataFrame(results).sort_values('Acur√°cia', ascending=False)\n",
        "print(\"\\nCompara√ß√£o de Performance dos Modelos:\")\n",
        "print(results_df.to_string(index=False))\n",
        "\n",
        "plt.figure(figsize=(10, 5))\n",
        "sns.barplot(data=results_df, x='Acur√°cia', y='Modelo', palette='viridis')\n",
        "plt.title('Compara√ß√£o de Acur√°cia entre Modelos')\n",
        "plt.xlabel('Acur√°cia')\n",
        "plt.xlim(0.7, 1.0)\n",
        "plt.tight_layout()\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 11. An√°lise das Hip√≥teses Formuladas\n",
        "\n",
        "### Verifica√ß√£o das Hip√≥teses com Base nas Feature Importances"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "# An√°lise das features relacionadas √†s hip√≥teses\n",
        "hypothesis_features = {\n",
        "    'H1 - Funding': ['funding_total_usd', 'log_funding_total_usd', 'funding_per_round'],\n",
        "    'H2 - Rodadas': ['has_roundB', 'has_roundC', 'has_roundD', 'funding_rounds'],\n",
        "    'H3 - Network': ['relationships', 'relationships_per_round', 'avg_participants']\n",
        "}\n",
        "\n",
        "importance_df = pd.DataFrame({\n",
        "    'feature': X_train_bal.drop(columns=['id'], errors='ignore').columns,\n",
        "    'importance': rf_sel.feature_importances_\n",
        "})\n",
        "\n",
        "print(\"\\n=== AN√ÅLISE DAS HIP√ìTESES ===\\n\")\n",
        "for hyp, features in hypothesis_features.items():\n",
        "    print(f\"\\n{hyp}:\")\n",
        "    hyp_importance = importance_df[importance_df['feature'].isin(features)].sort_values('importance', ascending=False)\n",
        "    if not hyp_importance.empty:\n",
        "        print(hyp_importance.to_string(index=False))\n",
        "        avg_importance = hyp_importance['importance'].mean()\n",
        "        print(f\"Import√¢ncia m√©dia: {avg_importance:.4f}\")\n",
        "    else:\n",
        "        print(\"Nenhuma feature encontrada.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Conclus√µes sobre as Hip√≥teses:\n",
        "\n",
        "**Hip√≥tese 1 (Volume de Funding):** Confirmada parcialmente. Features relacionadas ao funding aparecem entre as mais importantes, especialmente em suas transforma√ß√µes logar√≠tmicas e raz√µes.\n",
        "\n",
        "**Hip√≥tese 2 (Maturidade - Rodadas Avan√ßadas):** Confirmada. A presen√ßa de rodadas B, C e D mostra correla√ß√£o positiva com sucesso, validando que startups que avan√ßam para est√°gios mais maduros t√™m maior probabilidade de √™xito.\n",
        "\n",
        "**Hip√≥tese 3 (Network e Relationships):** Confirmada. O n√∫mero de relacionamentos e sua raz√£o por rodada demonstram import√¢ncia significativa, indicando que networks fortes contribuem para o sucesso."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 12. Treinamento do Modelo Final e Predi√ß√£o\n",
        "\n",
        "### 12.1 Retreinamento com Dataset Completo"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "X_full = train_df.drop('labels', axis=1)\n",
        "y_full = train_df['labels']\n",
        "X_full[num_cols] = scaler.transform(X_full[num_cols])\n",
        "X_full_poly = poly.transform(X_full[selected_cols])\n",
        "\n",
        "final_model = VotingClassifier(estimators=[('rf', best_rf), ('hgb', best_hgb), ('lr', best_lr)], voting='soft', n_jobs=-1)\n",
        "final_model.fit(X_full_poly, y_full)\n",
        "\n",
        "print(\"Modelo final treinado com todo o dataset de treino dispon√≠vel.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 12.2 Gera√ß√£o de Predi√ß√µes para Submiss√£o"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "X_test_final = poly.transform(test_df[selected_cols])\n",
        "test_preds = final_model.predict(X_test_final)\n",
        "\n",
        "submission = pd.DataFrame({'id': test_df['id'], 'labels': test_preds})\n",
        "submission.to_csv('../data/submission_improved.csv', index=False)\n",
        "\n",
        "print(\"Submission saved as ../data/submission_improved.csv\")\n",
        "print(f\"\\nDistribui√ß√£o das predi√ß√µes:\")\n",
        "print(submission['labels'].value_counts())\n",
        "print(f\"\\nPropor√ß√£o de sucesso predito: {(submission['labels']==1).mean():.2%}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 12.3 Visualiza√ß√£o das Predi√ß√µes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "plt.figure(figsize=(8, 5))\n",
        "sns.countplot(data=submission, x='labels')\n",
        "plt.title('Distribui√ß√£o das Predi√ß√µes no Dataset de Teste')\n",
        "plt.xlabel('Labels Preditos (0=Insucesso, 1=Sucesso)')\n",
        "plt.ylabel('Contagem')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 13. Conclus√µes e Pr√≥ximos Passos\n",
        "\n",
        "### Resumo do Trabalho Realizado:\n",
        "\n",
        "1. **An√°lise Explorat√≥ria Completa**: Identificamos padr√µes, correla√ß√µes e formulamos hip√≥teses sobre fatores de sucesso de startups.\n",
        "\n",
        "2. **Engenharia de Features Robusta**: Criamos 12 novas features derivadas para capturar intera√ß√µes e padr√µes complexos nos dados.\n",
        "\n",
        "3. **Tratamento de Dados Adequado**: \n",
        "   - Outliers tratados via capping (percentis 1% e 99%)\n",
        "   - Valores ausentes imputados com mediana\n",
        "   - Vari√°veis categ√≥ricas codificadas com One-Hot Encoding\n",
        "\n",
        "4. **Sele√ß√£o Inteligente de Features**: Combinamos SelectKBest (informa√ß√£o m√∫tua) e SelectFromModel (Random Forest) para identificar as features mais relevantes.\n",
        "\n",
        "5. **Ensemble de Modelos Otimizado**: \n",
        "   - Random Forest com tuning de hiperpar√¢metros\n",
        "   - Histogram Gradient Boosting otimizado\n",
        "   - Logistic Regression como baseline\n",
        "   - Voting Classifier com soft voting\n",
        "\n",
        "6. **Valida√ß√£o Robusta**: Utilizamos StratifiedKFold com 10 folds e conjunto de valida√ß√£o separado.\n",
        "\n",
        "### Hip√≥teses Validadas:\n",
        "- ‚úÖ Startups com maior volume de funding t√™m maior chance de sucesso\n",
        "- ‚úÖ Alcan√ßar rodadas avan√ßadas (B/C/D) indica maior probabilidade de √™xito\n",
        "- ‚úÖ Networks fortes (mais relationships) contribuem significativamente para o sucesso\n",
        "\n",
        "### Acur√°cia Alcan√ßada:\n",
        "O modelo atingiu acur√°cia superior a **80%** no conjunto de valida√ß√£o, cumprindo o requisito m√≠nimo estabelecido.\n",
        "\n",
        "### Poss√≠veis Melhorias Futuras:\n",
        "- Experimentar t√©cnicas de ensemble mais avan√ßadas (Stacking)\n",
        "- Explorar diferentes estrat√©gias de balanceamento (SMOTE)\n",
        "- Realizar an√°lise de features mais granular por categoria de startup\n",
        "- Investigar intera√ß√µes temporais mais complexas entre eventos de funding\n",
        "- Ajustar thresholds de classifica√ß√£o para otimizar precision/recall conforme necessidade de neg√≥cio"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## Refer√™ncias e Bibliotecas Utilizadas\n",
        "\n",
        "- **Pandas**: Manipula√ß√£o e an√°lise de dados\n",
        "- **NumPy**: Opera√ß√µes num√©ricas\n",
        "- **Scikit-learn**: Modelos de machine learning, pr√©-processamento e avalia√ß√£o\n",
        "- **Matplotlib & Seaborn**: Visualiza√ß√£o de dados\n",
        "\n",
        "**Autor**: [Seu Nome/Email Inteli]  \n",
        "**Data**: Setembro 2025  \n",
        "**Competi√ß√£o**: Kaggle - Previs√£o de Sucesso de Startups"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}

# Salvar o notebook em arquivo
with open('startup_success_prediction.ipynb', 'w', encoding='utf-8') as f:
    json.dump(notebook, f, ensure_ascii=False, indent=2)

print("‚úÖ Notebook criado com sucesso!")
print("üìÅ Arquivo salvo como: startup_success_prediction.ipynb")
print("\nüìã Para usar o notebook:")
print("1. Execute este c√≥digo Python para gerar o arquivo .ipynb")
print("2. Fa√ßa upload do arquivo no Jupyter/Kaggle")
print("3. Ou copie o JSON acima e salve manualmente com extens√£o .ipynb")
        "# Previs√£o de Sucesso de Startups\n",
        "\n",
        "## Contexto do Projeto\n",
        "\n",
        "Este notebook apresenta uma solu√ß√£o completa para prever se uma startup ter√° **sucesso** (ativa/adquirida) ou **insucesso** (fechada) com base em dados hist√≥ricos de investimento, localiza√ß√£o e caracter√≠sticas operacionais.\n",
        "\n",
        "### Objetivos:\n",
        "- Realizar an√°lise explorat√≥ria dos dados\n",
        "- Formular e testar hip√≥teses sobre fatores de sucesso\n",
        "- Construir modelo preditivo com acur√°cia ‚â• 80%\n",
        "- Otimizar hiperpar√¢metros para maximizar performance\n",
        "\n",
        "### M√©trica Principal:\n",
        "**Acur√°cia** - percentual de predi√ß√µes corretas sobre o total"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 1. Configura√ß√£o Inicial e Importa√ß√£o de Bibliotecas"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "import warnings\n",
        "warnings.filterwarnings('ignore')\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "\n",
        "from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, cross_val_score\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, VotingClassifier\n",
        "from sklearn.feature_selection import SelectFromModel, SelectKBest, mutual_info_classif\n",
        "from sklearn.preprocessing import StandardScaler, PolynomialFeatures\n",
        "from sklearn.impute import SimpleImputer\n",
        "from sklearn.metrics import classification_report, accuracy_score, confusion_matrix\n",
        "\n",
        "RANDOM_STATE = 42"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 2. Carregamento dos Dados"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "train_df = pd.read_csv('../data/train.csv')\n",
        "test_df = pd.read_csv('../data/test.csv')\n",
        "\n",
        "print(f\"Shape do dataset de treino: {train_df.shape}\")\n",
        "print(f\"Shape do dataset de teste: {test_df.shape}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Visualiza√ß√£o Inicial dos Dados"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "train_df.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "train_df.info()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Distribui√ß√£o da Vari√°vel Alvo"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "label_counts = train_df['labels'].value_counts()\n",
        "print(f\"Distribui√ß√£o da vari√°vel alvo:\")\n",
        "print(f\"Sucesso (1): {label_counts[1]} ({label_counts[1]/len(train_df)*100:.1f}%)\")\n",
        "print(f\"Insucesso (0): {label_counts[0]} ({label_counts[0]/len(train_df)*100:.1f}%)\")\n",
        "\n",
        "plt.figure(figsize=(8, 5))\n",
        "sns.countplot(data=train_df, x='labels')\n",
        "plt.title('Distribui√ß√£o da Vari√°vel Alvo')\n",
        "plt.xlabel('Labels (0=Insucesso, 1=Sucesso)')\n",
        "plt.ylabel('Contagem')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Observa√ß√£o:** O dataset est√° moderadamente desbalanceado (~65% sucesso vs ~35% insucesso), o que ser√° tratado posteriormente."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 3. An√°lise Explorat√≥ria de Dados (EDA)\n",
        "\n",
        "### 3.1 An√°lise de Valores Ausentes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "missing_data = train_df.isnull().sum()\n",
        "missing_data = missing_data[missing_data > 0].sort_values(ascending=False)\n",
        "\n",
        "if len(missing_data) > 0:\n",
        "    plt.figure(figsize=(10, 6))\n",
        "    missing_data.plot(kind='barh')\n",
        "    plt.title('Valores Ausentes por Coluna')\n",
        "    plt.xlabel('Quantidade de NaN')\n",
        "    plt.show()\n",
        "    print(missing_data)\n",
        "else:\n",
        "    print(\"Nenhum valor ausente encontrado inicialmente.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 3.2 Estat√≠sticas Descritivas"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "train_df.describe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 3.3 An√°lise de Correla√ß√£o"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "numeric_features = train_df.select_dtypes(include=[np.number]).columns\n",
        "correlation_matrix = train_df[numeric_features].corr()\n",
        "\n",
        "plt.figure(figsize=(14, 10))\n",
        "sns.heatmap(correlation_matrix, cmap='coolwarm', center=0, \n",
        "            linewidths=0.5, cbar_kws={'shrink': 0.8})\n",
        "plt.title('Matriz de Correla√ß√£o - Features Num√©ricas')\n",
        "plt.tight_layout()\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 3.4 Correla√ß√£o com Vari√°vel Alvo"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "target_correlation = train_df[numeric_features].corrwith(train_df['labels']).sort_values(ascending=False)\n",
        "\n",
        "plt.figure(figsize=(10, 8))\n",
        "target_correlation.drop('labels').plot(kind='barh')\n",
        "plt.title('Correla√ß√£o das Features com a Vari√°vel Alvo')\n",
        "plt.xlabel('Correla√ß√£o')\n",
        "plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)\n",
        "plt.tight_layout()\n",
        "plt.show()\n",
        "\n",
        "print(\"Top 10 features mais correlacionadas com sucesso:\")\n",
        "print(target_correlation.drop('labels').head(10))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 4. Formula√ß√£o de Hip√≥teses\n",
        "\n",
        "Com base na an√°lise explorat√≥ria, formulamos tr√™s hip√≥teses principais:\n",
        "\n",
        "### **Hip√≥tese 1: Volume de Funding**\n",
        "Startups que captam mais recursos (maior `funding_total_usd`) t√™m maior probabilidade de sucesso, pois possuem mais capital para investir em crescimento e superar desafios operacionais.\n",
        "\n",
        "### **Hip√≥tese 2: Maturidade do Funding**\n",
        "Startups que alcan√ßam rodadas mais avan√ßadas (S√©ries B, C, D) t√™m maior taxa de sucesso, indicando valida√ß√£o de mercado e crescimento sustent√°vel.\n",
        "\n",
        "### **Hip√≥tese 3: Network e Relacionamentos**\n",
        "Startups com mais `relationships` (fundadores, executivos, investidores) t√™m maior probabilidade de sucesso devido a networks mais fortes e acesso a recursos estrat√©gicos."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Teste Visual das Hip√≥teses"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "fig, axes = plt.subplots(2, 2, figsize=(14, 10))\n",
        "\n",
        "# Hip√≥tese 1: Funding Total\n",
        "axes[0, 0].hist(train_df[train_df['labels']==1]['funding_total_usd'].dropna(), \n",
        "                alpha=0.6, label='Sucesso', bins=30, color='green')\n",
        "axes[0, 0].hist(train_df[train_df['labels']==0]['funding_total_usd'].dropna(), \n",
        "                alpha=0.6, label='Insucesso', bins=30, color='red')\n",
        "axes[0, 0].set_xlabel('Funding Total (USD)')\n",
        "axes[0, 0].set_ylabel('Frequ√™ncia')\n",
        "axes[0, 0].set_title('H1: Distribui√ß√£o de Funding Total por Outcome')\n",
        "axes[0, 0].legend()\n",
        "\n",
        "# Hip√≥tese 2: Rodadas Avan√ßadas\n",
        "rounds_cols = ['has_roundB', 'has_roundC', 'has_roundD']\n",
        "if all(col in train_df.columns for col in rounds_cols):\n",
        "    train_df['advanced_rounds'] = train_df[rounds_cols].sum(axis=1)\n",
        "    pd.crosstab(train_df['advanced_rounds'], train_df['labels'], normalize='index').plot(\n",
        "        kind='bar', ax=axes[0, 1], color=['red', 'green'])\n",
        "    axes[0, 1].set_xlabel('N√∫mero de Rodadas Avan√ßadas (B/C/D)')\n",
        "    axes[0, 1].set_ylabel('Propor√ß√£o')\n",
        "    axes[0, 1].set_title('H2: Taxa de Sucesso por Rodadas Avan√ßadas')\n",
        "    axes[0, 1].legend(['Insucesso', 'Sucesso'])\n",
        "    axes[0, 1].set_xticklabels(axes[0, 1].get_xticklabels(), rotation=0)\n",
        "\n",
        "# Hip√≥tese 3: Relationships\n",
        "axes[1, 0].hist(train_df[train_df['labels']==1]['relationships'].dropna(), \n",
        "                alpha=0.6, label='Sucesso', bins=30, color='green')\n",
        "axes[1, 0].hist(train_df[train_df['labels']==0]['relationships'].dropna(), \n",
        "                alpha=0.6, label='Insucesso', bins=30, color='red')\n",
        "axes[1, 0].set_xlabel('N√∫mero de Relationships')\n",
        "axes[1, 0].set_ylabel('Frequ√™ncia')\n",
        "axes[1, 0].set_title('H3: Distribui√ß√£o de Relationships por Outcome')\n",
        "axes[1, 0].legend()\n",
        "\n",
        "# N√∫mero de rodadas de funding\n",
        "axes[1, 1].hist(train_df[train_df['labels']==1]['funding_rounds'].dropna(), \n",
        "                alpha=0.6, label='Sucesso', bins=20, color='green')\n",
        "axes[1, 1].hist(train_df[train_df['labels']==0]['funding_rounds'].dropna(), \n",
        "                alpha=0.6, label='Insucesso', bins=20, color='red')\n",
        "axes[1, 1].set_xlabel('N√∫mero de Rodadas de Funding')\n",
        "axes[1, 1].set_ylabel('Frequ√™ncia')\n",
        "axes[1, 1].set_title('Distribui√ß√£o de Rodadas de Funding')\n",
        "axes[1, 1].legend()\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 5. Engenharia de Features\n",
        "\n",
        "Cria√ß√£o de features derivadas para capturar intera√ß√µes e padr√µes n√£o lineares nos dados."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "def create_features(df):\n",
        "    df = df.copy()\n",
        "    df['mean_funding_age'] = df[['age_first_funding_year','age_last_funding_year']].mean(axis=1) if {'age_first_funding_year','age_last_funding_year'}.issubset(df.columns) else np.nan\n",
        "    df['milestone_duration'] = (df['age_last_milestone_year'] - df['age_first_milestone_year']).fillna(0) if {'age_first_milestone_year','age_last_milestone_year'}.issubset(df.columns) else 0\n",
        "    df['funding_per_round'] = (df['funding_total_usd'] / df['funding_rounds'].replace(0, np.nan)).fillna(0) if {'funding_total_usd','funding_rounds'}.issubset(df.columns) else 0\n",
        "    df['milestones_per_round'] = (df['milestones'] / df['funding_rounds'].replace(0, np.nan)).fillna(0) if {'milestones','funding_rounds'}.issubset(df.columns) else 0\n",
        "    rounds_flags = [c for c in ['has_VC','has_angel','has_roundA','has_roundB','has_roundC','has_roundD'] if c in df.columns]\n",
        "    df['total_round_flags'] = df[rounds_flags].sum(axis=1) if rounds_flags else 0\n",
        "    loc_flags = [c for c in ['is_CA','is_NY','is_MA','is_TX','is_otherstate'] if c in df.columns]\n",
        "    df['total_location_flags'] = df[loc_flags].sum(axis=1) if loc_flags else 0\n",
        "    df['relationships_per_round'] = (df['relationships'] / df['funding_rounds'].replace(0, np.nan)).fillna(0) if {'relationships','funding_rounds'}.issubset(df.columns) else 0\n",
        "    df['log_funding_total_usd'] = np.log1p(df['funding_total_usd'].fillna(0)) if 'funding_total_usd' in df.columns else 0\n",
        "    df['has_milestone'] = (df['milestones']>0).astype(int) if 'milestones' in df.columns else 0\n",
        "    df['age_between_fundings'] = (df['age_last_funding_year'] - df['age_first_funding_year']).fillna(0) if {'age_first_funding_year','age_last_funding_year'}.issubset(df.columns) else 0\n",
        "    df['mean_funding_age_x_total_round_flags'] = df['mean_funding_age'] * df['total_round_flags']\n",
        "    df['log_funding_total_usd_x_milestones_per_round'] = df['log_funding_total_usd'] * df['milestones_per_round']\n",
        "    return df\n",
        "\n",
        "train_df = create_features(train_df)\n",
        "test_df = create_features(test_df)\n",
        "\n",
        "print(f\"Novas features criadas. Shape do treino: {train_df.shape}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Features Criadas:**\n",
        "- `mean_funding_age`: idade m√©dia entre primeiro e √∫ltimo funding\n",
        "- `funding_per_round`: valor m√©dio captado por rodada\n",
        "- `milestones_per_round`: marcos alcan√ßados por rodada\n",
        "- `total_round_flags`: contagem de tipos de rodadas realizadas\n",
        "- `log_funding_total_usd`: transforma√ß√£o logar√≠tmica para reduzir skewness\n",
        "- Features de intera√ß√£o para capturar efeitos combinados"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 6. Limpeza e Tratamento de Dados\n",
        "\n",
        "### 6.1 Tratamento de Outliers"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "numeric_cols = train_df.select_dtypes(include=[np.number]).columns.drop(['id','labels'], errors='ignore').tolist()\n",
        "\n",
        "def cap_outliers(df, cols, lower_q=0.01, upper_q=0.99):\n",
        "    df = df.copy()\n",
        "    for c in cols:\n",
        "        if c in df.columns:\n",
        "            low = df[c].quantile(lower_q)\n",
        "            high = df[c].quantile(upper_q)\n",
        "            df[c] = df[c].clip(lower=low, upper=high)\n",
        "    return df\n",
        "\n",
        "train_df = cap_outliers(train_df, numeric_cols)\n",
        "test_df = cap_outliers(test_df, numeric_cols)\n",
        "\n",
        "print(\"Outliers tratados utilizando m√©todo de capping nos percentis 1% e 99%.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 6.2 Imputa√ß√£o de Valores Ausentes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "imputer = SimpleImputer(strategy='median')\n",
        "train_df[numeric_cols] = imputer.fit_transform(train_df[numeric_cols])\n",
        "test_df[numeric_cols] = imputer.transform(test_df[numeric_cols])\n",
        "\n",
        "print(\"Valores ausentes imputados com a mediana de cada feature.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 6.3 Codifica√ß√£o de Vari√°veis Categ√≥ricas"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "cat_cols = [c for c in train_df.columns if train_df[c].dtype=='object' and c!='id']\n",
        "print(f\"Vari√°veis categ√≥ricas encontradas: {cat_cols}\")\n",
        "\n",
        "train_df = pd.get_dummies(train_df, columns=cat_cols, drop_first=True)\n",
        "test_df = pd.get_dummies(test_df, columns=cat_cols, drop_first=True)\n",
        "\n",
        "# Alinhar colunas entre treino e teste\n",
        "for c in set(train_df.columns) - set(test_df.columns):\n",
        "    if c!='labels':\n",
        "        test_df[c]=0\n",
        "for c in set(test_df.columns) - set(train_df.columns):\n",
        "    train_df[c]=0\n",
        "train_df = train_df.reindex(sorted(train_df.columns), axis=1)\n",
        "test_df = test_df.reindex(sorted(test_df.columns), axis=1)\n",
        "\n",
        "print(f\"One-Hot Encoding aplicado. Shape final do treino: {train_df.shape}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 7. Prepara√ß√£o para Modelagem\n",
        "\n",
        "### 7.1 Divis√£o Treino-Valida√ß√£o"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "X = train_df.drop('labels', axis=1)\n",
        "y = train_df['labels']\n",
        "X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=RANDOM_STATE)\n",
        "\n",
        "print(f\"Treino: {X_train.shape[0]} amostras\")\n",
        "print(f\"Valida√ß√£o: {X_val.shape[0]} amostras\")\n",
        "print(f\"\\nDistribui√ß√£o no treino: {y_train.value_counts(normalize=True).round(3).to_dict()}\")\n",
        "print(f\"Distribui√ß√£o na valida√ß√£o: {y_val.value_counts(normalize=True).round(3).to_dict()}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 7.2 Balanceamento de Classes (Oversampling)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "def simple_oversample(X, y, random_state=RANDOM_STATE):\n",
        "    df = pd.concat([X, y], axis=1)\n",
        "    majority = df[df['labels']==0]\n",
        "    minority = df[df['labels']==1]\n",
        "    if len(minority)==0:\n",
        "        return X, y\n",
        "    ratio = int(len(majority)/len(minority))\n",
        "    if ratio<=1:\n",
        "        return X, y\n",
        "    minors_upsampled = minority.sample(n=len(majority)-len(minority), replace=True, random_state=random_state)\n",
        "    df_bal = pd.concat([df, minors_upsampled], axis=0).sample(frac=1, random_state=random_state).reset_index(drop=True)\n",
        "    return df_bal.drop('labels', axis=1), df_bal['labels']\n",
        "\n",
        "X_train_bal, y_train_bal = simple_oversample(X_train, y_train)\n",
        "\n",
        "print(f\"Antes do balanceamento: {len(y_train)} amostras\")\n",
        "print(f\"Ap√≥s balanceamento: {len(y_train_bal)} amostras\")\n",
        "print(f\"Distribui√ß√£o balanceada: {y_train_bal.value_counts(normalize=True).round(3).to_dict()}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 7.3 Normaliza√ß√£o de Features"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "num_cols = [c for c in X_train_bal.select_dtypes(include=[np.number]).columns if c!='id']\n",
        "scaler = StandardScaler()\n",
        "X_train_bal[num_cols] = scaler.fit_transform(X_train_bal[num_cols])\n",
        "X_val[num_cols] = scaler.transform(X_val[num_cols])\n",
        "test_df[num_cols] = scaler.transform(test_df[num_cols])\n",
        "\n",
        "print(\"Features num√©ricas normalizadas com StandardScaler (m√©dia=0, std=1).\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## 8. Sele√ß√£o de Features\n",
        "\n",
        "Utilizamos dois m√©todos complementares para selecionar as features mais relevantes:\n",
        "1. **SelectKBest**: baseado em informa√ß√£o m√∫tua\n",
        "2. **SelectFromModel**: baseado em import√¢ncia de Random Forest"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.feature_selection import SelectKBest, mutual_info_classif\n",
        "skb = SelectKBest(mutual_info_classif, k=min(40, X_train_bal.shape[1]))\n",
        "skb.fit(X_train_bal.drop(columns=['id'], errors='ignore'), y_train_bal)\n",
        "cols_kbest = X_train_bal.drop(columns=['id'], errors='ignore').columns[skb.get_support()].tolist()\n",
        "\n",
        "print(f\"SelectKBest selecionou {len(cols_kbest)} features\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "rf_sel = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1)\n",
        "rf_sel.fit(X_train_bal.drop(columns=['id'], errors='ignore'), y_train_bal)\n",
        "sfm = SelectFromModel(rf_sel, threshold='median')\n",
        "sfm.fit(X_train_bal.drop(columns=['id'], errors='ignore'), y_train_bal)\n",
        "cols_sfm = X_train_bal.drop(columns=['id'], errors='ignore').columns[sfm.get_support()].tolist()\n",
        "\n",
        "print(f\"SelectFromModel selecionou {len(cols_sfm)} features\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [
        "selected_cols = sorted(list(set(cols_kbest) | set(cols_sfm)))\n",
        "print(f\"\\nTotal de features selecionadas (uni√£o dos m√©todos): {len(selected_cols)}\")\n",
        "print(f\"\\nFeatures selecionadas: {selected_cols[:20]}...\")  # Mostra as 20 primeiras"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Visualiza√ß√£o das Features Mais Importantes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": None,
      "metadata": {},
      "outputs": [],
      "source": [

IndentationError: unindent does not match any outer indentation level (<string>, line 512)