# Kaggle Introvert vs Extrovert Classification - Google Colab Version

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-username/your-repo/blob/main/notebooks/colab_complete_analysis.ipynb)

This notebook provides a complete analysis and modeling pipeline for the Kaggle Introvert vs Extrovert classification competition, optimized for Google Colab.

## Table of Contents
1. [Setup and Installation](#setup)
2. [Data Upload and Loading](#data-loading)
3. [Exploratory Data Analysis](#eda)
4. [Feature Engineering](#feature-engineering)
5. [Model Training and Evaluation](#model-training)
6. [Final Predictions](#predictions)


## 1. Setup and Installation

First, let's install the required packages and set up the environment.

In [None]:
# Install required packages
!pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm
!pip install -q kaggle

print('Packages installed successfully!')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import os
import zipfile
from google.colab import files
from google.colab import drive
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print('Libraries imported successfully!')

## 2. Data Upload and Loading

Choose one of the following methods to load your data:

### Method 1: Upload files directly (Recommended for small files)

Run the cell below and upload your `train.csv`, `test.csv`, and `sample_submission.csv` files.

In [None]:
# Method 1: Direct file upload
print('Please upload your CSV files (train.csv, test.csv, sample_submission.csv):')
uploaded = files.upload()

# List uploaded files
print('\nUploaded files:')
for filename in uploaded.keys():
    print(f'- {filename} ({len(uploaded[filename])} bytes)')

### Method 2: Download from Kaggle API (Alternative)

If you prefer to download directly from Kaggle, upload your `kaggle.json` file first.

In [None]:
# Method 2: Kaggle API (uncomment if using)
# print('Upload your kaggle.json file:')
# uploaded = files.upload()
# 
# # Setup Kaggle API
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json
# 
# # Download competition data
# !kaggle competitions download -c playground-series-s5e7
# !unzip -o playground-series-s5e7.zip
# !rm playground-series-s5e7.zip

print('Kaggle setup ready (if needed)')

In [None]:
# Load the data
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    sample_submission = pd.read_csv('sample_submission.csv')
    
    print('Data loaded successfully!')
    print(f'Training data shape: {train_df.shape}')
    print(f'Test data shape: {test_df.shape}')
    print(f'Sample submission shape: {sample_submission.shape}')
except FileNotFoundError as e:
    print(f'Error loading data: {e}')
    print('Please make sure you have uploaded the CSV files or downloaded them via Kaggle API')

## 3. Exploratory Data Analysis

In [None]:
# Basic dataset information
print('=== Dataset Overview ===')
print('\nTraining Data Info:')
print(train_df.info())

print('\nFirst few rows of training data:')
display(train_df.head())

print('\nBasic statistics:')
display(train_df.describe())

In [None]:
# Missing values analysis
print('=== Missing Values Analysis ===')
missing_train = train_df.isnull().sum()
missing_test = test_df.isnull().sum()

print('\nMissing values in training data:')
print(missing_train[missing_train > 0])

print('\nMissing values in test data:')
print(missing_test[missing_test > 0])

# Visualize missing values
if missing_train.sum() > 0 or missing_test.sum() > 0:
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Training data missing values
    missing_train_pct = (missing_train / len(train_df)) * 100
    missing_train_pct = missing_train_pct[missing_train_pct > 0].sort_values(ascending=False)
    if len(missing_train_pct) > 0:
        missing_train_pct.plot(kind='bar', ax=axes[0])
        axes[0].set_title('Missing Values in Training Data (%)')
        axes[0].set_ylabel('Percentage')
    
    # Test data missing values
    missing_test_pct = (missing_test / len(test_df)) * 100
    missing_test_pct = missing_test_pct[missing_test_pct > 0].sort_values(ascending=False)
    if len(missing_test_pct) > 0:
        missing_test_pct.plot(kind='bar', ax=axes[1])
        axes[1].set_title('Missing Values in Test Data (%)')
        axes[1].set_ylabel('Percentage')
    
    plt.tight_layout()
    plt.show()
else:
    print('No missing values found in the datasets!')

In [None]:
# Target variable analysis
target_col = None
for col in train_df.columns:
    if col.lower() in ['personality', 'target', 'label', 'class']:
        target_col = col
        break

if target_col:
    print(f'=== Target Variable Analysis: {target_col} ===')
    print('\nValue counts:')
    print(train_df[target_col].value_counts())
    
    print('\nPercentage distribution:')
    print(train_df[target_col].value_counts(normalize=True) * 100)
    
    # Plot target distribution
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Count plot
    train_df[target_col].value_counts().plot(kind='bar', ax=axes[0])
    axes[0].set_title(f'Distribution of {target_col}')
    axes[0].set_ylabel('Count')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Pie chart
    train_df[target_col].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
    axes[1].set_title(f'Percentage Distribution of {target_col}')
    axes[1].set_ylabel('')
    
    plt.tight_layout()
    plt.show()
else:
    print('Target column not found!')

In [None]:
# Feature analysis
print('=== Feature Analysis ===')

# Separate numeric and categorical features
numeric_features = train_df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = train_df.select_dtypes(include=['object']).columns.tolist()

# Remove target from features if it's in the lists
if target_col in numeric_features:
    numeric_features.remove(target_col)
if target_col in categorical_features:
    categorical_features.remove(target_col)

print(f'Numeric features ({len(numeric_features)}): {numeric_features}')
print(f'Categorical features ({len(categorical_features)}): {categorical_features}')

# Analyze numeric features
if len(numeric_features) > 0:
    print('\nNumeric features statistics:')
    display(train_df[numeric_features].describe())

# Analyze categorical features
if len(categorical_features) > 0:
    print('\nCategorical features info:')
    for feature in categorical_features:
        unique_count = train_df[feature].nunique()
        print(f'{feature}: {unique_count} unique values')
        if unique_count <= 10:
            print(f'  Values: {train_df[feature].unique().tolist()}')
        print()

In [None]:
# Plot feature distributions
if len(numeric_features) > 0:
    print('=== Numeric Feature Distributions ===')
    
    # Calculate number of rows and columns for subplots
    n_features = len(numeric_features)
    n_cols = min(3, n_features)
    n_rows = (n_features + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
    if n_features == 1:
        axes = [axes]
    elif n_rows == 1:
        axes = axes.flatten()
    else:
        axes = axes.flatten()
    
    for i, feature in enumerate(numeric_features):
        train_df[feature].hist(bins=30, ax=axes[i], alpha=0.7)
        axes[i].set_title(f'Distribution of {feature}')
        axes[i].set_xlabel(feature)
        axes[i].set_ylabel('Frequency')
    
    # Hide empty subplots
    for i in range(n_features, len(axes)):
        axes[i].set_visible(False)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Correlation analysis
if len(numeric_features) > 1:
    print('=== Correlation Analysis ===')
    
    # Calculate correlation matrix
    correlation_matrix = train_df[numeric_features].corr()
    
    # Plot correlation heatmap
    plt.figure(figsize=(12, 10))
    mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
    sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', 
                center=0, square=True, fmt='.2f')
    plt.title('Feature Correlation Matrix')
    plt.tight_layout()
    plt.show()
    
    # Find highly correlated features
    high_corr_pairs = []
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            corr_val = correlation_matrix.iloc[i, j]
            if abs(corr_val) > 0.7:
                high_corr_pairs.append((
                    correlation_matrix.columns[i], 
                    correlation_matrix.columns[j], 
                    corr_val
                ))
    
    if high_corr_pairs:
        print('\nHighly correlated feature pairs (|correlation| > 0.7):')
        for feat1, feat2, corr in high_corr_pairs:
            print(f'{feat1} - {feat2}: {corr:.3f}')
    else:
        print('\nNo highly correlated feature pairs found.')

## 4. Feature Engineering

Since we're in Colab, we'll implement the feature engineering directly in this notebook.

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

class FeatureEngineer:
    def __init__(self):
        self.scaler = StandardScaler()
        self.label_encoder = LabelEncoder()
        self.preprocessor = None
        self.feature_selector = None
        self.feature_names = None
        
    def preprocess_data(self, X, y=None):
        """Preprocess the data"""
        # Separate numeric and categorical features
        numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
        categorical_features = X.select_dtypes(include=['object']).columns.tolist()
        
        # Create preprocessing pipelines
        numeric_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ])
        
        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ])
        
        # Combine preprocessing steps
        self.preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_features),
                ('cat', categorical_transformer, categorical_features)
            ])
        
        # Fit and transform features
        X_processed = self.preprocessor.fit_transform(X)
        
        # Get feature names
        feature_names = []
        feature_names.extend(numeric_features)
        if categorical_features:
            cat_feature_names = self.preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)
            feature_names.extend(cat_feature_names)
        self.feature_names = feature_names
        
        # Process target variable if provided
        if y is not None:
            if y.dtype == 'object':
                y_processed = self.label_encoder.fit_transform(y)
            else:
                y_processed = y.values
            return X_processed, y_processed
        
        return X_processed
    
    def transform_features(self, X):
        """Transform features using fitted preprocessor"""
        if self.preprocessor is None:
            raise ValueError("Preprocessor not fitted. Call preprocess_data first.")
        return self.preprocessor.transform(X)
    
    def select_features(self, X, y, k=50):
        """Select top k features"""
        self.feature_selector = SelectKBest(score_func=f_classif, k=min(k, X.shape[1]))
        X_selected = self.feature_selector.fit_transform(X, y)
        return X_selected

print('FeatureEngineer class defined successfully!')

In [None]:
# Initialize feature engineer
feature_engineer = FeatureEngineer()

# Preprocess the data
print('=== Feature Engineering ===')

# Separate features and target
if target_col:
    X = train_df.drop(columns=[target_col])
    y = train_df[target_col]
    
    print(f'Original feature shape: {X.shape}')
    print(f'Target shape: {y.shape}')
    
    # Apply feature engineering
    X_processed, y_processed = feature_engineer.preprocess_data(X, y)
    
    print(f'Processed feature shape: {X_processed.shape}')
    print(f'Processed target shape: {y_processed.shape}')
    
    # Process test data
    X_test_processed = feature_engineer.transform_features(test_df)
    print(f'Processed test shape: {X_test_processed.shape}')
    
    # Feature selection
    X_selected = feature_engineer.select_features(X_processed, y_processed, k=50)
    print(f'Selected features shape: {X_selected.shape}')
else:
    print('Cannot proceed without target column!')

## 5. Model Training and Evaluation

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
import lightgbm as lgb

print('Model libraries imported successfully!')

In [None]:
# Split data for training and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_selected, y_processed, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_processed
)

print(f'Training set shape: {X_train.shape}')
print(f'Validation set shape: {X_val.shape}')
print(f'Training target distribution:')
print(pd.Series(y_train).value_counts(normalize=True))
print(f'\nValidation target distribution:')
print(pd.Series(y_val).value_counts(normalize=True))

In [None]:
# Define models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
    'LightGBM': lgb.LGBMClassifier(n_estimators=100, random_state=42, verbose=-1),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000)
}

print('Models defined successfully!')

In [None]:
# Perform cross-validation
print('=== Cross-Validation Results ===')
cv_results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    print(f'\nEvaluating {name}...')
    scores = cross_val_score(model, X_selected, y_processed, cv=cv, scoring='accuracy')
    cv_results[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }
    print(f'{name}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})')

# Display results
results_df = pd.DataFrame(cv_results).T
results_df = results_df.sort_values('mean', ascending=False)
print('\n=== Final CV Results ===')
display(results_df[['mean', 'std']])

In [None]:
# Train the best model
best_model_name = results_df.index[0]
best_model = models[best_model_name]

print(f'=== Training Best Model: {best_model_name} ===')
best_model.fit(X_train, y_train)

# Make predictions on validation set
y_pred = best_model.predict(X_val)

# Evaluate performance
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation Accuracy: {accuracy:.4f}')

print('\nClassification Report:')
print(classification_report(y_val, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_val, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

In [None]:
# Create ensemble model
print('=== Creating Ensemble Model ===')

# Select top 3 models
top_3_models = results_df.head(3)
ensemble_models = []

for model_name in top_3_models.index:
    ensemble_models.append((model_name.lower().replace(' ', '_'), models[model_name]))

# Create voting classifier
ensemble_model = VotingClassifier(
    estimators=ensemble_models,
    voting='hard'
)

# Train ensemble
ensemble_model.fit(X_train, y_train)

# Evaluate ensemble on validation set
y_pred_ensemble = ensemble_model.predict(X_val)
ensemble_accuracy = accuracy_score(y_val, y_pred_ensemble)

print(f'Ensemble Validation Accuracy: {ensemble_accuracy:.4f}')
print(f'Best Single Model Accuracy: {accuracy:.4f}')
print(f'Improvement: {ensemble_accuracy - accuracy:.4f}')

## 6. Final Predictions

In [None]:
# Make final predictions on test set
print('=== Generating Final Predictions ===')

# Use the better performing model
final_model = ensemble_model if ensemble_accuracy > accuracy else best_model
model_name = 'Ensemble' if ensemble_accuracy > accuracy else f'Best Single Model ({best_model_name})'

print(f'Using {model_name} for final predictions')

# Apply feature selection to test data
X_test_selected = feature_engineer.feature_selector.transform(X_test_processed)

# Generate predictions
test_predictions = final_model.predict(X_test_selected)

# Convert predictions back to original labels if needed
if hasattr(feature_engineer.label_encoder, 'classes_'):
    test_predictions_labels = feature_engineer.label_encoder.inverse_transform(test_predictions)
else:
    test_predictions_labels = test_predictions

print(f'Generated {len(test_predictions_labels)} predictions')
print(f'Prediction distribution:')
print(pd.Series(test_predictions_labels).value_counts())

In [None]:
# Create submission file
submission = sample_submission.copy()
submission[target_col] = test_predictions_labels

print('Submission preview:')
display(submission.head(10))

print(f'\nSubmission shape: {submission.shape}')
print(f'Submission target distribution:')
print(submission[target_col].value_counts(normalize=True))

# Save submission
submission.to_csv('submission.csv', index=False)
print('\nSubmission saved as submission.csv')

# Download the submission file
files.download('submission.csv')
print('Submission file downloaded!')

## Summary

This Google Colab notebook provided a complete analysis pipeline for the Kaggle Introvert vs Extrovert classification competition:

### Key Features:
- **Cloud-optimized**: Designed specifically for Google Colab environment
- **Data Upload**: Multiple methods to upload your competition data
- **Complete EDA**: Comprehensive exploratory data analysis with visualizations
- **Feature Engineering**: Built-in preprocessing and feature selection
- **Model Training**: Multiple algorithms with cross-validation
- **Ensemble Methods**: Automatic ensemble creation for better performance
- **Easy Download**: Automatic download of submission file

### Results:
- Trained and evaluated multiple machine learning models
- Created an ensemble model for improved performance
- Generated final predictions ready for Kaggle submission

### Next Steps:
1. Download the `submission.csv` file
2. Upload to Kaggle competition
3. Experiment with different hyperparameters or feature engineering techniques
4. Try advanced ensemble methods or neural networks

**Happy Kaggling! 🚀**