# Give Me Some Credit — Baseline Submission Tutorial

This notebook demonstrates how to create a **minimal baseline** submission for the [Give Me Some Credit](https://www.kaggle.com/c/GiveMeSomeCredit) Kaggle competition.

**Steps:**
1. Load the training and test data
2. Minimal preprocessing (handle missing values)
3. Train a single Random Forest model
4. Predict on test data
5. Create a submission CSV file

## 0. Setup

Download the dataset from [Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit/data) and place `cs-training.csv` and `cs-test.csv` in the same directory as this notebook (or update the paths below).

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

## 1. Load Data

In [2]:
train = pd.read_csv('cs-training.csv', index_col=0)
test = pd.read_csv('cs-test.csv', index_col=0)

print('Training shape:', train.shape)
print('Test shape:', test.shape)
train.head()

Training shape: (150000, 11)
Test shape: (101503, 11)


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 150000 entries, 1 to 150000
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      150000 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 2   age                                   150000 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 4   DebtRatio                             150000 non-null  float64
 5   MonthlyIncome                         120269 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 7   NumberOfTimes90DaysLate               150000 non-null  int64  
 8   NumberRealEstateLoansOrLines          150000 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 10  NumberOfDependents                    146076 non-null  float64
dtypes: fl

In [4]:
# Check missing values
train.isnull().sum()

SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64

## 2. Minimal Preprocessing

We simply fill missing values with the column median. No feature engineering, no outlier handling — this is intentionally minimal.

In [5]:
target = 'SeriousDlqin2yrs'
features = [col for col in train.columns if col != target]

X = train[features].copy()
y = train[target].copy()
X_test = test[features].copy()

# Fill missing values with median
for col in features:
    median_val = X[col].median()
    X[col] = X[col].fillna(median_val)
    X_test[col] = X_test[col].fillna(median_val)

print('Missing values in X after filling:', X.isnull().sum().sum())
print('Missing values in X_test after filling:', X_test.isnull().sum().sum())

Missing values in X after filling: 0
Missing values in X_test after filling: 0


## 3. Train a Simple Model

We train a Random Forest with default hyperparameters. First, let's do a quick local evaluation with a train/validation split.

In [6]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

# Evaluate on validation set
val_proba = model.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, val_proba)
print(f'Validation AUC-ROC: {auc:.4f}')

Validation AUC-ROC: 0.8418


## 4. Retrain on Full Training Data and Predict

In [7]:
# Retrain on all training data for the final submission
model_full = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model_full.fit(X, y)

# Predict probabilities on the test set
test_proba = model_full.predict_proba(X_test)[:, 1]

## 5. Create Submission File

The Kaggle competition expects a CSV with two columns: `Id` and `Probability`.

In [8]:
submission = pd.DataFrame({
    'Id': test.index,
    'Probability': test_proba
})

submission.to_csv('submission.csv', index=False)
print('Submission file created: submission.csv')
submission.head(10)

Submission file created: submission.csv


Unnamed: 0,Id,Probability
0,1,0.02
1,2,0.04
2,3,0.0
3,4,0.04
4,5,0.18
5,6,0.04
6,7,0.08
7,8,0.11
8,9,0.0
9,10,0.344762


## 6. Submit to Kaggle — Step-by-Step Guide

Now upload `submission.csv` to the Kaggle competition page.

### Step 1: Click "Late Submission"

Go to [https://www.kaggle.com/c/GiveMeSomeCredit](https://www.kaggle.com/c/GiveMeSomeCredit) and click the **"Late Submission"** button in the top-right corner (since the competition has ended):

![Step 1 — Click Late Submission](ls0.PNG)

### Step 2: Upload your submission file

In the dialog that appears, drag and drop your `submission.csv` file (or click **Browse Files**), then click **Submit**:

![Step 2 — Upload and submit](ls1.PNG)

### Step 3: Check your score

After a few seconds, your **Private Score** and **Public Score** will appear on the Submissions page. Take a **screenshot** of this page for your project submission:

![Step 3 — Check your score](ls2.PNG)

> **Baseline result:** Private Score **0.84337**, Public Score **0.83485**

That's it! You can improve upon this baseline by:
- Better handling of missing values and outliers
- Feature engineering
- Trying different models (XGBoost, LightGBM, SVM, etc.)
- Hyperparameter tuning
- Stacking / ensembling