### Initializing the dataset and taking a peak at the contents.
We want to load the data from kaggle and then view its contents.
When using kagglehub.dataset_download() (from the kagglehub library), it downloads and caches datasets locally using a consistent folder structure — but it's a bit hidden by default.
So we will move the files to our working directory instead. 

Later this will make it easier to upload the dataset and the notebook files to GitHub.

#### Information about the dataset
I found the current dataset on Kaggle, I am not really sure about the full context behind the dataset. But it seems that it contains data about salary and demographics for individuals. The general steps I will proceed to take for this analysis can be reproduced for another dataset, if I am able to find one with a better description containing similar data points. 

In [42]:
import kagglehub
import shutil
import os
import pandas as pd
import numpy as np
import json
from ydata_profiling import ProfileReport
#modeling and visualization libraries 
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error


In [8]:
with open("config.json") as f:
    config = json.load(f)

In [9]:
# === CONFIG ===
dataset_id = config["dataset"]["source"]
destination_dir = os.getcwd()
expected_filename = "adult11.csv"  
expected_path = os.path.join(destination_dir, expected_filename)

# === Check if file already exists ===
if os.path.exists(expected_path):
    print(f"✅ Dataset already exists at: {expected_path}")
else:
    print("📥 Dataset not found. Downloading...")

    # Download dataset
    source_path = kagglehub.dataset_download(dataset_id)
    print("Downloaded to:", source_path)

    # Move the first .csv file
    for file in os.listdir(source_path):
        if file.endswith(".csv"):
            full_source = os.path.join(source_path, file)
            shutil.move(full_source, expected_path)
            print(f"Moved {file} to {destination_dir}")
            break

# === Load the CSV ===
income_df = pd.read_csv(expected_path)
print("✅ Data loaded.")

✅ Dataset already exists at: C:\Users\shaha\Data-Science\adult11.csv
✅ Data loaded.


In [10]:
income_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,salary
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [11]:
#Check how many NA values are present in the columns.
income_df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
gender            object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
salary            object
dtype: object

In [12]:
profile = ProfileReport(income_df, title="EDA Report", explorative=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|                                                                                           | 0/15 [00:00<?, ?it/s][A
100%|██████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 33.19it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

#### Insights for EDA and Modeling direction

From the EDA I see that there are a few problems in our data that may impact modeling.
There are few variables that are highly correlated to eachother.
The native-coutnry variable hashighly imbalanced data. 
And there are two variables capital-gain and capital loss, that are mostly zeros.
There are also I few duplicates, I will 0.1% of the data, I will drop those rows. 

There will be some steps needed to preprocess the data before modeling:

- First, I will remove one of the high correlated variables. 
- Second, I will take the log of captial gains,capital loss, fnlwgt, hours-per-week, and age variables
- Lastly, I will turn the native country feature into a binary feature categorizing the countries as US or Other. 

I will develop the model first with tree based classifiers and see how they perform. I will be trying out the LGBM and Random Forest models and compare the performance. These models perform better with sparse data, and also with categorical variables.


In [23]:
income_df = income_df.drop_duplicates()
income_df = income_df.drop(columns=['education-num'])

In [38]:
cols_to_transform = ['capital-gain', 'capital-loss', 'age', 'fnlwgt','hours-per-week']

for cols in cols_to_transform:
    income_df[f'{cols}_log'] = np.log1p(income_df[cols])