# Exploratory Data Analysis and Preprocessing of US Census Data

## Table of Content
This notebook includes the ETL pipeline creation topic, the cleaned .

**Part 1: ETL (Explore-Transform-Load):**<br>
* [1. Data Wrangling](#1.) - with import libraries, gather datasets and first cleaning activities
* [2. EDA](#2.) - with data exploration including statistics and visualisations, starting general feature engineering
* [3. Further Feature Engineering](#3.) - additional feature preparation necessary for our classification task of the US data
* [4. Load](#4.) - we store the modified dataset as _preproc_census.csv_ file, but not in the associated GitHub repository. There dvc is used for versioning.

---

## Part 1: ETL

<a id='1.'></a>
### 1. Data Wrangling
In this section of the project, after libraries import, the dataset is loaded, a brief overview of the data structure is given and its general properties are checked. Cleaning is started if necessary.

In [61]:
#
#  imports
#
import re
import sys
import numpy as np
import pandas as pd
import logging
import logging.config
import yaml

from ydata_profiling import ProfileReport
import scipy.stats as st

# for ETL part
import sklearn
from sklearn import set_config  # necessary to display estimator pipeline diagrams
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder
from sklearn import metrics


# for visualisations
from IPython.display import display # Allows use of display() for DataFrames
import seaborn as sns
sns.set('notebook')
import matplotlib.pyplot as plt
# plotted graphics appear in the notebook just after current cell
%matplotlib inline  

# suppress matplotlib user warnings
import warnings
warnings.filterwarnings("ignore", 
                        category = Warning,
                        module = "matplotlib")

2023-08-21 13:28:45 - matplotlib.pyplot - DEBUG - Loaded backend module://matplotlib_inline.backend_inline version unknown.


In [55]:
# 
# config file
#
CONFIG_FILE = '../../config.yml'

In [56]:
# set logging properties
with open(CONFIG_FILE, 'r') as f:
    try:
        config = yaml.safe_load(f.read())
        logging.config.dictConfig(config['logging'])
    except yaml.YAMLError as exc:
        print(f'Cannot create logger: {exc}')

# get logger (we use staging, knowing that file&console concepts work)
logger = logging.getLogger('staging')

In [45]:
# If necessary during coding: 
# which kind of pandas and other libs are we using?
# what kind of encoding exists
logger.info(f"installed system versions: \
            {pd.show_versions(str(config['logging']['handlers']['file']['filename']))}")

2023-08-21 11:26:30 - staging - INFO - installed system versions: None


In [50]:
# get original US census data csv filen (comma separator)
try:
    df_census_orig = pd.read_csv(str(config['etl']['orig_census']))
    logger.info("\n'US census' orig dataset has {} samples with {} features each.".format(*df_census_orig.shape))
except:
    logger.error("US census orig dataset could not be loaded. Is it missing?")

2023-08-21 11:38:27 - staging - INFO - 
'US census' orig dataset has 32561 samples with 15 features each.


In [51]:
df_census_orig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlgt           32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [52]:
df_census_orig.head(10)

Unnamed: 0,age,workclass,fnlgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [57]:
# create profiling report of original census data
profile = ProfileReport(df_census_orig,
                        title=str(config['eda']['orig_report_title']))

In [62]:
profile.to_file(str(config['eda']['orig_census_report']))
logger.info(f"EDA report stored as {config['eda']['orig_census_report']}.")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-08-21 13:29:09 - staging - INFO - EDA report stored as ./eda_census_orig_report.html.


In [63]:
df_census_orig.duplicated().sum()

24