# [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/Adult)

Predict whether income exceeds $50K/yr based on census data.
Also known as "Census Income" dataset.

**Attributes**

- age
- workclass
  - Represents the employment status of an individual
- fnlwgt
  - Final weight which is the number of people the census believes the entry represents
  - People with similar demographic characteristics should have similar weights
    - This only applies within state.
- education
  - The highest level of education achieved by an individual
- education-num
  - The highest level of education achieved in numerical form
- marital-status
- relationship
  - Represents what this individual is relative to others
- race
- sex
- capital-gain
- capital-loss
- hours-per-week
  - The hours an individual has reported to work per week
- native-country

The dataset is given as a `csv` file.

![Adult Data CSV Capture](../image/data_csv_capture.png)

# Data Loading

In [None]:
import pandas as pd

In [None]:
adult_df = pd.read_csv(
    "../data/adult_data.csv"
)

In [None]:
adult_df.head()

<div class="alert alert-block alert-danger">
Something went wrong; column names are not correct.
</div>

Ah, there are no column names in the raw data. Let's give them explicitly.

In [14]:
column_names = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country',
    'income',
]

adult_df = pd.read_csv(
    "../data/adult_data.csv",
    names=column_names
)

In [15]:
adult_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Essential Check

In [16]:
adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [None]:
adult_df.describe()

In [None]:
adult_df.describe(exclude='number')

In [None]:
adult_df['workclass'].unique()

<div class="alert alert-block alert-danger">
We encountered another problem; there is an unnecessary space in string columns.
</div>

We can fix it using `skipinitialspace` parameter of the `pd.read_csv()` function.

In [None]:
adult_df = pd.read_csv(
    "../data/adult_data.csv",
    names=column_names,
    skipinitialspace=True
)

adult_df['workclass'].unique()

# Notebook Refactoring Recommendations

- Put `import` statements at first
- Put every cell in order, so that your notebook can be excuted from the top to the bottom without errors
- Follow [PEP 8 - Style Guide for Python Code](https://peps.python.org/pep-0008/)
- Try to split one big notebook into several ones by some logic
- **Extract repeated code to functions**

# Refactoring example: extract repeated code to functions



Original code:

```python
column_names = [
    'age',
    'workclass',
    'fnlwgt',
    ...
]

adult_df = pd.read_csv(
    "../data/adult_data.csv",
    names=column_names,
    skipinitialspace=True
)
```

Improvements:

- Extract a target file to read as a `data_file` argument to provide flixibility
- Provide a default file path
- Change `column_names` to a Tuple and make the variable name to a constant using all capital letters

In [None]:
def load_adult_data(data_file='../data/adult_data.csv'):
    COLUMN_NAMES = (
        'age',
        'workclass',
        'fnlwgt',
        'education',
        'education_num',
        'marital_status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'capital_gain',
        'capital_loss',
        'hours_per_week',
        'native_country',
        'income',
    )
    
    return pd.read_csv(
        data_file,
        names=COLUMN_NAMES,
        skipinitialspace=True
    )

In [None]:
new_adult_df = load_adult_data()
new_adult_df.head()