## Data Exploration
In this notebook, we will explore the 'Adult' dataset from the UCI repository. We'll understand the structure of the dataset, look at the first few rows, check for missing values, and understand the basic statistics of the dataset.

### Import necessary libraries
Before we start, let's import the libraries we will need.

In [1]:
import pandas as pd

### Load the dataset
Now, let's load the dataset using pandas.

In [12]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
data = pd.read_csv(url, names=columns, header=None, na_values="?", skipinitialspace=True)

### Initial Data Exploration

Let's start by looking at the first few rows of the dataset. Then, we'll check the dataset's structure, and finally, check for missing values.

In [13]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [15]:
data.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64

### Basic Statistical Details

Now, let's understand the basic statistical details of the dataset.

In [16]:
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


### Check Unique Values for Categorical Variables

Let's check the unique values for each of the categorical variables in the dataset. This will help us understand what categories exist within these variables.

In [17]:
for column in data.select_dtypes(include=['object']).columns:
    print("Column: ", column)
    print(data[column].unique())

Column:  workclass
['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' nan
 'Self-emp-inc' 'Without-pay' 'Never-worked']
Column:  education
['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']
Column:  marital-status
['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed']
Column:  occupation
['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' nan
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv']
Column:  relationship
['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']
Column:  race
['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
Column:  sex
['Male' 'Female']
Column:  native-country
['United-States' 'Cuba' 'Jamaica' 'I

### Check the Balance of Categorical Variables

Let's check the balance of our categorical variables. This can help us understand if there is any significant imbalance that might affect our model.

In [20]:
for column in data.select_dtypes(include=['object']).columns:
    print("Column: ", column)
    print(data[column].value_counts())

Column:  workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64
Column:  education
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64
Column:  marital-status
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital-status, dtype: int64
Column:  occupation
Prof-specialty       4140
Craft-repair        

### Check the Distribution of the Target Variable

Next, we should check the distribution of the target variable ('income'). This will help us understand how balanced the classes are in our classification task.

In [18]:
data['income'].value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

### Display Number of Unique Values for Each Column
Finally, let's display the number of unique values for each column in the dataset. This can be especially useful when dealing with a lot of categorical variables.

In [19]:
data.nunique()

age                  73
workclass             8
fnlwgt            21648
education            16
education-num        16
marital-status        7
occupation           14
relationship          6
race                  5
sex                   2
capital-gain        119
capital-loss         92
hours-per-week       94
native-country       41
income                2
dtype: int64

## Business Insights

Based on our initial exploration of the Adult dataset, here are a few potential business insights:

1. **Distribution of the target variable**: Our data shows that fewer individuals earn above 50K. This income distribution disparity could be significant for various business applications, such as income forecasting, targeted marketing, or social policy planning.

2. **Missing values**: The missing values in 'workclass', 'occupation', and 'native-country' columns could suggest issues with the data collection process that may require attention. Incomplete data could lead to less accurate predictions or insights.

3. **Diversity of occupations**: There are a wide variety of unique occupations in our data. This could suggest a diversity of skills in the population, which could be of interest to recruitment, training, or education companies.

4. **Correlations**: In future analysis, we will look for correlations between different features and income. For instance, there may be a relationship between education level and income, which could have implications for education policy.

5. **Gender balance**: There are more males than females in our dataset. This imbalance may affect our ability to accurately model income disparities between genders and could have implications for businesses interested in gender wage gap analysis.

6. **Age distribution**: Understanding the age distribution in our dataset could be vital for businesses that cater to specific age demographics.

7. **Hours per week**: The number of hours individuals work per week might be directly correlated to their income. This could be of interest to employers, labor unions, or policy makers.

8. **Capital gain and loss**: These financial variables could be particularly interesting for economic analysts or financial institutions. A deeper understanding of these features might be useful in financial planning and investment strategies.

9. **Educational Level**: The number of years of education could play a significant role in determining one's income. This insight might be beneficial for educational institutions or policy makers focusing on education reforms.

10. **Native Country**: The native country of individuals could be valuable for studies involving immigration and its impact on income levels. Such data could be important for social scientists or policy makers.

This summary provides a high-level overview of the potential business insights that we can derive from the Adult dataset based on our initial exploration. It sets the stage for the more detailed analysis and machine learning model building that will follow in the subsequent notebooks.

## Summary

In this notebook, we performed the initial exploration of the Adult dataset. We did the following:

1. **Imported necessary libraries** - We imported pandas, which is crucial for data manipulation and analysis.

2. **Loaded the dataset** - We loaded the Adult dataset from the UCI Machine Learning Repository.

3. **Performed initial data exploration** - We displayed the first few rows of the dataset, checked its structure, and identified the presence of missing values.

4. **Checked basic statistical details** - We observed the count, mean, std deviation, min, and max values for continuous features.

5. **Checked unique values for categorical variables** - This helped us understand what categories exist within these variables.

6. **Checked the balance of categorical variables** - This helped us understand if there is any significant imbalance that might affect our model.

7. **Checked the distribution of the target variable** - We found out how balanced the classes are in our classification task.

8. **Displayed the number of unique values for each column** - This was especially useful in understanding the categorical variables.

From the exploration, we understood more about the data structure, its completeness, and the type of cleaning and preprocessing that might be required in the next steps.

In the next notebook, we will focus on data preprocessing, including dealing with missing values, outliers, and more.