# **Analyzing "Census Adult Income" Dataset**

# Step 1: Import Libraries and Explore the Dataset

## **Objectives**
- Load the dataset and examine its structure.
- Check the first few rows of the data.
- Assess the column data types and summary statistics.
- Identify any missing values.

## **Code**


In [16]:
# Import required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Initialize BigQuery client (if working with BigQuery)
from google.cloud import bigquery
client = bigquery.Client(project="central-catcher-444011-s7")

In [17]:
# Corrected query with valid column names
query = """
    SELECT 
        age,
        workclass,
        education,
        education_num,
        marital_status,
        occupation,
        relationship,
        race,
        sex,
        capital_gain,
        capital_loss,
        hours_per_week,
        native_country,
        income_bracket
    FROM `bigquery-public-data.ml_datasets.census_adult_income`
"""
query_job = client.query(query)
# Convert query results into a dictionary
results = query_job.result()  # This fetches all the rows from the query
# Load it directly into a pandas DataFrame
df = pd.DataFrame([dict(row) for row in results])
print(df.head())


   age workclass education  education_num       marital_status  \
0   39   Private       9th              5   Married-civ-spouse   
1   77   Private       9th              5   Married-civ-spouse   
2   38   Private       9th              5   Married-civ-spouse   
3   28   Private       9th              5   Married-civ-spouse   
4   37   Private       9th              5   Married-civ-spouse   

           occupation relationship    race      sex  capital_gain  \
0       Other-service         Wife   Black   Female          3411   
1     Priv-house-serv         Wife   Black   Female             0   
2       Other-service         Wife   Black   Female             0   
3     Protective-serv         Wife   Black   Female             0   
4   Machine-op-inspct         Wife   Black   Female             0   

   capital_loss  hours_per_week  native_country income_bracket  
0             0              34   United-States          <=50K  
1             0              10   United-States          <

In [None]:
# Check the structure of the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   education       32561 non-null  object
 3   education_num   32561 non-null  int64 
 4   marital_status  32561 non-null  object
 5   occupation      32561 non-null  object
 6   relationship    32561 non-null  object
 7   race            32561 non-null  object
 8   sex             32561 non-null  object
 9   capital_gain    32561 non-null  int64 
 10  capital_loss    32561 non-null  int64 
 11  hours_per_week  32561 non-null  int64 
 12  native_country  32561 non-null  object
 13  income_bracket  32561 non-null  object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB
None


In [18]:
# Check the structure of the DataFrame
print(df.describe())

                age  education_num  capital_gain  capital_loss  hours_per_week
count  32561.000000   32561.000000  32561.000000  32561.000000    32561.000000
mean      38.581647      10.080679   1077.648844     87.303830       40.437456
std       13.640433       2.572720   7385.292085    402.960219       12.347429
min       17.000000       1.000000      0.000000      0.000000        1.000000
25%       28.000000       9.000000      0.000000      0.000000       40.000000
50%       37.000000      10.000000      0.000000      0.000000       40.000000
75%       48.000000      12.000000      0.000000      0.000000       45.000000
max       90.000000      16.000000  99999.000000   4356.000000       99.000000


In [19]:
# Count the number of unique values in each column
unique_counts = df.nunique()

# Print the unique counts for each column
print(unique_counts)

age                73
workclass           9
education          16
education_num      16
marital_status      7
occupation         15
relationship        6
race                5
sex                 2
capital_gain      119
capital_loss       92
hours_per_week     94
native_country     42
income_bracket      2
dtype: int64
