# Demographic Data Analyzer

We will analyze demographic data using Pandas. You have a dataset of demographic data that was extracted from the 1994 Census database. We will use Pandas to answer some of the question like:
- How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
- What is the average age of men?
- What is the percentage of people who have a Bachelor's degree?
- What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
- What percentage of people without advanced education make more than 50K?
- What is the minimum number of hours a person works per week?
- What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
- What country has the highest percentage of people that earn >50K and what is that percentage?
- Identify the most popular occupation for those who earn >50K in India.

First import the pandas library

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r"C:\Users\*****\Downloads\adult.data.csv")

How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.

What is the average age of men?

In [3]:
average_age_men = df[df["sex"] == "Male"]["age"].mean().round(1)
print('Average age of men:', average_age_men)

Average age of men: 39.4


What is the percentage of people who have a Bachelor's degree?

In [4]:
bachelor = df['education'] == 'Bachelors'    #select the education column where the value is 'Bachelors'
bachelor_total = df.loc[bachelor].value_counts().sum()   #.loc retrieves the data based on 'labels'
educated = df['education'].value_counts().sum()
percentage_bachelors = round(bachelor_total * 100 / educated, 1)
print('Percent of people with bachelors degree:',percentage_bachelors)

Percent of people with bachelors degree: 16.4


In [5]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
What percentage of people without advanced education make more than 50K?

In [6]:
master = df['education'] == 'Masters'
doctor = df['education'] == 'Doctorate'
higher_education = bachelor | master | doctor  # | represents the bitwise "or" operator


lower_education = (df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate')

# percentage with salary >50K
hi_ed_rich = df.loc[higher_education & (df['salary'] == '>50K')].value_counts().sum()
hi_ed_total = df.loc[bachelor | master | doctor].value_counts().sum()
#print('HI ED RICH', hi_ed_rich)
higher_education_rich = round(hi_ed_rich * 100 / hi_ed_total, 1)

lo_ed_rich = df.loc[lower_education & (df['salary'] == '>50K')].value_counts().sum()
lo_ed_total = df.loc[lower_education].value_counts().sum()
#print('LO ED RICH', lo_ed_rich)
lower_education_rich = round(lo_ed_rich * 100 / lo_ed_total, 1)

print('Higher education rich:', higher_education_rich)
print('Lower education rich:', lower_education_rich)

Higher education rich: 46.5
Lower education rich: 17.4


What is the minimum number of hours a person works per week?
What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [7]:
# What is the minimum number of hours a person works per week (hours-per-week feature)?
min_work_hours = df['hours-per-week'].value_counts().min()
print('MIN WORK HOURS:', min_work_hours)

# What percentage of the people who work the minimum number of hours per week have a salary of >50K?
num_min_workers = df.loc[df['hours-per-week'] == 1 & (df['salary'] == '>50K')].value_counts().sum()
print('MIN WORKERS RICH:', num_min_workers)

rich_percentage = round(num_min_workers * 100 / df.loc[df['hours-per-week'] == 1].value_counts().sum(), 1)
print('RICH PERCENT:',rich_percentage)

MIN WORK HOURS: 1
MIN WORKERS RICH: 2
RICH PERCENT: 10.0


What country has the highest percentage of people that earn >50K and what is that percentage?

In [8]:
 # What country has the highest percentage of people that earn >50K?
rich_pop_by_country = df.loc[df['salary'] == '>50K', 'native-country'].value_counts()
country_population = df['native-country'].value_counts()
#print('POPULATION BY COUNTRY', country_population)
#print('RICH BY COUNTRY', rich_pop_by_country)

rich_percent_by_country = round(rich_pop_by_country * 100 / country_population, 2)
#print('% RICH BY COUNTRY', rich_percent_by_country)
highest_earning_country = rich_percent_by_country.idxmax()
#print('RICHEST COUNTRY', highest_earning_country)
highest_earning_country_percentage = round(rich_percent_by_country.max(), 1)
#print('HIGHETS %RICH', highest_earning_country_percentage)
print('Highest earning country where people earn >50k:', highest_earning_country_percentage)

Highest earning country where people earn >50k: 41.9


Identify the most popular occupation for those who earn >50K in India

In [9]:
# Identify the most popular occupation for those who earn >50K in India.
india = df['native-country'] == 'India'
india_rich = df.loc[india & (df['salary'] == '>50K'), 'occupation'].value_counts()
#print('INDIA RICH', india_rich)
top_IN_occupation = india_rich.idxmax()
#print('TOP OCCUPATION', top_IN_occupation)
print('Top occupation in India:', top_IN_occupation)

Top occupation in India: Prof-specialty
