# Demographic Data Analyzer

This project is based on the FreeCodeCamp Data Analysis Python Project:
https://www.freecodecamp.org/learn/data-analysis-with-python/data-analysis-with-python-projects/demographic-data-analyzer

In [168]:
import pandas as pd
df=pd.read_csv('adult.data.csv')
pd.options.display.float_format = '{:.2f}'.format
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [152]:
print(df.shape)

(32561, 15)


##How many people of each race are represented in this dataset? 
This should be a Pandas series with race names as the index labels. (race column)

In [177]:
race_count = df.groupby('race')['race'].count().sort_values(axis=0, ascending=False)
print("Number of each race:\n", race_count) 

Number of each race:
 race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64


##What is the average age of men?

In [184]:
avg_age_men = df[df['sex'] == 'Male']['age'].mean()
print("Average age of men:", round(avg_age_men,1))

Average age of men: 39.4


##What is the percentage of people who have a Bachelor's degree?

In [189]:
percentage_bachelors = round((df[df['education'] == 'Bachelors'].shape[0] / df.shape[0]),2)
print(f"Percentage with Bachelors degrees: {percentage_bachelors*100}%")

Percentage with Bachelors degrees: 16.0%


##What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?

In [156]:
higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
higher_education_rich = round(higher_education[higher_education['salary'] == '>50K']['salary'].count() / higher_education.shape[0],3)
print(f"Percentage with higher education that earn >50K: {higher_education_rich*100}%")

Percentage with higher education that earn >50K: 46.5%


##What percentage of people without advanced education make more than 50K?

In [157]:
lower_education_rich = round(lower_education[lower_education['salary'] == '>50K']['salary'].count() / lower_education.shape[0],3)
print(f"Percentage without higher education that earn >50K: {lower_education_rich*100}%")

Percentage without higher education that earn >50K: 17.4%


##What is the minimum number of hours a person works per week?

In [158]:
 min_work_hours = df['hours-per-week'].min() 
 print(f"Min work time: {min_work_hours} hours/week")

Min work time: 1 hours/week


##What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [159]:
num_min_workers = df[df['hours-per-week'] == 1]['hours-per-week'].count()
rich_percentage = df[(df['hours-per-week'] == 1) & (df['salary'] == '>50K')].shape[0] / num_min_workers
print(f"Percentage of rich among those who work fewest hours: {rich_percentage*100}%")

Percentage of rich among those who work fewest hours: 10.0%


##What country has the highest percentage of people that earn >50K and what is that percentage?

In [160]:
#filter out invalid data as there are 584 of them
total = df[df['native-country'] != '?']
print(total.shape)

(31978, 15)


In [161]:
#Group the data together per country and get the total count of salary 
total = total[['native-country','salary']].groupby(['native-country']).count(). sort_values(by=['salary'], ascending = False)
total.head()

Unnamed: 0_level_0,salary
native-country,Unnamed: 1_level_1
United-States,29170
Mexico,643
Philippines,198
Germany,137
Canada,121


In [162]:
#Create a new dataset without invalid country entry and salary is above 50K
high_earner = df[(df["salary"] == '>50K') & (df["native-country"] != '?')]
print(high_earner.shape)

(7695, 15)


In [163]:
#Group and sort the dataset of high earners to get the total salary count of high earners
high_earner = high_earner[['native-country', 'salary']].groupby(['native-country']).count(). sort_values(by=['salary'], ascending = False)
high_earner.head()

Unnamed: 0_level_0,salary
native-country,Unnamed: 1_level_1
United-States,7171
Philippines,61
Germany,44
India,40
Canada,39


In [169]:
#Combime 2 datasets together for calculating the percentage and easy viewing 
combine = pd.concat([total, high_earner], axis=1)
combine.columns.values[0] = 'total'
combine.head()

Unnamed: 0,total,salary
United-States,29170,7171.0
Mexico,643,33.0
Philippines,198,61.0
Germany,137,44.0
Canada,121,39.0


In [173]:
#Calculate the percentage
combine['percentage'] = ((combine['salary']/ combine['total']) *100).round(2)
combine.style.format({'percentage': '{:.2f}%'})
#Why doesnt' the formating work?
combine.sort_values(by=['percentage'], inplace = True, ascending=False)
combine.head()

Unnamed: 0,total,salary,percentage
Iran,43,18.0,41.86
France,29,12.0,41.38
India,100,40.0,40.0
Taiwan,51,20.0,39.22
Japan,62,24.0,38.71


In [183]:
print("Country with highest percentage of rich:", combine.index[0])
print(f"Highest percentage of rich people in country: {combine.iloc[0,2]}%")

Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.86%


##Identify the most popular occupation for those who earn >50K in India. 

In [167]:
top_IN_occupation=df[(df['salary'] == '>50K') & (df['native-country'] == 'India')].loc[:,"occupation"].value_counts()[:1].sort_values(ascending=False)
print("Top occupations in India:", top_IN_occupation)

Top occupations in India: Prof-specialty    25
Name: occupation, dtype: int64
