### This project, "Demographic Data Analyzer", is part of the FreeCodeCamp Data Analysis in Python course. 
### The aim of this project is to analyze demographic data using Pandas library. The dataset used in this project is extracted from the 1994 Census database.

In this challenge, we are required to analyze the provided demographic dataset to answer the following questions:

1. How many people of each race are represented in this dataset?
2. What is the average age of men?
3. What is the percentage of people who have a Bachelor's degree?
4. What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
5. What percentage of people without advanced education make more than 50K?
6. What is the minimum number of hours a person works per week?
7. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
8. What country has the highest percentage of people that earn >50K and what is that percentage?
9. Identify the most popular occupation for those who earn >50K in India.

In [24]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/adultdatademog/adult.data.csv


In [25]:
df = pd.read_csv('/kaggle/input/adultdatademog/adult.data.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.

In [27]:
df['race'].value_counts()

race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64

### What is the average age of men?

In [28]:
avg_age = df['age'].mean()
avg_age

38.58164675532078

### What is the percentage of people who have a Bachelor's degree?

In [29]:
bachelor_avg = df['education'].value_counts()['Bachelors']/df.shape[0]
bachelor_avg

0.16446055096587942

### What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?

In [30]:
higher_education = ['Bachelors', 'Masters','Doctorate']
lower_education = ['HS-grad','Some-college', 'Assoc-voc','Prof-school', 'Assoc-acdm', '7th-8th', '12th','10th', '11th', '9th', '5th-6th', '1st-4th']
greater_salary = df[df['salary']== '>50K']
lower_salary = df[df['salary']== '<=50K']
h1 = greater_salary['education'].value_counts()[higher_education].sum()
h2 = greater_salary['education'].value_counts()[lower_education].sum()
print(h1+h2)
print(greater_salary.shape[0])

7841
7841


In [31]:
per_high_salary_adv_ed = greater_salary['education'].value_counts()[higher_education].sum()/greater_salary.shape[0]
per_high_salary_adv_ed

0.4445861497258003

### What percentage of people without advanced education make more than 50K?

In [32]:
per_high_salary_low_ed = greater_salary['education'].value_counts()[lower_education].sum()/greater_salary.shape[0]
per_high_salary_low_ed

0.5554138502741998

### What is the minimum number of hours a person works per week (hours-per-week feature)?

In [40]:
min_hours_week = df['hours-per-week'].min()
min_hours_week

1

### What percentage of the people who work the minimum number of hours per week have a salary of >50K?

In [64]:
greater_salary['hours-per-week'].min()
# So there is only oner person who work only for 1 hour a week and has salary of more than 50K

1

In [66]:
rich_percentage = 1/df.shape[0]
rich_percentage

3.071158748195694e-05

### What country has the highest percentage of people that earn >50K?

In [89]:
g = pd.DataFrame(greater_salary['native-country'].value_counts())
# United States has the highest people who earn more than 50k and the count is 7171
country = g.idxmax().tolist()
per_max_coun = g.loc['United-States']/df.shape[0]
per_max_coun

count    0.220233
Name: United-States, dtype: float64

### Identify the most popular occupation for those who earn >50K in India.

In [112]:
Indian_rich = greater_salary[greater_salary['native-country'] == 'India']
occupation = Indian_rich['occupation'].value_counts()
occupation_pop_ind = occupation.idxmax()
occupation_pop_ind

'Prof-specialty'