# CMPSC 448: Homewrok #1
# Exploratory Data Analysis with `pandas`

## Objectives

In this assignment, you are asked to analyze the UCI Adult data set containing demographic information about the US residents. This data was extracted from the census bureau database found at

http://www.census.gov/ftp/pub/DES/www/welcome.html

The features of data with possible values of each feature are listed below:

| Feature Name| Possible Values  |
|------|------|
| age | continuous|
| workclass| Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
| fnlwgt| continuous|
| education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|education_num | continuous|
|marital_status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
|race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|sex | Female, Male|
|capital_gain| continuous|
|capital_loss | continuous|
|hours-per-week | continuous |
|native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
|salary | >50K,<=50K |


Please  complete the tasks in the Jupyter notebook by answering following 8 questions.

In [2]:
import numpy as np
import pandas as pd
from numpy import linalg as LA
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')



In [3]:
data = pd.read_csv('adult.data.csv')
print("\n".join(data.columns))

age
 workclass
 fnlwgt
 education
 education-num
 marital-status
 occupation
 relationship
 race
 sex
 capital-gain
 capital-loss
 hours-per-week
 native-country
 salary


In [4]:
data.shape

(32561, 15)

In [5]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 1. How many men and women (sex feature) are represented in this dataset?

In [6]:
# You answer (code + results)
data.columns = data.columns.str.strip()
data['sex'].value_counts() #display both sexes

 Male      21790
 Female    10771
Name: sex, dtype: int64

### 2. What is the average age (age feature) of women?

In [32]:
# You answer (code + results)
#data.columns = data.columns.str.strip()
num_women = len(data[data['sex'] == ' Female'])    # number of women
sum_ages = sum(data[data['sex'] == ' Female']['age'])    # sum of ages of women
print('Average age of women is: ',round(sum_ages / num_women, 2))

Average age of women is:  36.86


### 3. What is the percentage of German citizens (native-country feature)?


In [56]:
# You answer (code + results)
data.columns = data.columns.str.strip()
num_German = len(data[data['native-country'] == ' Germany']) #number of people who are german
all_citizens = len(data)
print("Percentage of German citizens is", round(float(100 * num_German / all_citizens), 3),"%")

Percentage of German citizens is 0.421 %


###  4. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

In [78]:
# You answer (code + results)
more_than50 = data[data['salary'] == ' >50K']['age']
data.columns = data.columns.str.strip()
print('Mean of those whose salary is more than 50K: ',round(np.mean(more_than50),2))
print('Standard Deviation of those whose salary is more than 50K: ',round(np.std(more_than50),2))

less_than50 = data[data['salary'] == ' <=50K']['age']
print('\nMean of those whose salary is less than 50K: ',round(np.mean(less_than50),2))
print('Standard Deviation of those whose salary is less than 50K: ',round(np.std(less_than50),2))

Mean of those whose salary is more than 50K:  44.25
Standard Deviation of those whose salary is more than 50K:  10.52

Mean of those whose salary is less than 50K:  36.78
Standard Deviation of those whose salary is less than 50K:  14.02


### 5. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

In [81]:
# You answer (code + results)
bachelors = 0
profSchool = 0
assocAcdm = 0
assocVoc = 0
masters = 0
doctorate = 0
highSchool = 0
someCollege = 0
count = 0 # total number of people

for i in range (0, len(data)):
    if (data.iloc[i][14] == ' >50K'): # data is gathered from people who make more than 50K
        if (data.iloc[i][3] == ' Bachelors'):
            bachelors += 1
        elif (data.iloc[i][3] == ' Prof-school'):
            profSchool += 1
        elif (data.iloc[i][3] == ' Assoc-acdm'):
            assocAcdm += 1
        elif (data.iloc[i][3] == ' Assoc-voc'):
            assocVoc += 1
        elif (data.iloc[i][3] == ' Masters'):
            masters += 1
        elif (data.iloc[i][3] == ' Doctorate'):
            doctorate += 1
        elif (data.iloc[i][3] == ' HS-grad'):
            highSchool += 1
        elif (data.iloc[i][3] == ' Some-college'):
            someCollege += 1
        count += 1
print("Number of people that earn more than 50K: ", count)
print("the number of people with any education: ", bachelors + profSchool + assocAcdm + assocVoc + masters + doctorate + highSchool + someCollege)
print("the number of people with education higher than high school: ", bachelors + profSchool + assocAcdm + assocVoc + masters + doctorate)

print("in conclusion, no")


Number of people that earn more than 50K:  7841
the number of people with any education:  7597
the number of people with education higher than high school:  4535
in conclusion, no


### 6.  Display age statistics for each race (race feature) and each gender (sex feature). 

Hint: Use `groupby()` and `describe()` functions of DataFrame. Find the maximum age of men of Amer-Indian-Eskimo race.

In [64]:
# You answer (code + results)
for (race, sex), sub_df in data.groupby(['race', 'sex']):
    print("Race: {0}, sex: {1}".format(race, sex))
    print(sub_df['age'].describe())

Race:  Amer-Indian-Eskimo, sex:  Female
count    119.000000
mean      37.117647
std       13.114991
min       17.000000
25%       27.000000
50%       36.000000
75%       46.000000
max       80.000000
Name: age, dtype: float64
Race:  Amer-Indian-Eskimo, sex:  Male
count    192.000000
mean      37.208333
std       12.049563
min       17.000000
25%       28.000000
50%       35.000000
75%       45.000000
max       82.000000
Name: age, dtype: float64
Race:  Asian-Pac-Islander, sex:  Female
count    346.000000
mean      35.089595
std       12.300845
min       17.000000
25%       25.000000
50%       33.000000
75%       43.750000
max       75.000000
Name: age, dtype: float64
Race:  Asian-Pac-Islander, sex:  Male
count    693.000000
mean      39.073593
std       12.883944
min       18.000000
25%       29.000000
50%       37.000000
75%       46.000000
max       90.000000
Name: age, dtype: float64
Race:  Black, sex:  Female
count    1555.000000
mean       37.854019
std        12.637197
min       

### 7. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?


In [70]:
# You answer (code + results)
max_hours = data['hours-per-week'].max()# max hours
print("Max time = {0} hours/week.".format(max_hours))

num_workers = data[data['hours-per-week'] == max_hours].shape[0] #number of people who work max hours per week
print("Total number of such workers: {0}".format(num_workers))

rich_workers = float(data[(data['hours-per-week'] == max_hours) & (data['salary'] == '>50K')].shape[1]) / num_workers #number of workers who make >50k
print("Percentage of workers who make >50k: {0}%".format(int(100 * rich_workers)))


Max time = 99 hours/week.
Total number of such workers: 85
Percentage of workers who make >50k: 17%


### 8. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

In [85]:
# You answer (code + results)
for (country, salary), sub_df in data.groupby(['native-country', 'salary']):
    print(country, salary, round(sub_df['hours-per-week'].mean(), 2))

print("As shown above, for japan; for those who earn less than 50K work an average of 41.0 hours and those who earn more than 50K work an average of 47.96 hours.")

 ?  <=50K 40.16
 ?  >50K 45.55
 Cambodia  <=50K 41.42
 Cambodia  >50K 40.0
 Canada  <=50K 37.91
 Canada  >50K 45.64
 China  <=50K 37.38
 China  >50K 38.9
 Columbia  <=50K 38.68
 Columbia  >50K 50.0
 Cuba  <=50K 37.99
 Cuba  >50K 42.44
 Dominican-Republic  <=50K 42.34
 Dominican-Republic  >50K 47.0
 Ecuador  <=50K 38.04
 Ecuador  >50K 48.75
 El-Salvador  <=50K 36.03
 El-Salvador  >50K 45.0
 England  <=50K 40.48
 England  >50K 44.53
 France  <=50K 41.06
 France  >50K 50.75
 Germany  <=50K 39.14
 Germany  >50K 44.98
 Greece  <=50K 41.81
 Greece  >50K 50.62
 Guatemala  <=50K 39.36
 Guatemala  >50K 36.67
 Haiti  <=50K 36.33
 Haiti  >50K 42.75
 Holand-Netherlands  <=50K 40.0
 Honduras  <=50K 34.33
 Honduras  >50K 60.0
 Hong  <=50K 39.14
 Hong  >50K 45.0
 Hungary  <=50K 31.3
 Hungary  >50K 50.0
 India  <=50K 38.23
 India  >50K 46.48
 Iran  <=50K 41.44
 Iran  >50K 47.5
 Ireland  <=50K 40.95
 Ireland  >50K 48.0
 Italy  <=50K 39.62
 Italy  >50K 45.4
 Jamaica  <=50K 38.24
 Jamaica  >50K 41.1
 Jap

In [29]:
# HW #1 question 2
a = np.matrix([[3, 1, 1], [2, 4, 2], [-1, -1, 1]])
w, v = LA.eig(a)
w; v

matrix([[-0.40824829, -0.81174463, -0.45491293],
        [-0.81649658,  0.32969496, -0.35973232],
        [ 0.40824829,  0.48204967,  0.81464525]])

In [None]:
   """
   A = [3, 1, 1]
       [2, 4, 2]
       [-1, -1, 1]
   
   D = [2, 0, 0]
       [0, 2, 0]
       [0, 0, 4]
   
   P = [-0.40824829, -0.81174463, -0.45491293]
       [-0.81649658,  0.32969496, -0.35973232]
       [ 0.40824829,  0.48204967,  0.81464525]
       
P^-1 = [-1.22, -1.22, -1.22]
       [-1.44,  0.41, -0.62]
       [ 1.46,  0.37,  2.21]
       using A=PDP^-1
       =>
    [3, 1, 1]
    [2, 4, 2]    = [-0.40824829, -0.81174463, -0.45491293]    [2, 0, 0]          [-1.22, -1.22, -1.22]
    [-1, -1, 1]    [-0.81649658,  0.32969496, -0.35973232] *   [0, 2, 0]     *  [-1.44,  0.41, -0.62]
                   [0.40824829,  0.48204967,  0.81464525]     [0, 0, 4]       [ 1.46,  0.37,  2.21]
                         
                   
     result:
     0.6772588508 0.3427759054 -2.0187411324      [1, 0, -2]
     3.1435669192 2.795005356   4.7634636136  or  [3, 3 , 5]  There are significant changes between A and the result
     2.3730993828 0.6048298718  5.6075965916      [2, 1, 6]
                   
    """