# US Income Prediction

## Introduction
In this project, we will be using a dataset containing census information from [UCI’s Machine Learning Repository](https://archive-beta.ics.uci.edu/ml/datasets/census+income).

By using this census data with a **Random Forest**, we will try to predict whether or not a person makes more than $50,000. So basically our prediction will be binary.

## Import Required Libraries

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

## Load Data

In [3]:
income_data = pd.read_csv("income.csv")
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Investigate Data

In [4]:
income_data.describe(include = "all")

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,32561.0,32561,32561.0,32561,32561.0,32561,32561,32561,32561,32561,32561.0,32561.0,32561.0,32561,32561
unique,,9,,16,,7,15,6,5,2,,,,42,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,22696,,10501,,14976,4140,13193,27816,21790,,,,29170,24720
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,
std,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117827.0,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178356.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237051.0,,12.0,,,,,,0.0,0.0,45.0,,


In [5]:
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)
value_count_keys = income_data['native-country'].value_counts().keys()
#print(value_count_keys)
income_data["country-int"] = income_data["native-country"].apply(lambda row: value_count_keys.get_loc(row))

## Missing Values

In [6]:
income_data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
sex-int           0
country-int       0
dtype: int64

## Creating labels and predictor vairables

In [7]:
labels = income_data[["income"]]
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int", "country-int"]]

## Splitting data into training and testing sets

In [8]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)

## Create & Fit Random Forest

In [9]:
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels.values.ravel())
score = forest.score(test_data, test_labels)
score

0.8216435327355361

## Predicting income

In [14]:
X = [[52, 0, 0, 45, 1, 0]]
predicted_income = forest.predict(X)
predicted_income_proba = forest.predict_proba(X)
predicted_income_classes = forest.classes_



In [15]:
print(f"Probability for income to be {predicted_income_classes[0]} is {predicted_income_proba[0][0]}")
print(f"Probability for income to be {predicted_income_classes[1]} is {predicted_income_proba[0][1]}")

Probability for income to be <=50K is 0.9713809523809525
Probability for income to be >50K is 0.02861904761904762
