## Income Prediction Based on Demographic and Socioeconomic Factors

### Summary

| Metric                                   | Value   | Description                                                                                                     |
|------------------------------------------|---------|-----------------------------------------------------------------------------------------------------------------|
| Accuracy                                 | 0.8524  | The overall accuracy of the model is approximately 85.24%. It indicates the proportion of correctly classified instances. A higher value suggests better performance. |
| Precision (<=50K)                        | 0.88    | The precision for instances labeled as <=50K is 0.88. Precision represents the proportion of true positive predictions over the total predicted instances for that class. |
| Recall (<=50K)                           | 0.93    | The recall (sensitivity) for instances labeled as <=50K is 0.93. Recall represents the proportion of true positive predictions over the total actual instances of that class. |
| F1-Score (<=50K)                         | 0.90    | The F1-score for instances labeled as <=50K is 0.90. The F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. |
| Precision (>50K)                         | 0.73    | The precision for instances labeled as >50K is 0.73. This means that the model is correct about 73% of the time when predicting >50K. |
| Recall (>50K)                            | 0.63    | The recall (sensitivity) for instances labeled as >50K is 0.63. The model identifies about 63% of the actual >50K instances in the dataset. |
| F1-Score (>50K)                          | 0.68    | The F1-score for instances labeled as >50K is 0.68. The F1-score takes into account both precision and recall for a class. |
| Macro-Average F1-Score                   | 0.79    | The macro-average F1-score considers both classes equally and is 0.79. It provides an overall measure of the model's performance without considering class imbalance. |
| Weighted-Average F1-Score                | 0.85    | The weighted-average F1-score is 0.85, considering class proportions in the test set. It indicates the F1-score when each class is weighted by the number of instances. |
| Confusion Matrix (<=50K True Positives)  | 4549    | There were 4549 instances correctly predicted as <=50K (true positives). |
| Confusion Matrix (<=50K False Positives) | 363     | There were 363 instances incorrectly predicted as >50K when they actually belong to <=50K (false positives). |
| Confusion Matrix (<=50K False Negatives) | 598     | There were 598 instances incorrectly predicted as <=50K when they actually belong to >50K (false negatives). |
| Confusion Matrix (>50K True Positives)   | 1002    | There were 1002 instances correctly predicted as >50K (true positives). |
| Support (<=50K)                          | 4912    | There are 4912 instances labeled as <=50K in the test set. The support represents the number of instances in each class. |
| Support (>50K)                           | 1600    | There are 1600 instances labeled as >50K in the test set. The support represents the number of instances in each class. |


### Importing Libraries

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


### Read Dataset

In [5]:
# Read csv data
df = pd.read_csv("adult.data.csv")

### Data Cleaning

In [6]:
# List of column names in the 'columns_list' variable
columns_list = ["age", "workclass", "final_weight", "education", "education_num",
                "marital_status", "occupation", "relationship", "race", "sex",
                "capital_gain", "capital_loss", "hours_per_week", "native_country", "income_flag"]

# Assign column names to the DataFrame
df.columns = columns_list

### Data Profiling

In [7]:
# Checking Nan values
df.isnull().sum()

age               0
workclass         0
final_weight      0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income_flag       0
dtype: int64

In [50]:
df

Unnamed: 0.1,Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_flag
0,0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,32555,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32556,32556,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32557,32557,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32558,32558,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Update Dataset

In [8]:
# Updated data
df.to_csv("adult.updated.data.csv")

### Classification Model

In [9]:
# Split the data into features (X) and the target variable (y)
X = df.drop("income_flag", axis=1)
y = df["income_flag"]

# Convert categorical variables into numerical using one-hot encoding
X = pd.get_dummies(X)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=26)

# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

In [10]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate a classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate a confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8524262899262899
Classification Report:
              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.90      4912
        >50K       0.73      0.63      0.68      1600

    accuracy                           0.85      6512
   macro avg       0.81      0.78      0.79      6512
weighted avg       0.85      0.85      0.85      6512

Confusion Matrix:
[[4549  363]
 [ 598 1002]]


In [51]:
# Test data
test_data = {
    'age': 26,
    'workclass': 'Private',
    'fnlwgt': 220000,
    'education': 'Bachelors',
    'education-num': 13,
    'marital-status': 'Never-married',
    'occupation': 'Exec-managerial',
    'relationship': 'Unmarried',
    'race': 'White',
    'sex': 'Female',
    'capital-gain': 1000,
    'capital-loss': 0,
    'hours-per-week': 40,
    'native-country': 'United-States'
}

# Convert test_data into a DataFrame
test_df = pd.DataFrame([test_data])

# Perform one-hot encoding on categorical variables
# It turns the categorical variables (like workclass, education, etc.) into numerical values 
test_df = pd.get_dummies(test_df)

# Reorder the columns in test_df to match the order in X_train
test_df = test_df.reindex(columns=X.columns, fill_value=0)

# Make predictions using the trained classifier
predicted_income_flag = rf_classifier.predict(test_df)

print("Predicted income_flag:", predicted_income_flag[0])

Predicted income_flag:  <=50K
