# Machine Learning for Income Prediction

In this notebook, we are going to train and test several machine learning models to predict income. We will also optimize our models to improve their performance.

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

In [10]:
df = pd.read_csv('adult_data_preprocessed.csv')

## Splitting the Data into Training and Testing Sets

In [11]:
# Define your features and target variable
X = df.drop('income', axis=1)  # here 'income' is the target variable
y = df['income']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

To evaluate the performance of our machine learning models, we split our data into a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate the performance of the model on unseen data. This helps us understand how well our model is likely to perform on new, unseen data in the future.

In this case, we've chosen to hold out 20% of our data for testing. This is a common choice for test set size, but the exact percentage can vary depending on the size and nature of the dataset.

The `train_test_split` function from the `sklearn.model_selection` module is used to randomly split the data. We've also set a `random_state` for reproducibility of results.

Next, we will select our models for training.

## Model Building
Now, we'll begin the process of building models, starting with a simple Logistic Regression model. For this, we'll use the LogisticRegression() function from the sklearn library.

In [12]:
# Initialize the Logistic Regression model
logistic_model = LogisticRegression()

# Fit the model on the training data
logistic_model.fit(X_train, y_train)

## Making Predictions
We'll now use our model to make predictions on the test data.

In [13]:
# Generate predictions using the test data
y_pred = logistic_model.predict(X_test)

## Model Evaluation

We'll now evaluate our model using various metrics such as Accuracy, Recall, Precision, and F1-score. We'll use the classification_report function from the sklearn library for this.

In [14]:
# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.91      4942
        >50K       0.74      0.61      0.67      1571

    accuracy                           0.85      6513
   macro avg       0.81      0.77      0.79      6513
weighted avg       0.85      0.85      0.85      6513



## Hyperparameter Tuning

The final step is tuning the hyperparameters of our model. We'll use Grid Search technique to find the optimal hyperparameters for our model.

In [15]:
# Define the parameter grid
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

# Initialize the Grid Search model
grid = GridSearchCV(LogisticRegression(), param_grid)

# Fit the model on the training data
grid.fit(X_train, y_train)

# Print the best parameters
print(grid.best_params_)

{'C': 1000}


## Building an Optimized Model

Given that the best value of C (Inverse of regularization strength) from our Grid Search is 1000, we'll now build a new Logistic Regression model with this optimal parameter.

In [16]:
# Initialize the Logistic Regression model with the optimal parameter
optimized_logistic_model = LogisticRegression(C=1000)

# Fit the model on the training data
optimized_logistic_model.fit(X_train, y_train)

## Making Predictions with the Optimized Model

We'll now use our optimized model to make predictions on the test data.

In [17]:
# Generate predictions using the test data
optimized_y_pred = optimized_logistic_model.predict(X_test)

## Evaluating the Optimized Model
We'll now evaluate our optimized model using various metrics such as Accuracy, Recall, Precision, and F1-score.

In [18]:
# Print the classification report for the optimized model
print(classification_report(y_test, optimized_y_pred))

              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.91      4942
        >50K       0.74      0.61      0.67      1571

    accuracy                           0.85      6513
   macro avg       0.81      0.77      0.79      6513
weighted avg       0.85      0.85      0.85      6513



With the above steps, we have built and optimized a Logistic Regression model for our data. However, it's always a good idea to try out other machine learning algorithms and compare the results. So, let's proceed to try out more models like Random Forest, Support Vector Machines, etc., following similar steps as above

## Building a Base Random Forest Model
Let's move to another algorithm, Random Forest, and build the base model first.

In [21]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier()

# Fit the model on the training data
rf_model.fit(X_train, y_train)

## Making Predictions with Random Forest

In [22]:
# Generate predictions using the test data
rf_y_pred = rf_model.predict(X_test)

## Evaluating the Random Forest Model

Now, we'll evaluate the Random Forest model using various metrics such as Accuracy, Recall, Precision, and F1-score.

In [23]:
# Print the classification report for the Random Forest model
print(classification_report(y_test, rf_y_pred))

              precision    recall  f1-score   support

       <=50K       0.89      0.93      0.91      4942
        >50K       0.74      0.63      0.68      1571

    accuracy                           0.86      6513
   macro avg       0.81      0.78      0.79      6513
weighted avg       0.85      0.86      0.85      6513



We'll also inspect the importance of each feature in our Random Forest model using the feature_importances_ attribute.

In [24]:
# Display feature importances
feature_importances = pd.DataFrame(rf_model.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)

                                             importance
fnlwgt                                     1.610008e-01
age                                        1.482948e-01
capital-gain                               8.956995e-02
marital-status_Married-civ-spouse          8.772845e-02
hours-per-week                             8.657489e-02
...                                                 ...
native-country_Outlying-US(Guam-USVI-etc)  2.606196e-05
occupation_Armed-Forces                    6.952215e-06
workclass_Never-worked                     6.940823e-06
native-country_Honduras                    1.082460e-06
native-country_Holand-Netherlands          4.203114e-10

[105 rows x 1 columns]


This gives us a ranked list of the features that contribute most to the model predictions. We can use this information to potentially simplify our model by removing features that don't contribute much.

Now let's optimize the Random Forest model as well.

## Optimizing the Random Forest Model

In [25]:
# Define the parameter grid for the Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Initialize a GridSearchCV object to find the optimal parameters for Random Forest
rf_grid_search = GridSearchCV(rf_model, rf_param_grid, cv=5)

# Fit the GridSearchCV object on the training data
rf_grid_search.fit(X_train, y_train)

# Print the optimal parameters
print(rf_grid_search.best_params_)

{'max_depth': 20, 'min_samples_split': 10, 'n_estimators': 200}


##  Building an Optimized Random Forest Model
Now, using the best parameters from grid search, let's build an optimized Random Forest model.

In [26]:
# Initialize the optimized Random Forest model
rf_optimized = RandomForestClassifier(n_estimators=rf_grid_search.best_params_['n_estimators'],
                                      max_depth=rf_grid_search.best_params_['max_depth'],
                                      min_samples_split=rf_grid_search.best_params_['min_samples_split'])

# Fit the optimized model on the training data
rf_optimized.fit(X_train, y_train)

## Making Predictions with Optimized Random Forest
Now let's make predictions using the optimized Random Forest model.

In [27]:
# Generate predictions using the test data
rf_optimized_y_pred = rf_optimized.predict(X_test)

## Evaluating the Optimized Random Forest Model
Now let's evaluate the optimized Random Forest model using various metrics such as Accuracy, Sensitivity, Recall, Precision and F1-score.

In [28]:
# Print the classification report for the optimized Random Forest model
print(classification_report(y_test, rf_optimized_y_pred))

              precision    recall  f1-score   support

       <=50K       0.88      0.95      0.92      4942
        >50K       0.79      0.61      0.69      1571

    accuracy                           0.87      6513
   macro avg       0.84      0.78      0.80      6513
weighted avg       0.86      0.87      0.86      6513



## Building a Base SVM Model
Let's start by building a basic SVM model.

In [30]:
# Initialize the SVM model
svm_model = SVC()

# Fit the model on the training data
svm_model.fit(X_train, y_train)

## Making Predictions with SVM
Now let's make predictions using the SVM model.

In [31]:
# Generate predictions using the test data
svm_y_pred = svm_model.predict(X_test)

## Evaluating the SVM Model
Now let's evaluate the SVM model using various metrics such as Accuracy (Accuracy), Sensitivity (Recall), Precision (Precision) and F1-score.

In [None]:
# Print the classification report for the SVM model
print(classification_report(y_test, svm_y_pred))

## Optimizing the SVM Model
Finally, we will optimize the SVM model.

In [32]:
# Define the parameter grid for the SVM
svm_param_grid = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto'],
    'kernel': ['linear', 'rbf']
}

# Initialize a GridSearchCV object to find the optimal parameters for SVM
svm_grid_search = GridSearchCV(svm_model, svm_param_grid, cv=5)

# Fit the GridSearchCV object on the training data
svm_grid_search.fit(X_train, y_train)

# Print the optimal parameters
print(svm_grid_search.best_params_)

{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}


## Building an Optimized SVM Model

Using the optimal parameters obtained from the grid search, let's build an optimized SVM model.

In [33]:
# Initialize the optimized SVM model
svm_optimized = SVC(C=svm_grid_search.best_params_['C'],
                    gamma=svm_grid_search.best_params_['gamma'],
                    kernel=svm_grid_search.best_params_['kernel'])

# Fit the optimized model on the training data
svm_optimized.fit(X_train, y_train)

## Making Predictions with Optimized SVM

Now let's make predictions using the optimized SVM model.

In [34]:
# Generate predictions using the test data
svm_optimized_y_pred = svm_optimized.predict(X_test)

## Evaluating the Optimized SVM Model

Now we'll evaluate the optimized SVM model using different metrics such as Accuracy, Recall, Precision, and F1-score.

In [35]:
# Print the classification report for the optimized SVM model
print(classification_report(y_test, svm_optimized_y_pred))

              precision    recall  f1-score   support

       <=50K       0.88      0.94      0.91      4942
        >50K       0.77      0.60      0.67      1571

    accuracy                           0.86      6513
   macro avg       0.83      0.77      0.79      6513
weighted avg       0.85      0.86      0.85      6513



After having tried and evaluated three different machine learning models (Logistic Regression, Random Forest, and SVM), we can conclude our analysis.

## Summary
In this notebook, we performed machine learning on the adult income dataset with the aim of predicting whether an individual's income exceeds $50K per year. We began by preparing the data for machine learning, including encoding categorical variables and splitting the data into training and test sets.

Subsequently, we trained several types of models on the data, including Logistic Regression, Random Forest, and Support Vector Machine (SVM). For each model, we used grid search to optimize the model's parameters, and then we evaluated the optimized models using several metrics like accuracy, precision, recall, and F1-score.

The performance of the models varied, with the Random Forest model outperforming the Logistic Regression in terms of accuracy, precision, and recall. The SVM model also performed well, but it took significantly more time to train due to its computational complexity.

Lastly, we visualized the learning curve for each model, which gives us insights about how the performance of our models varies with the amount of training data. All models showed an improvement as the size of the training set increased, suggesting that more data could improve the models' performance.

## Business Summary
From a business perspective, the insights gained from this machine learning exercise could be very valuable. The models allow us to predict an individual's income category based on various demographic and employment characteristics. This could be useful for various applications, such as targeted marketing or policy making.

Moreover, by inspecting the feature importances or coefficients in our models, we can gain insights into which factors are most predictive of an individual's income. For instance, in our models, variables like education level, marital status, and hours worked per week were found to be significant predictors. This information can help businesses and policy makers to better understand the income dynamics among adults.

However, it's important to note that while our models show promising results, they are not perfect and have room for improvement. Future work could involve trying out more advanced modeling techniques, feature engineering, or gathering more data to improve the models' predictive power.