## Import Packages

In [1]:
import warnings  # Suppress warnings

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests  # Send HTTP requests
import seaborn as sns
import sklearn
from sklearn.ensemble import RandomForestClassifier

# Models from scikit learn module:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

warnings.filterwarnings("ignore")

import os  # Interact with the operating system.
import sys  # Read system parameters.
from io import StringIO  # Read and write strings as files
from time import time  # Calculate training time.
from urllib.parse import urlparse  # Parse URL and extract path component

from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    RandomForestRegressor,
)
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    mean_absolute_error,
    precision_score,
    recall_score,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import train_test_split

# Summarize software libraries used.
print("Libraries used in this project:")
print("- NumPy {}".format(np.__version__))
print("- pandas {}".format(pd.__version__))
print("- scikit-learn {}".format(sklearn.__version__))
print("- Matplotlib {}".format(matplotlib.__version__))
print("- Seaborn {}".format(sns.__version__))
print("- Python {}\n".format(sys.version))
print("- Matplotlib {}\n".format(matplotlib.__version__))
print("\nSummary of software packages complete.\n")

Libraries used in this project:
- NumPy 1.22.3
- pandas 1.4.1
- scikit-learn 1.2.2
- Matplotlib 3.8.4
- Seaborn 0.13.0
- Python 3.9.20 (main, Oct  3 2024, 02:27:54) 
[Clang 14.0.6 ]

- Matplotlib 3.8.4


Summary of software packages complete.



# Load the Dataset

In [2]:
# Capture URL
url = "https://raw.githubusercontent.com/tyrantdavis/datasets/refs/heads/main/census-adult.data"
request = requests.get(url).text

# LOAD DATA
# Save as data frame and copy
census = pd.read_csv(StringIO(request))
df = census.copy()

# Parse URL
parsed_url = urlparse(url)
path = parsed_url.path
filename = os.path.basename(path)


print(f"Loaded {len(df)} census records from {filename}.")

Loaded 32561 census records from census-adult.data.


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlwgt          32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


**Features** 

- age: continuous
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week: continuous
- native country: discrete
- income: discrete, >50K, <=50K

**Spotlights** 

- The dataset comprises 32561 rows and 15 columns in total.
- Among these columns, 6 are designated for integer values and 9 for objects.
- No missing records are identified.
- The name for **concave points_worst** will be altered to facilitate access to the feature through dot notation.
- The diagnosis feature classes will be encoded and represented as 0 and 1 opposed to **B**(Benign) and **M**(Malignant)

# Examine Distribution of Income

Before we proceed, it's important to do a bit of preparation. The column names currently have leading spaces, which can make it difficult to access the values. This step will address that issue, allowing for easier access to the data.

In [4]:
df.columns

Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',
       ' marital-status', ' occupation', ' relationship', ' race', ' sex',
       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',
       ' income'],
      dtype='object')

### Remove white space from column names

In [5]:
df.columns = df.columns.str.replace(" ", "")

# Clean columns by stripping extra whitespace for columns of type "object"
for c in df.select_dtypes(include=["object"]).columns:
    df[c] = df[c].str.strip()

print("White space removed from the column names.\n\n")
df.columns

White space removed from the column names.




Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [6]:
print(df["income"].value_counts(normalize=True))

<=50K    0.75919
>50K     0.24081
Name: income, dtype: float64


**Spotlights** 

- Roughly 76% make 50k or less while roughly 24% make more than 50k.

# Create a features dataframe X

To achieve this, the focus will be solely on the attributes specified in the features list. The goal is to transform any categorical features into dummy variables by utilizing pd.get_dummies(). The parameter drop_first=True is set  to remove any unnecessary duplicate features.

In [7]:
features = ["age", "capital-gain", "capital-loss", "hours-per-week", "sex", "race"]

"""Create feature dataframe X with feature columns and dummy variables for categorical features"""
X = pd.get_dummies(df[features], drop_first=True)

**Examine X Samples**

In [8]:
X.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,sex_Male,race_Asian-Pac-Islander,race_Black,race_Other,race_White
0,39,2174,0,40,1,0,0,0,1
1,50,0,0,13,1,0,0,0,1
2,38,0,0,40,1,0,0,0,1
3,53,0,0,40,1,0,1,0,0
4,28,0,0,40,0,0,1,0,0


# Create the Output Variable 'y'

This is a simple binary classification. It should be assigned a value of 0 if the income is below 50k, and a value of 1 if the income exceeds 50k.

In [9]:
"""Create y"""

y = np.where(df.income == "<=50K", 0, 1)

"""Examine y"""
y

array([0, 0, 0, ..., 0, 0, 1])

# Partition the Dataset

This step generates the Training and Test sets with a Test size of 20%.

In [10]:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

# Instantiate Random Forest Classifier

To begin, we will create an instance of the RandomForestClassifier() using its default settings. Next, we will train the model with the training data and then evaluate its accuracy on the test data, which will serve as a baseline for comparing the performance of other models.

In [11]:
"""Instantiate random forest classifier, fit and score with default parameters"""

rf = RandomForestClassifier()
rf.fit(x_train, y_train)
rf.score(x_test, y_test)
print(
    f"Accuracy score for default random forest: {round(rf.score(x_test, y_test)*100,3)}%"
)

Accuracy score for default random forest: 82.005%


**Spotlights** 

The classifier achieved a score of roughly 82%. 

# Build and Tune Random Forest Classifiers by Depth

To begin, create an instance of the RandomForestClassifier() using its default settings. Next, train the model with the training data and then evaluate its accuracy on the test data, which will serve as a baseline for comparing the performance of other models.

In [12]:
"""7 -- Tune the hyperparameter max_depth over a range from 1-25, save scores for test and train set"""

np.random.seed(0)
accuracy_train = []
accuracy_test = []
depths = range(1, 26)
for i in depths:
    rf = RandomForestClassifier(max_depth=i)
    rf.fit(x_train, y_train)
    y_pred = rf.predict(x_test)
    accuracy_test.append(accuracy_score(y_test, rf.predict(x_test)))
    accuracy_train.append(accuracy_score(y_train, rf.predict(x_train)))

# Find the Largest Accuracy and its Depth

In [13]:
"""Find the best accuracy and at what depth that occurs"""

best_acc = np.max(accuracy_test)
best_depth = depths[np.argmax(accuracy_test)]
print(f"The highest accuracy on the test is achieved when depth: {best_depth}")
print(f"The highest accuracy on the test set is: {round(best_acc*100,3)}%")

The highest accuracy on the test is achieved when depth: 12
The highest accuracy on the test set is: 83.464%
