<a href="https://colab.research.google.com/github/yoosufcancode/machinelearningCW/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Loading the Dataset

We are preparing our environment to load the dataset required for our income classification challenge in this first stage. We're mounting the disk using Google Colab's features, which enables us to access files from our Google disk straight from this notebook. As a result, loading the Census Income dataset for preprocessing and analysis will go more smoothly.



In [1]:
#Loading the dataset
from google.colab import drive
drive.mount('/content/drive/')


Mounted at /content/drive/


## Specifying Dataset and Metadata Paths

We now specify the paths to the dataset and metadata files after mounting our Google Drive. We can be sure we have structured access to all required files because the paths for the training data, test data, index, and metadata have been defined. In order to properly load the data in the following steps for our machine learning work, these paths indicate where the files are placed within the Google Drive directory structure.



In [2]:
# Set the path to the dataset file
train_data_path = '/content/drive/My Drive/ML_CW_DATASET/adult.data'

metadata1_path = '/content/drive/My Drive/ML_CW_DATASET/adult.names'

test_data_path = '/content/drive/My Drive/ML_CW_DATASET/adult.test'

index_path = '/content/drive/My Drive/ML_CW_DATASET/index'

metadata2_path = '/content/drive/My Drive/ML_CW_DATASET/old.adult.names'

## Importing Necessary Libraries

Before we start analyzing and processing our data, we need to import several key Python libraries:

- `pandas`: Essential for data manipulation and analysis.
- `sklearn.model_selection`: Contains functions like `train_test_split` to divide our data into training and test sets.
- `sklearn.preprocessing`: Provides the `OneHotEncoder` for converting categorical variables into a form that could be provided to ML algorithms.
- `sklearn.naive_bayes`: Includes the `GaussianNB` classifier which is the Naïve Bayes algorithm for classification tasks.
- `sklearn.ensemble`: From this module, we are using the `RandomForestClassifier` for building a more complex model than Naïve Bayes.
- `sklearn.metrics`: It provides functions to assess the accuracy and performance of our models such as `accuracy_score` and `classification_report`.

These libraries will provide the necessary tools to preprocess the data, train the classification models, and evaluate their performance.


In [3]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

## Defining Column Names and Loading the Data

Since column headers are absent from the dataset, we begin by defining the `column_names` using the metadata included in the `adult.names` file. The target variable and attributes that we will utilize to predict revenue are reflected in these names.

After defining the column names, we use the robust data manipulation tool `pandas` to load the training set of data. After reading the CSV file, we assigned the headers based on the list of `column_names`. Lastly, we use `df.head()` to show the first few rows of our dataframe in order to confirm that the columns have been named appropriately and the data has been loaded correctly. In order to properly prepare our dataframe for the upcoming data preparation and analysis, this step guarantees it.



In [4]:
# column names from 'adult.names' file
column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education_num',
    'marital_status', 'occupation', 'relationship', 'race', 'sex',
    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
    'income'
]

# Loading the data
df = pd.read_csv(train_data_path, header=None, names=column_names)

# Checking the first few rows to ensure it's loaded correctly
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Handling Categorical Columns and Missing Values

By choosing the columns in our dataset that have the {object} dtype, which typically denotes string values in a pandas DataFrame, we have been able to identify the category columns. We must quantitatively encode the qualitative input in these columns so that our machine learning models can process it.

We address missing values inside these columns after identifying them. The most prevalent value, or mode, for each categorical column is used to fill in any missing values. This is a standard procedure that permits us to keep rows with missing data without adding undue bias. We make sure that our models won't come across nulls, which could potentially lead to mistakes during training, by filling in the missing data.



In [5]:
# Identifying categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns

# filling missing values with the mode for each categorical column
for column in categorical_columns:
    df[column].fillna(df[column].mode()[0], inplace=True)


## Data Cleaning and Preprocessing for Model Training

Placeholder items in the dataset, such as '?', may indicate missing values. Here, we use `numpy} for the `NaN} definition and replace such placeholders with `NaN} to standardize missing value representation.


We reevaluate the dataset for missing values after replacing placeholders. We restate the method for categorical columns, which is to use the mode to fill in any missing values. This validates that our DataFrame is clear and prepared for additional handling.

After missing value management, we concentrate on categorical variable encoding. We use `OneHotEncoder` from `sklearn.preprocessing} to accomplish this. By using one-hot encoding, categorical data are transformed into a format that machine learning algorithms may use to make more accurate predictions. In order to prevent multicollinearity, we decide to discard the first level of each category characteristic.

Following encoding, the altered categorical characteristics are contained in a new DataFrame called {encoded_df}. Next, we add the newly encoded variables to our dataset and remove the old categories columns.

After cleaning and preprocessing our dataset, we can go on to the next stage, which is dividing it into training and test sets before training our machine learning models.



In [6]:
import numpy as np

# Replacing the placeholders like '?' with NaN for missing values
df.replace('?', np.NaN, inplace=True)

# Checking again for missing values
missing_values = df.isnull().sum()

# Choosing a strategy to handle missing values, for example:
# df.dropna(inplace=True) - This will drop all rows with any NaN values
# or Fill missing values with mode for categorical columns
for column in categorical_columns:
    df[column].fillna(df[column].mode()[0], inplace=True)

# encoding the categorical variables
from sklearn.preprocessing import OneHotEncoder

# Initializing the OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first')

# Selecting categorical columns and fit the encoder
categorical_columns = df.select_dtypes(include=['object']).columns
encoded_data = encoder.fit_transform(df[categorical_columns])

# Creating a DataFrame with the encoded variables
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_columns))

# Dropping original categorical columns and concatenate the encoded ones
df = df.drop(categorical_columns, axis=1)
df = pd.concat([df, encoded_df], axis=1)

# df is ready for splitting into train and test sets and then for model training




## Displaying Dataset Columns

Following the preprocessing and encoding of categorical variables, it is crucial to confirm that our DataFrame is structured correctly. The names of the columns in our now-transformed dataset are printed out in this phase. By doing this, we make sure that our dataset is prepared for the next phases of machine learning model development and that all anticipated changes have been implemented successfully.



In [7]:
print(df.columns)


Index(['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week', 'workclass_ Federal-gov', 'workclass_ Local-gov',
       'workclass_ Never-worked', 'workclass_ Private',
       ...
       'native_country_ Puerto-Rico', 'native_country_ Scotland',
       'native_country_ South', 'native_country_ Taiwan',
       'native_country_ Thailand', 'native_country_ Trinadad&Tobago',
       'native_country_ United-States', 'native_country_ Vietnam',
       'native_country_ Yugoslavia', 'income_ >50K'],
      dtype='object', length=101)


## Splitting the Data into Features and Target

We separated our data into the characteristics and the target variable now that it is ready. The target variable, in this case `'income_ >50K'`, is the only column not included in the features (denoted as `X`). We pick this column as our target variable (`y`) because it indicates if an individual's income exceeds $50,000 annually.

We are laying the groundwork for supervised learning, which aims to predict the target variable using the input features, by segmenting the dataset in this way. In order to ensure that our model can learn from one subset of the data and be evaluated on a subset, this step is essential for the following phase, which entails splitting the dataset into training and testing sets.



In [8]:
# Splitting the data into features and target
X = df.drop('income_ >50K', axis=1)  # Make sure to use the correct column name
y = df['income_ >50K']


## Preparing Training and Testing Sets

By dividing our dataset into training and testing sets, we improve our data preparation procedure even more at this critical step. This separation enables our models to be trained on a subset of the data (the `training set`) and then assessed on a different subset (the `testing set`) to see how well they perform on data that hasn't been seen before. Here, `sklearn.model_selection`'s `train_test_split` function is utilized, with 20% of the data set aside for testing.

We also make sure that the feature sets we use for testing and training are encoded consistently. One-Hot Encoding is used to accomplish this, converting categorical variables into a format that machine learning algorithms may use to ensure correct data interpretation. There is a chance that the training and testing sets will have different numbers of columns after encoding because one set may have more categorical variables than the other.

In order to preserve consistency throughout our data, we align the training and testing sets to make sure they have the same columns. Ensuring that our models receive input data in a consistent format is a crucial step in the training and evaluation process.



In [11]:
from sklearn.model_selection import train_test_split

# Splitting the data into features and target
X = df.drop('income_ >50K', axis=1)  # Assuming 'income' is the target variable
y = df['income_ >50K']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying One-Hot Encoding to the categorical variables in the feature set
X_train_encoded = pd.get_dummies(X_train)
X_test_encoded = pd.get_dummies(X_test)


# Making sure both training and testing sets have the same columns after encoding
X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='inner', axis=1)

# Now, X_train_encoded and X_test_encoded are ready for model training and evaluation


## Implementing the Naïve Bayes Classifier

We start by getting our data ready, and then we use a machine learning technique to tackle the problem. First, we use the `GaussianNB` implementation of the Naïve Bayes classifier from `sklearn.naive_bayes`. Because of their ease of use, effectiveness, and frequently unexpectedly high performance—particularly in text classification problems—naïve Bayes models are widely used for classification tasks.

Initially, we set the `GaussianNB` model to zero. In many cases, it makes sense to assume that the characteristics in this model have a normal distribution.

Next, we use our encoded training datasets (`X_train_encoded` and `y_train`) to train the model. The `.fit()` method in scikit-learn is commonly used to train a model by providing the training data and the corresponding labels.

Lastly, we make predictions on our test dataset (`X_test_encoded`) using the trained Naïve Bayes model. For this, a set of predictions is generated using the `.predict()` method, which is based on the features of the test set. The performance of the model will then be determined by comparing these predictions to the true labels (`y_test`).



In [12]:
from sklearn.naive_bayes import GaussianNB

# Initialize the Naïve Bayes model
nb_classifier = GaussianNB()

# Train the model
nb_classifier.fit(X_train_encoded, y_train)

# Make predictions on the test set
nb_predictions = nb_classifier.predict(X_test_encoded)


## Utilizing the Random Forest Classifier

We now examine the Random Forest Classifier, a more sophisticated and generally more potent algorithm that builds on the framework created by the Naïve Bayes model. This model, which is a component of the `sklearn.ensemble` module, is well-known for its excellent accuracy, efficiency when processing big datasets, and adaptability to various classification problem types.

To produce a prediction that is more reliable and accurate, the Random Forest method constructs several decision trees and then combines them. The robustness and effectiveness of the overall model are influenced by the variety among the individual trees, which are trained on various subsets of the dataset using a method known as bootstrap aggregating (or bagging).


To guarantee the repeatability of our findings, we begin by initializing the `Random ForestClassifier` with a given `random_state`. We use our encoded training dataset to train the model after initialization. In order for the model to learn the correlations between the features and the target variable, it must be fitted to the data during the training step.

We use the encoded test set to generate predictions once the model has been trained. As with the Naïve Bayes model evaluation, the Random Forest model's predictions will be compared to the true labels in order to assess its performance.



In [13]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
rf_classifier = RandomForestClassifier(random_state=42)

# Train the model
rf_classifier.fit(X_train_encoded, y_train)

# Make predictions on the test set
rf_predictions = rf_classifier.predict(X_test_encoded)


In [14]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate Naïve Bayes
nb_accuracy = accuracy_score(y_test, nb_predictions)
print("Naïve Bayes Accuracy:", nb_accuracy)
print(classification_report(y_test, nb_predictions))

# Evaluate Random Forest
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Accuracy:", rf_accuracy)
print(classification_report(y_test, rf_predictions))


Naïve Bayes Accuracy: 0.7990173499155535
              precision    recall  f1-score   support

         0.0       0.81      0.95      0.88      4942
         1.0       0.68      0.32      0.43      1571

    accuracy                           0.80      6513
   macro avg       0.75      0.64      0.66      6513
weighted avg       0.78      0.80      0.77      6513

Random Forest Accuracy: 0.8582834331337326
              precision    recall  f1-score   support

         0.0       0.89      0.93      0.91      4942
         1.0       0.74      0.63      0.68      1571

    accuracy                           0.86      6513
   macro avg       0.82      0.78      0.80      6513
weighted avg       0.85      0.86      0.85      6513

