![alt text](https://www.nlab.org.uk/wp-content/uploads/nlab.png)
# ML Practical 3: Evaluation of multiple models

## The task.

Task: Predict whether a person makes over $50k per year from census data known about them.

Data set from the paper: Kohavi, Ron. "Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid." KDD. Vol. 96. 1996.
Data URL: We will be using modified versions of the publically avaliable data. Please download the data from the URLs provided.

**Output Feature:**

Feature | type | values
:-------:|:--------:|:--------:|
salary | categorical | >50K, <=50K|

**Input features**

|     Feature    |     Type    |                                                                                                                                                                                                              Values                                                                                                                                                                                                             |
|:--------------:|:-----------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|       age      |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                               |
|    workclass   | categorical |                                                                                                                                                              Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked                                                                                                                                                              |
|     fnlwgt     |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|    education   | categorical |                                                                                                                                      Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.                                                                                                                                     |
|  education-num |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| marital-status | categorical |                                                                                                                                                            Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.                                                                                                                                                           |
|   occupation   | categorical |                                                                                                    Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.                                                                                                    |
|  relationship  | categorical |                                                                                                                                                                               Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.                                                                                                                                                                               |
|      race      | categorical |                                                                                                                                                                                   White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.                                                                                                                                                                                  |
|       sex      | categorical |                                                                                                                                                                                                          Female, Male.                                                                                                                                                                                                          |
|  capital-gain  |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|  capital-loss  |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| hours-per-week |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| native-country | categorical | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |


# To help you along some of the basic data preparation has been done for you.
Read the code. Understand what has been done.

In [1]:
# Some basic imports
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV

# Read the data into a pandas DataFrame
data = pd.read_csv('https://drive.google.com/uc?export=download&id=1lBiNrYBk5KdfBllyjuELgRwYaK4yT2z9', header = 0, names = ['age','workclass','fnlwgt','education','education-num','matrial-status','occupation','relationship','race','sex','captial-gain','captial-loss','hours-per-week','salary'])

In [5]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,matrial-status,occupation,relationship,race,sex,captial-gain,captial-loss,hours-per-week,salary
0,25,Private,292058,HS-grad,9,Never-married,Other-service,Other-relative,White,Male,0,0,30,<=50K
1,28,Private,285294,Bachelors,13,Married-civ-spouse,Sales,Wife,Black,Female,15024,0,45,>50K
2,31,Private,113364,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,>50K
3,33,Federal-gov,29617,Some-college,10,Divorced,Other-service,Not-in-family,Black,Male,0,0,40,<=50K
4,34,Private,157289,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,<=50K


In [4]:
data.describe()

Unnamed: 0,age,fnlwgt,education-num,captial-gain,captial-loss,hours-per-week
count,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0
mean,38.453714,186912.8,10.181143,1121.488286,94.314286,40.262571
std,13.683679,105449.8,2.408376,7554.172001,420.050908,12.326367
min,17.0,19302.0,1.0,0.0,0.0,1.0
25%,28.0,115648.2,9.0,0.0,0.0,40.0
50%,37.0,175335.0,10.0,0.0,0.0,40.0
75%,47.0,233419.5,12.25,0.0,0.0,45.0
max,90.0,1033222.0,16.0,99999.0,3683.0,99.0


In [6]:
# Define our input features and our output feature
# Call our input features X and our output feature y (the sklearn standard)
# Note that we have categorical features.
X = data.drop( columns = 'salary' )
y = data.salary

In [8]:
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,matrial-status,occupation,relationship,race,sex,captial-gain,captial-loss,hours-per-week
0,25,Private,292058,HS-grad,9,Never-married,Other-service,Other-relative,White,Male,0,0,30
1,28,Private,285294,Bachelors,13,Married-civ-spouse,Sales,Wife,Black,Female,15024,0,45
2,31,Private,113364,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40
3,33,Federal-gov,29617,Some-college,10,Divorced,Other-service,Not-in-family,Black,Male,0,0,40
4,34,Private,157289,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40


In [9]:
# Now we need to encode our output feature to be an integer 0 or 1.
# This is because we have a binary classification problem and in order to use sklearn's
# built-in evaluation measures we need to have one class defined as 1 (target) and one as 0 (non-target).

# We could do this by using the LabelEncoder from sklearn. The LabelEncoder will convert n-distinct values
# to 0,..,n-1 values in this case giving us what we want. We assume that our training set contains both
# labels and that this mapping will be valid. However, we have no control
# over which value is represented by 1 and which is represented by 0.
# Therefore it is easier (in terms of subsequent interpretation) to do this
# manually. Recall the problem, we want our target variable (1) to be '>50k'

# To do this (your variable y is a pandas.Series object, use the replace method):
# 1) update all values '<=50K' within y to equal 0
# 2) update all values '>50K' within y to equal 1

y.replace(to_replace = ' <=50K', value = 0, inplace = True)
y.replace(to_replace = ' >50K', value = 1, inplace = True)

  y.replace(to_replace = ' >50K', value = 1, inplace = True)


In [10]:
# The baseline classifier for you to use
lr_model = Pipeline([
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse_output = False)),  # will automatically pick string columns (could have specified)
    ('standardize', StandardScaler()), # will convert everything (can't specify which columns but all columns are fine after onehot)
    ('model',LogisticRegression(solver = 'liblinear') )
    ])



# Your turn. See the instructions in the slide deck...

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    stratify=y,
    random_state=42
)

In [12]:
n_estimators_list = [50, 100]
max_depth_list = [10, 20, None]

list_of_models = []

for n in n_estimators_list:
    for depth in max_depth_list:
        
        model = Pipeline([
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
            ('rf', RandomForestClassifier(
                n_estimators=n,
                max_depth=depth,
                random_state=42
            ))
        ])
        
        list_of_models.append(model)

In [13]:
X_subtrain, X_valid, y_subtrain, y_valid = train_test_split(
    X_train, y_train,
    test_size=0.3,
    stratify=y_train,
    random_state=42
)

In [14]:
results = []

for model in list_of_models:
    model.fit(X_subtrain, y_subtrain)
    score = model.score(X_valid, y_valid)
    results.append(score)

best_index = np.argmax(results)
best_model = list_of_models[best_index]

print("Validation Scores:", results)
print("Best Validation Score:", results[best_index])

Validation Scores: [0.7795918367346939, 0.8258503401360544, 0.8244897959183674, 0.7768707482993197, 0.8217687074829932, 0.8312925170068027]
Best Validation Score: 0.8312925170068027


In [15]:
best_model.fit(X_train, y_train)
test_score = best_model.score(X_test, y_test)

print("Final Test Accuracy:", test_score)

Final Test Accuracy: 0.8276190476190476


In [16]:
deploy_model = best_model.fit(X, y)