# Homework 2: Classification Competition

#### COSC 410: Spring 2024, Colgate University

See HW2.pdf for more details. **Due Feb 26**

In [181]:
import pandas as pd
import sklearn
from sklearn import metrics
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import random
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import precision_recall_curve
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestClassifier

### ML Task Description

The `Lab3_train.csv` file contains 10 years worth of daily weather observations from locations
across Australia, one row per day. It contains a column registering a binary label for each observation (`RainTomorrow`) a `1` if it rained
on the following day or a `0` if it did not. Your goal will be to create a ML model that, when given a
new weather observations, can predict whether it will rain on the day after
the observation. In other words, can you use machine learning to predict if it will rain tomorrow
based on the weather today?

### Open Ended Questions

Answer the following questions (referencing your code in this notebook when appropriate).

1. Describe the data preprocessing steps your pipeline performs.



My program iterates through the data, filling in the mode for catagorical empty data and the median for numerical data. Then it uses a scale to turn the categorical data into numerical indices. I tried using a KNNimputer to make the program run better, but I ran into a couple of issues. The biggest being runtime. Using k=5, I couldn't get the program to end, with k=1, it still took 10 minutes, and the results were frankly suboptimal. I'm sure with a high powered computer, this would be the better method, but since I can't test different values of K, it was extremely inefficient. 

2. What different models did you try to improve your performance? How did they perform relative to each other?

I tried a bunch of different models for this. I tried KNNs, Logistic Regressions, Random Forests, and Random Forest with Gradient Boosting and SVMs. They all performed around ~.45-.6 F1 score. Except for SVM, which had a very high precision but very low recall that I had trouble modifying to get good results. Overall, the logistic regression performs the best, but by a very small margin. 

3. What different hyperparameters did you attempt to optimize for the model you submitted? What values did you choose in the end?

I've been tinkering with just about every hyperparameter I can get my hands on. Right now, here are my current values: LogisticRegression(max_iter = 1000, multi_class = 'multinomial', C = 10, solver = 'saga', class_weight = 'balanced', penalty = 'elasticnet', l1_ratio = 1). Some of the params don't really make a difference, but taking them out didn't help/hurt it, so left in for now. There might be some more variations by the very end, but this is where I am at right now. 

4. Did model selection or hyperparameter optimization make the most difference for improving model performance?

Its hard to say which made the bigger difference, as each model has different hyperparameters. Both are super important, and identifying which made the most difference is a little difficult to wrap my head around. Choosing the right model is important, and then tuning the parameters after is how it succeeds. Also, I was able to get more success by changing the binary classifier threshold from .5 to .64, which brought precision and recall to ~the same number, and increased the f1 score by .02.

YOUR ANSWER GOES HERE

### Preprocessing

Your initial task is to preprocess this dataset. This includes resolving missing features, encoding nominal features, and appropriately scaling all features. You'll implement the function `preprocess`. Blocks below point out some useful tricks for approaching this.

In [182]:
df = pd.read_csv('Lab3_train.csv')

In [183]:
def scale(df: pd.DataFrame) -> pd.DataFrame:
    """ x' = (x - mean)/sd 
    Args:
        df (pd.DataFrame): Dataframe to scale 
    Returns:
        pd.DataFrame having standardized features
        
    Note: Only apply after steps 1 and 2"""
    
    nonLabel = list(filter(lambda x: x != 'RainTomorrow', df.columns))
    
    # We don't want to scale our prediction 
    subset = df[nonLabel]
    # Mapping feature to it's mean and sd
    means = dict(subset.mean())
    sds = dict(subset.std())

    # Loop through and do the math
    for col in means:
        df[col] = (df[col] - means[col])/sds[col]
    return df

In [184]:
def preprocess(filename: str) -> pd.DataFrame: 
    """ Preprocess your data 

    Args:
        filename (str): Name of the csv file containing the data

    Returns: 
        pd.DataFrame: Dataframe with relevent preprocessing applied
    """
    
    df = pd.read_csv(filename)
    for col in df.select_dtypes(include=['float64', 'int64']).columns: 
        df[col].fillna(df[col].median(), inplace = True)
  
    for col in df.select_dtypes(include=['object']).columns:
        df[col].fillna(df[col].mode()[0], inplace = True)
    for col in df.select_dtypes(include=['object']).columns: 
        df[col],_ = pd.factorize(df[col])

    df = scale(df)

    return df

In [185]:
data = preprocess('Lab3_train.csv')

## Train a Classifier

In [192]:
def fit_predict(train_fname: str, test_fname: str) -> np.array: 
    """ Fit a logistic regression model and return its predictions on test data 

    Args:
        train_fname (str): Name of the training file 
        test_fname (str): Name of the testing file
    Returns:
        np.array: Predictions of the model on test data

    Note: 
        Make sure you preprocess both your train and test data!"""
    #scaler = MinMaxScaler()
    train = preprocess(train_fname)
    test = preprocess(test_fname)
    #kept_features = ['Humidity3pm','Sunshine', 'Pressure3pm', 'Cloud3pm']
    #x_train = train[kept_features]
    x_train = train.drop('RainTomorrow', axis = 1)
    y_train = train['RainTomorrow']

    #x_test = test[kept_features]
    x_test = test.drop('RainTomorrow', axis = 1)
    model = LogisticRegression(max_iter = 1000, multi_class = 'multinomial', C = 100, solver = 'saga', class_weight = 'balanced', penalty = 'elasticnet', l1_ratio = 1)
    model.fit(x_train, y_train)
    probabilities = model.predict_proba(x_test)[:, 1] 
    y_pred = (probabilities >= .64)
    #y_pred = model.predict(x_test)

    return y_pred

In [187]:
def fit_predict_2(train_fname: str, test_fname: str) -> np.array: 
    """ Fit a logistic regression model and return its predictions on test data 

    Args:
        train_fname (str): Name of the training file 
        test_fname (str): Name of the testing file
    Returns:
        np.array: Predictions of the model on test data

    Note: 
        Make sure you preprocess both your train and test data!"""
    train = preprocess(train_fname)
    test = preprocess(test_fname)
    #kept_features = ['Humidity3pm','Sunshine', 'Pressure3pm', 'Cloud3pm']
    #x_train = train[kept_features]
    x_train = train.drop('RainTomorrow', axis = 1)
    y_train = train['RainTomorrow']

    #x_test = test[kept_features]
    x_test = test.drop('RainTomorrow', axis = 1)
    model = SVC(kernel = 'poly', class_weight = 'balanced', C = 100, max_iter = 3000)
    #model = LogisticRegression(max_iter = 1000)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)

    return y_pred

In [188]:
def fit_predict3(train_fname: str, test_fname: str) -> np.array: 
    """ Fit a logistic regression model and return its predictions on test data 

    Args:
        train_fname (str): Name of the training file 
        test_fname (str): Name of the testing file
    Returns:
        np.array: Predictions of the model on test data

    Note: 
        Make sure you preprocess both your train and test data!"""
    train = preprocess(train_fname)
    x_train = train.drop('RainTomorrow', axis = 1)
    y_train = train['RainTomorrow']
    test = preprocess(test_fname)
    x_test = test.drop('RainTomorrow', axis = 1)
    model = KNeighborsClassifier(n_neighbors = 100)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)

    return y_pred

In [189]:
def fit_predict4(train_fname: str, test_fname: str) -> np.array: 
    """ Fit a logistic regression model and return its predictions on test data 

    Args:
        train_fname (str): Name of the training file 
        test_fname (str): Name of the testing file
    Returns:
        np.array: Predictions of the model on test data

    Note: 
        Make sure you preprocess both your train and test data!"""
    train = preprocess(train_fname)
    x_train = train.drop('RainTomorrow', axis = 1)
    y_train = train['RainTomorrow']
    test = preprocess(test_fname)
    x_test = test.drop('RainTomorrow', axis = 1)
    model = RandomForestClassifier(n_estimators = 100, random_state = 0)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)

    return y_pred

In [190]:
from sklearn.ensemble import GradientBoostingClassifier
def fit_predict5(train_fname: str, test_fname: str) -> np.array: 
    """ Fit a logistic regression model and return its predictions on test data 

    Args:
        train_fname (str): Name of the training file 
        test_fname (str): Name of the testing file
    Returns:
        np.array: Predictions of the model on test data

    Note: 
        Make sure you preprocess both your train and test data!"""
    train = preprocess(train_fname)
    x_train = train.drop('RainTomorrow', axis = 1)
    y_train = train['RainTomorrow']
    test = preprocess(test_fname)
    x_test = test.drop('RainTomorrow', axis = 1)
    model = GradientBoostingClassifier(n_estimators = 100, random_state = 0, learning_rate = .1)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)

    return y_pred

In [193]:
def score(test_fname: str, Y_pred: np.array) -> list[float]:
    test = preprocess(test_fname)
    Y = test[test.columns[test.columns.isin(['RainTomorrow'])]]

    precision = metrics.precision_score(Y, Y_pred)
    recall = metrics.recall_score(Y, Y_pred)
    f1 = metrics.f1_score(Y, Y_pred)

    return precision, recall, f1

Y_pred = fit_predict("Lab3_train.csv", "Lab3_valid.csv")
print(score('Lab3_valid.csv', Y_pred))

(0.6071495066095699, 0.6457425742574258, 0.6258516457153824)
