All of the models performed similarly, around .5 model accuracy. The training accuracy was much better, with random forrest and gradient boosting at 1 and SVM and logistic regression around .7. 

I handled missing values using dropna to get rid of them. I encoded categorical data using onehot because my data is nominal. I removed outliers using robustscaler and standardized my data using standardscaler. 

Performed relatively poorly on SVM and logistic regression testing, possibly showing too much bias

Perhaps too much variance on random forest and gradient boosting because performed well on training set and poorly on testing set


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler

df = pd.read_csv('student-mat.csv')

df = df.dropna() 

X = df.drop('Walc', axis=1)  
y = df['Walc']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('outlier_remover', RobustScaler(with_centering=False)),  
    ('scaler', StandardScaler())  
])

categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

models = {
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC(),
    'Logistic Regression': LogisticRegression()
}


for name, model in models.items():
    model_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    model_pipeline.fit(X_train, y_train)
    accuracy = model_pipeline.score(X_test, y_test)
    print(f'{name} - Model accuracy: {accuracy}')


Random Forest - Model accuracy: 0.4810126582278481
Gradient Boosting - Model accuracy: 0.4936708860759494
SVM - Model accuracy: 0.46835443037974683
Logistic Regression - Model accuracy: 0.4810126582278481


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [6]:
from sklearn.metrics import accuracy_score

for name, model in models.items():
    model_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    model_pipeline.fit(X_train, y_train)
    

    y_train_pred = model_pipeline.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    
  
    y_test_pred = model_pipeline.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    print(f'{name} - Training accuracy: {train_accuracy}, Testing accuracy: {test_accuracy}')

Random Forest - Training accuracy: 1.0, Testing accuracy: 0.4810126582278481
Gradient Boosting - Training accuracy: 1.0, Testing accuracy: 0.4936708860759494
SVM - Training accuracy: 0.7183544303797469, Testing accuracy: 0.46835443037974683
Logistic Regression - Training accuracy: 0.6550632911392406, Testing accuracy: 0.4810126582278481


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


model = LinearRegression()
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', model)
])

model_pipeline.fit(X_train, y_train)

y_pred = model_pipeline.predict(X_test)

r_squared = r2_score(y_test, y_pred)

rmse = mean_squared_error(y_test, y_pred, squared=False)

mae = mean_absolute_error(y_test, y_pred)

print(f'R-squared: {r_squared}')
print(f'RMSE: {rmse}')
print(f'MAE: {mae}')

R-squared: 0.4407968437886879
RMSE: 0.9198895915493684
MAE: 0.7731037381329114
