# **Experimentation - Report** <br>

#### ***Introduction to the test***


In this test we were instructed to select 3 of the following datasets:
- 2023 QS World University Rankings
- Student Marks
- Student performance 
- Wine quality
- Space Titanic

 **Created by Ron Ismaili**

#### ***Table of contents***

- Introduction to the test (include the table of contents)
- Table of contents
- My selection
- First look at the datasets
- Which algorithms am I going to use for which dataset
- Initial scores for the algorithms
- Data pre-processing (cleaning & scaling)
- Post-pre-processing scores
- Data analysis
- Final scores and conclusions

#### ***My selection***

I have chosen the following 3 datasets:
- Student Marks : The data consists of Marks of students including their study time & number of courses.
- Wine Quality : The data consists of different characteristics of red wine and their quality.
- Student Data : The data consists of 33 different features.

#### ***First look at the datasets***

We start the project as any other by importing all of the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt

We follow up by loading into our file the following 3 datasets:
- "Student_Marks.csv"
- "winequality-red.csv"
- "student_data.csv"

In [None]:
StudentMarks_df = pd.read_csv("Student_Marks.csv")
WineQuality_df = pd.read_csv("winequality-red.csv")
StudentData_df = pd.read_csv("student_data.csv")

Then we take a look at their **dataframes**, **shapes** & **keys**.

In [None]:
display(StudentMarks_df)
print("Data Keys:\n", StudentMarks_df.keys(), "\n\nData Shape:\n", StudentMarks_df.shape)

display(WineQuality_df)
print("Data Keys:\n", WineQuality_df.keys(), "\n\nData Shape:\n", WineQuality_df.shape)

display(StudentData_df)
print("Data Keys:\n", StudentData_df.keys(), "\n\nData Shape:\n", StudentData_df.shape)

#### ***Which algorithms am I going to use for which dataset***

The following algorithms will be used
- Student Marks: KNN regressor, Lasso, Decision Tree Regressor
- Wine Quality: KNN regressor, Ridge, Decision Tree Regressor
- Student Data: KNN regressor, Lasso, Decision Tree Regressor

#### ***Initial scores for the algorithms***

I cannot use strings in the dataset to train my model for the Student Data, that is why we will create the model after the pre-processing step.

I left every algorithm variable unspecified un purpose. I will play around with alpha, max_depth, etc. in the data analysis section.

In [None]:
#I use the train_test_split function to split the data for the first 2 datasets
StudentMarks_features = StudentMarks_df.loc[:, "number_courses":"time_study"] #I get all of the features
StudentMarks_label = StudentMarks_df["Marks"] #I get the label (target value)
StudentMarks_features_train, StudentMarks_features_test, StudentMarks_label_train, StudentMarks_label_test = train_test_split(StudentMarks_features, StudentMarks_label, random_state = 69)

WineQuality_features = WineQuality_df.loc[:, "fixed acidity":"alcohol"] #I get all of the features
WineQuality_label = WineQuality_df["quality"] #I get the label (target value)
WineQuality_features_train, WineQuality_features_test, WineQuality_label_train, WineQuality_label_test = train_test_split(WineQuality_features, WineQuality_label, random_state = 69)

knn1 = KNeighborsRegressor(n_neighbors = 2).fit(StudentMarks_features_train , StudentMarks_label_train) 
lasso = Lasso().fit(StudentMarks_features_train, StudentMarks_label_train) 
tree1 = DecisionTreeRegressor().fit(StudentMarks_features_train, StudentMarks_label_train)

knn2 = KNeighborsRegressor(n_neighbors = 2).fit(WineQuality_features_train , WineQuality_label_train)
ridge = Ridge().fit(WineQuality_features_train, WineQuality_label_train)
tree2 = DecisionTreeRegressor().fit(WineQuality_features_train, WineQuality_label_train)

In [None]:
def knn1Score(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test):
    train_score = knn1.score(StudentMarks_features_train, StudentMarks_label_train)
    test_score = knn1.score(StudentMarks_features_test, StudentMarks_label_test)
    return train_score, test_score

def lassoScore(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test):
    train_score = lasso.score(StudentMarks_features_train, StudentMarks_label_train)
    test_score = lasso.score(StudentMarks_features_test, StudentMarks_label_test)
    return train_score, test_score

def tree1Score(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test):
    train_score = tree1.score(StudentMarks_features_train, StudentMarks_label_train)
    test_score = tree1.score(StudentMarks_features_test, StudentMarks_label_test)
    return train_score, test_score

def knn2Score(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test):
    train_score = knn2.score(WineQuality_features_train, WineQuality_label_train)
    test_score = knn2.score(WineQuality_features_test, WineQuality_label_test)
    return train_score, test_score

def ridgeScore(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test):
    train_score = ridge.score(WineQuality_features_train, WineQuality_label_train)
    test_score = ridge.score(WineQuality_features_test, WineQuality_label_test)
    return train_score, test_score

def tree2Score(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test):
    train_score = tree2.score(WineQuality_features_train, WineQuality_label_train)
    test_score = tree2.score(WineQuality_features_test, WineQuality_label_test)
    return train_score, test_score

knn1_train_score, knn1_test_score = knn1Score(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)
lasso_train_score, lasso_test_score = lassoScore(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)
tree1_train_score, tree1_test_score = tree1Score(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)

knn2_train_score, knn2_test_score = knn2Score(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)
ridge_train_score, ridge_test_score = ridgeScore(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)
tree2_train_score, tree2_test_score = tree2Score(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)

In [None]:
def printKnn1(knn1_train_score, knn1_test_score):
    print("Student Marks KNN Regressor - Training set score: {:.4f}".format(knn1_train_score))
    print("Student Marks KNN Regressor - Test set score: {:.4f}".format(knn1_test_score))

def printLasso(lasso_train_score, lasso_test_score):
    print("\nStudent Marks Lasso - Training set score: {:.4f}".format(lasso_train_score))
    print("Student Marks Lasso - Test set score: {:.4f}".format(lasso_test_score))

def printTree1(tree1_train_score, tree1_test_score):
    print("\nStudent Marks Decision Tree Regressor - Training set score: {:.4f}".format(tree1_train_score))
    print("Student Marks Decision Tree Regressor - Test set score: {:.4f}".format(tree1_test_score))

def printKnn2(knn2_train_score, knn2_test_score):
    print("\nWine Quality KNN Regressor - Training set score: {:.4f}".format(knn2_train_score))
    print("Wine Quality KNN Regressor - Test set score: {:.4f}".format(knn2_test_score))

def printRidge(ridge_train_score, ridge_test_score):
    print("\nWine Quality Ridge - Training set score: {:.4f}".format(ridge_train_score))
    print("Wine Quality Ridge - Test set score: {:.4f}".format(ridge_test_score))

def printTree2(tree2_train_score, tree2_test_score):
    print("\nWine Quality Decision Tree Regressor - Training set score: {:.4f}".format(tree2_train_score))
    print("Wine Quality Decision Tree Regressor - Test set score: {:.4f}".format(tree2_test_score))

def printStudentMark():
    printKnn1(knn1_train_score, knn1_test_score)
    printLasso(lasso_train_score, lasso_test_score)
    printTree1(tree1_train_score, tree1_test_score)

def printWineQuality():
    printKnn2(knn2_train_score, knn2_test_score)
    printRidge(ridge_train_score, ridge_test_score)
    printTree2(tree2_train_score, tree2_test_score)

Student Marks accuracy score

In [None]:
printStudentMark()

Wine Quality accuracy score

In [None]:
printWineQuality()

#### ***Data pre-processing (cleaning & scaling)***

Student Marks pre-processing

In [None]:
display(StudentMarks_features_train)
print(StudentMarks_features_train.max(), "\n")

scaler = MinMaxScaler().fit(StudentMarks_features_train)
StudentMarks_features_train_scaled = scaler.transform(StudentMarks_features_train)

display(StudentMarks_features_train_scaled) #I will use the scaling at the end when we score the algorithms

Wine Quality pre-processing

In [None]:
display(WineQuality_features_train)
print(WineQuality_features_train.max(), "\n")

scaler = MinMaxScaler().fit(WineQuality_features_train)
WineQuality_features_train_scaled = scaler.transform(WineQuality_features_train)

display(WineQuality_features_train_scaled)

Student Data pre-processing

In [None]:
display(StudentData_df)
print(StudentData_df.keys())

#I take a look at all of the columns, what kind of values they have
def printUnique():
    print("\n")
    print(StudentData_df["school"].unique())
    print(StudentData_df["address"].unique())
    print(StudentData_df["famsize"].unique())
    print(StudentData_df["Pstatus"].unique())
    print(StudentData_df["Mjob"].unique())
    print(StudentData_df["Fjob"].unique())
    print(StudentData_df["reason"].unique())
    print(StudentData_df["guardian"].unique())
    print(StudentData_df["traveltime"].unique())
    print(StudentData_df["failures"].unique())
    print(StudentData_df["schoolsup"].unique())
    print(StudentData_df["famsup"].unique())
    print(StudentData_df["paid"].unique())
    print(StudentData_df["activities"].unique())
    print(StudentData_df["nursery"].unique())
    print(StudentData_df["higher"].unique())
    print(StudentData_df["internet"].unique())
    print(StudentData_df["romantic"].unique())
    print("\n")

printUnique()

#Feature encoding
le = preprocessing.LabelEncoder()
StudentData_df["sex"] = le.fit_transform(StudentData_df["sex"])
StudentData_df["school"] = le.fit_transform(StudentData_df["school"])
StudentData_df["address"] = le.fit_transform(StudentData_df["address"])
StudentData_df["famsize"] = le.fit_transform(StudentData_df["famsize"])
StudentData_df["Pstatus"] = le.fit_transform(StudentData_df["Pstatus"])
StudentData_df["Mjob"] = le.fit_transform(StudentData_df["Mjob"])
StudentData_df["Fjob"] = le.fit_transform(StudentData_df["Fjob"])
StudentData_df["reason"] = le.fit_transform(StudentData_df["reason"])
StudentData_df["guardian"] = le.fit_transform(StudentData_df["guardian"])
StudentData_df["traveltime"] = le.fit_transform(StudentData_df["traveltime"])
StudentData_df["failures"] = le.fit_transform(StudentData_df["failures"])
StudentData_df["schoolsup"] = le.fit_transform(StudentData_df["schoolsup"])
StudentData_df["famsup"] = le.fit_transform(StudentData_df["famsup"])
StudentData_df["paid"] = le.fit_transform(StudentData_df["paid"])
StudentData_df["activities"] = le.fit_transform(StudentData_df["activities"])
StudentData_df["nursery"] = le.fit_transform(StudentData_df["nursery"])
StudentData_df["higher"] = le.fit_transform(StudentData_df["higher"])
StudentData_df["internet"] = le.fit_transform(StudentData_df["internet"])
StudentData_df["romantic"] = le.fit_transform(StudentData_df["romantic"])

printUnique()
display(StudentData_df)

StudentData_df["G1"] = StudentData_df["G1"] + StudentData_df["G2"] + StudentData_df["G3"]
StudentData_df = StudentData_df.drop(["G2"], axis = 1)
StudentData_df = StudentData_df.drop(["G3"], axis = 1)

display(StudentData_df)

#### ***Post-pre-processing scores***

Student Marks

In [None]:
#ReSplitting the dataset
StudentMarks_features_train_scaled, StudentMarks_features_test, StudentMarks_label_train, StudentMarks_label_test = train_test_split(StudentMarks_features, StudentMarks_label, random_state = 69)

#ReTraining the model
knn1 = KNeighborsRegressor(n_neighbors = 2).fit(StudentMarks_features_train_scaled , StudentMarks_label_train)
lasso = Lasso().fit(StudentMarks_features_train_scaled, StudentMarks_label_train)
tree1 = DecisionTreeRegressor().fit(StudentMarks_features_train_scaled, StudentMarks_label_train)

#Scoring the algorithms
knn1_train_score, knn1_test_score = knn1Score(StudentMarks_features_train_scaled, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)
lasso_train_score, lasso_test_score = lassoScore(StudentMarks_features_train_scaled, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)
tree1_train_score, tree1_test_score = tree1Score(StudentMarks_features_train_scaled, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)

#Printing the results
printStudentMark()

#I Still won't change the variables inside of the algorithm, I will play with those variables further down below
#Small increase in accuracy for the Decision Tree Regressor, the others stayed the same

Wine Quality

In [None]:
#ReSplitting the dataset
WineQuality_features_train_scaled, WineQuality_features_test, WineQuality_label_train, WineQuality_label_test = train_test_split(WineQuality_features, WineQuality_label, random_state = 69)

#ReTraining the model
knn2 = KNeighborsRegressor(n_neighbors = 2).fit(WineQuality_features_train_scaled , WineQuality_label_train)
ridge = Ridge().fit(WineQuality_features_train_scaled, WineQuality_label_train)
tree2 = DecisionTreeRegressor().fit(WineQuality_features_train_scaled, WineQuality_label_train)

#Scoring the algorithms
knn2_train_score, knn2_test_score = knn2Score(WineQuality_features_train_scaled, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)
ridge_train_score, ridge_test_score = ridgeScore(WineQuality_features_train_scaled, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)
tree2_train_score, tree2_test_score = tree2Score(WineQuality_features_train_scaled, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)

#Printing the results
printWineQuality()

#I Still won't change the variables inside of the algorithm, I will play with those variables further down below
#Small decrease in accuracy for the Decision Tree Regressor, the others stayed the same

Student Data

In [None]:
StudentData_features = StudentData_df.loc[:, "school":"absences"] #Get all of the features
StudentData_label = StudentData_df["G1"] #Get all of the labels (target values)
StudentData_features_train, StudentData_features_test, StudentData_label_train, StudentData_label_test = train_test_split(StudentData_features, StudentData_label, random_state = 69)

knn3 = KNeighborsRegressor(n_neighbors = 2).fit(StudentData_features_train, StudentData_label_train) 
lasso3 = Lasso().fit(StudentData_features_train, StudentData_label_train) 
tree3 = DecisionTreeRegressor().fit(StudentData_features_train, StudentData_label_train)

In [None]:
def knn3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test):
    train_score = knn3.score(StudentData_features_train, StudentData_label_train)
    test_score = knn3.score(StudentData_features_test, StudentData_label_test)
    return train_score, test_score

def lasso3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test):
    train_score = lasso3.score(StudentData_features_train, StudentData_label_train)
    test_score = lasso3.score(StudentData_features_test, StudentData_label_test)
    return train_score, test_score

def tree3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test):
    train_score = tree3.score(StudentData_features_train, StudentData_label_train)
    test_score = tree3.score(StudentData_features_test, StudentData_label_test)
    return train_score, test_score

knn3_train_score, knn3_test_score = knn3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test)
lasso3_train_score, lasso3_test_score = lasso3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test)
tree3_train_score, tree3_test_score = tree3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test)

def printKnn3(knn3_train_score, knn3_test_score):
    print("Student Data KNN Regressor - Training set score: {:.4f}".format(knn3_train_score))
    print("Student Data KNN Regressor - Test set score: {:.4f}".format(knn3_test_score))

def printLasso3(lasso3_train_score, lasso3_test_score):
    print("\nStudent Data Lasso - Training set score: {:.4f}".format(lasso3_train_score))
    print("Student Data Lasso - Test set score: {:.4f}".format(lasso3_test_score))

def printTree3(tree3_train_score, tree3_test_score):
    print("\nStudent Data Decision Tree Regressor - Training set score: {:.4f}".format(tree3_train_score))
    print("Student Data Decision Tree Regressor - Test set score: {:.4f}".format(tree3_test_score))

def printStudentData():
    printKnn3(knn3_train_score, knn3_test_score)
    printLasso3(lasso3_train_score, lasso3_test_score)
    printTree3(tree3_train_score, tree3_test_score)

printStudentData()

#### ***Data analysis***

I will start off by looking at all of the heatmaps.

**Student Marks**

In [None]:
sns.heatmap(StudentMarks_df.corr())

Since we have only 2 features here, I don't think there is much to be learned from this heatmap besides that there is a strong correlation between geting good marks and time studied.

**Wine Quality**

In [None]:
sns.heatmap(WineQuality_df.corr())

In this heatmap we can find some interesting bits of information. Like for example that quality is correlated with alcohol, sulphates, citric acid and fixed acidity.

**Student Data**

In [None]:
sns.heatmap(StudentData_df.corr())

In this heatmap there are a lot of features so it's a bit hard to figure out what correlates to what.

#### ***Final scores and conclusions***

**Student Marks**

Now we test different values for the algorithms.

In [None]:
#KNN regressor
StudentMarks_features_train, StudentMarks_features_test, StudentMarks_label_train, StudentMarks_label_test = train_test_split(StudentMarks_features, StudentMarks_label, random_state = 69)

arr_training = list()
arr_test = list()

for i in range(1,21):
    knn1 = KNeighborsRegressor(n_neighbors=i).fit(StudentMarks_features_train , StudentMarks_label_train)
    arr_training.append(knn1.score(StudentMarks_features_train, StudentMarks_label_train))
    arr_test.append(knn1.score(StudentMarks_features_test, StudentMarks_label_test))

train = arr_training
test = arr_test

plt.plot(train)
plt.plot(test)
plt.legend(["Train Accuracy","Test Accuracy"], loc = 1)
plt.xlabel("n_neigbours")
plt.ylabel("Accuracy")
plt.show()

In [None]:
#Lasso
arr_training = list()
arr_test = list()

for i in range(1,21):
    lasso = Lasso().fit(StudentMarks_features_train, StudentMarks_label_train)
    arr_training.append(lasso.score(StudentMarks_features_train, StudentMarks_label_train))
    arr_test.append(lasso.score(StudentMarks_features_test, StudentMarks_label_test))

train = arr_training
test = arr_test

plt.plot(train)
plt.plot(test)
plt.legend(["Train Accuracy","Test Accuracy"], loc = 1)
plt.xlabel("n_neigbours")
plt.ylabel("Accuracy")
plt.show()

In [None]:
#Decision Tree Regressor

In [None]:
#ReTraining the model to the optimal variables
knn1 = KNeighborsRegressor(n_neighbors = 3).fit(StudentMarks_features_train , StudentMarks_label_train)
lasso = Lasso().fit(StudentMarks_features_train, StudentMarks_label_train)
tree1 = DecisionTreeRegressor().fit(StudentMarks_features_train_scaled, StudentMarks_label_train)

#Scoring the algorithms
knn1_train_score, knn1_test_score = knn1Score(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)
lasso_train_score, lasso_test_score = lassoScore(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)
tree1_train_score, tree1_test_score = tree1Score(StudentMarks_features_train, StudentMarks_label_train, StudentMarks_features_test, StudentMarks_label_test)

#Printing the results
printStudentMark()

**Wine Quality**

Now we test different values for the algorithms.

In [None]:
#ReSplitting the dataset
WineQuality_features_train, WineQuality_features_test, WineQuality_label_train, WineQuality_label_test = train_test_split(WineQuality_features, WineQuality_label, random_state = 69)

#ReTraining the model
knn2 = KNeighborsRegressor(n_neighbors = 2).fit(WineQuality_features_train , WineQuality_label_train)
ridge = Ridge().fit(WineQuality_features_train, WineQuality_label_train)
tree2 = DecisionTreeRegressor().fit(WineQuality_features_train, WineQuality_label_train)

#Scoring the algorithms
knn2_train_score, knn2_test_score = knn2Score(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)
ridge_train_score, ridge_test_score = ridgeScore(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)
tree2_train_score, tree2_test_score = tree2Score(WineQuality_features_train, WineQuality_label_train, WineQuality_features_test, WineQuality_label_test)

#Printing the results
printWineQuality()

**Student Data**

Now we test different values for the algorithms.

In [None]:
#ReSplitting the dataset
StudentData_features_train, StudentData_features_test, StudentData_label_train, StudentData_label_test = train_test_split(StudentData_features, StudentData_label, random_state = 69)

#ReTraining the model
knn3 = KNeighborsRegressor(n_neighbors = 2).fit(StudentData_features_train, StudentData_label_train) 
lasso3 = Lasso().fit(StudentData_features_train, StudentData_label_train) 
tree3 = DecisionTreeRegressor().fit(StudentData_features_train, StudentData_label_train)

#Scoring the algorithms
knn3_train_score, knn3_test_score = knn3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test)
lasso3_train_score, lasso3_test_score = lasso3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test)
tree3_train_score, tree3_test_score = tree3Score(StudentData_features_train, StudentData_label_train, StudentData_features_test, StudentData_label_test)

#Printing the results
printStudentData()