# Predicting knn based on romatic experience of the students

In [1]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

Let's firstly take a look at the head of our dataset:

In [2]:
df = pd.read_csv('student-por.csv')
df = df.dropna() #first get rid of rows with empty cells
df.head(10)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13
5,GP,M,16,U,LE3,T,4,3,services,other,...,5,4,2,1,2,5,6,12,12,13
6,GP,M,16,U,LE3,T,2,2,other,other,...,4,4,4,1,1,3,0,13,12,13
7,GP,F,17,U,GT3,A,4,4,other,teacher,...,4,1,4,1,1,1,2,10,13,13
8,GP,M,15,U,LE3,A,3,2,services,other,...,4,2,2,1,1,1,0,15,16,17
9,GP,M,15,U,GT3,T,3,4,other,other,...,5,5,1,1,1,5,0,12,12,13


# Cleaning the data:

Here I chose 7 variables on which I will build my predicting model. I want to predict on what depends romantic interest of the students. It will be my independent variable and sex, age, health, studytime, freetime and go out will be dependent, which are in my opinion are good to be based on :

In [3]:
df1 = df[['sex','age', 'health','studytime', 'freetime', 'goout', 'romantic']]
df1=df1.dropna()
df1.head()

Unnamed: 0,sex,age,health,studytime,freetime,goout,romantic
0,F,18,3,2,3,4,no
1,F,17,3,2,3,3,no
2,F,15,3,2,3,2,no
3,F,15,5,3,2,2,yes
4,F,16,5,2,3,2,no


Legenda of some particular variable:



health - current health status (numeric: from 1 - very bad to 5 - very good)

studytime - weekly study time (numeric: 1 - 10 hours)

freetime - free time after school (numeric: from 1 - very low to 5 - very high)

goout - going out with friends (numeric: from 1 - very low to 5 - very high)

romantic - with a romantic relationship (binary: yes or no)

In [4]:
df['romantic'].value_counts()

no     410
yes    239
Name: romantic, dtype: int64

I decided to count my independent variable and to see how much of the relevant data for me I have indeed. As we only 239 answers are 'interesting' for me.
Here I want to turn variable 'romantic' into a string and do this with dummies function:

In [5]:
dummies = pd.get_dummies(df1['romantic'])
pd.get_dummies(df1['romantic'])
df1 = pd.concat([df1, dummies], axis=1)
df1=df1.drop("romantic", axis=1)
df1.head()

Unnamed: 0,sex,age,health,studytime,freetime,goout,no,yes
0,F,18,3,2,3,4,1,0
1,F,17,3,2,3,3,1,0
2,F,15,3,2,3,2,1,0
3,F,15,5,3,2,2,0,1
4,F,16,5,2,3,2,1,0


Turning variable 'sex' into a string with dummies function. At the same time I drop the orginal column, leaving only answers F and M:

In [6]:
dummies = pd.get_dummies(df1['sex'])
pd.get_dummies(df1['sex'])
df1 = pd.concat([df1, dummies], axis=1)
df1=df1.drop("sex", axis=1)
df1.head()

Unnamed: 0,age,health,studytime,freetime,goout,no,yes,F,M
0,18,3,2,3,4,1,0,1,0
1,17,3,2,3,3,1,0,1,0
2,15,3,2,3,2,1,0,1,0
3,15,5,3,2,2,0,1,1,0
4,16,5,2,3,2,1,0,1,0


# Splitting the data set into a training and test set

For 'y' I chose only answers 'yes' to a question about having a romatic interest, because this is the data I want to build my prediction on:

In [7]:
#This built-in function from sk-learn splits the data set randomly into a train set and a test set
#By stating random_state = 1, we use one particular "random state" (we could use any number, it's a so-called "random seed"). 
#This means if we run the code again, it will produce the same results. Which can be handy.
#test_size = 0.3, so I'm splitting the data into 70% training data and 30% test data
y = df1['yes'] #We need to take out the rating as our Y-variable
X = df1[['age', 'health','studytime', 'freetime', 'goout','F','M']] #this slices the dataframe to include all rows and the columns from "action" to "metascore"
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables
X_train.head() #The train data

Unnamed: 0,age,health,studytime,freetime,goout,F,M
358,18,1,1,2,4,1,0
74,16,5,2,3,3,1,0
640,18,3,1,4,3,0,1
423,16,5,1,3,3,1,0
61,16,5,1,5,5,1,0


# Model evaluation

Creating KNeightborsClassifier and calculating accuracy on the test data:

In [8]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need
 #this slices the dataframe to include all rows and the columns from "action" to "metascore"
knn = KNeighborsClassifier(n_neighbors=5) #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculate the fit on the test data

0.5743589743589743

The accurancy result is 57% whereas our baseline model of the 'yes' answer to romantic interest was 239/(410+239) was 37%. Training test result is a better prediction but not very much.

Confusion matrix for further testing:

In [9]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[92, 36],
       [47, 20]])

In [10]:
knn.classes_

array([0, 1], dtype=uint8)

0 is for 'not romantic' and 1 is for 'romantic'. Time for the the category text label:

In [11]:
conf_matrix = pd.DataFrame(cm, index=['Not romantic (actual)', 'Romantic (actual)'], columns = ['Not romantic (predicted)', 'Romantic (predicted)']) #make a dataframe, put labels on rows (index) and columns 
conf_matrix

Unnamed: 0,Not romantic (predicted),Romantic (predicted)
Not romantic (actual),92,36
Romantic (actual),47,20


Checking accurancy:

In [12]:
(92+20)/(92+20+47+36)

0.5743589743589743

Accurancy is the same as it was calculated with knn score.

Precision for 'yes' answers:

In [13]:
20/(36+20)

0.35714285714285715

The model shows I am not very much precised in my prediction. Honestly I don't know how to fix and make it higher.

Recall  for 'yes' answers:

In [14]:
20/(20+47)

0.29850746268656714

The recall has also very low result.

# Parameter setting

In [15]:
from sklearn.metrics import classification_report

for i in range(1,6):
    knn_new = KNeighborsClassifier(n_neighbors = i) #make a new kNN model with i (1-10) neighbors
    knn_new = knn_new.fit(X_train, y_train) #fit new model on train data
    y_test_pred_new = knn_new.predict(X_test) #predict using new model, with test data
    print(f"With {i} neighbors the result is:")
    print(classification_report(y_test, y_test_pred_new)) #use a built-in function to print out accuracy, precision and recall

With 1 neighbors the result is:
              precision    recall  f1-score   support

           0       0.66      0.67      0.67       128
           1       0.35      0.34      0.35        67

    accuracy                           0.56       195
   macro avg       0.51      0.51      0.51       195
weighted avg       0.56      0.56      0.56       195

With 2 neighbors the result is:
              precision    recall  f1-score   support

           0       0.67      0.90      0.77       128
           1       0.46      0.16      0.24        67

    accuracy                           0.65       195
   macro avg       0.57      0.53      0.51       195
weighted avg       0.60      0.65      0.59       195

With 3 neighbors the result is:
              precision    recall  f1-score   support

           0       0.68      0.73      0.71       128
           1       0.40      0.34      0.37        67

    accuracy                           0.60       195
   macro avg       0.54      0.5

I dont know at which indicator I have to look to choose the best result, but from my own guess I think it is with 5 neighbours.