# Training Split
In this notebook, I will explore how splitting training data will affect a knn algorithm

## Feature Variables
* **PassengerId**: ID for each passenger
* **Pclass**: 1st, 2nd, and 3rd class
* **Sex**: 0-male, 1-female
* **Age**: Age of passenger
* **SibSp**: Number of siblings/spouse for passenger
* **Parch**: Number of parents/children for passenger
* **Fare**: Price of ticket
* **Embarked**: Location of Departure, 'S': 0, 'C': 1, 'Q': 2

## Target Variables
* **Survived**: See if the passenger survived


In [2]:
import numpy as np
import pandas as pd

# Machine Learning - KNN Stuff
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, confusion_matrix

import matplotlib.pyplot as plt
from sklearn import metrics 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:
# Load the test data
train_data = pd.read_csv("train.csv")
train_data.head()

# Preprocess the train data
train_data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True) # Drop 
train_data['Sex'] = train_data['Sex'].map({'male': 0, 'female': 1})
train_data['Embarked'] = train_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

# Fill N/A Values
train_data.fillna(train_data.mean(), inplace=True) 

train_data # display 

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,0,22.000000,1,0,7.2500,0.0
1,2,1,1,1,38.000000,1,0,71.2833,1.0
2,3,1,3,1,26.000000,0,0,7.9250,0.0
3,4,1,1,1,35.000000,1,0,53.1000,0.0
4,5,0,3,0,35.000000,0,0,8.0500,0.0
...,...,...,...,...,...,...,...,...,...
886,887,0,2,0,27.000000,0,0,13.0000,0.0
887,888,1,1,1,19.000000,0,0,30.0000,0.0
888,889,0,3,1,29.699118,1,2,23.4500,0.0
889,890,1,1,0,26.000000,0,0,30.0000,1.0


In [4]:
# Features and Target Variables
# Features = Pclass, Sex, Age, SibSp, Parch, Fare, Embarked
X = train_data.drop(columns=['PassengerId', 'Survived'])
y = train_data['Survived']
print(X)
print(y)

     Pclass  Sex        Age  SibSp  Parch     Fare  Embarked
0         3    0  22.000000      1      0   7.2500       0.0
1         1    1  38.000000      1      0  71.2833       1.0
2         3    1  26.000000      0      0   7.9250       0.0
3         1    1  35.000000      1      0  53.1000       0.0
4         3    0  35.000000      0      0   8.0500       0.0
..      ...  ...        ...    ...    ...      ...       ...
886       2    0  27.000000      0      0  13.0000       0.0
887       1    1  19.000000      0      0  30.0000       0.0
888       3    1  29.699118      1      2  23.4500       0.0
889       1    0  26.000000      0      0  30.0000       1.0
890       3    0  32.000000      0      0   7.7500       2.0

[891 rows x 7 columns]
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64


In [4]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# List of different test sizes
splits_arr = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]

# Lists to store FPR and FNR
fpr_list = []
fnr_list = []

print("This is the 5-KNN neighbor. Using different test-training size splits, I will see how it affects FP, FN, and Accuracy.")

for split in splits_arr:
    train_percentage = int((1 - split) * 100)
    test_percentage = int(split * 100)
    print(f"\nThis is a {test_percentage}% test and {train_percentage}% train split.")
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=20)
    
    # Standardize the data
    scaler = StandardScaler()
    X_train_scale = scaler.fit_transform(X_train)
    X_test_scale = scaler.transform(X_test)  

    # Train the KNN model with a fixed number of neighbors (n_neighbors=5)
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_scale, y_train)

    # Predict on the test set
    y_pred = knn.predict(X_test_scale)

    # Calculate Accuracy, False Positives (FP), False Negatives (FN)
    accuracy = accuracy_score(y_test, y_pred)
    c_matrix = confusion_matrix(y_test, y_pred)
    
    FP = c_matrix[0, 1] 
    FN = c_matrix[1, 0]
    TP = c_matrix[1, 1]
    TN = c_matrix[0, 0]

    # Calculate False Positive Rate (FPR) and False Negative Rate (FNR)
    FP_rate = FP / (FP + TN) if (FP + TN) > 0 else 0
    FN_rate = FN / (FN + TP) if (FN + TP) > 0 else 0

    # Store FPR and FNR
    fpr_list.append(FP_rate)
    fnr_list.append(FN_rate)

    # Output results
    print(f"Model Accuracy: {accuracy * 100:.2f}%")
    print(f"Confusion Matrix:\n{c_matrix}")
    print(f"False Positive Rate: {FP_rate:.4f}")
    print(f"False Negative Rate: {FN_rate:.4f}")

# Plotting FPR and FNR
plt.figure(figsize=(10, 6))
plt.plot(splits_arr, fpr_list, label="False Positive Rate", marker='o')
plt.plot(splits_arr, fnr_list, label="False Negative Rate", marker='o')

# Adding titles and labels
plt.title('The Affect of Testing Size')
plt.xlabel('Test Size Split')
plt.ylabel('Rate')
plt.legend()
plt.grid(True)
plt.show()




This is the 5-KNN neighbor. Using different test-training size splits, I will see how it affects FP, FN, and Accuracy.

This is a 5% test and 95% train split.


NameError: name 'X' is not defined