## About
_________

This notebook contains a very fast fundamental k-nearest neighbour example in Python.
	
This work is part of a series called [Machine learning in minutes - very fast fundamental examples in Python](https://www.kaggle.com/jamiemorales/machine-learning-in-minutes-very-fast-examples). 
	
The approach is designed to help grasp the applied machine learning lifecycle in minutes. It is not an alternative to actually taking the time to learn. What it aims to do is help someone get started fast and gain intuitive understanding of the typical steps early on.

## Step 0: Understand the problem
What we're trying to do here is to classify conditions of orthopedic patients.

## Step 1: Set-up and understand data
This step helps uncover issues that we will want to address in the next step and take into account when building and evaluating our model. We also want to find interesting relationships or patterns that we can possibly leverage in solving the problem we specified.

In [None]:
# Set-up libraries
import os
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [None]:
# Check data input source
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read-in data
df = pd.read_csv('../input/biomechanical-features-of-orthopedic-patients/column_3C_weka.csv')

In [None]:
# Look at some details
df.info()

In [None]:
# Look at some records
df.head()

In [None]:
# Check for missing values
df.isna().sum()

In [None]:
# Look at breakdown of label
df['class'].value_counts()
sns.countplot(df['class'])

In [None]:
# Explore data visually with multple scatter plots
sns.pairplot(df, hue='class')

In [None]:
# Summarise
df.describe()

## Step 2: Preprocess data
This step typically takes the most time in the cycle but for our purposes, most of the datasets chosen in this series are clean. 
	
Real-world datasets are noisy and incomplete. The choices we make in this step to address data issues can impact downstream steps and the result itself. For example, it can be tricky to address missing data when we don't know why it's missing. Is it missing completely at random or not? It can also be tricky to address outliers if we do not understand the domain and problem context enough.

In [None]:
# Split dataset into 80% train and 20% validation
X = df.drop('class', axis=1)
y = df['class']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

## Step 3: Model and evaluate
This last step is three-fold.

We create the model and fit the model to the data we prepared for training.
	
We then proceed to classifying with the data we prepared for validation.
	
Lastly, we evaluate the model's performance with mainstream classification metrics. 

In [None]:
# Build model and train data
classifier = KNeighborsClassifier(n_neighbors=3)
knn = classifier.fit(X_train, y_train)
knn

In [None]:
# Apply model to validation data
y_predict = classifier.predict(X_val)

In [None]:
# Compare actual and predicted values
actual_vs_predict = pd.DataFrame({'Actual: ': y_val,
                     'Prediction: ': y_predict})
actual_vs_predict.head(10)

In [None]:
# Evaluate model
print('Classification metrics: \n', classification_report(y_val, y_predict))

## Learn more
If you found this example interesting, you may also want to check out:

* [Machine learning in minutes - very fast fundamental examples in Python](https://www.kaggle.com/jamiemorales/machine-learning-in-minutes-very-fast-examples)
* [List of machine learning methods & datasets](https://www.kaggle.com/jamiemorales/list-of-machine-learning-methods-datasets)

Thanks for reading. Don't forget to upvote.