<h3> KNN (k - nearest neighbors ) </h3>

<h4>FORMULA TO CALCULATE:</h4>
<h3 style="color: blue">distance(x, y) = sqrt(sum(wi * (xi - yi)^2))</h3>
<p>where:</p>
<ul>
  <li>x and y are the two data points being compared</li>
  <li>xi and yi are the values of the ith feature (dimension) of x and y, respectively</li>
  <li>wi is the weight assigned to the ith feature</li>
  <li>sqrt() is the square root function</li>
</ul>


<h3>Disadvantages:</h3>
<p><em>In production:</em></p>
<p>We need to deploy all training sets on the server so that our model can calculate nearest distances between rows and fill in missing values, which can be costly.</p>


In [1]:
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.impute import KNNImputer 
from sklearn.impute import SimpleImputer  
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv("train.csv")

In [3]:
#extracting only required columns 
df = df[['Age' , 'Pclass' , 'Fare' , 'Survived']] 

In [14]:
df.sample(5)

Unnamed: 0,Age,Pclass,Fare,Survived
80,22.0,3,9.0,0
282,16.0,3,9.5,0
856,45.0,1,164.8667,1
826,,3,56.4958,0
761,41.0,3,7.125,0


In [15]:
#calculating percentages of null values 
df.isnull().mean()*100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [16]:

#train test split 
x_train , x_test , y_train , y_test = train_test_split(df.drop(columns = "Survived" ) , df['Survived'] , test_size = 0.2 , random_state = 2)

In [17]:
knn = KNNImputer(n_neighbors = 4 ) 
x_train_trf = knn.fit_transform(x_train) 
x_test_trf = knn.fit_transform(x_test) 


In [18]:
lr = LogisticRegression() 
lr.fit(x_train_trf , y_train ) 
y_pred = lr.predict(x_test_trf) 
accuracy_score(y_test,y_pred)


0.6983240223463687

In [19]:
knn = KNNImputer(n_neighbors=3,weights='distance')

x_train_dis = knn.fit_transform(x_train)
x_test_dis= knn.transform(x_test)

#applying liner regression model
lr_0 = LogisticRegression()
lr_0.fit(x_train_dis ,  y_train) 
y_pred = lr_0.predict(x_test_dis)
accuracy_score (y_test , y_pred)

#note this is the higest no of accurecy that model got ever 


0.7150837988826816

In [20]:
#from simpleimpute
from sklearn.impute import SimpleImputer
#creating instances of simpleimputer class 
si = SimpleImputer(strategy = 'mean') 
si.fit(x_train)
x_train_si_mean = si.transform(x_train)
x_test_si_mean =si.transform(x_test) 


Trying to fill same values with simple imputer 


In [21]:
#creating another logisticregression class instance 
lr_1 = LogisticRegression()


lr_1.fit(x_train_si_mean  , y_train) 
y_pred = lr_1.predict(x_test_si_mean ) 
accuracy_score (y_test , y_pred)

0.6927374301675978

<h2> Conc </h2>

<table>
  <caption>Conclusion</caption>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>KNN</th>
      <th>SimpleImputer</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Weight</td>
      <td>Uniform = 0.6983240223463687 </td>
      <td>null</td>
    </tr>
    <tr>
      <td></td>
      <td></td>
      <td>null</td>
    </tr>
    <tr>
      <td>Weight</td>
      <td>Distance = 0.7150837988826816 </td>
      <td>null</td>
    </tr>
    <tr>
      <td></td>
      <td></td>
      <td>null</td>
    </tr>
    <tr>
      <td>Strategy</td>
      <td>null</td>
      <td>Mean = 0.6927374301675978 </td>
    </tr>
    <tr>
      <td></td>
      <td>null</td>
      <td>   </td>
    </tr>
    <tr>
      <td>Strategy</td>
      <td>null</td>
      <td>Most Frequent = 0.664804469273743</td>
    </tr>
    <tr>
      <td></td>
      <td>null</td>
      <td>td>
    </tr>
  </tbody>
</table>


#  from this we know so knn accuracy is better then simpleimputer  