## Learning Objective

At the end of this experiment, you will be able to:

* Perform Data preprocessing

## Dataset

### Description

We will be using district wise demographics, enrollments, school and teacher indicator data to predict whether the literacy rate is high / medium / low in each district.

### Data Preprocessing

Data preprocessing is an important step of solving every machine learning problem. Most of
the datasets used with Machine Learning problems need to be processed / cleaned / transformed
so that a Machine Learning algorithm can be trained on it.

There are different steps involved for Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        -Handling missing data
        -Handling nosiy data
        -Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources
    and data are combined to form consistent data. This data after performing cleaning is used
    for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified for-
    mat according to the need of the model we are building. There are many options used for
    transforming the data as below:
        -Normalization
        -Aggregation
        -Generalization
        
    4. Data Reduction → After data transformation and scaling the redundancy within the data
    is removed and efficiently organizing the data is performed.



In [None]:
# Run this cell to download the data (you will get the zip file)
!wget https://cdn.talentsprint.com/aiml/Experiment_related_data/data-20190108T113429Z-001.zip

In [None]:
# Run this cell to unzip the data
!unzip data-20190108T113429Z-001.zip

In [None]:
!ls

In [None]:
%cd data

In [None]:
!ls

#### Exercise 1 
We have four different files

* Districtwise_Basicdata.csv
* Districtwise_Enrollment_details_indicator.csv
* Districtwise_SchoolData.csv
* Districtwise_Teacher_indicator.csv
These files contain the neccesary data to solve the problem.
Load all the files correctly, after observing the header level details, data records etc

Hint : Use read_csv from pandas

In [1]:

import pandas as pd
import numpy as np
df=pd.read_csv("D:/Districtwise_Basicdata.csv")
df1=pd.read_csv("D:/Districtwise_Enrollment_details_indicator.csv")
df2=pd.read_csv("D:/Districtwise_SchoolData.csv")
df3=pd.read_csv("D:/Districtwise_Teacher_indicator.csv")
df


Unnamed: 0,Year,Statecd,statename,distcd,distname,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit
0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.00,1.72,High,84.52
1,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3,13,76,181,105539.0,11651.51,2.60,925.0,975.0,-0.07,0.00,0.72,High,79.39
2,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,3,8,42,58,36819.0,4226.82,0.00,778.0,961.0,-12.48,0.00,64.28,High,70.70
3,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,52,356,1576,4983,2737738.0,295675.70,27.68,1003.0,942.0,10.04,17.82,18.09,Low,51.99
4,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,Low,54.31
5,2012-13,28,ANDHRA PRADESH ...,2823,CHITTOOR ...,66,571,1404,6590,4170468.0,423302.50,29.47,1002.0,931.0,11.33,18.83,3.81,Medium,63.65
6,2012-13,28,ANDHRA PRADESH ...,2820,CUDDAPAH ...,51,398,863,4634,2884524.0,313547.76,34.10,984.0,919.0,10.87,16.16,2.63,Medium,57.26
7,2012-13,28,ANDHRA PRADESH ...,2814,EAST GODAVARI ...,60,414,1278,5892,5151549.0,492488.08,25.52,1005.0,969.0,5.10,18.34,4.14,Medium,67.82
8,2012-13,28,ANDHRA PRADESH ...,2817,GUNTUR ...,57,375,722,4935,4889230.0,466432.54,33.89,1003.0,948.0,9.50,19.59,5.06,Medium,60.64
9,2012-13,28,ANDHRA PRADESH ...,2805,HYDERABAD ...,16,92,60,3265,4010238.0,419470.89,100.00,943.0,938.0,4.71,6.29,1.24,High,78.42


In [2]:
# Your Code Here
#!cat Districtwise_Enrollment_details_indicator.csv

#### Exercise 2  

* Remove the unwanted columns, which are unlikely to contribute for the prediction of overall literacy grade. The decision of what constitutes unwanted columns depends on how it effects your final accuracy (and very little on your domain understanding of education sector in India; you're encouraged however to exercise some domain understanding too if you wish)

**Hint** use pandas drop function to drop your choice of unwanted columns (if any).


* As the required data is present in different files, we need to integrate all the four to make single dataframe/dataset. For that purpose, create a unique identifier for each row in all the dataframes so that it can be used to map the data in different files correctly
* Join/integrate this data 

Example : data of the district ananthapur in Andrapradesh, which present in different files should form a single row 

Hint : 
* Use the combination of year, statecode, district code as unique identifier 

* Refer the following link for merge, join and concat syntaxes:  

https://pandas.pydata.org/pandas-docs/stable/merging.html


In [3]:
# Your Code Here
df.dropna(axis=0,inplace=True)
df1.dropna(axis=0,inplace=True)
df2.dropna(axis=0,inplace=True)
df3.dropna(axis=0,inplace=True)

In [4]:
df_merge = pd.merge(df,df1, on='distcd',how='inner')
df_merge1 = pd.merge(df_merge,df2, on='distcd',how='inner')
df_merge2= pd.merge(df_merge1,df3, on='distcd',how='inner')
df_merge2.head()

Unnamed: 0,Year_x,Statecd_x,statename_x,distcd,distname_x,blocks,clusters,villages,totschools,totpopulation,...,trn_tch_f2,trn_tch_f3,trn_tch_f4,trn_tch_f5,trn_tch_f6,trn_tch_f7,prof_trn_tch_r,prof_trn_tch_p,days_nontch,tch_nontch
0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,...,176,135,0,22,103,0,2968,228,12,519
1,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,...,79,82,0,8,35,0,2873,232,14,50
2,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,...,176,135,0,22,103,0,2968,228,12,519
3,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,...,79,82,0,8,35,0,2873,232,14,50
4,2013-14,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,211,237586.0,...,176,135,0,22,103,0,2968,228,12,519


In [5]:
df_merge2.isnull().sum()

Year_x            0
Statecd_x         0
statename_x       0
distcd            0
distname_x        0
blocks            0
clusters          0
villages          0
totschools        0
totpopulation     0
p_06_pop          0
p_urb_pop         0
sexratio          0
sexratio_06       0
growthrate        0
p_sc_pop          0
p_st_pop          0
overall_lit       0
female_lit        0
Year_y            0
Statecd_y         0
State Name _x     0
distname_y        0
Enr Govt1         0
Enr Govt2         0
Enr Govt3         0
Enr Govt4         0
Enr Govt5         0
Enr Govt6         0
Enr Govt7         0
                 ..
tch_st_m3         0
tch_st_m4         0
tch_st_m5         0
tch_st_m6         0
tch_st_m7         0
tch_st_f1         0
tch_st_f2         0
tch_st_f3         0
tch_st_f4         0
tch_st_f5         0
tch_st_f6         0
tch_st_f7         0
trn_tch_m1        0
trn_tch_m2        0
trn_tch_m3        0
trn_tch_m4        0
trn_tch_m5        0
trn_tch_m6        0
trn_tch_m7        0


Follow this steps in order to clean the data:

#### Exercise 3 

* Overall_lit is our target variable, which we need to predict. Delete the row with missing overall_lit column
* Take a call to replace the missing values in any other column appropriately with mean/median/mode
* Convert categorical values to numerical values
Example : If a feature contains categorical values such as dog, cat, mouse etc then replace them with 1, 2, 3 etc or using one hot encoding (your judgement)

*Hint* :
* Use pandas fillna function to replace the missing values

In [6]:
# Your Code Here
from sklearn.preprocessing import LabelEncoder 
  
le = LabelEncoder() 
  
df_merge2['overall_lit']= le.fit_transform(df_merge2['overall_lit']) 

In [7]:
df_merge2['overall_lit']

0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      2
25      2
26      2
27      2
28      2
29      2
       ..
4530    0
4531    0
4532    0
4533    0
4534    0
4535    0
4536    0
4537    0
4538    0
4539    0
4540    2
4541    2
4542    2
4543    2
4544    2
4545    2
4546    2
4547    2
4548    0
4549    0
4550    0
4551    0
4552    0
4553    0
4554    0
4555    0
4556    1
4557    1
4558    1
4559    1
Name: overall_lit, Length: 4560, dtype: int32

#### Exercise 4 

Use the functions below to adjust the outliers

smooth_out function takes pandas dataframe as input and caculates mean, standard deviation of every column to check whether all the values in that lies within the range of mean +/- 2*standard_deviation of that column or not.
If any of the values are not present in that boundary, then that values is brought on to the boundary.

**Hint:** Should  the index column be normalized too? 

<img src="https://cdn.talentsprint.com/aiml/Experiment_related_data/normal_dist.png">

In [8]:
# Function to clip and clam the data
def clip_clamp(x, mean, sd):
    # Checking whether the value is less than a differenced value between mean and standard deviation.
    if x < mean - 2*sd :
        return mean - 2*sd
    #Checking whether the value is greater than a differenced value between mean and standard deviation.
    elif x > mean + 2*sd :
        return mean + 2*sd
    # If above two conditions are not statisfied we will return the original value
    else :
        return x

In [9]:
# Function to smooth the data
def smooth_out(Total_data):
    for i in Total_data.columns:
        # Calculating the mean value
        mean = np.mean(Total_data[i].values, axis=0)
        # Calculating the standard deviation value
        sd = np.std(Total_data[i].values, axis=0)
        # Calculating the corrected value using clip and clamp function
        corrected = np.array([clip_clamp(x, mean, sd) for x in Total_data[i].values])
        # Storing the data in form of series
        Total_data[i] = pd.Series(corrected, index=Total_data[i].index)
    return Total_data

In [15]:
# Your Code Here
smooth_out(df_merge2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,Statecd_x,distcd,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,...,trn_tch_f2,trn_tch_f3,trn_tch_f4,trn_tch_f5,trn_tch_f6,trn_tch_f7,prof_trn_tch_r,prof_trn_tch_p,days_nontch,tch_nontch
0,35,3501,3.000000,16.000000,83.000000,212.00000,2.375860e+05,23616.050000,55.89,874.000000,...,176.0,135.0,0.0,22.00000,103.0,0.000000,2968.000000,228.0,12.0,519.000000
1,35,3501,3.000000,16.000000,83.000000,212.00000,2.375860e+05,23616.050000,55.89,874.000000,...,79.0,82.0,0.0,8.00000,35.0,0.000000,2873.000000,232.0,14.0,50.000000
2,35,3501,3.000000,16.000000,83.000000,212.00000,2.375860e+05,23616.050000,55.89,874.000000,...,176.0,135.0,0.0,22.00000,103.0,0.000000,2968.000000,228.0,12.0,519.000000
3,35,3501,3.000000,16.000000,83.000000,212.00000,2.375860e+05,23616.050000,55.89,874.000000,...,79.0,82.0,0.0,8.00000,35.0,0.000000,2873.000000,232.0,14.0,50.000000
4,35,3501,3.000000,16.000000,83.000000,211.00000,2.375860e+05,23616.050000,55.89,874.000000,...,176.0,135.0,0.0,22.00000,103.0,0.000000,2968.000000,228.0,12.0,519.000000
5,35,3501,3.000000,16.000000,83.000000,211.00000,2.375860e+05,23616.050000,55.89,874.000000,...,79.0,82.0,0.0,8.00000,35.0,0.000000,2873.000000,232.0,14.0,50.000000
6,35,3501,3.000000,16.000000,83.000000,211.00000,2.375860e+05,23616.050000,55.89,874.000000,...,176.0,135.0,0.0,22.00000,103.0,0.000000,2968.000000,228.0,12.0,519.000000
7,35,3501,3.000000,16.000000,83.000000,211.00000,2.375860e+05,23616.050000,55.89,874.000000,...,79.0,82.0,0.0,8.00000,35.0,0.000000,2873.000000,232.0,14.0,50.000000
8,35,3503,3.000000,13.000000,76.000000,181.00000,1.055390e+05,11651.510000,2.60,925.000000,...,85.0,40.0,3.0,28.00000,60.0,0.000000,1249.000000,203.0,8.0,362.000000
9,35,3503,3.000000,13.000000,76.000000,181.00000,1.055390e+05,11651.510000,2.60,925.000000,...,36.0,8.0,0.0,24.00000,41.0,0.000000,1355.000000,218.0,6.0,78.000000


In [11]:
df_merge2.dtypes

Year_x             object
Statecd_x           int64
statename_x        object
distcd              int64
distname_x         object
blocks              int64
clusters            int64
villages            int64
totschools          int64
totpopulation     float64
p_06_pop          float64
p_urb_pop         float64
sexratio          float64
sexratio_06       float64
growthrate        float64
p_sc_pop          float64
p_st_pop          float64
overall_lit         int32
female_lit        float64
Year_y             object
Statecd_y           int64
State Name _x      object
distname_y         object
Enr Govt1           int64
Enr Govt2         float64
Enr Govt3           int64
Enr Govt4         float64
Enr Govt5           int64
Enr Govt6         float64
Enr Govt7           int64
                   ...   
tch_st_m3           int64
tch_st_m4           int64
tch_st_m5           int64
tch_st_m6           int64
tch_st_m7           int64
tch_st_f1           int64
tch_st_f2           int64
tch_st_f3   

In [33]:
X=df_merge2
#y=df_merge2['overall_lit']

In [34]:
df_merge2 = df_merge2.select_dtypes(exclude=['object'])
print(df_merge2)

      Statecd_x  distcd     blocks    clusters     villages  totschools  \
0            35    3501   3.000000   16.000000    83.000000   212.00000   
1            35    3501   3.000000   16.000000    83.000000   212.00000   
2            35    3501   3.000000   16.000000    83.000000   212.00000   
3            35    3501   3.000000   16.000000    83.000000   212.00000   
4            35    3501   3.000000   16.000000    83.000000   211.00000   
5            35    3501   3.000000   16.000000    83.000000   211.00000   
6            35    3501   3.000000   16.000000    83.000000   211.00000   
7            35    3501   3.000000   16.000000    83.000000   211.00000   
8            35    3503   3.000000   13.000000    76.000000   181.00000   
9            35    3503   3.000000   13.000000    76.000000   181.00000   
10           35    3503   3.000000   13.000000    76.000000   186.00000   
11           35    3503   3.000000   13.000000    76.000000   186.00000   
12           35    3502  

In [35]:
df_merge2.columns

Index(['Statecd_x', 'distcd', 'blocks', 'clusters', 'villages', 'totschools',
       'totpopulation', 'p_06_pop', 'p_urb_pop', 'sexratio',
       ...
       'trn_tch_f2', 'trn_tch_f3', 'trn_tch_f4', 'trn_tch_f5', 'trn_tch_f6',
       'trn_tch_f7', 'prof_trn_tch_r', 'prof_trn_tch_p', 'days_nontch',
       'tch_nontch'],
      dtype='object', length=595)

#### Exercise 5 

Use the function below (corr_features) to identify uncorrelated features and remove the remaining features
* corr_features takes pandas dataframe, columns in the dataframe and bar (corelation co-efficient)

In [36]:
# Function to find uncorrelated features
def corr_features(df,cols,bar=0.9):
    for c,i in enumerate(cols[:-1]):
        col_set = set(cols)
        for j in cols[c+1:]:
            if i==j:
                continue
           
            score = df[i].corr(df[j])
            
            if score>bar:
                cols = list(col_set-set([j]))
            if score<-bar:
                cols = list(col_set-set([j]))
    return cols

In [37]:
#df_merge2.corr()

In [38]:
# Your Code Here


#### Exercise 6 

Perform Mean Correction and Standard Scaling on the data feature/column wise.

**Hint:** In order to understand the idea behind the terms used above, you may refer the following link: 

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [39]:
# Your Code Here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
StandardScaler()
  
print(scaler.fit(df_merge2))
print(scaler.mean_)
print(scaler.transform(df_merge2))


StandardScaler(copy=True, with_mean=True, with_std=True)
[1.68026316e+01 1.69656579e+03 1.01256917e+01 1.21581130e+02
 9.07840162e+02 2.15136872e+03 1.83164729e+06 2.45916446e+05
 2.43115473e+01 9.42891936e+02 9.17818070e+02 1.76430178e+01
 1.50156079e+01 1.47971752e+01 6.47905515e+01 1.68026316e+01
 8.65468597e+04 5.28241581e+04 2.27381242e+03 1.64665807e+04
 8.42525653e+03 3.79933366e+03 4.69916497e+03 1.51168169e+00
 2.40294205e+04 2.74245942e+04 1.80380344e+04 4.92026650e+03
 7.14950281e+03 8.92577589e+03 4.23923549e+03 7.36315725e-01
 7.82037467e+04 4.67101418e+04 9.14328944e+02 1.51545226e+04
 6.07452098e+03 3.35656823e+03 3.86267650e+03 1.94060565e-01
 1.52785152e+04 1.42669597e+04 7.18487635e+03 3.98744067e+03
 4.39006623e+03 4.47858158e+03 2.40633362e+03 4.31608391e-01
 4.72291612e+04 4.26436031e+04 4.11219974e+04 3.85799884e+04
 3.63419580e+04 3.03665367e+04 2.82979574e+04 2.31039175e+04
 4.65808027e+04 4.24243316e+04 4.10959577e+04 3.91827124e+04
 3.73528637e+04 3.13239456e+

#### Exercise 7 **(Optional)**

you can apply different classifiers(from sklearn) on the preprocessed data .

In [40]:
df_merge2.columns

Index(['Statecd_x', 'distcd', 'blocks', 'clusters', 'villages', 'totschools',
       'totpopulation', 'p_06_pop', 'p_urb_pop', 'sexratio',
       ...
       'trn_tch_f2', 'trn_tch_f3', 'trn_tch_f4', 'trn_tch_f5', 'trn_tch_f6',
       'trn_tch_f7', 'prof_trn_tch_r', 'prof_trn_tch_p', 'days_nontch',
       'tch_nontch'],
      dtype='object', length=595)

In [41]:
def callKnn(data,targets):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(data, targets, test_size=0.33)
    neigh = KNeighborsClassifier(n_neighbors=3)
    neigh.fit(X_train, y_train)
    predicted_labels = neigh.predict(X_test)
    return accuracy_score(y_test,predicted_labels)

In [47]:
# Your Code Here
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
neigh = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [43]:
print(X)



      Statecd_x  distcd     blocks    clusters     villages  totschools  \
0            35    3501   3.000000   16.000000    83.000000   212.00000   
1            35    3501   3.000000   16.000000    83.000000   212.00000   
2            35    3501   3.000000   16.000000    83.000000   212.00000   
3            35    3501   3.000000   16.000000    83.000000   212.00000   
4            35    3501   3.000000   16.000000    83.000000   211.00000   
5            35    3501   3.000000   16.000000    83.000000   211.00000   
6            35    3501   3.000000   16.000000    83.000000   211.00000   
7            35    3501   3.000000   16.000000    83.000000   211.00000   
8            35    3503   3.000000   13.000000    76.000000   181.00000   
9            35    3503   3.000000   13.000000    76.000000   181.00000   
10           35    3503   3.000000   13.000000    76.000000   186.00000   
11           35    3503   3.000000   13.000000    76.000000   186.00000   
12           35    3502  

In [44]:
y

0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      2
25      2
26      2
27      2
28      2
29      2
       ..
4530    0
4531    0
4532    0
4533    0
4534    0
4535    0
4536    0
4537    0
4538    0
4539    0
4540    2
4541    2
4542    2
4543    2
4544    2
4545    2
4546    2
4547    2
4548    0
4549    0
4550    0
4551    0
4552    0
4553    0
4554    0
4555    0
4556    1
4557    1
4558    1
4559    1
Name: overall_lit, Length: 4560, dtype: int32

In [48]:
score=knn.score
score

<bound method ClassifierMixin.score of KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')>

In [50]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(accuracy_score(y_test, y_pred))



0.9694352159468439
              precision    recall  f1-score   support

           0       0.98      0.99      0.98       673
           1       0.96      0.95      0.95       277
           2       0.96      0.96      0.96       555

    accuracy                           0.97      1505
   macro avg       0.97      0.96      0.97      1505
weighted avg       0.97      0.97      0.97      1505

[[665   0   8]
 [  2 263  12]
 [ 12  12 531]]


In [51]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       673
           1       0.96      0.95      0.95       277
           2       0.96      0.96      0.96       555

    accuracy                           0.97      1505
   macro avg       0.97      0.96      0.97      1505
weighted avg       0.97      0.97      0.97      1505



In [52]:
print(confusion_matrix(y_test, y_pred))

[[665   0   8]
 [  2 263  12]
 [ 12  12 531]]


In [54]:
y_pred

array([0, 2, 2, ..., 0, 0, 2])

In [55]:
y_test

853     0
1161    2
3178    2
3387    2
3521    0
1278    2
3482    0
2317    2
3922    2
3418    1
3383    2
4316    0
3480    0
4000    1
2000    2
2360    0
2740    0
486     1
1509    0
1770    0
3917    2
1594    2
984     2
3023    0
2581    0
2419    0
732     2
1202    2
1448    0
217     1
       ..
4240    2
794     0
4389    0
541     2
3161    2
3878    2
148     1
2225    2
2183    0
283     2
1832    2
3274    1
3594    0
198     1
1946    0
2411    0
347     0
3390    0
2361    0
1250    0
196     1
897     0
306     2
1913    0
2554    0
1650    1
2536    0
220     0
1319    0
4303    2
Name: overall_lit, Length: 1505, dtype: int32

In [57]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

0.9694352159468439
