# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



## Learning Objective

At the end of this experiment, you will be able to:

* perform Data preprocessing

In [None]:
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("Running in Local Jupyter")


#@title  Mini Hackathon Walkthrough
from IPython.display import HTML

HTML("""<video width="640" height="420" controls>
  <source src="https://cdn.talentsprint.com/talentsprint/archives/sc/aiml/aiml_batch_15/preview_videos/Mini_Hackathon_Data_Munging_Briefing.mp4" type="video/mp4">
</video>
""")

## Problem Statement

We will be using district wise demographics, enrollments, and teacher indicator data to predict whether the literacy rate is high/ medium/ low in each district.

### Data Preprocessing

Data preprocessing is an important step in solving every machine learning problem. Most of
the datasets used with Machine Learning problems need to be processed / cleaned / transformed
so that a Machine Learning algorithm can be trained on it.

There are different steps involved in Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        - Handling missing data
        - Handling noisy data
        - Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources and data are combined to form consistent data.
    This data after performing cleaning is used for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified format according to the need of the model we are building.
    There are many options used for transforming the data as below:
        - Normalization
        - Aggregation
        - Generalization
        
    4. Data Reduction → Following data transformation and scaling, the redundancy within the data is removed and is organized efficiently.



### Total Marks  = 20

In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/B15_Data_Munging.zip")
    ipython.magic("sx unzip B15_Data_Munging.zip")
    print("Data downloaded successfully")
    return

setup()

Data downloaded successfully


In [None]:
!ls

## Exercise 1 - Load and Explore the Data (3 Marks)
1. We have three different files

  * Districtwise_Basicdata.csv
  * Districtwise_Enrollment_details_indicator.csv
  * Districtwise_Teacher_indicator.csv

  These files contain the necessary data to solve the problem. <br>

2. Load the files based on **team allocation** mentioned below. Observe the header level details, data records while loading the data.
  
  Hint : Use read_csv from pandas with [skiprows or header](https://towardsdatascience.com/import-csv-files-as-pandas-dataframe-with-skiprows-skipfooter-usecols-index-col-and-header-fbf67a2f92a) options.

3. Read the columns of the dataset and rename them if required.

  Hint : Rename column names (if any) using the following [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

Team allocation for dataset selection

    Team A = 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33
        Districtwise_Basicdata.csv
        Districtwise_Enrollment_details_indicator.csv

    Team B = 2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32
        Districtwise_Basicdata.csv
        Districtwise_Teacher_indicator.csv

In [None]:
# Importing all the required packages and add necessary imports if required
import pandas as pd
import numpy as np

In [None]:
# YOUR CODE HERE for loading and exploring the datasets
df1 = pd.read_csv('Districtwise_Basicdata.csv',skiprows=1)
df2 = pd.read_csv('Districtwise_Teacher_indicator.csv',skiprows=3)

In [None]:
# First 5 rows
df1.head()

Unnamed: 0,Year,Statecd,statename,distcd,distname,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit
0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.0,1.72,High,84.52
1,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3,13,76,181,105539.0,11651.51,2.6,925.0,975.0,-0.07,0.0,0.72,High,79.39
2,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,3,8,42,58,36819.0,4226.82,0.0,778.0,961.0,-12.48,0.0,64.28,High,70.7
3,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,52,356,1576,4983,2737738.0,295675.7,27.68,1003.0,942.0,10.04,17.82,18.09,Low,51.99
4,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,Low,54.31


In [None]:

# Shape of the data
print("Shape:", df1.shape)

Shape: (1324, 19)


In [None]:
# Column names
print("Columns:", df1.columns.tolist())

Columns: ['Year', 'Statecd', 'statename', 'distcd', 'distname', 'blocks', 'clusters', 'villages', 'totschools', 'totpopulation', 'p_06_pop', 'p_urb_pop', 'sexratio', 'sexratio_06', 'growthrate', 'p_sc_pop', 'p_st_pop', 'overall_lit', 'female_lit']


In [None]:
# Data types of each column
print(df1.dtypes)

Year              object
Statecd            int64
statename         object
distcd             int64
distname          object
blocks             int64
clusters           int64
villages           int64
totschools         int64
totpopulation    float64
p_06_pop         float64
p_urb_pop        float64
sexratio         float64
sexratio_06      float64
growthrate       float64
p_sc_pop         float64
p_st_pop         float64
overall_lit       object
female_lit       float64
dtype: object


In [None]:
# Summary statistics
print(df1.describe())

           Statecd       distcd       blocks     clusters     villages  \
count  1324.000000  1324.000000  1324.000000  1324.000000  1324.000000   
mean     17.108761  1727.391239    10.916918   126.652568   899.135952   
std       9.341604   933.187369     9.661577   100.247178   627.269611   
min       1.000000   101.000000     1.000000     1.000000     6.000000   
25%       9.000000   933.000000     5.000000    58.750000   394.750000   
50%      18.000000  1822.500000     8.000000   103.000000   821.500000   
75%      24.000000  2410.000000    13.000000   167.000000  1232.000000   
max      35.000000  3503.000000    66.000000   680.000000  3987.000000   

        totschools  totpopulation      p_06_pop    p_urb_pop     sexratio  \
count  1324.000000   1.268000e+03  1.266000e+03  1262.000000  1268.000000   
mean   2175.542296   1.899024e+06  2.511907e+05    24.819255   942.678233   
std    1434.679991   1.546865e+06  1.999103e+05    19.086172    62.391138   
min      31.000000   7.94

In [None]:
# Missing values
print(df1.isnull().sum()[df1.isnull().sum()>0])

totpopulation    56
p_06_pop         58
p_urb_pop        62
sexratio         56
sexratio_06      58
growthrate       56
p_sc_pop         68
p_st_pop         68
overall_lit      56
female_lit       56
dtype: int64


In [None]:
df2.head()

Unnamed: 0,statecd,statename,distcd,distname,ac_year,tch_govt1,tch_govt2,tch_govt3,tch_govt4,tch_govt5,...,trn_tch_f2,trn_tch_f3,trn_tch_f4,trn_tch_f5,trn_tch_f6,trn_tch_f7,prof_trn_tch_r,prof_trn_tch_p,days_nontch,tch_nontch
0,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,2012-13,329,429,1097,0,127,...,176,135,0,22,103,0,2968,228,12,519
1,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,2012-13,305,285,194,95,268,...,85,40,3,28,60,0,1249,203,8,362
2,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,2012-13,110,95,56,0,135,...,29,23,0,17,46,0,430,78,20,28
3,28,ANDHRA PRADESH ...,2801,ADILABAD ...,2012-13,4749,1788,38,0,22,...,267,0,0,0,8,248,16419,845,13,263
4,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,2012-13,5797,2879,209,8,6733,...,726,0,0,591,0,3,21487,676,14,1185


In [None]:
# Shape of the data
print("Shape:", df2.shape)

Shape: (1324, 181)


In [None]:
# Column names
print("Columns:", df2.columns.tolist())

Columns: ['statecd', 'statename', 'distcd', 'distname', 'ac_year', 'tch_govt1', 'tch_govt2', 'tch_govt3', 'tch_govt4', 'tch_govt5', 'tch_govt6', 'tch_govt7', 'tch_govt9', 'tch_pvt1', 'tch_pvt2', 'tch_pvt3', 'tch_pvt4', 'tch_pvt5', 'tch_pvt6', 'tch_pvt7', 'tch_pvt9', 'tch_un1', 'tch_un2', 'tch_un3', 'tch_un4', 'tch_un5', 'tch_un6', 'tch_un7', 'tch_un9', 'tch_bs1', 'tch_bs2', 'tch_bs3', 'tch_bs4', 'tch_bs5', 'tch_bs6', 'tch_bs7', 'tch_bs_p', 'tch_s1', 'tch_s2', 'tch_s3', 'tch_s4', 'tch_s5', 'tch_s6', 'tch_s7', 'tch_s_p', 'tch_hs1', 'tch_hs2', 'tch_hs3', 'tch_hs4', 'tch_hs5', 'tch_hs6', 'tch_hs7', 'tch_hs_p', 'tch_grad1', 'tch_grad2', 'tch_grad3', 'tch_grad4', 'tch_grad5', 'tch_grad6', 'tch_grad7', 'tch_grad_p', 'tch_pgrad1', 'tch_pgrad2', 'tch_pgrad3', 'tch_pgrad4', 'tch_pgrad5', 'tch_pgrad6', 'tch_pgrad7', 'tch_pgrad_p', 'tch_mph1', 'tch_mph2', 'tch_mph3', 'tch_mph4', 'tch_mph5', 'tch_mph6', 'tch_mph7', 'tch_mph_p', 'tch_pd1', 'tch_pd2', 'tch_pd3', 'tch_pd4', 'tch_pd5', 'tch_pd6', 'tch_

In [None]:

# Data types of each column
print(df2.dtypes)

statecd            int64
statename         object
distcd             int64
distname          object
ac_year           object
                   ...  
trn_tch_f7         int64
prof_trn_tch_r     int64
prof_trn_tch_p     int64
days_nontch        int64
tch_nontch         int64
Length: 181, dtype: object


In [None]:
# Summary statistics
print(df2.describe())


           statecd       distcd     tch_govt1     tch_govt2   tch_govt3  \
count  1324.000000  1324.000000   1324.000000   1324.000000  1324.00000   
mean     17.108761  1727.391239   3043.915408   1731.320242   155.31571   
std       9.341604   933.187369   2638.704948   2351.718141   469.74075   
min       1.000000   101.000000      0.000000      0.000000     0.00000   
25%       9.000000   933.000000   1300.000000     14.000000     0.00000   
50%      18.000000  1822.500000   2422.500000    486.500000    27.00000   
75%      24.000000  2410.000000   4031.500000   2694.250000   102.25000   
max      35.000000  3503.000000  23248.000000  12902.000000  5903.00000   

         tch_govt4     tch_govt5    tch_govt6    tch_govt7    tch_govt9  ...  \
count  1324.000000   1324.000000  1324.000000  1324.000000  1324.000000  ...   
mean    623.398036    777.830816   160.501511   410.455438     0.719033  ...   
std     848.384969   1722.935608   300.917995   941.320989    12.056540  ...   
min 

In [None]:
# Missing values
print(df2.isnull().sum()[df2.isnull().sum()>0])


Series([], dtype: int64)


## Exercise 2  - Data Integration (3 Marks)

As the required data is present in different datasets, we need to **integrate both to make a single dataframe/dataset**.
  * For integrating the datasets, create a unique identifier for each row in both the dataframes so that it can be used to map the data in different files.
   
    * Combine year, state code, district code columns and form a new unique identifier column, refer to this [link](https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas).
    * Set the identifier column as the index for each dataframe.

    * Integrate the dataframes using the above index
     
     Hint: For merging or joining the datasets, refer to this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

**Example:** Data of the district Anantapur in Andrapradesh, which is present in different files should form a single row after integrating the datasets


In [None]:
# YOUR CODE HERE for integrating the datasets
df1['uid'] = df1['Year'].astype(str) + '_' + df1['Statecd'].astype(str) + '_' + df1['distcd'].astype(str)
df2['uid'] = df2['ac_year'].astype(str) + '_' + df2['statecd'].astype(str) + '_' + df2['distcd'].astype(str)


print(df1.columns.size)
print(df2.columns.size)

df1.set_index('uid', inplace=True)
df2.set_index('uid', inplace=True)

df = df1.join(df2, lsuffix='_left', rsuffix='_right')


print(df.columns.size)
print(df1.columns.size)
print(df2.columns.size)
# df.columns.tolist()

df.index


20
182
200
19
181


Index(['2012-13_35_3501', '2012-13_35_3503', '2012-13_35_3502',
       '2012-13_28_2801', '2012-13_28_2822', '2012-13_28_2823',
       '2012-13_28_2820', '2012-13_28_2814', '2012-13_28_2817',
       '2012-13_28_2805',
       ...
       '2013-14_19_1906', '2013-14_19_1907', '2013-14_19_1910',
       '2013-14_19_1911', '2013-14_19_1920', '2013-14_19_1919',
       '2013-14_19_1914', '2013-14_19_1921', '2013-14_19_1918',
       '2013-14_19_1904'],
      dtype='object', name='uid', length=1324)

## Exercise 3 - Data Cleaning (3 Marks)

1.  **Overall_lit** is our target variable. Delete rows with missing overall_lit value

   Hint: Refer to the link [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).


2.  Convert categorical values to numerical values.

  For example, If a feature contains categorical values such as dog, cat, mouse, etc then replace them with 0, 1, 2, etc or use [Sklearn LabelEncoder's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

3. Replace the missing values in any other column appropriately with mean / median / mode.

  Hint: Use pandas [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function to replace the missing values




In [None]:
# YOUR CODE HERE for data cleaning
print(df.shape)
print(df['overall_lit'].isnull().sum())
df = df.dropna(subset=['overall_lit'])
print(df['overall_lit'].isnull().sum())

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

cols_to_encode = df.select_dtypes(exclude=['int64','float64']).columns

# cols_to_encode = ['Year', 'statename_left', 'statename_right','distname_left','distname_right','overall_lit','ac_year']

df[cols_to_encode] = df[cols_to_encode].apply(lambda col: le.fit_transform(col))
df.isnull().sum()[df.isnull().sum()>0]



(1324, 200)
56
0


Unnamed: 0,0
p_06_pop,2
p_urb_pop,6
sexratio_06,2
p_sc_pop,12
p_st_pop,12


In [None]:
import pandas as pd

def fill_nulls_smart(df):
    df_filled = df.copy()

    for col in df.columns:
        if df[col].isnull().sum() == 0:
            continue  # Skip if no nulls

        # Categorical columns → Mode
        if df[col].dtype == 'object':
            fill_value = df[col].mode().iloc[0]
            strategy = 'mode'

        # Numerical columns
        else:
            series = df[col].dropna()
            skewness = series.skew()
            Q1 = series.quantile(0.25)
            Q3 = series.quantile(0.75)
            IQR = Q3 - Q1
            outliers = series[(series < Q1 - 1.5*IQR) | (series > Q3 + 1.5*IQR)]

            if abs(skewness) > 1 or len(outliers) > 0.1 * len(series):
                fill_value = series.median()
                strategy = 'median'
            else:
                fill_value = series.mean()
                strategy = 'mean'

        # Fill missing values
        df_filled[col].fillna(fill_value, inplace=True)

        print(f"Filled nulls in column '{col}' using {strategy} with value: {fill_value}")

    return df_filled

df = fill_nulls_smart(df)

print(df.isnull().sum()[df.isnull().sum()>0])



Filled nulls in column 'p_06_pop' using median with value: 204934.14
Filled nulls in column 'p_urb_pop' using median with value: 19.5
Filled nulls in column 'sexratio_06' using mean with value: 918.8135860979463
Filled nulls in column 'p_sc_pop' using mean with value: 14.83001592356688
Filled nulls in column 'p_st_pop' using median with value: 4.165
Series([], dtype: int64)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled[col].fillna(fill_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled[col].fillna(fill_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting valu

## Exercise 4 - (3 Marks)

1. Remove the unnecessary columns which are not contributing to the overall literacy rate

2. Verify if there are any duplicate columns and remove them.

  For example: state name and district name are the same as state code and district code.

3. Make sure that the final dataframe has no null or nan values. Delete the rows with missing values.

   Hint: Give df.isna() to verify on the nan values in the dataframe.

In [None]:
# YOUR CODE HERE for cleaning the dataframe
print(df.shape)
df = df.drop(columns=['statename_left','Year','distname_left', 'Statecd','distcd_left','statecd','statename_right','distcd_right','distname_right','ac_year'])
print(df.shape)
df = df.dropna()
print(df.shape)
df = df.loc[:, df.nunique() > 1]
print(df.shape)
df = df.T.drop_duplicates().T
print(df.shape)
print(df.isna().sum()[df.isna().sum()>0])


(1268, 200)
(1268, 190)
(1268, 190)
(1268, 189)
(1268, 189)
Series([], dtype: int64)


## Exercise 5 - Apply Correlation Matrix (2 Marks)

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. More number of features does not imply better accuracy. More features may lead to a decline in the accuracy and create noise in the model, if they contain any irrelevant features.

*Features with high correlation value will imply the same meaning. Hence removing the highly correlated features*

**Function Description:**

`remove_Highly_Correlated()` function removes highly correlated features in the dataframe.
- Creates a correlation matrix of row and column wise features
- Extracts only uppertriangular matrix as correlation matrix, which will have the same values below and above the diagonal
- Removes columns which are having correlation value more than the threshold value.

In [None]:
def remove_Highly_Correlated(df, bar=0.9):
  # Creates correlation matrix
  corr = df.corr()

  # Set Up Mask To Hide Upper Triangle
  mask = np.triu(np.ones_like(corr, dtype=bool))
  tri_df = corr.mask(mask)

  # Finding features with correlation value more than specified threshold value (bar=0.9)
  highly_cor_col = [col for col in tri_df.columns if any(tri_df[col] > bar )]
  print("length of highly correlated columns",len(highly_cor_col))

  # Drop the highly correlated columns
  reduced_df = df.drop(highly_cor_col, axis = 1)
  print("shape of data",df.shape,"shape of reduced data",reduced_df.shape)
  return reduced_df

In [None]:
# YOUR CODE HERE to remove highly correlated features from the dataframe by calling above function.
df_cleaned = remove_Highly_Correlated(df)

length of highly correlated columns 25
shape of data (1268, 189) shape of reduced data (1268, 164)


## Exercise 6 - (3 Marks)

Perform Standard Scaling on the data feature/column wise.

**Hint:** In order to understand the idea behind the terms used above, you may refer to the following link:

[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def find_optimal_k(df, target_col, max_k=160):
    X = df.drop(columns=[target_col])
    y = df[target_col]

    X = pd.get_dummies(X, drop_first=True)
    X = X.fillna(X.mean())

    scores = []
    k_range = range(5, min(max_k, X.shape[1]) + 1, 5)  # test in steps of 5

    for k in k_range:
        selector = SelectKBest(score_func=f_regression, k=k)
        X_selected = selector.fit_transform(X, y)

        model = LinearRegression()
        score = cross_val_score(model, X_selected, y, cv=5, scoring='r2').mean()
        scores.append(score)

    # Plotting
    plt.plot(k_range, scores, marker='o')
    plt.xlabel('Number of Features (k)')
    plt.ylabel('Cross-validated R² Score')
    plt.title('Finding Best k')
    plt.grid(True)
    plt.show()

    best_k = k_range[np.argmax(scores)]
    print(f"✅ Best k = {best_k}, R² Score = {max(scores):.4f}")
    return best_k


In [None]:
best_k = find_optimal_k(df, target_col='literacy_rate', max_k=100)


In [None]:
# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler
df_cleaned.to_csv('output.csv', index=False)

x = df_cleaned.drop(columns=['overall_lit'])
y = df_cleaned['overall_lit']

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=160)  # top 150 features
x_new = selector.fit_transform(x, y)

# Get selected feature names
selected_features = x.columns[selector.get_support()]
# print("Top features:", selected_features.tolist())

print("shape of data",x.shape,"shape of reduced data",x_new.shape)


# Scale numeric features
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x_new)






shape of data (1268, 163) shape of reduced data (1268, 160)


## Exercise 7 - (3 Marks)

Apply different classifiers on the preprocessed data and figure out which classifier gives the best result.

* Split the data into train and test

* Fit the model with train data and find the accuracy of test data

### Expected Accuracy is above 90%

In [None]:
# YOUR CODE HERE for applying different classifiers
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier


x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)


models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Support Vector Machine': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Bagging Classifier': BaggingClassifier(random_state=42)

}

for model_name, model in models.items():
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    y_predtrain = model.predict(x_train)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_train = accuracy_score(y_train, y_predtrain)

    print(f'{model_name} Accuracy: {accuracy}  Accuracy Train: {accuracy_train}')



Logistic Regression Accuracy: 0.8700787401574803  Accuracy Train: 0.9822485207100592
Decision Tree Accuracy: 0.9291338582677166  Accuracy Train: 1.0
Random Forest Accuracy: 0.9094488188976378  Accuracy Train: 1.0
Support Vector Machine Accuracy: 0.8346456692913385  Accuracy Train: 0.8954635108481263
K-Nearest Neighbors Accuracy: 0.7283464566929134  Accuracy Train: 0.7988165680473372
Bagging Classifier Accuracy: 0.9330708661417323  Accuracy Train: 0.995069033530572
