# COMP 551 Assignment 1 : Getting Started With Machine Learning

### K- Nearest Neighbors Experiments

#### Group 1: Rudi Kischer, Ben Hepditch

# Setup

- make sure to install the requirements.txt file, and to use the correct virtual environment with juptyer notebook

In [11]:


from ucimlrepo import fetch_ucirepo 
import pandas as pd

pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 3)

# Data

Dataset 1: NHANES age prediction.csv (National Health and Nutrition Health Sur- vey 2013-2014 (NHANES) Age Prediction Subset): https://archive.ics.uci.edu/dataset/887/national+health+and+nutrition+health+survey+2013-2014+(nhanes)+age+prediction+subset

Dataset 2: Breast Cancer Wisconsin (Original) dataset: https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original

### Load Data

In [1]:


# # DATASET 1: NHANES age prediction.csv
national_health_and_nutrition_health_survey_2013_2014_nhanes_age_prediction_subset = fetch_ucirepo(id=887) 
dataset_1 = national_health_and_nutrition_health_survey_2013_2014_nhanes_age_prediction_subset.data
X_1 = dataset_1.features 
y_1 = dataset_1.targets 

# # DATASET 2: Breast Cancer Wisconsin
breast_cancer_wisconsin_original = fetch_ucirepo(id=15) 
dataset_2 = breast_cancer_wisconsin_original.data

### Clean Data

- We want to remove all rows from our data sets which have null values in the targets or in the features.

In [2]:
# Define Cleaning Function
def clean(dataset):
  X = dataset.features
  Y = dataset.targets
  missing_rows_features = X.isnull().any(axis=1)
  missing_rows_targets = Y.isnull().any(axis=1)
  missing_rows = missing_rows_features | missing_rows_targets
  
  print(f"features_missing: {missing_rows_features.sum()}")
  print(f"targets_missing: {missing_rows_targets.sum()}")

  X_clean = X[-missing_rows]
  Y_clean = Y[-missing_rows]
  print(f'{missing_rows.sum()} rows deleted')
  dataset.features = X_clean
  dataset.targets = Y_clean

  return dataset


In [3]:
# Clean the DataSets
dataset_1 = clean(dataset_1)
dataset_2 = clean(dataset_2)

features_missing: 0
targets_missing: 0
0 rows deleted
features_missing: 16
targets_missing: 0
16 rows deleted


### Target Statistics

- We want to get some statistics about our target values. We want to know the mean and the squared difference.

In [19]:
# Define mean
def grouped_target_means(dataset):
    # grouped by the target
    X = dataset.features
    Y = dataset.targets

    XY = pd.concat([X,Y], axis=1)
    XY_grouped = XY.groupby(Y.columns[0])
    XY_mean = XY_grouped.mean()
    return XY_mean

# Define Feature Distance
def grouped_feature_distance(dataset):
    XY_mean = grouped_target_means(dataset)

    sqr_diff = (XY_mean.iloc[0] - XY_mean.iloc[1]) ** 2
    df_sqr_diff = pd.DataFrame([sqr_diff], index=['squarred_diff'])
    return df_sqr_diff

# Print Col Ranking
def feature_ranking(dataset):
    df_sqr_diff = grouped_feature_distance(dataset)
    row = df_sqr_diff.iloc[0]
    sorted_row = row.sort_values(ascending=False)

    ranking_df = pd.DataFrame({
      'Feature': sorted_row.index,
      'Value': sorted_row.values,
      'Rank': range(1, len(sorted_row) + 1)
    })

    return ranking_df


##### Feature Means

In [17]:
# Get grouped means
print(f'Dataset 1 Feature Means:')
XY_1_bar = grouped_target_means(dataset_1)
print(XY_1_bar)

print(f'Dataset 2 Feature Means: ')
XY_2_bar = grouped_target_means(dataset_2)
print(XY_2_bar)


Dataset 1 Feature Means:
           RIAGENDR  PAQ605  BMXBMI   LBXGLU  DIQ010   LBXGLT   LBXIN
age_group                                                            
Adult         1.512   1.806  27.968   98.645   2.014  109.991  12.107
Senior        1.508   1.909  27.886  104.330   2.027  141.209  10.405
Dataset 2 Feature Means: 
       Clump_thickness  Uniformity_of_cell_size  Uniformity_of_cell_shape  Marginal_adhesion  Single_epithelial_cell_size  Bare_nuclei  Bland_chromatin  Normal_nucleoli  Mitoses
Class                                                                                                                                                                            
2                2.964                    1.306                     1.414              1.347                        2.108        1.347            2.083            1.261    1.065
4                7.188                    6.577                     6.561              5.586                        5.326        7.628 

##### Group Feature Distance

In [18]:
print('Dataset 1:')
XY_1_fd = grouped_feature_distance(dataset_1)
print(XY_1_fd)

print('Dataset 2:')
XY_2_fd = grouped_feature_distance(dataset_2)
print(XY_2_fd)

Dataset 1:
                RIAGENDR  PAQ605  BMXBMI  LBXGLU     DIQ010   LBXGLT  LBXIN
squarred_diff  1.425e-05   0.011   0.007  32.319  1.786e-04  974.576  2.895
Dataset 2:
               Clump_thickness  Uniformity_of_cell_size  Uniformity_of_cell_shape  Marginal_adhesion  Single_epithelial_cell_size  Bare_nuclei  Bland_chromatin  Normal_nucleoli  Mitoses
squarred_diff           17.845                   27.784                    26.484             17.969                       10.357       39.448           15.144           21.128    2.363


##### Features Ranked By Squared Difference

In [20]:

print("Dataset 1 Feature Ranking")
d1_feature_ranking = feature_ranking(dataset_1)
print(d1_feature_ranking)

print("Dataset 2 Feature Ranking")
d2_feature_ranking = feature_ranking(dataset_2)
print(d2_feature_ranking)


Dataset 1 Feature Ranking
    Feature      Value  Rank
0    LBXGLT  9.746e+02     1
1    LBXGLU  3.232e+01     2
2     LBXIN  2.895e+00     3
3    PAQ605  1.065e-02     4
4    BMXBMI  6.728e-03     5
5    DIQ010  1.786e-04     6
6  RIAGENDR  1.425e-05     7
Dataset 1 Feature Ranking
                       Feature   Value  Rank
0                  Bare_nuclei  39.448     1
1      Uniformity_of_cell_size  27.784     2
2     Uniformity_of_cell_shape  26.484     3
3              Normal_nucleoli  21.128     4
4            Marginal_adhesion  17.969     5
5              Clump_thickness  17.845     6
6              Bland_chromatin  15.144     7
7  Single_epithelial_cell_size  10.357     8
8                      Mitoses   2.363     9


-  TODO: *Description goes here analyzing if the features that are strongly different are associated with the target*

- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

# K-Nearest Neighbors