# AICrowd - Predict Diabetic Retinopathy

**Problem Statement:** \\
Diabetic Retinopathy is the leading cause of blindness in the working-age population of the developed world. Deep Learning has given us tremendous power in the field of computer vision. In some fields of vision ,computers can now see and perceive beyond human capabilities. But with great power comes responsibility. The problem we have for you is to classify the patient's retina as being diabetic or not diabetic taking into consideration the available image features in the dataset.

**Dataset:** \\
This dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not. There are total of 20 attributes to this dataset, out of which first 19 attributes represents a descriptive features extracted from the image set. Last attribute label is 1 if image shows signs of Diabetic Retinopathy and 0 if image does not show signs of Diabetic Retinopathy. For details about attributes visit here!.

**Files:** \\
1. train.csv - (920 samples) File that should be used for training. It contains in csv format, the feature representation of the images along with the binary label for each such representation.
2. test.csv - (230 samples) File that will be used for actual evaluation for the leaderboard score. It contains only the feature representation of the images and not their binary labels.

**Submission:** \\
Prepare a csv containing header as label and predicted value as digit 0 or 1 representing whether or not the image shows signs of diabetic retionpathy.
The name of above file should be submission.csv. Sample submission format available at sample_submission.csv.

**Evaluation Criteria:** \\
During evaluation F1 score will be used to test the efficiency of the model.

## Google Drive Mount and Imports

In [1]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive
/gdrive


In [0]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score

from xgboost import XGBClassifier

## Load data

In [3]:
train_df = pd.read_csv("/gdrive/My Drive/Dataset/AICrowd/PredictDiabeticRetinopathy/train.csv", header=None)
test_df = pd.read_csv("/gdrive/My Drive/Dataset/AICrowd/PredictDiabeticRetinopathy/test.csv", header=None)
train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,1,1,75,63,60,55,48,35,13.195493,4.396967,0.10407,0.0,0.0,0.0,0.0,0.0,0.513092,0.123966,0,1
1,1,1,79,76,74,72,69,50,61.559348,28.959444,12.778104,2.045287,0.038016,0.0,0.0,0.0,0.527993,0.101884,0,1
2,1,1,41,41,40,40,38,35,6.090116,0.834492,0.02746,0.0,0.0,0.0,0.0,0.0,0.506881,0.091535,1,0
3,1,1,17,16,16,14,12,9,75.438535,20.3525,5.237412,0.206817,0.003884,0.000971,0.000971,0.000971,0.544614,0.089329,1,1
4,1,1,63,63,63,59,57,48,13.558211,5.366467,0.604079,0.051511,0.0,0.0,0.0,0.0,0.552941,0.112387,0,1


In [0]:
labels = train_df.iloc[:,19]
train_df.drop(labels=[train_df.columns[0], train_df.columns[1], train_df.columns[18], train_df.columns[19]], inplace=True, axis=1)

In [5]:
x_train, x_validation, y_train, y_validation = train_test_split(train_df, labels, test_size=0.20, stratify=labels)

print(F"Training data stats: {x_train.shape}, {y_train.shape}")
print(F"Validation data stats: {x_validation.shape}, {y_validation.shape}")

Training data stats: (736, 16), (736,)
Validation data stats: (184, 16), (184,)


In [0]:
for col in train_df.columns:
    train_df.plot(kind='bar', x='1.2', y=col, title=F"{col}")

## Classification

In [0]:
def get_stats(model, train_data, train_labels, validation_data, validation_labels):
    model.fit(train_data, train_labels)
    y_pred = model.predict(validation_data)
    
    precision = precision_score(validation_labels, y_pred, average='micro')
    recall = recall_score(validation_labels, y_pred, average='micro')
    accuracy = accuracy_score(validation_labels, y_pred)
    f1 = f1_score(validation_labels, y_pred, average='macro')

    print("Accuracy of the model is :" ,accuracy)
    print("Recall of the model is :" ,recall)
    print("Precision of the model is :" ,precision)
    print("F1 score of the model is :" ,f1)

In [0]:
from sklearn.decomposition import PCA

dim_red = PCA(n_components=15)
x_train_reduced = dim_red.fit_transform(x_train)
x_validation_reduced = dim_red.transform(x_validation)

In [0]:
print("*"*100)
print("Logistic Regression")
print("*"*100)
classifier = LogisticRegression(solver = 'lbfgs', multi_class='auto', max_iter=500)
get_stats(classifier, x_train, y_train, x_validation, y_validation)

print("*"*100)
print("Logistic Regression + PCA")
print("*"*100)
classifier = LogisticRegression(solver = 'lbfgs', multi_class='auto', max_iter=500)
get_stats(classifier, x_train_reduced, y_train, x_validation_reduced, y_validation)

print("*"*100)
print("RF")
print("*"*100)
rf_classifier = RandomForestClassifier()
get_stats(rf_classifier, x_train, y_train, x_validation, y_validation)

print("*"*100)
print("RF + PCA")
print("*"*100)
rf_classifier = RandomForestClassifier()
get_stats(rf_classifier, x_train_reduced, y_train, x_validation_reduced, y_validation)

print("*"*100)
print("ET")
print("*"*100)
et_classifier = ExtraTreesClassifier()
get_stats(et_classifier, x_train, y_train, x_validation, y_validation)

print("*"*100)
print("ET + PCA")
print("*"*100)
et_classifier = ExtraTreesClassifier()
get_stats(et_classifier, x_train_reduced, y_train, x_validation_reduced, y_validation)

print("*"*100)
print("Bagging")
print("*"*100)
bagging = BaggingClassifier()
get_stats(bagging, x_train, y_train, x_validation, y_validation)

print("*"*100)
print("Bagging + PCA")
print("*"*100)
bagging = BaggingClassifier()
get_stats(bagging, x_train_reduced, y_train, x_validation_reduced, y_validation)

print("*"*100)
print("XGBoost")
print("*"*100)
xgboost = XGBClassifier()
get_stats(xgboost, x_train, y_train, x_validation, y_validation)

print("*"*100)
print("XGBoost + PCA")
print("*"*100)
xgboost = XGBClassifier()
get_stats(xgboost, x_train_reduced, y_train, x_validation_reduced, y_validation)

****************************************************************************************************
Logistic Regression
****************************************************************************************************


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy of the model is : 0.7880434782608695
Recall of the model is : 0.7880434782608695
Precision of the model is : 0.7880434782608695
F1 score of the model is : 0.7875351591413768
****************************************************************************************************
Logistic Regression + PCA
****************************************************************************************************


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy of the model is : 0.7880434782608695
Recall of the model is : 0.7880434782608695
Precision of the model is : 0.7880434782608695
F1 score of the model is : 0.7875351591413768
****************************************************************************************************
RF
****************************************************************************************************
Accuracy of the model is : 0.6521739130434783
Recall of the model is : 0.6521739130434783
Precision of the model is : 0.6521739130434783
F1 score of the model is : 0.6515151515151516
****************************************************************************************************
RF + PCA
****************************************************************************************************
Accuracy of the model is : 0.7989130434782609
Recall of the model is : 0.7989130434782609
Precision of the model is : 0.7989130434782609
F1 score of the model is : 0.7988595739651962
***********************************

In [10]:
model = RandomForestClassifier(bootstrap=True, 
                               max_depth=60, 
                               min_samples_leaf=3, 
                               min_samples_split=5, 
                               n_estimators=1000)
model.fit(x_train_reduced, y_train)
y_pred = model.predict(x_validation_reduced)

precision = precision_score(y_validation, y_pred, average='micro')
recall = recall_score(y_validation, y_pred, average='micro')
accuracy = accuracy_score(y_validation, y_pred)
f1 = f1_score(y_validation, y_pred, average='macro')

print("Accuracy of the model is :" ,accuracy)
print("Recall of the model is :" ,recall)
print("Precision of the model is :" ,precision)
print("F1 score of the model is :" ,f1)

Accuracy of the model is : 0.7717391304347826
Recall of the model is : 0.7717391304347826
Precision of the model is : 0.7717391304347826
F1 score of the model is : 0.7710629221471739


In [0]:
test_df
test_df.drop(labels=[test_df.columns[0], test_df.columns[1], test_df.columns[18]], inplace=True, axis=1)

In [0]:
x_test = dim_red.transform(test_df)
predictions = model.predict(x_test)
predictions

array([1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 0])

In [0]:
pd.DataFrame({"label": predictions}).to_csv("/gdrive/My Drive/Dataset/AICrowd/PredictDiabeticRetinopathy/submission.csv", index=False)

In [14]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Number of trees in random forest
n_estimators = range(200, 2400, 400)

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': range(200, 2400, 400), 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [15]:
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(x_train_reduced, y_train)
rf_random.best_params_

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  9.4min finished


{'bootstrap': True,
 'max_depth': 90,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 10,
 'n_estimators': 1400}

In [0]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': range(60, 120),
    'min_samples_leaf': [2, 3, 4, 5, 6],
    'min_samples_split': [7, 8, 9, 10, 11, 12, 13],
    'n_estimators': [1200, 1300, 1400, 1500, 1600]
}

# Create a based model
rf = RandomForestClassifier()

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(x_train_reduced, y_train)
print(grid_search.best_params_)

Fitting 3 folds for each of 10500 candidates, totalling 31500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed: 13.6min
[Parallel(n_jobs=-1)]: Done 644 tasks      | elapsed: 24.3min
[Parallel(n_jobs=-1)]: Done 1009 tasks      | elapsed: 37.9min
[Parallel(n_jobs=-1)]: Done 1454 tasks      | elapsed: 54.4min
[Parallel(n_jobs=-1)]: Done 1981 tasks      | elapsed: 73.8min
[Parallel(n_jobs=-1)]: Done 2588 tasks      | elapsed: 96.3min
[Parallel(n_jobs=-1)]: Done 3277 tasks      | elapsed: 121.0min


In [0]:
skf = StratifiedKFold(n_splits=5, random_state=None)

# X is the feature set and y is the target
for train_index, test_index in skf.split(X, y): 
    print("Train:", train_index, "Validation:", val_index) 
    X_train, X_test = X[train_index], X[val_index] 
    y_train, y_test = y[train_index], y[val_index]


{'bootstrap': True, 'max_depth': 60, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 1000}
