<h1 style="text-align:center">   
      <font color = purple >
                Polycystic Ovary Syndrome(PCOS) Classification
        </font>    
</h1>   
<hr style="width:100%;height:5px;border-width:0;background-color:teal">
<center><img style = "height:550px;" src="https://i.hizliresim.com/QnIpYV.jpg"></center>
<br>
<center><h1>
    <font color = purple>Introduction</font> </h1></center>
<br>
<p>Polycystic ovary syndrome is a disorder involving infrequent, irregular or prolonged menstrual periods, and often excess male hormone (androgen) levels.</p>

<h2><font color = purple>Content:</font></h2>
<br>
 
1. [Import Libraries](#1)
1. [Load and Check Data](#2)
1. [Variable Description](#3)
    * [Univariate Variable Analysis ](#4)
        * [Categorical Variable Analysis ](#5)
        * [Numerical Variable Analysis ](#6)
1. [Missing Values](#7)
1. [Data Analysis](#8)   
1. [Modeling](#9)
    * [Train - Test Split](#10)
    * [Simple Logistic Regression](#11)
    * [Hyperparameter Tuning -- Grid Search -- Cross Validation](#12)
    * [XGBRF and CatBoost Classsifier](#13)
1. [Results](#14)

<a id = "1" ></a>
# <span style="color:purple;"> Import Libraries </span>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from collections import Counter
from mlxtend.plotting import plot_confusion_matrix
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost
import lightgbm
from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


<a id = "2" ></a>
# <span style="color:purple;"> Load and Check Data </span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">If you see the Missing optional dependency 'xlrd' error.You just need to install a required package before trying to use pd.read_excel.</p>
</div>

In [None]:
!pip install openpyxl

In [None]:
#Load data
df_inf = pd.read_csv("/kaggle/input/polycystic-ovary-syndrome-pcos/PCOS_infertility.csv")
df_woinf = pd.read_excel("/kaggle/input/polycystic-ovary-syndrome-pcos/PCOS_data_without_infertility.xlsx",sheet_name="Full_new")

In [None]:
#Look at the data with infertile patients.
df_inf.head()

In [None]:
#Look at the data with non-infertile patients.
df_woinf.head()

In [None]:
#Look at the columns of data with non-infertile patients.
df_woinf.columns

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">
    The files were divided into infertility and without-infertility patients. Let's combine them by patient file no ,delete repeated features and change PCOS(Y/N) to Target.</p>
</div>

In [None]:
#Merge the files
data = pd.merge(df_woinf,df_inf, on='Patient File No.', suffixes={'','_wo'},how='left')
#Drop repeated features
data =data.drop(['Unnamed: 44', 'Sl. No_wo', 'PCOS (Y/N)_wo', '  I   beta-HCG(mIU/mL)_wo',
       'II    beta-HCG(mIU/mL)_wo', 'AMH(ng/mL)_wo'], axis=1)
#Change the title of the properties
data = data.rename(columns = {"PCOS (Y/N)":"Target"})
#Look at the merged data.
data.head() 

In [None]:
#Drop unnecessary features
data = data.drop(["Sl. No","Patient File No."],axis = 1)

In [None]:
data.info()

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">
    Let's look at the dtype is an object</p>
</div>

In [None]:
data["AMH(ng/mL)"].head() 

In [None]:
data["II    beta-HCG(mIU/mL)"].head()

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">
    As you can see some numeric data is saved as strings : AMH(ng/mL) , II    beta-HCG(mIU/mL). Let's converting them. </p>
</div>

In [None]:
#Converting
data["AMH(ng/mL)"] = pd.to_numeric(data["AMH(ng/mL)"], errors='coerce')
data["II    beta-HCG(mIU/mL)"] = pd.to_numeric(data["II    beta-HCG(mIU/mL)"], errors='coerce')

<a id = "3" ></a>
# <span style="color:purple;">Variable Description</span>

 <a id = "4" ></a>
 ## <span style="color:purple;">Univariate Variable Analysis</span>
* Categorical Variable : Target,  Pregnant(Y/N), Weight gain(Y/N), hair growth(Y/N), Skin darkening (Y/N), Hair loss(Y/N), Pimples(Y/N), Fast food (Y/N), Reg.Exercise(Y/N), Blood Group
* Numerical Variable : Age (yrs), Weight (Kg),Marraige Status (Yrs)... 

<a id = "5" ></a>
## <span style="color:purple;">Categorical Variable</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<ul><p style="padding: 10px;color:white;" >Look at the value count </p>
        <li> <p style="padding: 10px;color:white;" > yes --> 1 </p> </li>
        <li > <p style="padding: 10px;color:white;" > no  --> 0 </p> </li>
</ul>
</div>

In [None]:
colors = ['#670067','#008080']

In [None]:
def bar_plot(variable):
    """
     input: variable example : Target
     output: bar plot & value count
     
    """
    #get feature
    var = data[variable]
    #count number of categorical variable(value/sample)
    varValue = var.value_counts()
    #visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index,varValue,color=colors)
    plt.xticks(varValue.index,varValue.index.values)
    plt.ylabel("Count")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
category = ["Target", "Pregnant(Y/N)", "Weight gain(Y/N)", "hair growth(Y/N)", "Skin darkening (Y/N)", "Hair loss(Y/N)", 
            "Pimples(Y/N)", "Fast food (Y/N)", "Reg.Exercise(Y/N)", "Blood Group"]
for c in category:
    bar_plot(c)

<a id = "6" ></a>
## <span style="color:purple;">Numerical Variable</span>

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(data[variable], bins = 50,color=colors[0])
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
numericVar = [" Age (yrs)", "Weight (Kg)","Marraige Status (Yrs)"]
for n in numericVar:
    plot_hist(n)

<a id = "7" ></a>
# <span style="color:purple;">Missing Values</span>

In [None]:
data.columns[data.isnull().any()]

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">
    As you can see there are some missing values:Marraige Status (Yrs),II    beta-HCG(mIU/mL), AMH(ng/mL) and Fast food (Y/N). Let's deal with missing values. </p>
</div>

In [None]:
 #Filling missing values with the median value of the features.

data['Marraige Status (Yrs)'].fillna(data['Marraige Status (Yrs)'].median(),inplace=True)
data['II    beta-HCG(mIU/mL)'].fillna(data['II    beta-HCG(mIU/mL)'].median(),inplace=True)
data['AMH(ng/mL)'].fillna(data['AMH(ng/mL)'].median(),inplace=True)
data['Fast food (Y/N)'].fillna(data['Fast food (Y/N)'].median(),inplace=True)


In [None]:
data.isnull().sum()

<a id = "8" ></a>
# <span style="color:purple;">Data Analysis</span>

In [None]:
data.describe()

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">
Let's examine the correlation matrix of all features. </p>
</div>

In [None]:
corr_matrix= data.corr()
plt.subplots(figsize=(30,10))
sns.heatmap(corr_matrix,cmap="Set3", annot = True, fmt = ".2f");
plt.title("Correlation Between Features")
plt.show()

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">
Let's look at the properties that have a relationship greater than 0.25 with the target.</p>
</div>

In [None]:
threshold = 0.25 
filtre = np.abs(corr_matrix["Target"]) > threshold 
corr_features = corr_matrix.columns[filtre].tolist()
plt.subplots(figsize=(10,7))
sns.heatmap(data[corr_features].corr(),cmap="Set3", annot = True, fmt = ".2f")
plt.title("Correlation Between Features w Corr Theshold 0.25")
plt.show()

<a id = "9" ></a>
# <span style="color:purple;">Modeling</span>

<a id = "10" ></a>
## <span style="color:purple;">Train - Test Split</span>

In [None]:
X= data.drop(labels = ["Target"],axis = 1)
y=data.Target

In [None]:
X_train,X_test, y_train, y_test = train_test_split(X,y, test_size=0.3) 

In [None]:
print("X_train",len(X_train))
print("X_test",len(X_test))
print("y_train",len(y_train))
print("y_test",len(y_test))

<a id = "11" ></a>
## <span style="color:purple;">Simple Logistic Regression</span>

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
acc_log_train = round(logreg.score(X_train, y_train)*100,2) 
acc_log_test = round(logreg.score(X_test,y_test)*100,2)
print("Training Accuracy: % {}".format(acc_log_train))
print("Testing Accuracy: % {}".format(acc_log_test))

<a id = "12" ></a>
## <span style="color:purple;">Hyperparameter Tuning -- Grid Search -- Cross Validation</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<ul style="padding: 10px;color:white;">
We will compare 5 ml classifier and evaluate mean accuracy of each of them by stratified cross validation.
<li>Decision Tree</li>
<li>SVM</li>
<li>Random Forest</li>
<li>KNN</li>
<li>Logistic Regression</li></ul>
</div>

In [None]:
random_state = 42
classifier = [DecisionTreeClassifier(random_state = random_state),
             SVC(random_state = random_state),
             RandomForestClassifier(random_state = random_state),
             LogisticRegression(random_state = random_state),
             KNeighborsClassifier()]

dt_param_grid = {"min_samples_split" : range(10,500,20),
                "max_depth": range(1,20,2)}

svc_param_grid = {"kernel" : ["rbf"],
                 "gamma": [0.001, 0.01, 0.1, 1],
                 "C": [1,10,50,100,200,300,1000]}

rf_param_grid = {"max_features": ['auto', 'sqrt', 'log2'],
                "n_estimators":[300,500],
                "criterion":["gini"],
                'max_depth' : [4,5,6,7,8,9,10,12],}

logreg_param_grid = {"C":np.logspace(-3,3,7),
                    "penalty": ["l1","l2"]}

knn_param_grid = {"n_neighbors": np.linspace(1,19,10, dtype = int).tolist(),
                 "weights": ["uniform","distance"],
                 "metric":["euclidean","manhattan"]}


classifier_param = [dt_param_grid,
                   svc_param_grid,
                   rf_param_grid,
                   logreg_param_grid,
                   knn_param_grid]

In [None]:
cv_result = []
best_estimators = []
for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
    clf.fit(X_train,y_train)
    cv_result.append(round(clf.best_score_*100,2))
    best_estimators.append(clf.best_estimator_)
    print(cv_result[i])

In [None]:
best_estimators

In [None]:
dt = best_estimators[0]
svm = best_estimators[1]
rf = best_estimators[2]
lr = best_estimators[3]
knn = best_estimators[4]

<a id = "13" ></a>
## <span style="color:purple;">XGBRF and CatBoost Classsifier</span>

In [None]:
# xgbrf classifier
xgb_clf = xgboost.XGBRFClassifier(max_depth=3, random_state=random_state)
xgb_clf.fit(X_train,y_train)
acc_xgb_clf_train = round(xgb_clf.score(X_train, y_train)*100,2) 
acc_xgb_clf_test = round(xgb_clf.score(X_test,y_test)*100,2)
cv_result.append(acc_xgb_clf_train)
print("Training Accuracy: % {}".format(acc_xgb_clf_train))
print("Testing Accuracy: % {}".format(acc_xgb_clf_test))

In [None]:
#CatBoost Classifier
cat_clf = CatBoostClassifier()
cat_clf.fit(X_train,y_train)
acc_cat_clf_train = round(cat_clf.score(X_train, y_train)*100,2) 
acc_cat_clf_test = round(cat_clf.score(X_test,y_test)*100,2)
cv_result.append(acc_cat_clf_train)
print("Training Accuracy: % {}".format(acc_cat_clf_train))
print("Testing Accuracy: % {}".format(acc_cat_clf_test))

<a id = "14" ></a>
# <span style="color:purple;">Results</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#008080;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;color:white;">
Let's look at the model,accuracy score and confusion matrix</p>
</div>

In [None]:
model_list = ['Decision Tree','SVC','RandomForest','Logistic Regression','KNearestNeighbours','XGBRF','CatBoostClassifier']

In [None]:
import plotly.graph_objects as go
# create trace1
trace1 = go.Bar(
                x = model_list,
                y = cv_result,
                marker = dict(color = 'rgb(0, 128, 128)',
                              line=dict(color='rgb(0,0,0)',width=1.5)))
layout = go.Layout(title = 'Accuracy of different Classifier Models' , xaxis = dict(title = 'Classifier Models'), yaxis = dict(title = '% of Accuracy'))
fig = go.Figure(data = [trace1], layout = layout)
fig.show()

In [None]:
model = [dt,svm,rf,lr,knn,xgb_clf,cat_clf]
predictions = []

In [None]:
for i in model:
    predictions.append(i.predict(X_test))
for j in range(7):
    cm = confusion_matrix(y_test, predictions[j])
    plot_confusion_matrix(cm, figsize=(12,8), hide_ticks=True, cmap=plt.cm.Set3)
    plt.title(" {} Confusion Matrix".format(model_list[j]))
    plt.xticks(range(2), ["Not Pcos","Pcos"], fontsize=16)
    plt.yticks(range(2), ["Not Pcos","Pcos"], fontsize=16)
    plt.show()