<a href="https://imgur.com/T5FEMnP"><img src="https://i.imgur.com/T5FEMnP.png" title="source: imgur.com" /></a>

### Table of Contents
1. [Importing Libraries](#Importing-Libraries)
2. [Overview](#Overview)
3. [Loading the Data](#Loading-the-Data)
4. [Explore](#Explore)
5. [Visualization](#Visualization)
6. [Feature Importance](#Feature-Importance)
7. [Transforming the data](#Transforming-the-data)
8. [Training Models](#Training-Models)
9. [Result](#Result)

## Importing Libraries

In [None]:
## Kaggle's kernel have outdated version of Seaborn library(0.10, as of when uploading this notebook), we need 0.11 or above for smooth implementation of the notebook.
## Make sure you run this cell if you do not have seaborn version 0.11 or above.
!pip install seaborn --upgrade

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.__version__

## Overview
### Description :  
**Polycystic Ovary Syndrome (PCOS)** is a medical condition which causes hormonal disorder in women in their childbearing years. The hormonal imbalance leads to a delayed or even absent menstrual cycle. Women with PCOS majorly suffer from excessive weight gain, facial hair growth, acne, hair loss, skin darkening and irregular periods leading to infertility in rare cases. The existing methodologies and treatments are insufficient for early-stage detection and prediction. To deal with this problem, we propose a system which can help in early detection and prediction of PCOS treatment from an optimal and minimal set of parameters. To detect whether a woman is suffering from PCOS, 5 different machine learning classifiers like Random Forest, SVM, Logistic Regression, Gaussian Naïve Bayes, K Neighbours have been used. Out of the 41 features from the dataset, top 30 features were identified using CHI SQUARE method and used in the feature vector. We also compared the results of each classifier and it has been observed that the accuracy of the Random Forest Classifier is the highest and the most reliable. The dataset used is available on KAGGLE and owned by *Prasoon Kottarathil*.

### **Dataset link:** https://www.kaggle.com/prasoonkottarathil/polycystic-ovary-syndrome-pcos
### **Target:**  
`PCOS (Y/N)`: Whether the person has diagnosed with PCOS.    

### **Features:**  
**1. Physical Parameters:**  
`
Age (yrs)  
Weight (Kg)  
Height(Cm)  
Blood Group  
Pulse rate(bpm)  
RR (breaths/min)  
Cycle(R/I)  
Cycle length(days)  
Marraige Status (Yrs)
Hip(inch)  
Waist(inch)  
Waist:Hip Ratio  
No. of abortions `  

**2. Physical Symptoms:**  
`Pregnant(Y/N)  
Weight gain(Y/N)  
hair growth(Y/N)  
Skin darkening (Y/N)  
Hair loss(Y/N)  
Pimples(Y/N)  
Fast food (Y/N)  
Reg.Exercise(Y/N)`.

**3. Medical Parameters:**  
`BMI`: Body mass index (BMI) is a measure of body fat based on height and weight that applies to adult men and women.  
`Hb(g/dl)`: Hemoglobin(a protein in your red blood cells).  
`I beta-HCG(mIU/mL), II beta-HCG(mIU/mL)`: Human chorionic gonadotropin is a hormone for the maternal recognition of pregnancy.  
`FSH(mIU/mL)`: FSH helps manage the menstrual cycle and stimulates the ovaries to produce eggs.  
`LH(mIU/mL)`: LH helps control the menstrual cycle. It also triggers the release of an egg from the ovary.  
`FSH/LH`: are gonadotropins because they stimulate the gonads - in males, the testes, and in females, the ovaries.  
`TSH(mIU/L)`: TSH stands for thyroid stimulating hormone. A TSH test is a blood test that measures this hormone.  
`AMH(ng/mL)`: Within the ovaries, AMH helps in the early development of follicles.  
`PRL(ng/mL)`:  PRL or lactogenic hormone. Prolactin is mainly used to help women produce milk after childbirth.  
`Vit D3(ng/mL)`: Vitamin D.  
`PRG(ng/mL)`: Progesterone is an endogenous steroid and progestogen sex hormone involved in the menstrual cycle, pregnancy.  
`RBS(mg/dl)`: Random blood sugar (RBS) measures blood glucose regardless of when you last ate.  
`BP_Systolic (mmHg)`: Systolic pressure, the force of the blood against the artery walls as your heart beats.  
`BP_Diastolic (mmHg)`: Diastolic pressure, the blood pressure between heartbeats.  
`Follicle No. (L), Follicle No. (R)`: Ovarian follicles are small sacs filled with fluid that are found inside a woman's ovaries.  
`Endometrium (mm)`: The endometrium is the innermost lining layer of the uterus.  

## Loading the Data

In [None]:
pcos = pd.read_csv('../input/pcos-dataset/PCOS_data.csv')

## Explore

In [None]:
pcos

Some columns were wrongly interpreted as datatype "object" rather than a real number. Therefore, we will convert them into the apppropiate numeric datatype.  
Columns like `Sl. No, Patient File No., Unnamed: 44` are of no use to us, so we will remove them.

In [None]:
for i in ['AMH(ng/mL)', 'II    beta-HCG(mIU/mL)']:
    pcos[i] = pd.to_numeric(pcos[i], errors='coerce')
pcos = pcos.drop(['Sl. No', 'Patient File No.', 'Unnamed: 44'], axis =1)

In [None]:
target = pcos.columns[:1].to_list()
features = pcos.columns[1:].to_list()
print("Total number of Features:", len(features))

Let's check if there's any missing values, if yes, remove them.

In [None]:
pcos.isnull().sum()

As the amount of **missing values is very low** (negligible), we can directly delete them.

In [None]:
pcos = pcos.dropna()

## Visualization

The dataset contains columns which has continous as well as discrete observations. So let's see if we can derive any useful insights from the columsn which have continous values.

In [None]:
continous=[
'PRL(ng/mL)', 'FSH/LH', 
'II    beta-HCG(mIU/mL)', '  I   beta-HCG(mIU/mL)',
'BP _Diastolic (mmHg)', 'BP _Systolic (mmHg)',
'Avg. F size (L) (mm)', 'Avg. F size (R) (mm)',
'TSH (mIU/L)', 'RBS(mg/dl)',
'Vit D3 (ng/mL)','Cycle length(days)'
]

f, axes = plt.subplots(6, 2, figsize=(16,25))
k = 0
for i in range(0,6):
    for j in range(0,2):
        sns.kdeplot(data=pcos, x=continous[k], hue="PCOS (Y/N)", ax = axes[i][j])
        k = k + 1

We can see that patients who had PCOS have similar trends as the patients without PCOS. These distributions are not really useful from the point of view of finding features that can help us differentiate between a patient who is diagnosed with PCOS and a patient who isn't.  

So inorder to find important features we will take help from Statistics. 

## Feature Importance
As we have see above, we have 41 features. We could use all of them but it could happen that all of them are not useful or there can be a chance of overfitting. We also saw that visualisation did not help us in finding important features. Hence we will use the Chi Square method to determine important features.   
Chi square method will calculate a score. The score calculated tells us how important that feature is.  
We will use let's say **top 30** most important features.  
We will use **[SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)** and **[chi-squared](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)** to find the feature importance.  

#### What is a Chi-Square statistic?
A chi-square (χ2) statistic is a test that measures how expectations compare to actual observed data (or model results). The data used in calculating a chi-square statistic must be random, raw, mutually exclusive, drawn from independent variables, and drawn from a large enough sample.

#### What does a Chi-Square statistic tell you?
There are two main kinds of chi-square tests: the test of independence, which asks a question of relationship, such as, **"Is there a relationship between gender and SAT scores?"**; and the goodness-of-fit test, which asks something like **"If a coin is tossed 100 times, will it come up heads 50 times and tails 50 times?"**

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

num = 30

bestfeatures = SelectKBest(score_func=chi2, k=num)
fit = bestfeatures.fit(pcos[features], pcos[target])
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(pcos.columns)

featureScores = pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns = ['Feature','Score']
featureScores = featureScores.sort_values(by='Score', ascending = False)
featureScores = featureScores[featureScores.Feature != target[0]]
featureScores = featureScores.reset_index(drop = True)
featureScores[:num]

In [None]:
new_features = featureScores['Feature'].to_list()
new_features = new_features[:num]

## Transforming the data
We will use combination of [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) with [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to carry out the necessary transformation on our data.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, new_features)])

## Training Models  
### Classifiers used: 
1. [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. [Support Vector Machine(SVM)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
     1. SVM with Linear kernel.
     2. SVM with Radial kernel.
3. [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
4.  [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
5. [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

### Evaluation Metrics: [K-Folds cross-validator](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)

In [None]:
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score, train_test_split

train, test = train_test_split(pcos, test_size = 0.2, random_state = 0)

observations = pd.DataFrame()

classifiers = [
    'Linear SVM', 
    'Radial SVM',
    'LogisticRegression', 
    'RandomForestClassifier', 
    'KNeighborsClassifier', 
    'Gaussian Naive Bayes'
]

models = [
    svm.SVC(kernel='linear'), 
    svm.SVC(kernel='rbf'), 
    LogisticRegression(), 
    RandomForestClassifier(n_estimators=200, random_state=0),
    KNeighborsClassifier(),
    GaussianNB()
]

j = 0
for i in models:
    model = i
    cv = KFold(n_splits=5, random_state=0, shuffle=True)
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])
    observations[classifiers[j]] = (cross_val_score(pipe, train[new_features], np.ravel(train[target]), scoring='accuracy', cv=cv))
    j = j+1

## Evaluation
#### Here we have, scores of the 5 folds along with their mean.

In [None]:
mean = pd.DataFrame(observations.mean(), index= classifiers)
observations = pd.concat([observations,mean.T])
observations.index=['Fold 1','Fold 2','Fold 3','Fold 4','Fold 5','Mean']
observations.T.sort_values(by=['Mean'], ascending = False)

#### Let's see how our Random Forest model performed using Confusion Matrix and ROC Curve.

## Result

In [None]:
from sklearn.metrics import confusion_matrix

ran_model = RandomForestClassifier(n_estimators=200, random_state=0)
ran_pipe = Pipeline(steps=[('preprocessor', preprocessor), ('model', ran_model)])
ran_pipe.fit(train[new_features], np.ravel(train[target]))
pred = ran_pipe.predict(test[new_features])

In [None]:
plt.figure(dpi = 100)
plt.title("Confusion Matrix")
cf_matrix = confusion_matrix(np.ravel(test[target]), pred)
cf_hm = sns.heatmap(cf_matrix, annot=True, cmap = 'rocket_r')

In [None]:
import sklearn.metrics as metrics

fpr, tpr, threshold = metrics.roc_curve(test[target], pred)
roc_auc = metrics.auc(fpr, tpr)

plt.figure(dpi = 100)
plt.title('ROC curve for Random Forest Classifier')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate(sensitivity)')
plt.xlabel('False Positive Rate(specificity)')
plt.show()

[Go to Table of Contents](#Table-of-Contents)