<a href="https://colab.research.google.com/github/shumphries22/PCOS-Identification-Using-Machine-Learning/blob/main/PCOSPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PCOS Prediction Model

This model will be used to predict PCOS. <br>
<br>Firstly, we will load the PCOS dataset taken from Kaggle --> https://www.kaggle.com/datasets/prasoonkottarathil/polycystic-ovary-syndrome-pcos/data
<br>


###Load/Import Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [2]:
from google.colab import files
uploadedfiles = files.upload()
originalDataset = pd.read_csv("datasetOriginalCSV.csv")
#print (originalDataset.loc[0])
originalDataset.head()

Saving datasetOriginalCSV.csv to datasetOriginalCSV.csv


Unnamed: 0,SI. No,Patient File .No,PCOS (Y/N),Age (yrs),Weight (kg),Height (cm),BMI,Blood group,Pulse Rate (bpm),RR (breaths/min),...,Pimples (Y/N),Fast food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm)
0,1,1,0,28,44.6,152.0,19.3,15,78,22,...,0,1.0,0,110,80,3,3,18.0,18.0,8.5
1,2,2,0,36,65.0,161.5,24.921163,15,74,20,...,0,0.0,0,120,70,3,5,15.0,14.0,3.7
2,3,3,1,33,68.8,165.0,25.270891,11,72,18,...,1,1.0,0,120,80,13,15,18.0,20.0,10.0
3,4,4,0,37,65.0,148.0,29.674945,13,72,20,...,0,0.0,0,120,70,2,2,15.0,14.0,7.5
4,5,5,0,25,52.0,161.0,20.060954,11,72,18,...,0,0.0,0,120,80,3,4,16.0,14.0,7.0


In [None]:
originalDataset.shape

(541, 44)

Now, we have to ensure that only numerical values are included in this dataset. Any misplaced values will be removed.

In [None]:
originalDataset = originalDataset.apply(pd.to_numeric, errors='coerce')
originalDataset = originalDataset.dropna()
originalDataset.shape

(539, 44)

Now, we make x have all the features and y have only the target.

In [None]:
x = originalDataset.iloc[:, 3:44]
y = originalDataset.iloc[:, 2]
x.shape


(539, 41)

Split dataset into train, test, and validation.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
#30% test, 70% train


### Backward Elimination
Now, we will proceed to implement backward elimination on the dataset. Backward elimination allows us to see which features in the dataset have the most value in determining the outcome. The features with the most value are then used to train the machine learning model.

In [None]:
def backwardElimination(x, y, sigLvl=0.05):
  xValues = x.copy() #start with all features
  while True:
    model = sm.OLS(y, xValues).fit()
    pVals = model.pvalues
    maxPVal = pVals.max()

    #if all p values are below 0.05 then stop
    if maxPVal < sigLvl:
      break

    badFeature = pVals.idxmax()
    xValues.drop(columns=[badFeature], inplace=True)
    print(f"{badFeature} was removed with p value of {maxPVal:.4f}")

  return xValues

In [None]:
xTrainValues = backwardElimination(x_train, y_train)
xTestValues = x_test[xTrainValues.columns]

RBS(mg/dl) was removed with p value of 0.9106
Vit D3 (ng/mL) was removed with p value of 0.8882
FSH(mIU/mL) was removed with p value of 0.7831
PRL(ng/mL) was removed with p value of 0.7243
Endometrium (mm) was removed with p value of 0.7166
Age (yrs) was removed with p value of 0.6978
BP _Diastolic (mmHg) was removed with p value of 0.6741
Blood group was removed with p value of 0.6664
I beta-HCG(mIU/mL was removed with p value of 0.5905
FSH/LH was removed with p value of 0.5400
BP _Systolic (mmHg) was removed with p value of 0.5207
Hair loss(Y/N) was removed with p value of 0.5226
II beta-HCG(mIU/mL) was removed with p value of 0.4344
Marraige Status (Yrs) was removed with p value of 0.5164
RR (breaths/min) was removed with p value of 0.4472
Hb(g/dl) was removed with p value of 0.3675
Avg. F size (R) (mm) was removed with p value of 0.3625
Avg. F size (L) (mm) was removed with p value of 0.3632
No. of abortions was removed with p value of 0.2903
Reg.Exercise(Y/N) was removed with p va

In [None]:
print(xTrainValues.columns)

Index(['Weight (kg)', 'Height (cm) ', 'BMI', 'Cycle(R/I)', 'LH(mIU/mL)',
       'AMH(ng/mL)', 'Weight gain (Y/N)', 'hair growth(Y/N)',
       'Skin darkening (Y/N)', 'Fast food (Y/N)', 'Follicle No. (L)',
       'Follicle No. (R)'],
      dtype='object')


By implementing back elimination, we can ensure that only features that are significant are used to train the dataset. This ensures that the dataset focuses purely on related features.
<br>
<br>These are the features left after backwards elimination has been implemented:
*   Height (cm)
*   Cycle(R/I)
*   LH(mIU/mL)
*   Weight gain (Y/N)
*   Hair growth (Y/N)
*   Skin darkening (Y/N)
*   Pimples (Y/N)
*   Follicle No. (L)
*   Follicle No. (R)




### Training SVM Model

In [None]:
svmModel = SVC(kernel='linear')
svmModel.fit(xTrainValues, y_train)

In [None]:
predictPCOS = svmModel.predict(xTestValues)

### Evaluating SVM Model

In order to properly be able to determine whether this algorithm is suitable for its intended purpose, it is necessary to evaluate how accurate the model is.



In [None]:
#create confusion matrix
conMat = confusion_matrix(y_test, predictPCOS)

conMatDF = pd.DataFrame(conMat, index=[f'Actual {label}' for label in np.unique(y)],
                     columns=[f'Predicted {label}' for label in np.unique(y)])


#MAKING EVALUATIONS AND PRINTING THEM
print("Confusion Matrix:")
print(conMatDF)

print("\n Accuracy:", accuracy_score(y_test, predictPCOS))
print("\n Precision:", precision_score(y_test, predictPCOS))
print("\n Recall:", recall_score(y_test, predictPCOS))
print("\n F1 Score:", f1_score(y_test, predictPCOS))
#print(classification_report(y_test, predictPCOS))


Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0          110            4
Actual 1            8           40

 Accuracy: 0.9259259259259259

 Precision: 0.9090909090909091

 Recall: 0.8333333333333334

 F1 Score: 0.8695652173913043
